Background: Over the past decade, billions of sensors from connected devices have been used to transform physical signals and information into the digital world. Due to limited computing power, sensors integrated into embedded remote devices often transmit raw and unprocessed data to their hosts. However, the high energy cost of wireless data transmission affects the autonomy of devices and the bandwidth of data transmission. Improving their energy efficiency could open up a range of new applications and reduce their environmental footprint. In addition, data processing will be transferred from the remote host to the local sensor node.
Therefore, data transfers will be limited to structured and valuable data, which is required for this purpose. The von Neumann architecture separates processing and storage, requiring data to be transferred back and forth between data and signal processing or inference in a neural network. Data communication between memory and processing units already accounts for one-third of the energy consumed in scientific computing. To overcome the von Neumann communication bottleneck, in-memory computing architectures are being explored, in which memory, logic, and processing operations are parallel. Memory processing devices are particularly well-suited to perform vector matrix multiplication, which is a critical operation in data processing and the most computationally intensive of machine Xi algorithms.
By utilizing the physical layer of memory to perform multiply-accumulate (MAC) operations, the architecture overcomes the von Neumann communication bottleneck. So far, this processing strategy has been used in applications such as solving linear and differential equations, signal and image processing, and artificial neural network accelerators. However, the search for the best materials and devices for this type of processor is still ongoing.
Presentation of the results. In view of this, recently, the team of Professor Andras Kis of the Swiss Federal Institute of Technology in Lausanne reported an integrated 32 32 vector matrix multiplier, which uses a single-layer MOS2 as the channel material and has 1024 floating-gate FETs. In the wafer-level fabrication process in this paper, high yields and low device-to-device variation are achieved, which are prerequisites for practical applications.
Statistical analysis highlights the potential for multilevel and analog storage with a single programmed pulse, allowing the accelerator to be programmed using an efficient open-loop programming scheme. This article also demonstrates reliable, discrete signal processing in parallel. The article was published in the top journal Nature Electronics under the title "A large-scale integrated vector-matrix multiplication processor based on monolayer molybdenum disulfide memories".
*Guide.
Figure 1Description and characterization of devices and matrices. (a) 3D rendering of FGFETs connected into matrix arrays. (b) Cross-sectional 3D view of the FGFET. (c) Optical images of the storage matrix configuration. (d) IDS-VG hysteresis curves for 851 operating devices. (e) The three-dimensional diagram shows the on-off and off-current mapping on the 32-32 chip.
In this paper, a charge-based memory is used to realize storage computation by using a single-layer MOS2 as the channel material. Specifically, FGFETs are fabricated to take advantage of the electrostatic susceptibility of 2D semiconductors. To achieve a larger array, the FGFETs are integrated in a matrix that allows individual storage elements to be positioned by carefully selecting the corresponding rows and columns. Figures 1a and b show a 3D rendering of the memory matrix and the detailed structure of each FGFET, respectively. The use of matrix configurations allows for a denser topology and directly corresponds to performing vector matrix multiplication.
The memory is controlled by a local 2nm 40nm CR PT gate fabricated with a gate-first method. This enables improved dielectric growth through atomic layer deposition and minimizes the process steps of 2D channel exposure, resulting in higher yields. The floating gate is a 5 nm PT layer sandwiched between 30 nm HFO2 and 7 nm HFO2 (tunneling oxide).
Next, the through-holes are etched on HFO2 to electrically connect the bottom metal (M1) and top metal (M2) layers. This is necessary to route no overlap between the source and drain signals. Wafer-level MOCVD-grown MOS2 is transferred to the top of the gate stack and etched to form a channel of transistor. Finally, the 2 nm 60 nm Ti Au is patterned at the top and evaporated, forming a drain-source contact of the transistor as well as a second metal layer. Figure 1c shows an optical image of the fabricated chip with 32 rows and 32 columns with a total of 1,024 memories.
The memory in this article is based on standard flash memory. The storage mechanism relies on moving the neutral threshold voltage (VTH0), i.e., the PT floating gate, by changing the charge number (δq) in the trapping layer. When a high positive and negative bias is applied to the gate, the band alignment begins to favor the tunneling of electrons from the semiconductor to the floating gate, changing the carrier concentration in the trapping layer. The memory window (δvth) is defined by taking the difference between the threshold voltages of the positive and negative circuits. Because the storage effect is entirely dependent on charge-based processes, flash memory tends to have better reliability and repeatability than emerging memories that rely on materials, such as resistive random access memory and phase change memory.
Figure 1d shows the IDS-VG scan performed for each device. The yield of this process is 831%, the devices are statistically similar. The relatively high shutdown-state current is due to the lack of resolution of the analog-to-digital converter used in the setup. High-resolution single-device measurements confirm that typical off-state currents are in the picoamp range. Figure 1e shows the on-off and off-current distribution across the memory matrix. At VDS=100 mV, the on/off currents are taken to form two different planes. The on-off and off-currents exhibit a good distribution throughout the matrix. The device has a statistically similar memory window ΔVTH=430±0.25 v。
Figure 2Open-loop programming. (a) Schematic diagram of the two-state operation of the open-loop programming scheme. (b) Distribution in the output state (wout) sex scale. (c) The distribution of the output state (WOUT) in the log10 scale. (d) 3D imaging of the log10 value of WOUT as a function of device position and different programming voltages. (e) Empirical cumulative distribution function (ECDF) in relation to programming state.
The similarity of these devices has led to a statistical study of the programming behavior of the memory. In a storage computing environment, open-loop programming analysis is fundamental. When programming large flash arrays, standard write-verify methods can be too time-consuming. A statistical understanding of the stored state in the open loop is critical to improving performance and speed. In this paper, each device is independently stimulated for experiments by selecting the corresponding row (i) and column (j). The analog switches in the device interface board maintain a low-impedance path in the selected row (i) and column (j) and a high-impedance path in the remaining rows and columns. This ensures that the potential difference is applied only to the desired device, avoiding unnecessary programming.
For the same reason, this article separates device programming and reading into two separate phases. During the programming phase, the corresponding gate lines (rows) and corresponding source lines (columns) are selected, and the programming pulses with the parameters tpulse and vpulse are applied to the gate. Due to the tunneling nature of the device, only two terminals are required to generate the band bend required for charge injection into the floating gate. After pulsing, the gate voltage changes to vread, which is low enough to prevent reprogramming of the memory state.
During the reading phase, a drain wire is also connected and the conductivity value is detected by applying a voltage VDS to the drain. This two-stage process is necessary because the three-terminal device is used in this article. As a result, the gate and drain share the same row, so when the gate and drain wires are engaged, the entire row is offset. If a high voltage is applied in the gate, the entire line will be reprogrammed when the drain wire is connected, resulting in a loss of information in the memory. Figure 2a shows a depiction of this two-stage programming process.
For subsequent measurements, vRead = -3 V, VDS=1 V, and Tpulse = 100 ms are used here. Before each measurement, the memory is reset by applying a positive 10 V pulse, which puts the device into a low conductance state. This compensation method improves the programming reliability of the device by an order of magnitude. When programming a bit, there are 500 errors per million errors, while when programming the erase state, there is one error per million errors. Figures 2b and c show the linear and logarithmic distribution of the stored state after different pulse intensities. It was observed that an increase in pulse amplitude was accompanied by a higher stored state value and greater expansion on the ** scale.
On the other hand, by analyzing the logarithm of the state values, you can see that the memory has a well-defined storage state. As a result, the memory has the potential for multi-value storage without the need for write-verify algorithms, especially on logarithmic scales. Figure 2d shows the spatial distribution of state across the chip. It is observed that for different programmed voltages, the memory state produces a constant planar value. Finally, Figure 2e shows the empirical cumulative distribution function (ECDF) represented logarithmically. As mentioned earlier, these results support the possibility of multi-value programming and show that memory elements can be used to store analog weights for in-memory calculations.
Figure 3MAC operation. (a) Relationship between the state of the output memory with a programming error () and the programmed voltage (vprog). (b) Normalized YEXP and YTHEORY diagrams to compare the experimental theoretical results of MAC operations.
With the completion of the open-loop analysis (Figure 3a), this paper plots the memory state () versus the programmed voltage (vprog). This paper defines four equally distributed states (two-bit resolution) and programs them as discrete weights in the matrix for vector matrix multiplication.
In order to analyze the effectiveness of the processor in performing vector matrix operations, this paper compares the values of the Normalization Theory (Ytheory) and the values of the Normalization Experiment (YEXP) obtained on several dot product operations (Fig. 3b). For yexp=a ytheory+b, the linear regression of the experimental points is shown as the parameter a=0988±0.008 and b = -0129±0.003, the shaded area corresponds to a 95% confidence interval.
The ideal processor should converge to a=1 and b=0, with confidence intervals converging to a linear fit. In the case of this paper, the processor has a linear behavior that converges to the ideal case, and the experimental values have a large spread and a slight nonlinearity. This paper explains this behavior in terms of the non-ideality of the memory and the quantization error due to the limited resolution of the state. This change in parameter b can be explained by the intrinsic transimpedance amplifier shift and memory leakage at ytheory=0, but it does not affect the observed linear trend.
As a result, Mac operations can be performed with reasonable precision. This operation is used to perform various types of algorithms, such as signal processing and inference in artificial neural networks.
Figure 4Signal processing based on memory processing. (a) Convolution-based signal processing descriptions for different filters (low-high-pass filters and identity filters). (b) Comparison of theoretical kernel weight imaging with experimental weights transferred to memory conductance. (c) Fast Fourier transform (FFT) comparison of the output signal at the end of each core** and the experimental output.
Next, this article configures this accelerator to perform signal processing to demonstrate real-world scenarios and applications. For signal processing, the input signal (x) is convoluted with the kernel (h) to obtain the processed signal (y). Depending on the nature of the kernel elements, different types of processing can be implemented.
Here, this article is limited to three different cores, performing low-pass filtering, high-pass filtering, and feed-through. All cores work in parallel in a single processing cycle, demonstrating the efficiency of the processor in solving data-centric problems through parallel processing. More cores can be added in parallel, limited only by the size of the matrix. Figure 4a shows the convolution operation and the different cores used to process the input signal.
The strategy for encoding negative core values into memory conductance values is to divide cores (h) into cores (h+) with only positive values and cores (h) with negative absolute values, and encode only positive numbers that are directly related to conductance values (g). After processing, the outputs of the positive core (y+) and the negative core (y) are subtracted to obtain the final signal (y). Figure 4b shows a comparison of the raw weights with the weights passed into the memory matrix using the open-loop programming scheme described earlier. To simplify the transfer, the weight of each core is normalized to its maximum. A good agreement between the raw and experimental values was observed.
Next, in order to verify the effectiveness of the processing, the input signal (x) is first constructed as a sum of sine waves of different frequencies. In this way, it is easy to detect the behavior of the filter at different frequencies without generating overly complex signals. Since the signal has both positive and negative values, the signal amplitude must fall within the linear region in which the device operates.
Therefore, the signal range at vread=0 is limited from -100 mV to 100 mV. Figure 4c shows the fast Fourier transform of the analog processing signal and the experimental signal. The gray line in the analog and measurement signals is a fast Fourier transform for each core, providing guidance for the behavior of each operation. The experimental process of these three filters is in good agreement with the theoretical values and the prototype filters.
Summary and outlook.
This paper reports the large-scale integration of 2D materials as semiconductor channels in memory processors. This paper demonstrates the reliability and repeatability of the device in terms of characterization and statistical similarity of programming states in open-loop programming. The processor performs vector matrix multiplication and illustrates its function by performing discrete signal processing. The research method in this paper can enable memory processors to reap the benefits of 2D materials and bring new capabilities to edge devices for the Internet of Things.
Bibliographic information. a large-scale integrated vector-matrix multiplication processor based on monolayer molybdenum disulfide memories
nat. electron., 2023, doi:10.1038/s41928-023-01064-1)