Attention is a concept that is studied scientifically across multiple disciplines, including psychology, neuroscience, and, more recently, machine Xi. While all disciplines may have developed their own definitions of attention, a core quality they all agree on is that attention is a mechanism that makes biological and artificial nervous systems more flexible.
The study of attention originated in the field of psychology. Observations drawn from such studies can help researchers infer the mental processes behind such behavioral patterns. The concept of attention in machine learning Xi is roughly inspired by the mental mechanism of attention in the human brain.
The idea is to be able to work with an artificial neural network that can perform well on tasks with inputs that may have variable lengths, sizes, or structures, and can even handle multiple different tasks. It is in this spirit that the attention mechanisms in machine learning Xi are said to be inspired by psychology, not because they replicate the biology of the human brain.
The encoder's task is to generate a vector representation of the input, while the decoder's task is to convert this vector representation into an output. Attention mechanisms connect the two.
Neural network architectures that implement attention mechanisms have different propositions that are also relevant to the specific application they use. Natural language processing (NLP) and computer vision are among the most popular applications.
An early application of attention in NLP was machine translation, where the goal was to translate the input sentence in the source language into the output sentence in the target language. In this context, the encoder generates a set of context vectors, each corresponding to each word in the source sentence. The decoder, on the other hand, will read the context vector to generate output sentences in the target language, one word at a time.
For long or structurally complex sequences, representing inputs with fixed-length vectors is particularly problematic because they are forced to represent the same number of dimensions as shorter or simpler sequences.
This creates a bottleneck where the decoder's access to the information provided by the input (the information available in the fixed-length encoded vector) is limited. On the other hand, preserving the length of the input sequence during encoding allows the decoder to utilize its most relevant parts in a flexible way.
The latter is how the attention mechanism works.