Honestly, my heart just skipped a beat, the power of this neural network is just unbelievably good. These images are so cool and the point here is that these images aren't photoshopped or human-created, they are AI-generated by this new model called DALL·E. What it can do is, it can take a piece of text and it can output a picture that matches that text. The thing that is super astounding to note is the quality of these images and what’s even more astounding is sort of the range of capabilities that this model has. …
Performers are a new class of models and they approximate the Transformers. They do so without running into the classic transformer bottleneck which is that, the Attention matrix in the transformer has space and compute requirements that are quadratic in the size of the input, and that limits how much input (text or images) you can feed into the model. Performers get around this problem by a technique called Fast Attention Via Positive Orthogonal Random Features (FAVOR+). Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. Performers; capable of provably accurate and practical estimation of regular (softmax) full-rank attention, but of only linear space and time complexity and not relying on any priors such as sparsity or low-rankness. …
Transformers work really really well for NLP however they are limited by the memory and compute requirements of the expensive quadratic attention computation in the encoder block. Images are therefore much harder for transformers because an image is a raster of pixels and there are many many many… pixels to an image. The rasterization of images is a problem in itself even for Convolutional Neural Networks. To feed an image into a transformer every single pixel has to attend to every single other pixel (just like the attention mechanism), the image itself is 255² big so the attention for an image will cost you 255⁴ which is almost impossible even in current hardware. So people have resorted to other techniques like doing Local Attention and even Global Attention. …
The authors believe that underspecification is one of the key reasons for the poor behavior of Machine Learning models when deployed in real-world domains. Think of it like this, you have a big training set, you train your model on the training set, then you test it on the testing set, and usually, they come from the same distribution. However, there is a caveat here, when you deploy your trained model to production in the real-world, the distribution of the data is very different and the model might not as well perform well. So, the underspecification problem the authors identify is when all the models from your training procedure work equally well on the test set, however, they perform very differently in the real world scenario. There might be one of the models (from different random seed values) which performs well even in the real world but the other models don’t perform as well. This confirms that the train-test split pipeline is underspecified, the train-test split simply doesn't capture the importance of real-world data. …
This paper on a high level proposes to construct Knowledge Graphs which is a structured object that's usually built by human experts. It proposes to construct Knowledge Graphs automatically by simply using a pre-trained language model together with a corpus to extract the knowledge graph without human supervision. The cool thing about this paper is that there is no training involved, the entire knowledge is simply extracted from running the corpus once. So one forward pass through the pre-trained language model and that constructs the Knowledge Graph.
In this paper, the authors design an unsupervised approach called MAMA that successfully recovers the factual knowledge stored in Language Models to build Knowledge Graphs from scratch. MAMA constructs a KG with a single forward pass of a pre-trained LM (without fine-tuning) over a textual corpus. …
In recent times we have seen Transformers take over image classification (do check out my Medium Post - Vision Transformers ❤️) but it came either with downsampling the image to 16×16 patches or just by throwing the massive amount of data. LambdaNetworks, are computationally efficient and simple to implement using direct calls to operations available in modern neural network libraries. The attention Mechanism is a very very very general computational framework, it’s like a dynamic routing of information and the authors don't adapt to use expensive attention maps as it is computationally very expensive.
The Lambda Layers take the global context and summarizes/abstracts the context first without looking at the query. The context is summarized to a lower-dimensional linear function in form of a matrix (whose dimensions can be changed) and multiplied by the query. So the whole operation is going to be a linear function, as opposed to the attention operation where it looks at interactions between query and keys, you apply a softmax over it which makes it a non-linear function. …
About