ViT (architechture) - Словари - Клавогонки - онлайновый клавиатурный тренажер-игра

ViT (architechture)

(0) Использует 1 человек

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics. The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them.For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes.

Комментарии

Автор:

Обратный адрес:

Тема:

Сообщение: