Self-attention

Allows the inputs to interact with each other (“self”)

and find out who they should pay more attention to

Numbers of input vector are not fixed → words/voice/graph process

General Transformer

Types

Self attention

capture dependencies and relationships within input sequences

$q^i = W^q a^i$

$k^i = W^k a^i$

$v^i = W^v a^i$

$\alpha = k q$

$b = va$

$Q = W^q I$

$K = W^k I$

$V = W^k I$

$A^\prime = softmax(A) = K^T Q$

$O = V A^\prime$

Multi-head self-attention

$q^i → q^{i, 1}, q^{i, 2}$

$b^{i, 1}, b^{i, 1} → b^i$

Positional Encoding

Positional vector $e^i$

Truncated self-attention

Recurrent Neural Network

Self-attention is better

Graph Neural Network

Graph → Matrix → Vector