Self-attention

Allows the inputs to interact with each other (“self”)

and find out who they should pay more attention to
Numbers of input vector are not fixed → words/voice/graph process

Fully connected network

General Transformer

Types
  • Sequence labeling
  • Whole sequence to 1 label
Self attention
capture dependencies and relationships within input sequences
  • Dot product

  • Additive

qi=Wqaiq^i = W^q a^i

ki=Wkaik^i = W^k a^i

vi=Wvaiv^i = W^v a^i

α=kq\alpha = k q

b=vab = va

Q=WqIQ = W^q I

K=WkIK = W^k I

V=WkIV = W^k I

A=softmax(A)=KTQA^\prime = softmax(A) = K^T Q

O=VAO = V A^\prime

Multi-head self-attention

qiqi,1,qi,2q^i → q^{i, 1}, q^{i, 2}

bi,1,bi,1bib^{i, 1}, b^{i, 1} → b^i

Positional Encoding

Positional vector eie^i

Truncated self-attention
Recurrent Neural Network

Self-attention is better

Graph Neural Network

Graph → Matrix → Vector