Self-attentionAllows the inputs to interact with each other (“self”)and find out who they should pay more attention toNumbers of input vector are not fixed → words/voice/graph processFully connected networkGeneral TransformerTypesSequence labelingWhole sequence to 1 labelSequence 2 sequenceTransformerSelf attentioncapture dependencies and relationships within input sequencesDot product Additiveqi=Wqaiq^i = W^q a^iqi=Wqaiki=Wkaik^i = W^k a^iki=Wkaivi=Wvaiv^i = W^v a^ivi=Wvaiα=kq\alpha = k qα=kqb=vab = vab=vaQ=WqIQ = W^q IQ=WqIK=WkIK = W^k IK=WkIV=WkIV = W^k IV=WkIA′=softmax(A)=KTQA^\prime = softmax(A) = K^T QA′=softmax(A)=KTQO=VA′O = V A^\primeO=VA′Multi-head self-attentionqi→qi,1,qi,2q^i → q^{i, 1}, q^{i, 2}qi→qi,1,qi,2bi,1,bi,1→bib^{i, 1}, b^{i, 1} → b^ibi,1,bi,1→biPositional EncodingPositional vector eie^ieiTruncated self-attentionRecurrent Neural NetworkSelf-attention is betterGraph Neural NetworkGraph → Matrix → Vector