This vignette describes how to implement the attention mechanism - which forms the basis of transformers - in the R language.
It follows the same steps as the Simple
Self-Attention from Scratch, but does not rely on any of the helper
functions defined in the attention package, rather it
implements everything in base R.
The code is translated from the Python original by Stefania Cristina (University of Malta) in her post The Attention Mechanism from Scratch.
We begin by generating encoder representations of four different words.
# encoder representations of four different words
word_1 = matrix(c(1,0,0), nrow=1)
word_2 = matrix(c(0,1,0), nrow=1)
word_3 = matrix(c(1,1,0), nrow=1)
word_4 = matrix(c(0,0,1), nrow=1)Next, we stack the word embeddings into a single array (in this case
a matrix) which we call words.
# stacking the word embeddings into a single array
words = rbind(word_1,
              word_2,
              word_3,
              word_4)Let’s see what this looks like.
print(words)
#>      [,1] [,2] [,3]
#> [1,]    1    0    0
#> [2,]    0    1    0
#> [3,]    1    1    0
#> [4,]    0    0    1Next, we generate random integers on the domain
[0,3].
# initializing the weight matrices (with random values)
set.seed(0)
W_Q = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_K = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_V = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)Next, we generate the Queries (Q), Keys
(K), and Values (V). The %*%
operator performs the matrix multiplication. You can view the R help
page using help('%*%') (or the online An
Introduction to R).
# generating the queries, keys and values
Q = words %*% W_Q
K = words %*% W_K
V = words %*% W_VFollowing this, we score the Queries (Q) against the Key
(K) vectors (which are transposed for the multiplation
using t(), see help('t') for more info).
# scoring the query vectors against all key vectors
scores = Q %*% t(K)
print(scores)
#>      [,1] [,2] [,3] [,4]
#> [1,]    6    4   10    5
#> [2,]    4    6   10    6
#> [3,]   10   10   20   11
#> [4,]    3    1    4    2We now calculate the maximum value for each row and preserve the
structure (i.e. the 4 rows, now with only one column which
contains the maximum value for the corresponding row).
# calculate the max for each row of the scores matrix
maxs = as.matrix(apply(scores, MARGIN=1, FUN=max))
print(maxs)
#>      [,1]
#> [1,]   10
#> [2,]   10
#> [3,]   20
#> [4,]    4As you can see, the value for each row in maxs is the
maximum value of the corresponding row in scores.
We now generate the weights matrix.
# initialize weights matrix
weights = matrix(0, nrow=4, ncol=4)
# computing the weights by a softmax operation
for (i in 1:dim(scores)[1]) {
  weights[i,] = exp((scores[i,]-maxs[i,]) / ncol(K) ^ 0.5)/sum(exp((scores[i,]-maxs[i,]) / ncol(K) ^ 0.5))
}Let’s have a look at the weights matrix.
print(weights)
#>             [,1]        [,2]      [,3]        [,4]
#> [1,] 0.083717538 0.026383741 0.8429010 0.046997679
#> [2,] 0.025449248 0.080752324 0.8130461 0.080752324
#> [3,] 0.003072728 0.003072728 0.9883811 0.005473487
#> [4,] 0.273384789 0.086157735 0.4869837 0.153473823Finally, we compute the attention as a weighted sum of
the value vectors (which are combined in the matrix V).
# computing the attention by a weighted sum of the value vectors
attention = weights %*% VNow we can view the results using:
print(attention)
#>          [,1]     [,2]        [,3]
#> [1,] 2.816517 1.900235 0.046997679
#> [2,] 2.732294 1.757743 0.080752324
#> [3,] 2.985308 1.988381 0.005473487
#> [4,] 2.400826 1.674211 0.153473823