close
close

first Drop

Com TW NOw News 2024

Transformers learn in-context through gradient descent (R)
news

Transformers learn in-context through gradient descent (R)

Transformers learn in-context through gradient descent (R)

Can someone help me understand the reasoning in the paper Transformers learn in-context by gradient descent? The authors first assume a “reference” linear model with some weight \( W \), and then show that the loss of this model after a gradient descent step equals the loss of the “transformed data.” Then, in the main result (Proposition 1), the authors manually construct the weights of \( K \), \( Q \), and \( V \) such that a forward pass of a single-head attention layer assigns all tokens to this “transformed data.”

My question is: how does this construction prove that transformers can performing gradient descent in in-context learning (ICL)? Is the output of the forward pass (i.e., the “transformed data”) considered a new prediction? I thought it should be like this: the new prediction corresponds to the prediction given by the updated weight. I couldn’t understand the logic here.

https://preview.redd.it/cztva19y05kd1.png?width=1046&format=png&auto=webp&s=0196944994516f480670ba9d29be91f8f55fc6f9

https://preview.redd.it/oihuv48405kd1.png?width=1728&format=png&auto=webp&s=8cfc35bf8aa433d53f9d7b5bc5faef3c3e4fba8a

submitted by /u/mziycfh
(link) (reactions)