close
close

first Drop

Com TW NOw News 2024

(R) Why and when tying embedding (a story)
news

(R) Why and when tying embedding (a story)

(R) Why and when tying embedding (a story)

Hello, fellow Redditors! I want to share a little research journey that took place during my work. Instead of presenting it like a traditional research paper, I’ll try to make it more engaging and fun to read. I hope you find this approach interesting, and I’d love to hear your thoughts and feedback in the comments!

This should be a 11 min. read

Background

Many of you might already be familiar with a technique called Weight Tying (WT), which was first proposed here. In simple terms, WT works by sharing the weights between the input embedding layer and the output embedding layer (also known as the unembedding layer, output embedding layer, or pre-softmax layer). This technique is primarily used in the context of language modeling and offers two significant advantages:

  1. It reduces the memory footprint by eliminating one of the two largest parameter matrices in large language models (LLMs).
  2. It often results in better and faster outcomes.

While the first benefit is widely accepted, the second is a bit more complex. In fact, some LLMs use WT, while others do not. For example, I believe that Gemma uses WT, whereas LLaMa does not. This raises the question: why is that?

If you are interested, I found particularly insightful perspectives on this topic in this Reddit post.

Origin of the Idea

Earlier this year, I began exploring how to formalize the concept of semantic equivalence in neural networks. Interestingly, we can adapt the classical notion of semantics, commonly used in programming languages (see here). In computer theory, two programs are considered semantically equivalent if, regardless of the context in which they are executed, they yield the same resulting context. To borrow from denotational semantics, we can express this as:

https://preview.redd.it/9wiyxh815aid1.png?width=268&format=png&auto=webp&s=c710b2162f616f3a1181c2fddb0dc1c5cba20b55

This can be read as: “Program p_1 is semantically equivalent to p_2 if and only if, for all contexts ρ, the evaluation of p_1 with ρ produces the same result as the evaluation of p_2 with ρ*.”*

But how do we adapt this notion to our scenario? Let’s consider a simple example from Masked Language Modeling (MLM):

The of water is half empty/full.

It’s clear that we can use either “empty” or “full” in this sentence without changing the outcome distribution of the token. Therefore, we can say that “empty” and “full” are semantically equivalent in this context (“The of water is half ___“). Realizing that two tokens are semantically equivalent if they can be swapped without affecting the output distribution, I arrived at this definition:

https://preview.redd.it/vyxipqtw5aid1.png?width=332&format=png&auto=webp&s=a43cce9fc582edeb9f58984338a2805cba7b5cfc

Preliminary experiments

With this notion in mind, I wanted to explore how a neural network would encode these semantic equivalences in its weights. I suspected that embeddings for semantically equivalent tokens would naturally become closer to each other during training. This intuition was partly based on my knowledge that BERT embeddings capture similar relationships, where words like “learn,” “learning,” and “learned” are clustered together in the embedding space (see here).

To test this idea, I designed a simple experiment. The goal was to train a Masked Language Model (MLM) on a binary parity problem. Consider a string like 10011D, where there are three 1s, indicating that the string is odd. Along with the binary string, I included a parity label (D for odd and E for even). For instance, other examples could be 11000E and 00100D. Then, I introduced a twist: I randomly swapped the symbol 1 with either A or B with equal probability. So, from a string like 10011D, you might get something like A00BAD. Finally, I masked one of the symbols and trained a model to predict the masked symbol. This process resulted in a dataset like the following:

Sample Label
00A?00E A
00A?00E B
00B?00E A
00B?00E B
0BB?A0D 0

In this setup, symbols A and B are semantically equivalent by design—swapping A with B does not change the outcome distribution. As expected, the embeddings for A and B converged to be close to each other, while both remained distinct from the embedding of 0. Interestingly, this behavior was also observed in the output embeddings, which neatly aligns with the principles of the Weight Tying technique.

Formalizing the behavior

If it were up to me, I would have been content writing a paper on the observation that MLMs learn semantic relationships in both the input and output embedding layers. However, to publish in a reputable conference, a bit of mathematical rigor is usually required (even though math isn’t my strongest suit). So, I attempted to formalize this behavior.

Output Embeddings

When it came to the output embeddings, I couldn’t prove that two semantically equivalent symbols must be close in the output embedding space. However, I did manage to prove that they would be close under the following condition:

https://preview.redd.it/xrgs5gm15aid1.png?width=167&format=png&auto=webp&s=4645df4fd53ccd7777936be9926d6018fffba9ff

Interestingly, this result is purely about conditional probability and doesn’t directly involve labels or semantics. However since it provided some insight, I was reasonably satisfied and decided to move on.

Input Embeddings

For the input embeddings, I was able to prove that two semantically equivalent symbols would indeed be close to each other in the input embedding space. However, the assumptions required for this proof were so restrictive that they would likely never hold in a real-world scenario. So, it ended up being a “junk” theorem, written more for the sake of publication than for practical application. Despite this, the intuition behind it still feels compelling.

The idea is simple: if two symbols are semantically equivalent—meaning they can be swapped without affecting the model’s output—the easiest way to ensure this is by giving them identical embeddings. In this way, the network’s output remains unchanged by definition.

Proving this theorem, however, was a real challenge. I spent several days in the lab working on it, only to have my results scrutinized by colleagues and find errors. It took me about two to three weeks to produce a proof that could withstand their reviews. Despite the struggles, I remember this period as a particularly enjoyable part of my PhD journey.

The First Draft

Armed with these two theorems—one for the output embeddings and one for the input embeddings—I began writing the first draft of my paper. My goal was to convey the idea that LLMs are semantic learners. I started by introducing the concept of semantic equivalence, followed by the theorem related to input embeddings. Next, I presented the output embedding theorem.

However, as I progressed, I realized that I was missing something crucial: experimental evidence to support the output embedding theorem. While the theoretical groundwork was in place, without empirical validation, the argument felt incomplete (at least this is what a reviewer would say).

Back to the experiments (First time)

As I mentioned earlier, I proved the following implication (though I’m omitting some of the hypotheses here):

https://preview.redd.it/xvmu4t025aid1.png?width=354&format=png&auto=webp&s=e0e1332ef0232481cb136719451ac609cfa42693

So, I decided to rerun the experiments, this time closely monitoring the output embeddings. As expected, the output embeddings of A and B did indeed converge, becoming close to each other.

This finding was quite fascinating to me. On one hand, we have semantically equivalent symbols that are close in the input embedding space. On the other hand, we have conditionally equivalent symbols—those with the same conditional probability across all contexts (for all ρ: p(σ_1 | ρ) = p(σ_2 | ρ))—that are close in the output space.

Back to the Draft (First Time)

With these new experiments in hand, I revised the draft, introducing the concept of conditional equivalence and the theorem connecting it to output embeddings. This allowed me to clearly articulate how conditional equivalence is reflected in the output embeddings.

As I was writing, it struck me that the Weight Tying (WT) technique is often employed in these scenarios. But this led to a new question: what happens if we use WT with symbols that are conditionally equivalent but not semantically equivalent? On one hand, these symbols should be close in the input embedding space. On the other hand, they should be far apart in the output embedding space because they have different conditional probabilities. However, with WT, the input and output spaces are tied together, making it impossible for two symbols to be simultaneously close and far apart.

This realization sent me back to the experiments to investigate this particular setting.

Back to the Experiments (Second Time)

In our previous experiments, we established that the probability of seeing A is the same as B in any given context. Now, let’s introduce another layer of complexity by replacing the symbol 0 with symbols X and Y, but this time, X will be more probable than Y. This changes our dataset to something like this:

Sample Label
XYA?XXE A
XXA?XYE B
Y?BAXYE X
XXBBX?E Y
XBB?AYD 0

When we train an MLM model on this dataset, it’s easy to observe that in the input embedding space, X and Y become close to each other, just like A and B. This is because X and Y are semantically equivalent. However, unlike A and B, X and Y do not get close in the output embedding space because they have different conditional probabilities.

Now, what happens if we tie the embeddings? We observe that A and B converge more quickly, while X and Y remain distanced from each other. Additionally, we noticed that training becomes a bit more unstable—the distance between X and Y fluctuates significantly during training. Overall, the untied model tends to perform better, likely because it avoids the conflicting requirements imposed by weight tying.

Back to the Draft, Again (Third Time)

I was quite pleased with the results we obtained, so I eagerly incorporated them into the paper. As I was revising, I also discussed the idea that Weight Tying (WT) should be used only when conditionally equivalent symbols are also semantically equivalent. This can be expressed as:

https://preview.redd.it/g79ejj825aid1.png?width=441&format=png&auto=webp&s=67298479da49374d80ddcce7b716c5e4e5ef558c

Or, more concisely:

https://preview.redd.it/g5nle0h25aid1.png?width=216&format=png&auto=webp&s=68f0d715abf87495ae49273b518df6ccedc8a148

While discussing this property, I realized that my explanation closely mirrored the hypothesis that “similar words have similar contexts”. This concept, which I later discovered is known as the Distributional Hypothesis, made the whole paper click together. I then restructured the work around this central concept.

If we accept the formalization of the Distributional Hypothesis as σ_1 sem.eqv. σ_2 iff. σ_1 cnd.eqv. σ_2, then it follows that WT should be employed only when this hypothesis holds true.

Submission & Reviews

With ICML2024 being the next major conference on the horizon, we decided to submit our work there. Most of the reviews were helpful and positive, but a recurring critique was the lack of “large-scale” experiments.

I simply do not understand this obsession with experiments that require hundreds of GPUs. I mean, I submitted a paper mostly theoretical aiming to explain a very well-known phenomenon supported by a vast literature, isn’t a small controlled experiment enough (which is included mostly to make the paper self-contained) when backed by the literature?

Well, I am a nobody in the research community, furthermore this is my first publication at a “Big” conference like ICML so I complied (kind of, still one GPU experiment, I do not have access to more than that) although these experiments do not add practically anything.

In the end, I was thrilled to have the paper accepted as a spotlight poster. It was a huge milestone for me, making me feel like a genuine researcher in the field. I dedicated almost a month to preparing the poster and video presentation, which can be viewed here. The effort was well worth it!

Conference & Presentation

On the day of the conference, I arrived around 9 A.M. with only an hour of sleep from the flight—naturally, I was too excited to rest properly. I made it to the conference a bit late, and during the first tutorial session, I struggled to stay awake despite the coffee. A little before lunch, I headed back to the hotel to catch a few hours of sleep. In the evening I attended the great tutorial presentation physics of Language Model.

In the next days, I made a few friends. Talked to a lot of people included Alfredo Canziani, an incredible AI communicator, and Randall Balestriero an incredible scientist in the field. I saw also Michael Bronstein but of course, he was always surrounded and I could not bring myself to talk to him.

The last poster session was my time to present, and I was quite nervous, as it was my first time presenting at such a conference. To my surprise, many attendees weren’t familiar with the Distributional Hypothesis—a concept I assumed everyone would know, even though I hadn’t known the term myself. This made me question the effectiveness of my paper’s “marketing” (presentation, title, etc.). Perhaps I should have emphasized the “semantics” aspect more.

One particularly memorable interaction was with a tall guy from DeepMind. He listened to part of my presentation and then pointed out, in a very polite manner, that my theorems might not be correct. I was confident in the proofs, which had been reviewed by a PhD student in mathematics who had won some math competitions. We debated back and forth until I understood his argument, which involved a specific construction of the embedding matrices. He was right, but his argument broke one of the theorems’ assumptions. You have to know that, I was not even showing these hypothesis on the poster because I did not believe that anyone would have been interested in these details. This guy practically had a deeper understanding of my theorems than me without listening to half of the presentation and without the full hypothesis. Well, in conclusion, Deepmind has some freaking guys working there.

Conclusions

  • Use Weight Tying Only When the Distributional Hypothesis Holds.
  • DeepMind Has Some Incredible People
  • Do not go to the tutorials with 1hr of sleep (3hr are okay though).
  • Writing the Paper is Crucial: While I previously believed that experiments should come first, I now realize the importance of writing down your ideas early. Putting thoughts into words often clarifies and integrates concepts in ways experiments alone may not. This is perhaps the most valuable lesson I’ve learned from this paper.

Limitations & Future works

If you’re considering using the WT technique, you might wonder: when does the DH actually hold? Does it apply to your specific problem? Does it apply to natural language tasks in general?

Answering these questions can be challenging and may not always be feasible. Consequently, this work may have limited practical utility. It simply pushes the question when applying WT to when the DH holds. However, I suspect that the DH only partially holds for natural language, which might explain why not all LLMs use WT.

So, my idea is that it should be more useful to run the training with WT up until a certain point and then untie the embeddings to allow differences between tokens that are conditionally eqv. but not semantically eqv. (or vice versa) to arise. Unfortunately, I lack the GPU resources to train a meaningful LLM to test this hypothesis (I am from a very small lab (not even a Machine Learning lab to be fair)). If anyone is interested in exploring this idea or knows of similar work, I would greatly appreciate hearing about it.

submitted by /u/f14-bertolotti
(link) (comments)