Hi! Today I’ll make an article about a topic I was talking about with friends a lot. After Chat GPT came out, I feel like a lot of people already understand some of the limitations Chat GPT has intuitively, but do not know necessarily why or how they came about. Here, I like to start off with how we even started thinking about text, the GPT series motivation, and then some of the fundamental problems that still plague Chat GPT from our history of working with text.
Now, first of all, how do you work with text?
Bag of Words
The first approach that people thought of was why don’t we just split text like “the dog is on the table” by spaces to make it “the”, “dog”, “is”, “on”….. And then tell our model that the word “dog” occurred once, “the” occurred once, “on” occurred once, etc. Then, what we can do is we can represent this sentence as a bag of words like this
So, we can represent any sentence, any book, any text as just a sequence of numbers like above. AIs love this. In fact, AIs can’t work without fixed shapes. On the other hand, if we were to just say replace each word in a text with a number like “dog”->2 “is”-> 0 and make a sequence of numbers like 2, 0, etc, depending on the length of the text, the size of the input changes. AIs do not know how to handle arbitrary sizes by themselves so they can’t work with it.
Now, what’s the problem with this approach? The main problem is it completely ignores the order of the text. Let’s say you have “A can of worms”. In this form, it can’t distinguish this from “A worms of can”. Then, is there a way for us to somehow keep the ordering of our text and possibly work with arbitrary long text?
Recurrent Neural Networks(RNNs)
Now, while there were some things in between, the next big advancement was Recurrent Neural Networks. The main idea of Recurrent Neural Networks was
- Neural Networks are pretty awesome
- Can we use neural networks to remember the ordering of text?
What we get is something like the one below!
For each new word, the network keeps a memory of “what has occurred before now” which is the right blue arrow, and then inputs the next word. So in a way, you can think that each block has an understanding of all the text that occurred until its block and can work with that.
This has been used for a lot of tasks like translation, text classification, and so on but, there was one big limitation for recurrent neural networks: they forget.
The blue right arrow that passes information from block to block is a fixed size as AI can’t work with different input sizes. So, what happens when we try representing a long text into a fixed memory? The AI forgets the text around the beginning and mainly remembers the text close to our current block. There were some techniques such as Long Short Term Memory(LSTMs) but there was no huge breakthrough until attention came along
GPT-2/Attention Is All You Need
“Attention Is All You Need” by Google and the subsequent GPT-2 by Open AI was the moment that I became an AI hobbyist and in my opinion. It was probably one of the biggest news ever to happen to AI. But the idea is very simple. If we are having trouble remembering our first word when we are working with text around the end, why don’t we make it so that we always “see” the first word? And that is exactly what attention is! We can, for every single word in our sentence, get how important other words are in our sentence(how much we pay “attention”). You might notice that here, we can completely get rid of our forgetting issue since the model always “sees” all the text.
This mechanism is pretty much what was used to make GPT-2. The main idea is, given all the texts that happened before, we predict the next word. So, we, just like that make a text generator that doesn’t forget! This approach was so good at generating text at the time that Open AI, the company whose original goal was to open-source all their research, felt the need to not release their models for the first time. It was a shift in how people thought of AI which also lead to new organizations like Huggingface, Laion, Eleuther AI, Harmon AI, Stability AI trying to fill in the gaps that Open AI left which is pretty exciting in my opinion, especially as an open source contributor.
For more detailed math/understanding I did write a bit on it in my Understanding GPT-2 series here. But there are a lot of awesome resources on the inner workings here so feel free to check around.
Now, what are the issues with this approach? One limitation of GPT-2/attention is that while yes, it can memorize the contribution of each word to every other word, just to hold all that data, we need sequence length times sequence length amount of data! Like below
This means that if we were to increase the size of the amount of text our transformer can see at once to say twice the amount, we need 4 times the compute. And computing is not cheap.
However, while the above is an issue, there is a larger underlying problem in text generation.
The Fundamental Problem of Text Generation
The fundamental problem of GPT-2 and GPT-3 and pretty much any text generation model is the data. Not surprisingly, these transformer models take a lot of data to train. Open-source models like GPT-J, and GPT-Neox that try to compete with them typically use datasets like the Pile which is an 825 GiB dataset. Now, when we think about the average text we post online, are they high quality? Do they rival texts of classics like War and Peace or any of Stephen King’s literature? Even if we don’t go to that level, when we are feeding data into our model to train, at the very best, our model will be as good a text generator as the average data/texts. We might get lucky and get some good text from time to time but the models we train are limited by our data. We can train on high-quality data only but there isn’t enough to train a model from scratch.
But you might say hold on a minute. Chat-GPT’s responses seem to be above average. Same with GPT-4’s responses. In fact, it may still have a forgetting problem but it seems to be able to be generally helpful. The question is how did they do that?
Chat GPT/Instruct GPT: Reinforcement Learning With Human Feedback(RLHF)
Now, this idea was first proposed in Instruct GPT by Open AI but it exploded in popularity right around when Chat GPT came along. The main idea is just this:
- Let’s have humans score the output of our Chat GPT model given a certain input text
- Let’s have a model predict the human score given a certain output text from Chat GPT. We now have no humans in the loop!
- Let’s have Chat GPT train to generate text that gets as high a score as it possibly can when fed into the model at 2.
So now, we can make above-average text! Or can we?
Problem With Chat-GPT/RLHF
Now, what’s the catch in the above approach? You might have caught it. It’s the first step where we have humans label which generated text is better than another one. Depending on the culture/bias of the people who do these labels we can see interesting differences between models. In Chat-GPT we notice that the model is very against violence or saying anything political. In a model like Open Assistant, which is the open-source variant of Chat-GPT, they gamified the process of RLHF which lead to more liberal responses for “antagonistic” prompts. Now, countries like China are making their own variant of Chat-GPT that aligns with their values.
What’s the commonality here? RLHF is not a technique to increase the performance of the model. It is more of a personalization approach that reflects the values of the community/company that made it!
For this problem, there is no clear solution except just improving the quality of the data. And in fact, it may not even be a bit issue but I do feel that it’s important to understand.
What about memory?
Now, the other elephant in the room is what about the memory issue with transformers/attention? Is there no solution to that? Open AI may have a solution to this as their context length can go as high as 32768 tokens(32 times GPT-2!) But their technique is currently a secret. However, the open-source community has some guesses.
In AutoGPT here, people came up with an interesting idea of why don’t we have a database that stores all our conversations. Then, when we ask our Chat GPT model a question, why don’t we attach relevant bits of information that we have in our database to our Chat GPT model? Here, the search can be thought of as a Google search.
Now, the problem here is that while retrieving this information is pretty nice, we do not know that that information is the best information to answer/generate our questions.
The next idea that came along was from a paper “Scaling Transformer to 1M tokens and beyond with RMT” which created a lot of buzz where the basic idea was can we combine RNNs with Attention? And they demonstrated that you can have attention blocks process a batch of text and then pass it along the RNN chain and it doesn’t forget too much which was pretty impressive.
I do need to look into it more but the obvious catch here is we are compressing 1 million tokens worth of information to the RNN’s blue arrow of a fixed size. So there always will be some information that is lost.
There is a very good video about this here.
In my opinion, I think we can hook up our model with some kind of database where it can choose to store information like how Facebook demonstrated how AI can use “tools” like Google. I think this memory problem can be solved by having the AI choose what information it’ll want to retrieve/store.
Also, there are some techniques like long former where we say do we really need to get that full sequence length times sequence length matrix?
I do think that GPT-4 used something like the long former approach where they chose to not go the full sequence length squared but just some entries of it.
But overall, it’s a pretty exciting field!
While Chat GPT may be pretty awesome, it has 2 fundamental problems which if resolved might be the start of AGI!