Understanding the Current State of AI for Stories Generation

Isamu Isozaki
19 min readMay 15, 2024

--

The goal of this article is to do a literature review of stories for a presentation in the Huggingface reading group.

One thing I am pretty curious about is whether computers can generate good stories. I don’t have a clear research definition for good stories but I am a fan of reading good books by amazing authors such as Neil Gaiman, Madeline Miller, Highashino Keigo, or Miyabe Miyuki and I am very interested if it’s possible to have never-ending content of stories that can influence us.

Guesses

Before moving on, I have 3 guesses for a way to make or even just make progress in making AI do good stories.

  1. Do a simulation. The current closest thing I know to AI-generating descent stories is the game Dwarf Fortress and I think this video gives a good explanation of it. The basic idea is each NPC is a character and their interactions become history and with some bizarreness, it becomes a story. It is popular to the extent where there are youtube channels just recording these stories in the dwarf fortress. The main con of this idea is based on the properties of the simulation, I think it will restrict the types of stories we can generate.

2. Have an AI to determine if a story is good for a particular user. I do like reading stories online and I think currently, there is a lot of demand for just figuring out what stories readers will like before they read them. I haven’t seen any site recommend stories based on the contents of the story. For example, in sites like “小説家になろう/let’s be a writer”, usually, tags, the author's popularity, and bookmarks are used which I think is a gap in the research landscape and the story industry as a whole.

This is because authors, much like YouTubers, can maximize their views for the same amount of work by figuring out the optimal titles that cause viewers to read the story(which is why light novels have long titles) or by figuring out the optimal publishing time. Another common flaw that also appears in published work is the initial premise and the idea is good but then the later chapters stagnate as the full plot was not developed which is hard to evaluate. I also belive a lot of great stories are written with no one to read them which makes them impossible to evaluate. The main con of this approach is that this “good story” metric is ultimately subjective.

3. Take inspiration from history. I remember while reading “A World Undone” by G.J. Meyer feeling so invested in the story of how WW1 started and how it ended. I do believe many of the great stories we have take inspiration from historical/real events including the author’s personal experience. Take for example, “Flowers for Algernon”

the author used his personal experience of working in research and teaching mentally disabled people for the characters and used a setting of 1960s America.

So now, let’s get to the actual literature review!

Literature Review

Now, for this literature review, I’m following this GitHub repository as I couldn’t find a survey paper that was about the papers after GPT-4. I am planning to start off with “GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence”

GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence

First of all, this paper starts with past approaches to story generation.

Past Approaches

One approach is story-ending generation.

For example in “Story Ending Generation with Incremental Encoding and Commonsense Knowledge” the task was to give sentences leading up to the ending to generate the final ending sentence. The idea was this would require the model to have an understanding of the story.

What the authors seem to do is, given a word in a sentence, in this case when we are encoding “Halloween”, we do something like RAG but with knowledge graphs to get a bunch of words connected to it for “common sense”. Then a weighted sum of those word representations is given as graph embeddings which combined with the knowledge from the previous sentences will give enough information to the model to generate the final sentence! I am honestly a bit impressed with the authors making this work since this was even before GPT 2 was out.

Another more interesting approach was “Controllable Neural Story Plot Generation via Reward Shaping” where the basic idea is that the authors wanted the final sentence to contain a certain verb that indicates resolution. The example the authors used was “Suppose our goal is to generate a plot in which one character admires another. Events that contain the verb meet are more likely to appear nearby events that contain admire, whereas events that contain the verb leave are likely to appear farther away.”

And so the authors decided on a reward function of

where l_s is the length of the story s and ds calculates the distance between verb v and the goal verb g in the story. This value will be highest if there is minimal distance between our verb and our goal verb.

Now, the paper mentions another class of methods which is making an AI fill in the gaps of an outline. One paper that was interesting in particular was “Psychology-guided Controllable Story Generation” which transforms the protagonist and his needs and emotions to stories.

Another one is “PLOTMACHINES: Outline-Conditioned Generation with Dynamic Plot State Tracking” where the authors convert interactions between plot points to a story like so

Now there is another line of work for injecting common sense into stories but at least when looking at “Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension” it seems like pretty similar to RAG.

Back to GROVE

Overall, I think it’s pretty evident that most of these works, while interesting from an engineering perspective, require a lot of human control to make them work and yet still it’s extremely small scale in what they accomplish. GLOVE, on the other hand, tries to make a lot of these processes automatic with 3 main processes

Retrieval Repository

Here, the authors got a lot of stories from Writing Prompts and extracted “conditions”, such as plot, mood, genre, and subject to pair each story with. I believe the conditions were extracted with GPT-4.

To retrieve a story, we will first have a target condition set C and then we find the story with the most similar conditions via cosine similarity which is pretty much just RAG.

But now, how do we generate a story using this?

Evidence Forest Construction via Asking Why

  1. We first have the LLM generate an initial story given some condition C and retrieved story samples from the Retrieval Repository.
  2. We have the LLM generate unclear parts of the initial story. The prompt for this is

“Here is a story: [STORY]. When analyzing fictional stories, it is okay to mention the negative aspects. Pretend to be a writer, and without further ado, point out N missing background information in the story with N simple sentences one by one.”

3. Iteratively ask why for each unclear part and have GPT 4 make “evidence”. This will be recursive because each piece of evidence might have an unclear part in it. So by recursively asking until there are no ambiguities, we get an evidence tree for the given ambiguity! The prompt for this is

“Here is a story: [STORY]. A missing detail is: [EVIDENCE CHAIN]. Except for pure coincidence, point out b factual pieces of background information that compensate the story one by one. Each additional piece of information should be in one short sentence and only contain factual information without opinions or judgments.”

One important part to note here is that the evidence trees are not per story but per ambiguity. In the above prompt, ambiguity is introduced as the root node of the evidence tree. So the missing detail is: is followed by the ambiguity and then the “evidence” trying to resolve that ambiguity. As there are N ambiguities we call this an “evidence forest”

Evidence Chains-supported Story Rewriting

  1. The optimal evidence chain is selected by concatenating all the evidence chains and feeding them to an LLM to select the “most suitable evidence chain” for the initial story which is repeated for all N evidence trees. The reason multiple chains are not selected is to avoid logical contradictions
  2. The story is rewritten with the given evidence

“Here is a story: [STORY]. Here is the missing background information: [EVIDENCE CHAINS]. Pretend to be a writer and complete the story by including the given information. Modify the necessary sentences in the story and repeat the unrelated parts to include the given background information.”

One question I had was whether this was just distilling the WikiStories into some random stories but according to the ablation study where they evaluated with humans, removing the Retrieve part seems to not affect the results that much

in addition, they did run a plagiarism checker and no plagiarism was detected which I found pretty interesting.

Also, they seem to claim that they outperform humans!

in an evaluation with folks with MS in English Literature. I am personally curious what the limitation is as I couldn’t find the stories online as of now.

Personally, I’m pretty impressed that this method was able to avoid contradiction between trees. But more fundamentally, it does raise the question of whether the only thing missing from GPT 4 to generate great stories is just some logical understanding of the plot. And I am a bit skeptical that an LLM has enough logical capabilities to do this for complicated plots.

Another slight criticism I have of this work is it only works for short stories/stories that fit in the LLM’s context length even disregarding the trees. But overall I think it’s an interesting method.

Now, back to the premise. Can GPT 4 really create creative plots?

Creating Suspenseful Stories: Iterative Planning with Large Language Models

In “Creating Suspenseful Stories: Iterative Planning with Large Language Models” the authors argue that GPT 4/LLMs, at least initially, have issues generating suspenseful stories. The reason for this is that “suspense is an affective response to a cognitive state that only comes about under certain circumstances” which are

  1. There must be a protagonist that is empathetic to the reader
  2. That protagonist must face a high possibility of an outcome undesired by the protagonist and the reader
  3. The quantity and quality (roughly in terms of expected probability) of ways of avoiding that undesired outcome are reduced

Now the main idea of this paper seems to be to just resolve it with prompts like so

So essentially, given a goal prompt the llm what the best action is and why it failed over and over again. I think this paper’s main contribution is understanding what suspense is and I do find it a bit interesting that GPT 4 can’t do this from the start. One main critique I have is that this restricts what kind of suspenseful writing can be produced in my opinion. For example, I think this can produce plots like Stephen King novels theoretically but I don’t think the writing will be there. But happy to be proven wrong.

Another thing LLMs lack may be pacing in the stories they generate.

Improving Pacing in Long-Form Story Planning

In “Improving Pacing in Long-Form Story Planning”, the authors found that LLM stories have unnatural pacing in that “Existing LLM-based systems for writing longform stories or story outlines frequently suffer from unnatural pacing, whether glossing over important events or over-elaborating on insignificant details, resulting in a jarring experience for the reader”

To solve this, the authors came up with a “Concrete Outline Control”

which works with Project Gutenberg and makes paragraph-level summaries and chapter-level summaries. The next step is to train the concreteness evaluator to predict which text is more vague. Here, since the chapter level summaries have to compress more information, the authors concluded it was more vague.

Now, how can we use this to improve our outlines?

Basically, the idea seems to be

  1. Find the vaguest part of the outline by comparing every fact to each other.
  2. Generate/Expand on it using GPT-4 for it to be more concrete.

And iterate until it’s not vague anymore. This felt a bit like GROVE but with the added work of training a classifier, this long-term planning may not be an issue in GROVE-like systems.

Now, what about the final step of generating a story? The idea seems to be just following a paper called “DOC: Improving Long Story Coherence With Detailed Outline Control” which I saw referenced a lot. The basic idea seems to be, given an outline, use LLM to expand the outline and then generate a story. GROVE didn’t have to do this as it was not a long story.

I am curious here about whether writing style changes/coherence so I will talk about this later on.

However, one thing I wanted to show was LLMs, without some prompting to force them to generate can’t generate stories with certain emotions like suspense and may be lacking in storytelling as well.

How about for the logical side? How much logical processing can LLMs do in the context of stories?

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

In “Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives” the authors found that current LLMs have “limitations in inferencing complex relationships and handling longer narratives”.

To accomplish this, the authors first noticed that evaluating this with existing datasets is hard as

  1. The LLM trained on those stories, which I presume are classics like the Gutenberg project
  2. The stories are too simple, ex children's stories where “characters and their relationships are typically introduced when the characters first appear in the narrative”

However, in more complicated stories, the information about each character is “incomplete and uncertain” and can be contradictory for ex lies!

So when LLMs work on these more complicated plots, the researchers observe they misidentify relationships which causes them to misunderstand the entire story.

To solve this the authors came up with Conan which stands for COntextual Narrative ANalysis (Conan)

Essentially, as input, we have narratives of k characters. We also have the background story of each character from the perspective of said characters. The task of the LLM is to

  1. Extract all the characters in the narrative. This was a challenge for the LLMs as the characters might have different identities/aliases through different points of the story and LLMs have issue keeping track of this.
  2. Link character relationships from the perspective of each character. One limitation observed by the LLM was, as found in “The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A””, even if LLMs can say A is the parent of B, it doesn’t necessarily make the connection B is the child of A.
  3. Infer relationships between characters that are not actually stated. LLMs definitely struggle here, especially in the face of inconsistent information.

Now, one part I’m curious of is how was such a dataset constructed where there is only one answer for the resolution of relationships. I found this a bit funny but the researchers got all their stories from background narratives of murder mystery games and selected 100 out of 2135. To mitigate the reversal curse, all relationships were human-labeled by fans of detective narratives.

The authors measured the performance of the labelers by evaluating inter-annotator agreement to see how they label a specific narrative. The only parts where labelers had low agreements were on subjective labels such as ““father’s friend of x” might be labelled as “acquaintance of x””. The figure for the pipeline is below

The result of this dataset on LLMs are as follows:

where corruption rate is the rate at which the LLMs do not generate in the right format. I’m not sure what before and after is from reading the paper but they mention it’s post-processing to make the format correct. I think this makes sense as GPT 4 is the best but still performs pretty bad.

Here,

  1. AllTogether=directly output a relationship graph, including characters and their relationships
  2. DirRelation=extract the characters from the narrative first, then utilise it alongside the narrative script to generate the relationship network
  3. PairRelation=initially extract characters, inquire about the relationship of each character pair, and finally, aggregate the results, merging them into a comprehensive relationship map

So one common thing I noticed here is the chain of thoughts prompting here isn’t working too well as DirRelation seems to be decent but the PairRelation is consistently having the worse result.

Another part that surprised me is in some metrics like character extraction, GPT 4 does not get the best result

which I found very strange. The final metric I want to talk a bit about is the narrative length vs F1 score.

and it does seem that around 20000 characters(not tokens!) the F1 score is highest but after that, the LLMs consistently lose performance. The authors hypothesized that this is what “Lost in the Middle: How Language Models Use Long Contexts” where LLMs “tend to put more attention on the beginning and the end of long inputs, often ignoring information in the middle” which might be because humans behave that way.

Overall, I think this was a pretty interesting paper on the limitations of LLMs for not just stories but logic in general.

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

To highlight current LLM’s capabilities more, “Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers” found that for stories that are not shared online, the SOTA models like GPT-4, Claude-2.1 and Llama 2 70b made mistakes in over 50% of summarization. With results like so

where the scores are out of 4 scored by the original authors.

Now, we did find that LLMs are not currently capable of generating compelling stories, but let’s say this issue is solved. How can LLMs generate long compelling stories?

DOC: Improving Long Story Coherence With Detailed Outline Control

The first paper that tried this was “DOC: Improving Long Story Coherence With Detailed Outline Control”

As mentioned before, the 2 parts of this paper are getting the detailed outline and then generating the story. This is an improvement on a paper called “Re3: Generating longer stories with recursive reprompting and revision”

Detailed Outline

Given an event in the outline, say “Sue flies to Italy”, the detailed outliner expands this event into trees when the event is the root and continues expansion recursively. This does feel a bit like GROVE but not for ambiguities, but for the next events. This is called “Breadth-First Expansion”.

Now, while we do this we don’t want to generate contradictory stuff in the outlines. So, all of the current node’s ancestors and children are in the context length when generating the children.

Finally, with filtering and reranking, we managed to expand the outline!

Here, one interesting part is the model is made to explicitly get the setting and characters which might be the authors noticing LLMs have issues with character memorization/relationships otherwise.

Another interesting part is the authors made the model explicitly mention the character developments for each passage over time.

But overall, that was the main interesting part I got from skimming. Let me know if I missed anything interesting! The code for DOC is here.

End-to-end Story Plot Generator

One issue with DOC is that it requires “hundreds to thousands of calls to LLMs (e.g., OpenAI API) in the planning stage of the story plot, which is costly and takes at least several minutes”. Meta’s improvement on this was “End-to-end Story Plot Generator”. The main contribution was

  1. Replace expensive Open AI API calls with local llms to generate plots
  2. Do Supervised Tuning on the prompts for the generated plots so that we can generate these plots 10x faster.
  3. Finally do RLHF/DPO with a reward model so that the generation of plots are even better.

Now, finally, what are some efforts to train LLMs to be good storywriters?

SWAG: Storytelling With Action Guidance

Before talking about this, I assume you have a basic understanding of Pretraining, SFT, and RLHF. If you don’t I recommend you check out this blog post first.

“SWAG: Storytelling With Action Guidance” has an interesting approach to making story writing a bit like a reinforcement learning problem. Where there is one LLM that generates story content/actions and another LLM that chooses the best action out of the candidates.

Now how do we do this? First of all, we need to, given a state, have a list of candidate choices and a way to choose the best option. To do this we start with a prompt for the story, P, and the current situation, S. Like so

Then we prompt GPT-4 and Mixtral to generate the first paragraph to make a dataset.

Then, we have GPT-4 choose which action is best which will be our label.

Next, we do pertaining on the data which basically means we do the next token prediction on our base dataset which is long stories.

For supervised fine-tuning(SFT), it was just training using the state and all the paragraphs that were generated. This means that given the story and prompt we try generating the paragraph and make that the loss of our LLM.

To adapt the open source model, the authors had an issue with the context length being too large. To fix this, the authors used “LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models”

where my basic understanding is that only the highlighted parts are calculated instead of the entire attention so we can encode way longer context lengths for the same amount of computing. I’m a bit interested if there were any issues with the Positional embeddings for this or ROPE was used.

Now, finally, we do RLHF/DPO to make the model more incentivized to generate the correct paragraph!

Note that this was all to only train the discriminator for the actions! For the base model, the authors just used a base open-source model/GPT 3.5/GPT 4 and it seems like having this discriminator there does help making better stories when evaluated by humans

On the other hand, I’m not sure why the authors didn’t use the discriminator itself to try to generate a story but I may have missed it.

Weaver: Foundation Models for Creative Writing

The final paper I want to talk about is “Weaver: Foundation Models for Creative Writing”

The Weaver paper found an issue with current LLMs in creative writing in the data. For example, the annotators during the RLHF stage are not professional writers/content creators who only focus on producing “helpful and harmless responses” which removes creativity.

The authors argue that the above makes LLMs handicapped in creativity while LLMs are powerful in other domains such as writing code and answering general questions.

Data for SFT

One of the main contributions of Weaver seems to be how the data was corrected. For SFT/alignment, they took inspiration from “Longform: Optimizing instruction tuning for long text generation with corpus extraction” to generate instructions from a human written passage like so

This has the benefit of making an instruct dataset more cheaply and cleanly. Now, as the authors did observe crowd workers are not the best for creative writing purposes, instead they were used to correct high-quality text data, such as stories, articles, and blog posts which then reduces the cost of making an instruct dataset and increasing quality!

After this, they filtered the data using rule-based and ML-based methods and then tried mixing all the data in an even ratio.

2 additional facts that were interesting to me were

  1. The authors included training to specifically work with RAG instead of relying on prompt engineering by adding context etc.
  2. The authors trained on a function-calling dataset.

Constitutional DPO

Another part that was interesting to me was Constitutional DPO.

First human experts annotate principles for a certain domain and task like below

As well as one case adhering to the principle and one case violating it with rationale.

Then, given a filtered instruction output pair from the instruct dataset, we ask GPT-4, given the principles, why it is a good quality. Then, we ask GPT-4 to generate a response that violates the principle with minimal modifications.

Then, it’s exactly the same ad DPO. This training method is apparently extremely not noisy which seems awesome.

My main critique of this work is

  1. The benchmark library they used, WriteBench’s repository seems to be deleted and also is their own benchmark but humans seem to consistently prefer this model too in creative writing
  2. More fundamentally, one thing this model does, pertaining, was interesting since I thought that would cause catastrophic forgetting of what the model already knows. And at least from the papers above it seems like logical knowledge/maybe even math is pretty important in story writing so overall I am curious if this approach is more valid than say a generalist one.

But overall I do like this paper.

Conclusion

I think this blog and the previous blog on a literature review of graphs go together since it seemed to me that one main part missing from storytelling is understanding the plot which I felt should be offloaded to another model specializing in graphs. But that may be a future paper. I did have some guesses in the beginning but my understanding was we are not quite there yet but we are close.

--

--