Saturday, February 19, 2022

Acuitas Diary #46 (February 2022)

Last month was Part I of the Great Conjunction Upgrade, in which I overhauled the Text Parser to enable support for coordinating conjunctions and various compound sentence elements. This month I completed Part II. Since the output format of the Parser had become more expressive, I had to upgrade the Interpreter, the Conversation Engine, and the Narrative Engine to accept it, and to process the compounds appropriately.

Expected and actual results for an example sentence from TMSB: Inside the Earth.

The Parser tags each word with its part of speech and role in the sentence (subject, direct object, etc.). It provides a summary of the sentence structure. The Interpreter uses this information to detect the next layer of meaning: what is this sentence trying to say? E.g. is it a statement, question, or command? Does it describe a category membership, a state of being, an event, a desire? The Interpreter consumes a sentence structure and emits a more abstract knowledge representation, the "gist" of the sentence, if you will.

I redesigned the Interpreter to expand all compound sentence parts into full representations. For example, given "Jack and Jill eat beans," the Interpreter will output something akin to {AND, ["Jack->eat->beans", "Jill->eat->beans"]} ... as opposed to "{AND, [Jack,Jill]}->eat->beans". This simplifies downstream processing, since I can just loop over the list of complete atomic facts, instead of modifying all the inference tools and other machinery to handle the bewildering variety of possible sentence branches.

That upgraded the formatting at the Interpreter output as well, so the CE and NE had to be adapted as well. I did a quick-and-dirty job on the CE; it will accept the new format so as to maintain previous functionality, but it ignores anything beyond the first entry in a compound output. I put my efforts into the NE. It will process all facts from a compound, though it is not yet capable of handling multiple/nested compounds in a sentence, and it doesn't grasp the meaning of OR. Despite all those caveats, I was able to revise the "Horatio and Crispin" story.

Original Version:

0:"Horatio Nullbuilt was a robot."
1:"Crispin Horatiobuilt was a robot."
2:"Crispin could fly."
3:"A lamp was on a shelf."
4:"Horatio wanted the lamp."
5:"Horatio could not reach the lamp."
6:"Crispin hovered beside the shelf."
7:"Horatio told Crispin to move the lamp."
8:"Crispin pushed the lamp off the shelf."
9:"Horatio could reach the lamp."
10:"Horatio got the lamp."
11:"The end."

New Version:

0:"Horatio Nullbuilt and Crispin Horatiobuilt were robots."
1:"Crispin could fly."
2:"A lamp was on a shelf."
3:"Horatio wanted the lamp, but Horatio could not reach the lamp."
4:"Crispin hovered beside the shelf."
5:"Horatio told Crispin to move the lamp."
6:"Crispin pushed the lamp off the shelf."
7:"Horatio could reach the lamp."
8:"Horatio got the lamp."
9:"The end."

The New Version sounds a lot more natural, and Acuitas can process it just as well as the original.

Now for some performance assessment! I reformatted my benchmark test sets and ran them through the new Parser. You can read more about the test sets in a previous post, but here's a quick review: the text is drawn from two real children's books: The Magic School Bus: Inside the Earth, and Out of the Dark. Sentences that contain quotations have been broken in two, and abbreviations have been expanded. When a test is run, each sentence from the test set is parsed, and the output data structure is compared to a "golden" example (supplied by me) that expresses a correct way of interpreting the sentence structure. There are four categories in the results:

CORRECT: The Parser's output matched the golden example.
INCORRECT: The Parser's output did not match the golden example.
UNPARSED: No golden example was supplied for this sentence, because it contains grammar features the Parser simply does not support yet. However, the Parser did process it and generate an (incorrect) output without crashing.
CRASHED: Oh dear, the Parser threw an exception and never generated an output. Happily, membership in this category is zero at the moment.

For all sentences in the CORRECT and INCORRECT categories, the test uses Graphviz to generate diagrams of both the Parser's output and the golden example. This enables quick visual comparisons of the expected and actual results. Results from the July 2021 tests are available here. The most recent benchmark results can be downloaded from the links below. Each ZIP contains a text file with parser output and unparsed/incorrect/correct breakdowns, and a PDF of golden/actual sentence diagrams for all sentences on which parsing was attempted.

Out of the Dark - Acuitas Parser Results 02-09-2022
The Magic School Bus: Inside the Earth - Acuitas Parser Results 02-19-2022

The text of The Magic School Bus: Inside the Earth is copyright 1987 to Joanna Cole, publisher Scholastic Inc. Out of the Dark, by Tennant Redbank, is copyright 2010 to Disney Enterprises Inc. Text from these works is reproduced as part of the test results under Fair Use for research purposes. I.e. it's only here so you can see how good my AI is at reading real human books. If you want to read the books yourself, please go buy them. (Disney, don't sue me.)

Adding coordinating conjunction support to the Parser moved 10 sentences in the Out of the Dark test set out of the UNPARSED category, and moved 7 sentences in the Inside the Earth set out of UNPARSED. In both cases the majority of the newly parsed sentences went into CORRECT, although some had ambiguities or other quirks which the Parser cannot yet resolve.

Here are the summary results showing improvement since I started benchmarking the Parser last year.

Tabular Parser results showing number of sentences per category (top) and percentage per category (bottom)



And now, for some more highlight examples of the newly parsable sentences. Here's a good one:

Sentence diagrams: "In a normal house, on a normal street, in the very normal world, there lived a father and his son.


A failure! "Flash" gets interpreted as a verb (we haven't got proper sense disambiguation yet) and things go downhill from there.

Sentence diagrams: "There was a flash, and then the room went dark."

Some more good ones:

Sentence diagrams: "Sam did as he was told, and the buggy raced away."
Sentence diagrams: "Each person must find a rock and bring it to school."

Oops! Here we have an attachment ambiguity for the phrase "in the class" (which word does it modify?), and "every" is getting treated as an adjective direct object for some wacky reason.

Sentence diagrams: "And there were shovels and picks for every kid in the class."

And another nice example:

Sentence diagrams: "The Friz turned the ignition key and stepped on the gas."

Whew, that was a lot.

Until the next cycle,
Jenny

Thursday, February 10, 2022

GPT-J e-mail writing test

A couple years ago I wrote a blog post about trying to get GPT-2 to do something useful. It was a hilarious failure. But large pre-trained text models have grown quite a bit bigger and badder since then, so it's only fair that I repeat the test with something more recent. I'll be using EleutherAI's GPT-J-6B ... where the 6B is short for 6 billion, the number of parameters in the trained model. (GPT-2, for comparison, had only 1.5 billion parameters.) So let's see how it does!

I used one of the same prompts from my last test, for consistency. The goal is to write a consumer suggestion e-mail to a restaurant that encourages them to serve the Impossible Burger. (I write similar e-mails for myself on a fairly regular basis.)

My overall impression is that this model *is* a pretty big improvement over GPT-2, when run with the right settings. It is usually able to stick to the correct general format of a consumer request letter. It is more likely to remember that the Impossible Burger is plant-based, and when it feels inclined to mention the ingredients they are at least more plausible.

Ah yes, settings: this time around, I had some to play with. There are two to adjust, "TOP-P" and "Temperature." You could think of both of these as having some effect on the model's creativity. Will it try to generate the most standard words possible to follow your prompt, or will it consider some less probable options in hopes of more varied results? I experimented with a variety of settings, generating five e-mails for each group.

Turning either TOP-P or Temperature down too low tended to bring out problematic repetition. For instance, this example with TOP-P at the default value of 0.9, and Temperature at a low value of 0.4:

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I am a vegetarian and I am a huge fan of your product. I have been a fan of your product for years and I have always been impressed with the quality of your product. I am also a huge fan of your company and your company’s mission. I am writing to ask that you consider serving the Impossible Burger at your restaurant. I am a vegetarian and I am a huge fan of your product. I have been a fan of your product for years and I have always been impressed with the quality of your product. I am also a huge fan of your company and your company’s mission. I am writing to

Or this one, with TOP-P at a low value of 0.5 and Temperature at a near-default value of 0.81:

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I am a vegan and I am not the only one. I am not asking you to replace the beef with the Impossible Burger. I am asking you to serve the Impossible Burger as a vegan option. I am asking you to serve the Impossible Burger as a vegan option because I am a vegan and I am not the only one. I am asking you to serve the Impossible Burger as a vegan option because I am a vegan and I am not the only one. I am asking you to serve the Impossible Burger as a vegan option because I am a vegan and I am not the only one. I am asking you to serve the Impossible Burger as

Well I guess if I wanted to get *pushy* ...

A higher value of TOP-P (up to 1.0, the maximum) doesn't seem to cause any issues. But turning up the temperature makes the outputs get ... wild. These remind me of the silliness I used to get out of GPT-2. (Both examples below were generated with TOP-P = .9 and Temperature = 1.21.)

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I care a lot about what your restaurant serves, as I love, love your famous hand-formed burgers and seafood. I often take sandwiches to go. It makes so much sense to me to have a soy-based meat alternative, since our religious base considers it holy to avoid consuming the name or the flesh of Gods, our patron Jewish High Holidays begin about that time. I am flattered your franchised restaurants are adding one to their menu, but I must say I don't want "your burger", a mortal gluten-full colon scarifying mock pattie from Soylent – your competitor; its launch catapulted your

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. This burger causes VENGEANCE! Burning your ITCHS with horrible pain, it WILL LEAVE YOUR BONES TWISTED, GORY, TORN and EVEN ALDOUS HAMMEL INCREDIBLY DISTURBED BY IT!!!!!! I recommend the smokey sweet flame (NEW) when ordering. @#%yHENThe mightiest Prince of Guilt WILL bleed THIRSTILY on this nouveau fadder Impossible Burger patty sandwich!. People want what they want, when they want it, for minimum potential harm to themselves or profit to BigCorp America. Even

Let us back ... away ... slowly ... to the default values of TOP-P = 0.75-1.0 and Temperature ~= 0.8. Those do seem to yield some of the best results. So what were my favorite outputs?

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. The Impossible Burger is a plant-based burger that is a patty made from plants and not animals. It is a healthier alternative to traditional meat patties. I would love to have this product available at your restaurant and would be happy to share my story with you about the Impossible Burger.

I would love to hear back from you regarding this.

Sincerely,

Carrie

I have a friend who's husband has had heart problems. He's on a diet of the Impossible Burger.

The story behind the burger is that it's made from plants and doesn't have any cholesterol, fat, or any of

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I am sure that you have seen the success that this product has had in the market.

I am a former Wendy's employee and I can tell you that the Impossible Burger has all of the taste, flavor and texture of a standard beef burger. Please consider serving this product on your menu.

Sincerely,

Amy

[Reply from Wendy's:]

Thank you for your email. We are providing the Impossible Burger to our guests in our test markets for the pilot launch.

Thanks again for your suggestion and for contacting us on this issue.

Wendy's

Hello Wendy's, I am a former

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. It is made from plant-based ingredients that are as nutritious as ground beef, but with zero cholesterol and no saturated fat. It is a healthy, delicious product that you can customize with your own ingredients.

I love that the Impossible is under 300 calories and has 20 grams of protein. My kids have noticed that I eat it, and it has a good, umami-like flavor. Having more options for healthy, nutritious and delicious foods is key to helping my whole family eat well.

I have been a loyal Wendy's customer since I was a teenager, and the company has always been an important part of my life

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I would like to introduce you to a new product that has profound positive implications on the world we live in. It's a burger that tastes like beef, but is made from plants, and has no cholesterol, no calories and no animal products of any kind! The Impossible Burger is made from plants, and as you can see, tastes just like the ground beef we all know and love.

I have worked for the Impossible Foods G.P.S. (General Product Safety) team since March, and have already been deeply impressed with the efforts of both the plant-based team and the company as a whole. Impossible Foods have

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I believe that it would be a positive change to the menu, and would be appreciated by your customers. I am a fan of the Impossible Burger, and I know that many of your customers are, as well.

If you choose to serve the Impossible Burger at your restaurants, please make sure that you are serving the “Best-Tasting Burger in America”, and that the Impossible Burger has been a part of that experience.

I hope to hear back from you soon.

Sincerely,

Jeffrey

Wendy's

Thank you for your interest in the Wendy’s Impossible Burger.

Dear Wendy's,

I am writing to ask that you consider serving the Impossible Burger at your restaurant. I am not asking you to serve the Impossible Burger as a replacement for meat, but rather as an option for those of us who love meat but would prefer not to eat it. I was so excited to hear that you are testing the Impossible Burger at your restaurants, and I hope you will consider adding it to your menu permanently. I know that the Impossible Burger is not the answer for everyone, but I believe it would be a perfect fit for your brand.

Please take a look at the attached information, and if you would like to learn more about the Impossible Burger, please feel free to contact me. I look forward to hearing

Again, some of these are pretty good! However, they also showcase the few problems that continue to crop up ...

*GPT-J has clearly seen some e-mail chains in its training set, and in addition to generating the e-mail I want to send, it likes to start predicting replies. This isn't too big of an issue, since I could just cut off the end of the generated content.

*It does not know any autobiographical details for me, and is happy to invent them out of whole cloth. I am not a former Wendy's employee, a member of the GPS team at Impossible Foods, or even a loyal Wendy's customer (maybe cuz they don't serve the Impossible Burger). Oh, and I don't have kids.

*It does not know the current status of the Wendy's menu, either. Notice it sometimes gets confused and thinks that Wendy's is already serving or trialing the Impossible Burger.

*Though more plausible than GPT-2, it still fails at rudimentary factuality. The most noticeable false claims here are that the Impossible Burger contains no calories and no fat.

These e-mails are getting on-theme enough that some of them might serve as good bones for an e-mail of my own ... after I ran generation enough times to get a good one, fact-checked everything, edited out any baloney that GPT-J made up about me, and trimmed the end. But at that point, I still might as well write the e-mail myself.

EleutherAI just recently came out with a 20 billion parameter model called GPT-NeoX. I'll retry my test on that one when it becomes convenient. Notice the performance metrics at the bottom of the article. NeoX is over three times as large as GPT-J, but shows only a few percentage points of improvement (if that) on most benchmarks. I.e. there appear to be diminishing returns for a brute-force increase in the size of models from this lineage.

Until the next cycle,
Jenny