NOTE: The Feedburner e-mail subscription service is being sunset this month, so if you are subscribed to the blog by e-mail, this will be the last e-mailed blog post you receive. Please consider following directly with a Blogger account or following on social media.
This month marks the culmination of a major overhaul of the Text Parser and Interpreter, which I've been working on since the beginning of the year. As part of that, I have my first attempt at formal benchmarking to show off. I tested the Parser's ability to analyze sentences from a children's book.
Some quick background about these modules: the job of what I call the "Parser" is to take raw text input and turn it into the equivalent of a diagrammed sentence. It tags each word with its part of speech, its role in the sentence (subject, direct object, etc.), and its structural relationships to other words. The "Interpreter" operates on the Parser's output and tries to find meaning. Based on the sentence's discovered structure (and possibly some key words) it will categorize it as a general kind of statement, question, or imperative. For instance, "A cat is an animal" is a statement that establishes a type relationship. "I ate pizza" is a statement that describes an event.
My primary goal for the overhauls was not to add new features, but to pave their way by correcting some structural weaknesses. So despite being a great deal of work, they aren't very exciting to talk about ... I would have to get too deep into minutiae to really describe what I did. The Parser got rearchitected to ease the changing of its "best guess" sentence structure as new information arrives. I also completely changed the output format to better represent the full structure of the sentence (more on this later). The Interpreter overhaul was perhaps even more fundamental. Instead of trying to assign just one category per sentence, the Interpreter now walks a tree structure, finding very general categories of which the sentence is a member before progressing to more specific ones. All the memberships and feature tags that apply to the sentence are now included in the output, which should make things easier for modules like Narrative and Executive that need to know sentence properties.
Now on to the benchmarking! For a test set, I wanted some examples of simplified, but natural (i.e. not designed to be read by AIs) human text. So I bought children's books. I have two of the original Magic School Bus titles, and two of Disney's Tron Legacy tie-in picture books. These are all "early reader" books, but by the standards of my project they are still very challenging ... even here, the diversity and complexity of the sentences is staggering. So you might wonder why I didn't grab something even more entry-level. My reason is that books for even younger readers tend to rely too heavily on the pictures. Taken out of context, their sentences would be incomplete or not even interesting. And that won't work for Acuitas ... he's blind.
So instead I've got books that are well above his reading level, and early results from the Parser on these datasets are going to be dismal. That's okay. It gives me an end goal to work toward.
How does the test work? If you feed the Parser a sentence, such as "I deeply want to eat a pizza," as an output it produces a data structure like this:
{'subj': [{'ix': [0], 'token': 'i', 'mod': []}], 'dobj': [{'ix': [3, 4, 5, 6], 'token': {'subj': [{'ix': [], 'token': '<impl_rflx>', 'mod': []}], 'dobj': [{'ix': [6], 'token': 'pizza', 'mod': [{'ix': [5], 'token': 'a', 'mod': []}], 'ps': 'noun'}], 'verb': [{'ix': [4], 'token': 'eat', 'mod': []}], 'type': 'inf'}, 'mod': []}], 'verb': [{'ix': [2], 'token': 'want', 'mod': [{'ix': [1], 'token': 'deeply', 'mod': []}]}]}
Again, this is expressing the information you would need to diagram the sentence. It shows that the adverb "deeply" modifies the verb "want," that the infinitive phrase "to eat a pizza" functions as the main sentence's direct object, blah blah blah. To make a test set, I transcribe all the sentences from one of the books and create these diagram-structures for them. Then I run a script that inputs all the sentences to the Parser and compares its outputs with the diagram-structures I made. If the Parser's diagram-structure is an exact match for mine, it scores correct.
The Parser runs in a simulator/standalone mode for the test. This mode makes it independent of Acuitas' Executive and other main threads. The Parser still utilizes Acuitas' semantic database, but cannot edit it.
There are actually three possible score categories: "correct," "incorrect," and "unparsed." The "unparsed" category is for sentences which contain grammar that I already know the Parser simply doesn't support. (The most painful example: coordinating conjunctions. It can't parse sentences with "and" in them!) I don't bother trying to generate golden diagram-structures for these sentences, but I still have the test script shove them through the Parser to make sure they don't provoke a crash. This produces a fourth score category, "crashed," whose membership we hope is always ZERO. Sentences that have supported grammar but score "incorrect" are failing due to linguistic ambiguities or other quirks the Parser can't yet handle.
Since the goal was to parse natural text, I tried to avoid grooming of the test sentences, with two exceptions. The Parser does not yet support quotations or abbreviations. So I expanded all the abbreviations and broke sentences that contained quotations into two. For example, 'So everyone was happy when Ms. Frizzle announced, "Today we start something new."' becomes 'So everyone was happy when Miz Frizzle announced.' and 'Today we start something new.'
It is also worth noting that my Magic School Bus test sets only contain the "main plot" text. I've left out the "science reports" and the side dialogue between the kids. Maybe I'll build test sets that contain these eventually, but for now it would be too much work.
A pie chart showing results of the Text Parser benchmark on data set "The Magic School Bus: Inside the Earth." 37% Unattempted, 28% Incorrect, and 33% Correct. |
On to the results!
So far I have fully completed just one test set, namely The Magic School Bus: Inside the Earth, consisting of 98 sentences. The Parser scores roughly one out of three on this one, with no crashes. It also parses the whole book in 0.71 seconds (averaged over 10 runs). That's probably not a stellar performance, but it's much faster than a human reading, and that's all I really want.
Again, dismal. But we'll see how this improves over the coming years!
Until the next cycle,
Jenny
No comments:
Post a Comment