Monday, January 31, 2022

Acuitas Diary #45 (January 2022)

I mentioned last October that I had started reworking the Text Parser to add support for branching and coordinating conjunctions. My objective for this month was to finish that. It proved to be a little too ambitious, but I did get the output format altered to be branch-friendly, and the parser now supports two-part branches in a number of key places. Compound subjects, verbs, direct objects, adjectives, and adverbs, branching before or after the verb, and compound sentences with two fully separate independent clauses can all be processed now. What's missing? Compound prepositions, objects of prepositions, nouns of direct address, dependent clauses, and things inside clauses ... as well as comma-separated lists, larger compounds than two, nested compounds, and probably some other stuff I haven't even thought of yet.

Even though I pared down what I wanted to accomplish, it was still a rush to get that much done, make sure all the old parser functionality was compatible with the new features, and update the visualizer to work with the new output format. That's why this blog is going up on the final day of the month ... I literally just finished about an hour ago. Included in this was an update of the Out of the Dark benchmark data, and now I can correctly diagram a sentence that was previously in the "unparseable" category:

Sentence diagrams: "Kevin was a busy man, but he always made time for his son."
Just the one example doesn't give a good sense of what the Parser can do now, but maybe I can give some more expansive results next month, when I'm not as pressed for time.

Why is all this so hard? Partly because nothing in my Parser was originally designed to support compounds, so there are many different pieces that have to be reworked. More importantly, because it is not a simple matter to decide what a conjunction is actually joining. Consider these two example sentences:

The people of the city and the country are different.
The residents of the city and the folk of the country are different.

Same basic meaning, different structure. In the first sentence, the conjunction "and" joins "city" and "country" to form a compound object of the preposition "of," whose phrase modifies the single subject. In the second sentence, the conjunction joins "residents" and "folk" into a compound subject. A naïve parser would try to link "city" and "folk" instead, to generate the comically wrong reading "residents of the folk of the country." To know this is wrong, you have to access not just the sentence structure, but also the meanings of the words. Then you can spot some parallelism: residents and folk are both collective nouns for persons, while city and country are both locations. And it is not possible to be a resident of the folk (unless you are a tapeworm, I suppose). With access to the semantic database, the Parser could eventually make judgments like this, but that's for later because it's complicated, argh. For now, I've just got some of the basic structural support in place.

Now I need to get the other parts of the text processing chain (Interpreter, Conversation Engine, and Narrative Engine) ready to accept the new output. I think that will end up occupying much of next month.

Until the next cycle,
Jenny

No comments:

Post a Comment