Word Parsing, Part 2

In the previous episode, I rambled about the history and completion of my mechanical word parser.  This time, we go section by section through the deceptively short synopsis.

As before, the code lives at https://github.com/ology/Lingua-Word-Parser and also at https://metacpan.org/release/Lingua-Word-Parser.

UPDATE: A Web GUI can now be found at https://github.com/ology/Word-Part


 

This is the standard preamble for perl followed by my word parser module.

Right off, we create an object with attributes for the lexicon file and the word to parse:

 

This lexicon is made of regular expressions (indicating prefix, affix, suffix or combining vowel) and the definition.  The word can be anything that the dictionary knows about.  (Yes it’s a loaded deck, but that is the point – using a finite lexicon.)  It looks like this:

 

The first method of the code captures the known parts of the word:

Printing Dumper($known) shows more information than is (currently) used to process a word.  Here are the first entries of a made up, test word “abiotically” :

 

In particular, we use the set of masks for parsing the word into known parts.  These are found by iterating over each and every entry of our simple word part lexicon.  Oof!

The mask that is created shows a single known “chunk” with ones for the known characters and zeros for the unknowns.

Next up, we create all possible combinations of known parts that do not overlap.  This is where the mechanical, combinatorial bits come into play.

This produces a long list that begins with:

This is the set of non-overlapping bitstrings that when combined, make all or part of the original word.

The meat of the power() subroutine is the iteration over the powerset of masks, ignoring ones that contain overlapping known parts.

You can see the mechanical iteration.  The crucial comparison of masks to determine overlap is done with an XOR and an OR expression.  If they are not the same, there is an overlap!

Lastly, we score the combinations.

I spent a lot of time thinking about this, but in the end it was two ratios: “known chunks vs unknown chunks” and known characters vs unknown characters.

For each (non-overlapping) combination we OR them together, then the four measures are counted up…

In order to do that, we run-length encode an “un-digitized” string.  For example, 011100 becomes ukkkuu which finally is converted to u1k3u2.

By tallying these numbers it is possible to get two ratios – a two dimensional familiarity score that effectively considers the relative sizes of contiguous spans of known vs unknowns.

When the word is reconstructed into something meaningful, that is the ratios and the word itself with delimited parts, we get this:

 

It seems easy enough when you look at a word and the parts jump out at you like differently colored LEGOs.  But convincing a computer to do it takes a bit of head scratching.

Comments

*