Creating and Training a New Parser¶

Extending the AlphaBeta class¶

TODO

A skeleton for extending the AlphaBeta class and defining a new parser class can be found here: https://github.com/hyperbase/hyperbase/blob/master/skeletons/parser_xx.py

Collecting a corpus of sample texts¶

TODO

Files should be named by category and then number. For example:

wikipedia1.txt
wikipedia2.txt
...
books1.txt
...

This allows for the generation of balanced training datasets later on, as well as the testing of accuracy by category.

Extracting sentences¶

For example:

$ python -m hyperbase.scripts.extract-sentences --parser .parser_xx.ParserXX --infile parser-training-data/xx/text-samples/wikipedia1.txt --outfile parser-training-data/xx/sentences/wikipedia1.txt

Annotating sentences to generate a parser training dataset¶

For example:

$ python -m hyperbase.scripts.generate-parser-training-data --parser .parser_xx.ParserXX --indir parser-training-data/de/sentences --outfile parser-training-data/xx/sentence-parses.json

Splitting into training and testing datasets¶

To split the sentence parses dataset into training (66%) and testing (33%) datasets, the following script can be used:

$ python -m hyperbase.scripts.split-parser-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json

The files sentence-parses-train.json and sentence-parses-test.json will be created in the same directory as the original file, in this case parser-training-data/xx.

Generating alpha training data¶

For example:

$ python -m hyperbase.scripts.generate-alpha-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json --outfile parser-training-data/xx/atoms.csv

Notice that atoms-train.csv and atoms-test.csv can be generated from the previous split.

Testing the alpha stage¶

With this script and the two datasets, it is now possible to test the accuracy of the alpha stage:

$ python -m hyperbase.scripts.test-alpha --parser .parser_xx.ParserXX --infile parser-training-data/xx/atoms-test.csv --training_data parser-training-data/xx/atoms-train.csv

Overall results are presented, as well as per category. For example:

news accuracy: 0.962852897473997 [648 correct out of 673]
science accuracy: 0.9427083333333334 [543 correct out of 576]
fiction accuracy: 0.9581881533101045 [275 correct out of 287]
non-fiction accuracy: 0.9338235294117647 [254 correct out of 272]
wikipedia accuracy: 0.9482288828337875 [696 correct out of 734]

overall accuracy: 0.950432730133753 [2416 correct out of 2542]

Manual parser testing¶

The full parser can be manually tested. An interactive script is provided for this, that uses as input a sentence file generated by the extract-sentences script discussed above. Of course, it is necessary to use a text corpus different from the one used to train the parser to obtain meaningful results:

$ python -m hyperbase.scripts.manual-parser-test --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentences/manual-test-sentences.txt --outfile parser-training-data/de/manual-test-results.csv

It also makes sense to create sentence files by text category, to be able to appraise accuracy across different kinds of text.