Creating and Training a New Parser¶
Extending the AlphaBeta class¶
TODO
A skeleton for extending the AlphaBeta class and defining a new parser class can be found here: https://github.com/hyperbase/hyperbase/blob/master/skeletons/parser_xx.py
Collecting a corpus of sample texts¶
TODO
Files should be named by category and then number. For example:
This allows for the generation of balanced training datasets later on, as well as the testing of accuracy by category.
Extracting sentences¶
For example:
$ python -m hyperbase.scripts.extract-sentences --parser .parser_xx.ParserXX --infile parser-training-data/xx/text-samples/wikipedia1.txt --outfile parser-training-data/xx/sentences/wikipedia1.txt
Annotating sentences to generate a parser training dataset¶
For example:
$ python -m hyperbase.scripts.generate-parser-training-data --parser .parser_xx.ParserXX --indir parser-training-data/de/sentences --outfile parser-training-data/xx/sentence-parses.json
Splitting into training and testing datasets¶
To split the sentence parses dataset into training (66%) and testing (33%) datasets, the following script can be used:
$ python -m hyperbase.scripts.split-parser-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json
The files sentence-parses-train.json and sentence-parses-test.json will be created in the same directory as the original file, in this case parser-training-data/xx.
Generating alpha training data¶
For example:
$ python -m hyperbase.scripts.generate-alpha-training-data --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentence-parses.json --outfile parser-training-data/xx/atoms.csv
Notice that atoms-train.csv and atoms-test.csv can be generated from the previous split.
Testing the alpha stage¶
With this script and the two datasets, it is now possible to test the accuracy of the alpha stage:
$ python -m hyperbase.scripts.test-alpha --parser .parser_xx.ParserXX --infile parser-training-data/xx/atoms-test.csv --training_data parser-training-data/xx/atoms-train.csv
Overall results are presented, as well as per category. For example:
news accuracy: 0.962852897473997 [648 correct out of 673]
science accuracy: 0.9427083333333334 [543 correct out of 576]
fiction accuracy: 0.9581881533101045 [275 correct out of 287]
non-fiction accuracy: 0.9338235294117647 [254 correct out of 272]
wikipedia accuracy: 0.9482288828337875 [696 correct out of 734]
overall accuracy: 0.950432730133753 [2416 correct out of 2542]
Manual parser testing¶
The full parser can be manually tested. An interactive script is provided for this, that uses as input a sentence file generated by the extract-sentences script discussed above. Of course, it is necessary to use a text corpus different from the one used to train the parser to obtain meaningful results:
$ python -m hyperbase.scripts.manual-parser-test --parser .parser_xx.ParserXX --infile parser-training-data/xx/sentences/manual-test-sentences.txt --outfile parser-training-data/de/manual-test-results.csv
It also makes sense to create sentence files by text category, to be able to appraise accuracy across different kinds of text.