I had indicated that I have tested my solution with 30 sentences from KANDEL’s book. I corrected all the mistakes of my English parser program by working on these examples. Here are the statistical information about them.
Sentences between min and max sentence sequence in the book are indicated.
min=20890;max=20895; 94 words 5 sentences
min=20885;max=20890; 82 words 5 sentences
min=20880;max=20885; 70 words 5 sentences
min=20875;max=20880; 135 words 5 sentences
min=15035; max=15045; 228 words 10 sentences
Control set test is necessary to check how the program behaves when it is used to parse ‘new’ or unexpected texts. This shows the fitting level of the program. When a solution is produced using a fixed number of test cases the success depends on the quality and the quantity of test cases. If you select a too few number of cases your solution will not work. Even if you choose many cases they have to be linearly independent so that they cover the input space cases as a whole.
I chose examples from different parts of KANDEL’s book so that they are possibly written by different writers.
min=15062; max=15072 227 words 10 sentences 5 mistakes % 2.2
min=15125; max=15135 200 words 10 sentences 5 mistakes % 2.5
min=33948; max=33958 245 words 10 sentences 8 mistakes % 3.2
min=2679; max=2689 153 words 10 sentences 5 mistakes % 3.3
min=42150; max=42160 232 words 10 sentences 5 mistakes % 2.2
The result is; % 2.8 of the words are parsed wrong. More importantly the fact is that almost half of the sentences have at least one mistake, possibly the only one.
The average sentence length is 21 words. Smaller sentences tend to have less errors.
This shows my model has to be improved so that there is at most 1 mistake in ten sentences. This corresponds to approx. 1 / 200 = % 0.5 mistakes.
I will continue to do example testing but also I will do theoretical work to outline the linearly independent test cases so that I can catch cases that I may not encounter by chance.