I had indicated that I have tested my solution with 30
sentences from KANDEL’s book. I
corrected all the mistakes of my English parser program by working on these
examples. Here are the statistical information
about them.
Sentences
between min and max sentence sequence in the book are indicated.
min=20890;max=20895;
94 words 5 sentences
min=20885;max=20890;
82 words 5 sentences
min=20880;max=20885;
70 words 5 sentences
min=20875;max=20880;
135 words 5 sentences
min=15035;
max=15045; 228 words 10 sentences
Control set
test is necessary to check how the program
behaves when it is used to parse ‘new’ or unexpected texts. This shows the fitting level of the
program. When a solution is produced
using a fixed number of test cases the success depends on the quality and the
quantity of test cases. If you select a
too few number of cases your solution will not work. Even if you choose many cases they have to be
linearly independent so that they cover the input space cases as a whole.
I chose
examples from different parts of KANDEL’s book so that they are possibly
written by different writers.
min=15062;
max=15072 227 words 10 sentences 5 mistakes % 2.2
min=15125;
max=15135 200 words 10 sentences 5 mistakes % 2.5
min=33948;
max=33958 245 words 10 sentences 8 mistakes % 3.2
min=2679;
max=2689 153 words 10 sentences 5 mistakes % 3.3
min=42150;
max=42160 232 words 10 sentences 5 mistakes % 2.2
The result
is; % 2.8 of the words are parsed wrong.
More importantly the fact is that almost half of the sentences have at
least one mistake, possibly the only one.
The average sentence length is 21 words. Smaller sentences tend to have less errors.
This shows
my model has to be improved so that there is at most 1 mistake in ten
sentences. This corresponds to approx. 1
/ 200 = % 0.5 mistakes.
I will
continue to do example testing but also I will do theoretical work to outline
the linearly independent test cases so that I can catch cases that I may not
encounter by chance.