Friday, 9 May 2014

LANGANA-E English Language Parser project progresses May 2014

This is part of a dictionary that indicates the types of English words only.  Only the words beginning with the letters W-X-Y-Z are included.  The other letters will be posted as the work progresses.

This effort is part of an ongoing process in parallel with my Turkish Language processing package LANGANA.  I have two aims for LANGANA.  The first one is to make a program that reads texts and parses-converts them to a pseudo language output which it can later use to answer questions about the text.  The second is to make a quality Turkish-English and vise versa translation motor.

I parsed the last 30 000 lines of the Webster dictionary which is publicly available.  The I did a small converter mechanism to exclude the word names and types.  My parser is approx. 1000 lines.  I progressed with 30-40 lines successes in the beginning and it took many  hours to do this.  Recently I have seen 2000 lines successess in a matter of 10 minutes.  I am looking forward to more improvements  and finish this dictionary in a couple of months at most.

 The second group of chars namely S-T-U-V has been added. This has been a considerable endeavour as these chars are explained in  approx. 240 000 lines in Webster(1910ver.)  My current parser parses approx. 270 000 lines and lists the word types of 25 - 30 000  English words.  The whole of Webster is 1 000 000 lines.  I have reached a point of saturation in the development of the parser and  it has become fairly facile if not easy to proceed.  I am looking forward to finish the parser in 1-2 months time.

 After the parser is finished I will do fine tuning to decide what items will be included to the output.  I will put the output into  a MySQL database afterwards and proceed with the rest of my plans.

 I will make the output publicly available as the Webster 1910 ver. but I will provide letter S by e-mail, only to requests clearly  identified as non-profit.