Monday 10 February 2014

How to determine the subject of a Turkish sentence


Subject is a noun/proper noun/noun phrase/adjective phrase/pronoun.

Subject is the first occurrence of the above options in a sentence/sub-sentence.

A sentence is formed of one/or more sub-sentences and may/may not have at least a verb.

Example output of 100 sentences:


USER REQUIREMENTS:
1- Detect sub-sentences.
2- Detect beginning of subject and accumulate the words that form it.
3- Detect the end of the phrase and print SUBJECT information.
4- Print the words of the sentence in the order of input.
5- DB access management

1- Detect sub-sentences.
1.1- A sub-sentence begins after a conjunct or ',;:'.
1.2- 'and', ',' may be used between nouns, adjectives without indicating sub-sentence.
1.3- A sub-sentence begins with a comma if the previous word is a standard verb.
1.4- A sub-sentence begins with/without a comma if the previous word is a verb derived from a noun/proper noun/adjective/adverb.

2- Detect beginning of subject and accumulate the words that form it.
2.1- A subject begins with a simple-extensionless name or with '-i'(family) name or
pronoun/nounFromVerb and exceptions from other types or dictionary related.
2.2- A subject continues with noun/proper noun/adjective/pronoun/nounFromVerb/adjectiveFromVerb words with some exceptions.
2.3- Conjunctions and ',;:' have to be handled when met during the Subject element scan.
2.4- If 'AND' and ',' is met in the subject between valid subject elemnts they have to be added to the subject.
2.5- A plural noun ends the phrase.

3- Detect the end of the phrase and print SUBJECT information.
3.1- The end of a phrase is detected when a word with extension is met.
3.2- The reaction is determined accroding to the type of the word and its extension.
3.3- If the word is not noun/proper noun/valid pronoun/valid nounFromVerb/valid phraseEndingAdverb
or there is no word extension with exceptions
it is the first word after the end of subject.
//ÖZNE =====>  tavşanlar
Kumluk kıyılarda sessiz,
SESSİZ(adverb) OTURAN(adjFromVerb) tavşanlar(noun),
taştan(noun) yontulmuş küçük,
boz heykellere benziyorlardı.
3.4- If the word is noun/proper noun/pronoun/nounFromVerb/validPhraseEndingAdverb
and if the previous word is name/proper name/valid pronoun
and the previous word has valid phrase ending extension with acceptions or valid nounFromVerb
and if the current word is a valid phrase ending word or with multiple extensions in exceptions
then this word ends the previous phrase hence the subject is not valid ()
//büyük asfalt vilâyet şosesİ tarafLARInda  //alçak bir dalI önÜnde
else this is not a valid word, so it does not belong to the prev phrase hence subject is valid
            //bazı kızlar önümüzDEN
3.5 The word 'a' (bir) has to be handled as exception in all of these requirements.  Bir is both a noun and
an adjective in Turkish. 'like' (gibi) preposition requires special handling also.
 // bir yandan  da bir at gibi
3.6 The nouns and adjectives can be interleaved in Turkish.  There is a need for a function to exclude
 adjectives at the end of Subject.

4- Print the words of the sentence in the order of input.
4.1- At the end of all the tests if subject is not printed for this sub-sentence yet, then
print the Subject.
4.2- Print the current word's info.

5- DB acces management
5.1- Get sentence and sentence struct info.
5.2- manage the input sentence in a buffer of arrays.
5.3- Buffer listing utilities.
5.4- Main program driver to read each sentence and process them.