Monday, 10 February 2014
How to determine the subject of a Turkish sentence
Subject is a noun/proper noun/noun phrase/adjective phrase/pronoun.
Subject is the first occurrence of the above options in a sentence/sub-sentence.
A sentence is formed of one/or more sub-sentences and may/may not have at least a verb.
Example output of 100 sentences:
USER REQUIREMENTS:
1- Detect sub-sentences.
2- Detect beginning of subject and accumulate the words that form it.
3- Detect the end of the phrase and print SUBJECT information.
4- Print the words of the sentence in the order of input.
5- DB access management
1- Detect sub-sentences.
1.1- A sub-sentence begins after a conjunct or ',;:'.
1.2- 'and', ',' may be used between nouns, adjectives without indicating sub-sentence.
1.3- A sub-sentence begins with a comma if the previous word is a standard verb.
1.4- A sub-sentence begins with/without a comma if the previous word is a verb derived from a noun/proper noun/adjective/adverb.
2- Detect beginning of subject and accumulate the words that form it.
2.1- A subject begins with a simple-extensionless name or with '-i'(family) name or
pronoun/nounFromVerb and exceptions from other types or dictionary related.
2.2- A subject continues with noun/proper noun/adjective/pronoun/nounFromVerb/adjectiveFromVerb words with some exceptions.
2.3- Conjunctions and ',;:' have to be handled when met during the Subject element scan.
2.4- If 'AND' and ',' is met in the subject between valid subject elemnts they have to be added to the subject.
2.5- A plural noun ends the phrase.
3- Detect the end of the phrase and print SUBJECT information.
3.1- The end of a phrase is detected when a word with extension is met.
3.2- The reaction is determined accroding to the type of the word and its extension.
3.3- If the word is not noun/proper noun/valid pronoun/valid nounFromVerb/valid phraseEndingAdverb
or there is no word extension with exceptions
it is the first word after the end of subject.
//ÖZNE =====> tavşanlar
Kumluk kıyılarda sessiz,
SESSİZ(adverb) OTURAN(adjFromVerb) tavşanlar(noun),
taştan(noun) yontulmuş küçük,
boz heykellere benziyorlardı.
3.4- If the word is noun/proper noun/pronoun/nounFromVerb/validPhraseEndingAdverb
and if the previous word is name/proper name/valid pronoun
and the previous word has valid phrase ending extension with acceptions or valid nounFromVerb
and if the current word is a valid phrase ending word or with multiple extensions in exceptions
then this word ends the previous phrase hence the subject is not valid ()
//büyük asfalt vilâyet şosesİ tarafLARInda //alçak bir dalI önÜnde
else this is not a valid word, so it does not belong to the prev phrase hence subject is valid
//bazı kızlar önümüzDEN
3.5 The word 'a' (bir) has to be handled as exception in all of these requirements. Bir is both a noun and
an adjective in Turkish. 'like' (gibi) preposition requires special handling also.
// bir yandan da bir at gibi
3.6 The nouns and adjectives can be interleaved in Turkish. There is a need for a function to exclude
adjectives at the end of Subject.
4- Print the words of the sentence in the order of input.
4.1- At the end of all the tests if subject is not printed for this sub-sentence yet, then
print the Subject.
4.2- Print the current word's info.
5- DB acces management
5.1- Get sentence and sentence struct info.
5.2- manage the input sentence in a buffer of arrays.
5.3- Buffer listing utilities.
5.4- Main program driver to read each sentence and process them.