“To generate a summary, we need brief, accurate representation of the contents of a given electronic text document.”
Presently, most of students and researchers are facing the problem given below:
It is very difficult for students/researchers to read each and every newly published paper to get information about latest progress and when they will work on a new research project. They need to take long time to read papers about research projects. The main objective of the proposed project is to develop a domain independent, automatic text extraction system to solve this problem.
We scored sentences in the given text both linguistically and statistically, without using NLP to generate a summary comprising of the most important ones obtained so. This program gets input from a text file and sends the summary into a similar text file. Our main task is to develop a scoring algorithm that would produce the best results for a wide range of text types and works efficiently. The only means to arrive at it was to summarize sentences manually and then check the sentences for common traits, which would then be converted into the machine language.
SCORING ALGORITHM:
Our program will work on the logics which are described below:
a. WORD SCORING:
1. Stop Words: We can’t create any text in English language without using these insignificant words. They are neglected while scoring sentences because they will not provide real idea about the textual theme. Some examples are: I, a, an, of, am, the, et cetera.
2. Cue Words: We are using Cue words while making sentences for any given summary, when concluding the sentences of a text. They are useful when scoring sentences which provides closure to a given subject/matter.
Thus, hence, summary, conclusion, etc are few examples.
3. Basic Dictionary Words: Without these words any sentence will have no meaning in English language. There are more than 850 words in English language are most commonly used and defined as Basic Dictionary words. These words play an important role in the creation of a sensible summary and form the backbone of our algorithm. When scoring sentences, these words are very useful.
4. Proper Nouns:
In most cases, Proper Nouns will form the central theme of a given text. It is very difficult to identify the proper nouns without the help of linguistic methods but in many cases we successfully identified them. They are very important while scoring sentences, because they will give semantics to the summary.
5. Keywords: The user can use a particular word, i.e. the keyword to get the generated summary. In the absence of NLP, we tried our best to produce result by using this keyword.
6. Word Frequency: The final score of words will be calculated based on their frequency of occurrence in the document, after basic scores have been allotted to words. The words which repeat more frequently in the text will get higher importance in the text due to their profound impression of the context.
b. SENTENCE SCORING:
1. Primary Score: We can calculate the final word score by using above methods and we will get a sentence score by the sum of word scores. This gives long sentence a big advantage over their smaller counterparts, which are also may be important.
2. Final Score: We can obtain the final score by multiplying the score obtained by the ratio “average length / current length” and can nullify the above drawback to a large extent. The most important aspect has been the successful merger of definition based and frequency based categorization of words into one efficient algorithm to generate an possible, complete summary for a given sensible text.
OPTIMIZATION:
1. We will take 850 Basic English Words as input from a file, they will be sorted lexicographically and then we implement binary search on the same words, which takes 0(In n) time.
2. We will store the entered text in two types of Data structures which are given below:
a. Red Black Tree
b. Hash Table
And we have drawn an analogy between these two types of data structures.
Buy this project: