Extractive Text Summarization using NLTK in Python

Artificial Intelligence

Extractive Text Summarization using NLTK in Python

theb2bnews

April 6, 2021

Extractive Text Summarization using NLTK in Python

[ad_1]

Let’s first try to understand what we mean by Text Summarization Python. Here is the definition:

“Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning”-Text Summarization Techniques: A Brief Survey, 2017

Contributed by: Nitin Kumar

Need For Text Summarization Python
Approaches used for Text Summarization
Steps for Implementation
Complete Code

Need for Text Summarization Python:

Today various organizations, be it online shopping, government and private sector organizations, catering and tourism industry or other institutions that offer customer services, are concerned about their customers and ask for feedback every single time we use their services. Consider the fact that these companies may be receiving enormous amounts of user feedback every single day. And it would become quite tedious for the management to sit and analyze each of those.

But, the technologies today have reached an extent where they can do all the tasks of human beings. And the field which makes these things happen is Machine Learning. The machines have become capable of understanding human languages using Natural Language Processing. Today researches are being done in the field of text analytics.

And one such application of text analytics and NLP is a Text Summarizer which helps in summarizing and shortening the text in the user feedback. This can be done via an algorithm to reduce text bodies but keeping their original meaning or giving a great insight into the original text.

There are broadly two different approaches that are used for text summarization:

Extractive Summarization
Abstractive Summarization

Extractive Summarization:

In this, we identify important sentences or phrases from the original text and extract only those from the text. Those extracted sentences are our summary.

Abstractive Summarization:

In this approach, we generate new sentences from the original text. This is in contrast to the extractive approach described above. The sentence generated through the abstractive approach might not even present in the original text.

We will focus on using extractive methods, which function by identifying the important sentences or excerpts from the text and reproducing them verbatim as part of the summary. No new text is generated; the only existing text is used in the summarization process.

Steps for Implementation:

Step 1: Import Required Libraries – There are two NLTK libraries that will be necessary for building an efficient text summarizer.

Terms Used:

Corpus
Corpus means a collection of text. It could be data sets of anything containing texts be it poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of predetermined stop words.
Tokenizers
it divides a text into a series of tokens. There are three main tokenizers – word, sentence, and regex tokenizer. We will only use the word and sentence tokenizer

Step 2: Removing Stop Words and storing them in a separate array of words

Stop Word

Any word like (is, a, an, the, for) that does not add value to the meaning of a sentence. For example, let’s say we have the sentence

Greatlearning is one of the most useful websites for datascience aspirants.

After removing stop words, we can narrow the number of words and preserve the meaning as follows:

[‘Greatlearning’, ‘one’, ‘useful’, ‘website’, ‘datascience’, ‘aspirants’, ‘.’]

Step 3: Create a frequency table of words

A python dictionary that’ll keep a record of how many times each word appears in the text after removing the stop words. we can use the dictionary over every sentence to know which sentences have the most relevant content in the overall text

Step 4: Assign score to each sentence depending on the words it contains and the frequency table.

We can use the sent_tokenize() method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, we will later go through the dictionary to generate the summary.

Step 5: Assign a certain score to compare the sentences within the text

A simple approach to compare our scores would be to find the average score of a sentence. The average itself can be a good threshold

Apply the threshold value and store sentences in order into the summary

Find Complete Code Here:

Input Text:

Output Summary:

This brings us to the end of the blog on Text Summarization Python. We hope that you were able to learn more about the concept. If you wish to learn more such concepts, do take up the Python for Machine Learning free online course offered by Great Learning Academy.