Natural Language Processing with Python - N-Grams
When we are analyzing a text is crucial to identify the words that are relevant. So we can assume that a word is more relevant if they appear more frequently in a corpus.
But not all important concept can be defined by a single word. E.g (Artificial) word in a text, if we identify that this word is commonly used as (Artificial Intelligence) We have a systematic unity, which apport context to the analysis.
đź““Notes
What are N-Grams?
Basically is a sequence of N-words. Where the length of N have different names:
- 1-gram: Unigram (“Van”)
- 2-gram: Bigram (“Van” “Helsing”)
- 3-gram: Trigram (“Temerous Van Helsing”), (“Senior Van Helsing”)
[!Note] The relevance that we can assign a n-gram is because how much it repeats in a corpus (text) and the words are relevant if they are rich semantically.
- ✅Relevant words: (“relevance”, “majestic”, “people”)
- ❌Irrelevant words: (“the”, who, “,”, “are”)
📝Examples
1.How to calculate it using Python?
- Step 1: Import the modules
import nltk
from nltk.util import ngrams # bigrams, trigrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import requests
- Step 2: Download the book “Dracula” 🧛
1
2
3
url = "https://gutenberg.org/files/345/345-0.txt"
book = requests.get(url).text.lower()
# print(book)
- Observation:
- We can see that we have a lot of irrelevant text, like characters, words, white spaces, etc.
2.Preprocessing of the book text (corpus)
- Step 1: Tokenization - Convert Text to a List of Words
1
2
3
nltk.download("punkt_tab")
tokens = word_tokenize(book, language="english")
tokens[1000:1010]
- Observation:
- We have irrelevant words like (“wallachs”, “,”, “who”, “are”, “the”)
[!NOTE] this irrelevant words are called stopwords
- Step 2: Stopwords - Irrelevant Words
1
2
3
nltk.download("stopwords")
common_words = set(stopwords.words("english"))
list(common_words)[-10:]
- Observation:
- We can see the common words used in English by using
nltk.download("stopwords")
and set the language of our preference.
- We can see the common words used in English by using
- Step 3: Remove Words with Irrelevant Meaning
1
2
3
4
5
6
words = []
for word in tokens:
if word.isalpha() and word not in common_words:
words.append(word)
words[1000:1010]
- Observation:
- We get the relevant words once we remove the characters or stopwords. This is semantically more rich full.
3.Extract Frequent N-Grams
- E.g - Bigrams
1
2
3
4
5
6
7
8
9
ngrams = list(ngrams(words, n=2))
# Count the frequency of n-grams
counts = Counter(ngrams)
# N-grams with more frequency
print(f"\n---The top 5 frequently n-grams:---\n")
for ngram, frequency in counts.most_common(n=5):
print(frequency, ngram)
- Output:
1
2
3
4
5
6
7
---The top 5 frequently n-grams:
315 ('van', 'helsing')
103 ('could', 'see')
87 ('madam', 'mina')
66 ('lord', 'godalming')
58 ('friend', 'john')
- Observation:
- Now we can see the difference between frequently bigrams and trigrams and the number of times they appear in the book “Dracula” 🧛
🖇️Additional Resources
See the full source code in action in my GitHub repository: