N-Grams - NLP in Python

Beyond single words, text insights with n-grams

Posted by Dan Guitron on July 2, 2025

Natural Language Processing with Python - N-Grams


When we are analyzing a text is crucial to identify the words that are relevant. So we can assume that a word is more relevant if they appear more frequently in a corpus.

But not all important concept can be defined by a single word. E.g (Artificial) word in a text, if we identify that this word is commonly used as (Artificial Intelligence) We have a systematic unity, which apport context to the analysis.

đź““Notes


What are N-Grams?

Basically is a sequence of N-words. Where the length of N have different names:

  • 1-gram: Unigram (“Van”)
  • 2-gram: Bigram (“Van” “Helsing”)
  • 3-gram: Trigram (“Temerous Van Helsing”), (“Senior Van Helsing”)

[!Note] The relevance that we can assign a n-gram is because how much it repeats in a corpus (text) and the words are relevant if they are rich semantically.

  • âś…Relevant words: (“relevance”, “majestic”, “people”)
  • ❌Irrelevant words: (“the”, who, “,”, “are”)

📝Examples


1.How to calculate it using Python?
  • Step 1: Import the modules
import nltk
from nltk.util import ngrams  # bigrams, trigrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import requests
  • Step 2: Download the book “Dracula” đź§›
1
2
3
url = "https://gutenberg.org/files/345/345-0.txt"
book = requests.get(url).text.lower()
# print(book)
  • Observation:
    • We can see that we have a lot of irrelevant text, like characters, words, white spaces, etc.
2.Preprocessing of the book text (corpus)
  • Step 1: Tokenization - Convert Text to a List of Words
1
2
3
nltk.download("punkt_tab")
tokens = word_tokenize(book, language="english")
tokens[1000:1010]
  • Observation:
    • We have irrelevant words like (“wallachs”, “,”, “who”, “are”, “the”)

[!NOTE] this irrelevant words are called stopwords

  • Step 2: Stopwords - Irrelevant Words
1
2
3
nltk.download("stopwords")
common_words = set(stopwords.words("english"))
list(common_words)[-10:]
  • Observation:
    • We can see the common words used in English by using nltk.download("stopwords") and set the language of our preference.
  • Step 3: Remove Words with Irrelevant Meaning
1
2
3
4
5
6
words = []
for word in tokens:
    if word.isalpha() and word not in common_words:
        words.append(word)

words[1000:1010]
  • Observation:
    • We get the relevant words once we remove the characters or stopwords. This is semantically more rich full.
3.Extract Frequent N-Grams
  1. E.g - Bigrams
1
2
3
4
5
6
7
8
9
ngrams = list(ngrams(words, n=2))

# Count the frequency of n-grams
counts = Counter(ngrams)

# N-grams with more frequency
print(f"\n---The top 5 frequently n-grams:---\n")
for ngram, frequency in counts.most_common(n=5):
    print(frequency, ngram)
  • Output:
1
2
3
4
5
6
7
---The top 5 frequently n-grams:

315 ('van', 'helsing')
103 ('could', 'see')
87 ('madam', 'mina')
66 ('lord', 'godalming')
58 ('friend', 'john')
  • Observation:
    • Now we can see the difference between frequently bigrams and trigrams and the number of times they appear in the book “Dracula” đź§›

🖇️Additional Resources


See the full source code in action in my GitHub repository: