Users Pricing

articles

home / developersection / articles / keyword extractor from html content — step-by-step guide
Keyword Extractor from HTML Content — Step-by-Step Guide

Keyword Extractor from HTML Content — Step-by-Step Guide

Ravi Vishwakarma 52 27 May 2026 Updated 27 May 2026

Search engines, SEO tools, recommendation systems, and AI applications often need to identify the most important keywords from web pages. Since websites are built using HTML, extracting meaningful keywords from HTML content is a common task in web scraping and natural language processing (NLP).

In this guide, you’ll learn how to build a simple keyword extractor from HTML content step by step using Python.

What Is a Keyword Extractor?

A keyword extractor is a program that:

  • Reads text content
  • Identifies important words or phrases
  • Removes unnecessary words
  • Returns the most relevant keywords

For example, from this sentence:

“Artificial intelligence is transforming software development.”

A keyword extractor may return:

  • artificial intelligence
  • software development
  • transforming

Why Extract Keywords from HTML?

Keyword extraction from HTML is useful for:

  • SEO analysis
  • Search engines
  • Content categorization
  • Web scraping
  • Recommendation systems
  • Semantic search
  • AI document processing

Project Overview

We will build a Python application that:

  • Reads HTML content
  • Removes HTML tags
  • Cleans the text
  • Removes stop words
  • Extracts keywords
  • Displays top keywords

Technologies Used

We’ll use:

  • Python
  • BeautifulSoup
  • NLTK
  • Collections Counter

Step 1 — Install Required Libraries

Install the required packages.

pip install beautifulsoup4 nltk

Step 2 — Import Required Modules

# Import BeautifulSoup for HTML parsing
from bs4 import BeautifulSoup

# Import regular expressions
import re

# Import stopwords from nltk
from nltk.corpus import stopwords

# Import tokenizer
from nltk.tokenize import word_tokenize

# Import Counter for counting keyword frequency
from collections import Counter

# Import nltk package
import nltk

Step 3 — Download NLTK Data

Run this once.

# Download tokenizer data
nltk.download('punkt')

# Download stopwords dataset
nltk.download('stopwords')

Step 4 — Sample HTML Content

We’ll use a sample HTML page.

# Sample HTML content
html_content = """
<html>
<head>
    <title>AI and Machine Learning</title>
</head>
<body>
    <h1>Artificial Intelligence</h1>
    <p>
        Artificial intelligence and machine learning are transforming
        software development and modern applications.
    </p>
</body>
</html>
"""

Step 5 — Remove HTML Tags

Use BeautifulSoup to extract plain text.

# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Extract plain text
text = soup.get_text()

# Print extracted text
print(text)

Step 6 — Clean the Text

We remove:

  • Special characters
  • Numbers

Extra spaces

# Convert text to lowercase
text = text.lower()

# Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)

# Remove extra spaces
text = re.sub(r'\s+', ' ', text)

# Print cleaned text
print(text)

Step 7 — Tokenize the Text

Tokenization splits text into individual words.

# Split text into words
words = word_tokenize(text)

# Print tokenized words
print(words)

Step 8 — Remove Stop Words

Stop words are common words like:

  • the
  • is
  • and
  • are

These words usually don’t help identify keywords.

# Load English stop words
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_words = [
    word for word in words
    if word not in stop_words
]

# Print filtered words
print(filtered_words)

Step 9 — Count Keyword Frequency

Now count how often each keyword appears.

# Count word frequency
word_counts = Counter(filtered_words)

# Print top 10 keywords
print(word_counts.most_common(10))

Full Working Code

# Import libraries
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import nltk

# Download nltk datasets
nltk.download('punkt')
nltk.download('stopwords')

# Sample HTML content
html_content = """
<html>
<head>
    <title>AI and Machine Learning</title>
</head>
<body>
    <h1>Artificial Intelligence</h1>
    <p>
        Artificial intelligence and machine learning are transforming
        software development and modern applications.
    </p>
</body>
</html>
"""

# Parse HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract text
text = soup.get_text()

# Convert to lowercase
text = text.lower()

# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)

# Remove extra spaces
text = re.sub(r'\s+', ' ', text)

# Tokenize words
words = word_tokenize(text)

# Load stop words
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_words = [
    word for word in words
    if word not in stop_words
]

# Count frequency
word_counts = Counter(filtered_words)

# Print top keywords
print("Top Keywords:")

# Display keywords
for word, count in word_counts.most_common(10):
    print(word, ":", count)

Example Output

Top Keywords:
artificial : 2
intelligence : 2
machine : 1
learning : 1
transforming : 1
software : 1
development : 1
modern : 1
applications : 1

How It Works Internally

The workflow looks like this:

HTML Content
      ↓
Remove HTML Tags
      ↓
Clean Text
      ↓
Tokenize Words
      ↓
Remove Stop Words
      ↓
Count Frequencies
      ↓
Top Keywords

Improvements You Can Add

The basic keyword extractor works well, but real-world systems can be improved further.

1. Use TF-IDF

TF-IDF helps identify important words more intelligently.

Instead of only counting frequency, it considers:

  • Word frequency in document
  • Word rarity across documents

Useful library:

scikit-learn

2. Extract Key Phrases

Instead of single words, extract phrases like:

  • machine learning
  • artificial intelligence

Libraries:

  • RAKE
  • spaCy
  • KeyBERT

3. Ignore Navigation Content

Real HTML pages contain:

  • Menus
  • Headers
  • Footers
  • Ads

You can remove unnecessary tags:

for tag in soup(['script', 'style', 'nav', 'footer']):
    tag.decompose()

4. Use Stemming or Lemmatization

Convert related words into base forms.

Example:

  • running → run
  • applications → application

5. Process Live URLs

You can fetch real websites using requests.

import requests

response = requests.get("https://example.com")
html_content = response.text

Real-World Use Cases

Keyword extraction is widely used in:

Industry Usage
SEO Content optimization
Search Engines Indexing pages
E-commerce Product tagging
AI Chatbots Context understanding
News Platforms Article categorization
Recommendation Systems Topic matching

Challenges in Keyword Extraction

Noisy HTML

Web pages often contain irrelevant content.

Duplicate Words

Repeated keywords may not always indicate importance.

Context Understanding

Simple frequency counting cannot fully understand meaning.

Multilingual Content

Different languages require different NLP models.

Advanced Alternatives

Modern AI systems use embedding-based methods and transformer models for smarter keyword extraction.

Popular tools include:

  • spaCy
  • KeyBERT
  • NLTK
  • Transformers

These methods understand semantics rather than only word frequency.

Final Thoughts

Building a keyword extractor from HTML content is an excellent beginner-friendly NLP and web scraping project.

In this tutorial, you learned how to:

  • Parse HTML
  • Clean text
  • Tokenize words
  • Remove stop words
  • Extract important keywords

This foundational project can later evolve into:

  • SEO analyzers
  • Semantic search engines
  • AI content tools
  • Recommendation systems
  • Document analysis platforms

Ravi Vishwakarma

IT-Hardware & Networking

Ravi Vishwakarma is a dedicated Software Developer with a passion for crafting efficient and innovative solutions. With a keen eye for detail and years of experience, he excels in developing robust software systems that meet client needs. His expertise spans across multiple programming languages and technologies, making him a valuable asset in any software development project.