Keyword Extractor from HTML Content — Step-by-Step Guide

Search engines, SEO tools, recommendation systems, and AI applications often need to identify the most important keywords from web pages. Since websites are built using HTML, extracting meaningful keywords from HTML content is a common task in web scraping and natural language processing (NLP).

In this guide, you’ll learn how to build a simple keyword extractor from HTML content step by step using Python.

What Is a Keyword Extractor?

A keyword extractor is a program that:

Reads text content
Identifies important words or phrases
Removes unnecessary words
Returns the most relevant keywords

For example, from this sentence:

“Artificial intelligence is transforming software development.”

A keyword extractor may return:

artificial intelligence
software development
transforming

Why Extract Keywords from HTML?

Keyword extraction from HTML is useful for:

SEO analysis
Search engines
Content categorization
Web scraping
Recommendation systems
Semantic search
AI document processing

Project Overview

We will build a Python application that:

Reads HTML content
Removes HTML tags
Cleans the text
Removes stop words
Extracts keywords
Displays top keywords

Technologies Used

We’ll use:

Python
BeautifulSoup
NLTK
Collections Counter

Step 1 — Install Required Libraries

Install the required packages.

pip install beautifulsoup4 nltk

Step 2 — Import Required Modules

# Import BeautifulSoup for HTML parsing
from bs4 import BeautifulSoup

# Import regular expressions
import re

# Import stopwords from nltk
from nltk.corpus import stopwords

# Import tokenizer
from nltk.tokenize import word_tokenize

# Import Counter for counting keyword frequency
from collections import Counter

# Import nltk package
import nltk

Step 3 — Download NLTK Data

Run this once.

# Download tokenizer data
nltk.download('punkt')

# Download stopwords dataset
nltk.download('stopwords')

Step 4 — Sample HTML Content

We’ll use a sample HTML page.

# Sample HTML content
html_content = """
<html>
<head>
    <title>AI and Machine Learning</title>
</head>
<body>
    <h1>Artificial Intelligence</h1>
    <p>
        Artificial intelligence and machine learning are transforming
        software development and modern applications.
    </p>
</body>
</html>
"""

Step 5 — Remove HTML Tags

Use BeautifulSoup to extract plain text.

# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Extract plain text
text = soup.get_text()

# Print extracted text
print(text)

Step 6 — Clean the Text

We remove:

Special characters
Numbers

Extra spaces

# Convert text to lowercase
text = text.lower()

# Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)

# Remove extra spaces
text = re.sub(r'\s+', ' ', text)

# Print cleaned text
print(text)

Step 7 — Tokenize the Text

Tokenization splits text into individual words.

# Split text into words
words = word_tokenize(text)

# Print tokenized words
print(words)

Step 8 — Remove Stop Words

Stop words are common words like:

These words usually don’t help identify keywords.

# Load English stop words
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_words = [
    word for word in words
    if word not in stop_words
]

# Print filtered words
print(filtered_words)

Step 9 — Count Keyword Frequency

Now count how often each keyword appears.

# Count word frequency
word_counts = Counter(filtered_words)

# Print top 10 keywords
print(word_counts.most_common(10))

Full Working Code

# Import libraries
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import nltk

# Download nltk datasets
nltk.download('punkt')
nltk.download('stopwords')

# Sample HTML content
html_content = """
<html>
<head>
    <title>AI and Machine Learning</title>
</head>
<body>
    <h1>Artificial Intelligence</h1>
    <p>
        Artificial intelligence and machine learning are transforming
        software development and modern applications.
    </p>
</body>
</html>
"""

# Parse HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract text
text = soup.get_text()

# Convert to lowercase
text = text.lower()

# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)

# Remove extra spaces
text = re.sub(r'\s+', ' ', text)

# Tokenize words
words = word_tokenize(text)

# Load stop words
stop_words = set(stopwords.words('english'))

# Remove stop words
filtered_words = [
    word for word in words
    if word not in stop_words
]

# Count frequency
word_counts = Counter(filtered_words)

# Print top keywords
print("Top Keywords:")

# Display keywords
for word, count in word_counts.most_common(10):
    print(word, ":", count)

Example Output

Top Keywords:
artificial : 2
intelligence : 2
machine : 1
learning : 1
transforming : 1
software : 1
development : 1
modern : 1
applications : 1

How It Works Internally

The workflow looks like this:

HTML Content
      ↓
Remove HTML Tags
      ↓
Clean Text
      ↓
Tokenize Words
      ↓
Remove Stop Words
      ↓
Count Frequencies
      ↓
Top Keywords

Improvements You Can Add

The basic keyword extractor works well, but real-world systems can be improved further.

1. Use TF-IDF

TF-IDF helps identify important words more intelligently.

Instead of only counting frequency, it considers:

Word frequency in document
Word rarity across documents

Useful library:

scikit-learn

2. Extract Key Phrases

Instead of single words, extract phrases like:

machine learning
artificial intelligence

Libraries:

RAKE
spaCy
KeyBERT

3. Ignore Navigation Content

Real HTML pages contain:

Menus
Headers
Footers
Ads

You can remove unnecessary tags:

for tag in soup(['script', 'style', 'nav', 'footer']):
    tag.decompose()

4. Use Stemming or Lemmatization

Convert related words into base forms.

Example:

running → run
applications → application

5. Process Live URLs

You can fetch real websites using requests.

import requests

response = requests.get("https://example.com")
html_content = response.text

Real-World Use Cases

Keyword extraction is widely used in:

Industry	Usage
SEO	Content optimization
Search Engines	Indexing pages
E-commerce	Product tagging
AI Chatbots	Context understanding
News Platforms	Article categorization
Recommendation Systems	Topic matching

Challenges in Keyword Extraction

Noisy HTML

Web pages often contain irrelevant content.

Duplicate Words

Repeated keywords may not always indicate importance.

Context Understanding

Simple frequency counting cannot fully understand meaning.

Multilingual Content

Different languages require different NLP models.

Advanced Alternatives

Modern AI systems use embedding-based methods and transformer models for smarter keyword extraction.

Popular tools include:

spaCy
KeyBERT
NLTK
Transformers

These methods understand semantics rather than only word frequency.

Final Thoughts

Building a keyword extractor from HTML content is an excellent beginner-friendly NLP and web scraping project.

In this tutorial, you learned how to:

Parse HTML
Clean text
Tokenize words
Remove stop words
Extract important keywords

This foundational project can later evolve into:

SEO analyzers
Semantic search engines
AI content tools
Recommendation systems
Document analysis platforms

articles