Search engines, SEO tools, recommendation systems, and AI applications often need to identify the most important keywords from web pages. Since websites are built using HTML, extracting meaningful keywords from HTML content is a common task in web scraping and natural language processing (NLP).
In this guide, you’ll learn how to build a simple keyword extractor from HTML content step by step using Python.
What Is a Keyword Extractor?
A keyword extractor is a program that:
- Reads text content
- Identifies important words or phrases
- Removes unnecessary words
- Returns the most relevant keywords
For example, from this sentence:
“Artificial intelligence is transforming software development.”
A keyword extractor may return:
- artificial intelligence
- software development
- transforming
Why Extract Keywords from HTML?
Keyword extraction from HTML is useful for:
- SEO analysis
- Search engines
- Content categorization
- Web scraping
- Recommendation systems
- Semantic search
- AI document processing
Project Overview
We will build a Python application that:
- Reads HTML content
- Removes HTML tags
- Cleans the text
- Removes stop words
- Extracts keywords
- Displays top keywords
Technologies Used
We’ll use:
- Python
- BeautifulSoup
- NLTK
- Collections Counter
Step 1 — Install Required Libraries
Install the required packages.
pip install beautifulsoup4 nltk
Step 2 — Import Required Modules
# Import BeautifulSoup for HTML parsing
from bs4 import BeautifulSoup
# Import regular expressions
import re
# Import stopwords from nltk
from nltk.corpus import stopwords
# Import tokenizer
from nltk.tokenize import word_tokenize
# Import Counter for counting keyword frequency
from collections import Counter
# Import nltk package
import nltk
Step 3 — Download NLTK Data
Run this once.
# Download tokenizer data
nltk.download('punkt')
# Download stopwords dataset
nltk.download('stopwords')
Step 4 — Sample HTML Content
We’ll use a sample HTML page.
# Sample HTML content
html_content = """
<html>
<head>
<title>AI and Machine Learning</title>
</head>
<body>
<h1>Artificial Intelligence</h1>
<p>
Artificial intelligence and machine learning are transforming
software development and modern applications.
</p>
</body>
</html>
"""
Step 5 — Remove HTML Tags
Use BeautifulSoup to extract plain text.
# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Extract plain text
text = soup.get_text()
# Print extracted text
print(text)
Step 6 — Clean the Text
We remove:
- Special characters
- Numbers
Extra spaces
# Convert text to lowercase
text = text.lower()
# Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text)
# Print cleaned text
print(text)
Step 7 — Tokenize the Text
Tokenization splits text into individual words.
# Split text into words
words = word_tokenize(text)
# Print tokenized words
print(words)
Step 8 — Remove Stop Words
Stop words are common words like:
- the
- is
- and
- are
These words usually don’t help identify keywords.
# Load English stop words
stop_words = set(stopwords.words('english'))
# Remove stop words
filtered_words = [
word for word in words
if word not in stop_words
]
# Print filtered words
print(filtered_words)
Step 9 — Count Keyword Frequency
Now count how often each keyword appears.
# Count word frequency
word_counts = Counter(filtered_words)
# Print top 10 keywords
print(word_counts.most_common(10))
Full Working Code
# Import libraries
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import nltk
# Download nltk datasets
nltk.download('punkt')
nltk.download('stopwords')
# Sample HTML content
html_content = """
<html>
<head>
<title>AI and Machine Learning</title>
</head>
<body>
<h1>Artificial Intelligence</h1>
<p>
Artificial intelligence and machine learning are transforming
software development and modern applications.
</p>
</body>
</html>
"""
# Parse HTML
soup = BeautifulSoup(html_content, 'html.parser')
# Extract text
text = soup.get_text()
# Convert to lowercase
text = text.lower()
# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text)
# Tokenize words
words = word_tokenize(text)
# Load stop words
stop_words = set(stopwords.words('english'))
# Remove stop words
filtered_words = [
word for word in words
if word not in stop_words
]
# Count frequency
word_counts = Counter(filtered_words)
# Print top keywords
print("Top Keywords:")
# Display keywords
for word, count in word_counts.most_common(10):
print(word, ":", count)
Example Output
Top Keywords:
artificial : 2
intelligence : 2
machine : 1
learning : 1
transforming : 1
software : 1
development : 1
modern : 1
applications : 1
How It Works Internally
The workflow looks like this:
HTML Content
↓
Remove HTML Tags
↓
Clean Text
↓
Tokenize Words
↓
Remove Stop Words
↓
Count Frequencies
↓
Top Keywords
Improvements You Can Add
The basic keyword extractor works well, but real-world systems can be improved further.
1. Use TF-IDF
TF-IDF helps identify important words more intelligently.
Instead of only counting frequency, it considers:
- Word frequency in document
- Word rarity across documents
Useful library:
scikit-learn
2. Extract Key Phrases
Instead of single words, extract phrases like:
- machine learning
- artificial intelligence
Libraries:
- RAKE
- spaCy
- KeyBERT
3. Ignore Navigation Content
Real HTML pages contain:
- Menus
- Headers
- Footers
- Ads
You can remove unnecessary tags:
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
4. Use Stemming or Lemmatization
Convert related words into base forms.
Example:
- running → run
- applications → application
5. Process Live URLs
You can fetch real websites using requests.
import requests
response = requests.get("https://example.com")
html_content = response.text
Real-World Use Cases
Keyword extraction is widely used in:
| Industry | Usage |
|---|---|
| SEO | Content optimization |
| Search Engines | Indexing pages |
| E-commerce | Product tagging |
| AI Chatbots | Context understanding |
| News Platforms | Article categorization |
| Recommendation Systems | Topic matching |
Challenges in Keyword Extraction
Noisy HTML
Web pages often contain irrelevant content.
Duplicate Words
Repeated keywords may not always indicate importance.
Context Understanding
Simple frequency counting cannot fully understand meaning.
Multilingual Content
Different languages require different NLP models.
Advanced Alternatives
Modern AI systems use embedding-based methods and transformer models for smarter keyword extraction.
Popular tools include:
- spaCy
- KeyBERT
- NLTK
- Transformers
These methods understand semantics rather than only word frequency.
Final Thoughts
Building a keyword extractor from HTML content is an excellent beginner-friendly NLP and web scraping project.
In this tutorial, you learned how to:
- Parse HTML
- Clean text
- Tokenize words
- Remove stop words
- Extract important keywords
This foundational project can later evolve into:
- SEO analyzers
- Semantic search engines
- AI content tools
- Recommendation systems
- Document analysis platforms
Leave a Comment