What is Cross-Validation?
23
05-Mar-2026
Updated on 05-Mar-2026
Anubhav Kumar
05-Mar-2026What is Cross-Validation?
Cross-Validation is a technique used in Machine Learning to check how well a model will perform on new, unseen data.
In simple terms, it helps answer an important question:
“Is my model really good, or did it just memorize the training data?”
This is important because many models suffer from Overfitting, where they perform very well on training data but fail on new data.
Cross-validation helps prevent this problem.
Simple Idea (Layman Example)
Imagine a teacher wants to test students.
Instead of asking questions only from one chapter, the teacher randomly selects questions from different chapters to see if students truly understand the subject.
Cross-validation does the same thing with data.
It repeatedly splits the dataset into training and testing parts to evaluate the model more reliably.
How Cross-Validation Works
The dataset is divided into multiple parts called folds.
The model is trained on some folds and tested on the remaining fold.
This process repeats several times.
Example with 5 folds:
Next round:
This continues until every fold has been used as a test set once.
Finally, the performance scores are averaged.
K-Fold Cross-Validation
The most common type is K‑Fold Cross‑Validation.
Steps:
Example:
If K = 5
This gives a more stable and reliable model evaluation.
Example
Suppose you build a Decision Tree model.
Dataset size:
Using 5-Fold Cross-Validation:
Process:
Average accuracy from all rounds gives the final model performance.
Types of Cross-Validation
1. K-Fold Cross-Validation
Most commonly used technique.
2. Stratified K-Fold
Used for classification problems where class distribution must stay balanced.
3. Leave-One-Out Cross-Validation (LOOCV)
Here:
This repeats for every row.
Real-World Usage
Cross-validation is used in many machine learning tasks:
It is also commonly used with algorithms like Decision Tree, Random Forest, and Support Vector Machine.
One-Line Summary
Cross-Validation is a method to test how well a machine learning model will perform on unseen data by repeatedly training and testing it on different parts of the dataset.