Help with Writing a Query to Remove Duplicates from a Table in SQL Server

Question

Ravi Vishwakarma · Answer

To remove duplicates from a table in SQL Server, you typically need to identify the duplicates based on certain columns and then delete the extra rows while retaining one instance of each duplicate group. Here’s a step-by-step approach:

Step 1: Identify Duplicates

First, identify the duplicate rows based on the columns that define the duplicates. You can use a ROW_NUMBER() window function to assign a unique number to each row within a group of duplicates.

Step 2: Delete Duplicates

Use a common table expression (CTE) or a subquery to delete rows where the row number is greater than 1, thereby retaining only one instance of each duplicate group.

Example

Assume you have a table named MyTable with columns ID, Column1, and Column2, and you want to remove duplicates based on Column1 and Column2.

Step 1: Identify Duplicates

WITH CTE AS (
    SELECT 
        ID,
        Column1,
        Column2,
        ROW_NUMBER() OVER (PARTITION BY Column1, Column2 ORDER BY ID) AS RowNum
    FROM 
        MyTable
)
SELECT * FROM CTE WHERE RowNum > 1;

This query assigns a row number to each row within each group of duplicates defined by Column1 and Column2. Rows with RowNum greater than 1 are considered duplicates.

Example

WITH Order_CTE AS (
	SELECT OrderId, OrderName, OrderFrom, ID,
	ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID DESC) AS RowNo
	FROM Orders
)

SELECT * FROM Order_CTE WHERE RowNo <> 1;

Step 2: Delete Duplicates

WITH CTE AS (
    SELECT 
        ID,
        Column1,
        Column2,
        ROW_NUMBER() OVER (PARTITION BY Column1, Column2 ORDER BY ID) AS RowNum
    FROM 
        MyTable
)
DELETE FROM CTE WHERE RowNum > 1;

Explanation:

WITH CTE AS (...): Defines a Common Table Expression (CTE) named CTE.
ROW_NUMBER() OVER (PARTITION BY Column1, Column2 ORDER BY ID) AS RowNum: Assigns a unique row number to each row within the partition defined by Column1 and Column2. Rows are ordered by ID.
DELETE FROM CTE WHERE RowNum > 1: Deletes rows from the CTE where RowNum is greater than 1, effectively removing duplicates and keeping only the first occurrence of each group.

WITH Order_CTE AS (
	SELECT OrderId, OrderName, OrderFrom, ID,
	ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID DESC) AS RowNo
	FROM Orders
)

--SELECT * FROM Order_CTE WHERE RowNo <> 1;
Delete FROM Order_CTE WHERE RowNo <> 1; -- Delete duplicate data

Important Note:

Ensure that you have a backup of your data before performing delete operations, as this action cannot be undone.

If your table has a primary key or a unique identifier (like ID in this example), this approach works well. If not, you may need to adapt the query to suit your specific table schema.

Read more

Write a query to n-th highest salary.

Optimize SQL Server for high-concurrency workloads?

Designing a normalized database schema in SQL Server

How to use SQL Server indexing to optimize query performance?

Explain the Dynamic SQL Query with examples in SQL Server.