Era of New Technology: Removing Duplicates from a Table in SQL

Removing Duplicates from a Table in SQL

Checking for Duplicates :

On any version of SQL Server, you can identify duplicates using a simple query, with GROUP BY and HAVING, as follows:

DECLARE @table TABLE (data VARCHAR(20))
INSERT INTO @table VALUES ('not duplicate row')
INSERT INTO @table VALUES ('duplicate row')
INSERT INTO @table VALUES ('duplicate row')

SELECT data, COUNT(data) nr
FROM @table
GROUP BY data
HAVING COUNT(data) > 1

The result indicates that there are two occurrences of the row containing the “duplicate row” text:

data nr
-------------------- ----
duplicate row 2

Removing Duplicate Rows in SQL Server :

The following sections present a variety of techniques for removing duplicates from SQL Server database tables, depending on the nature of the table design.

Tables with no primary key :

When you have duplicates in a table that has no primary key defined, and you are using an older version of SQL Server, such as SQL Server 2000, you do not have an easy way to identify a single row. Therefore, you cannot simply delete this row by specifying a WHERE clause in a DELETE statement.

You can, however, use the SET ROWCOUNT 1 command, which will restrict the subsequent DELETE statement to removing only one row. For example:

DECLARE @table TABLE (data VARCHAR(20))
INSERT INTO @table VALUES ('not duplicate row')
INSERT INTO @table VALUES ('duplicate row')
INSERT INTO @table VALUES ('duplicate row')

SET ROWCOUNT 1

DELETE FROM @table WHERE data = 'duplicate row'

SET ROWCOUNT 0

In the above example, only one row is deleted. Consequently, there will be one remaining row with the content “duplicate row”. If you have more than one duplicate of a particular row, you would simply adjust the ROWCOUNT accordingly. Note that after the delete, you should reset the ROWCOUNT to 0 so that subsequent queries are not affected.

To remove all duplicates in a single pass, the following code will work, but is likely to be horrendously slow if there are a large number of duplicates and table rows:

DECLARE @table TABLE (data VARCHAR(20))

INSERT INTO @table VALUES ('not duplicate row')
INSERT INTO @table VALUES ('duplicate row')
INSERT INTO @table VALUES ('duplicate row')

SET NOCOUNT ON

SET ROWCOUNT 1

WHILE 1 = 1
BEGIN
DELETE FROM @table
WHERE data IN (SELECT data
FROM @table
GROUP BY data
HAVING COUNT(*) > 1)

IF @@Rowcount = 0
BREAK ;
END

SET ROWCOUNT 0

When cleaning up a table that has a large number of duplicate rows, a better approach is to select just a distinct list of the duplicates, delete all occurrences of those duplicate entries from the original and then insert the list into the original table.

DECLARE @table TABLE(data VARCHAR(20))

INSERT INTO @table VALUES ('not duplicate row')
INSERT INTO @table VALUES ('duplicate row')
INSERT INTO @table VALUES ('duplicate row')
INSERT INTO @table VALUES ('second duplicate row')
INSERT INTO @table VALUES ('second duplicate row')

SELECT data
INTO #duplicates
FROM @table
GROUP BY data
HAVING COUNT(*) > 1

-- delete all rows that are duplicated

DELETE FROM @table
FROM @table o INNER JOIN #duplicates d
ON d.data = o.data

-- insert one row for every duplicate set

INSERT INTO @table(data)
SELECT data
FROM #duplicates

As a variation of this technique, you could select all the data, without duplicates, into a new table, delete the old table, and then rename the new table to match the name of the original table:

CREATE TABLE duplicateTable3(data VARCHAR(20))

INSERT INTO duplicateTable3 VALUES ('not duplicate row')
INSERT INTO duplicateTable3 VALUES ('duplicate row')
INSERT INTO duplicateTable3 VALUES ('duplicate row')
INSERT INTO duplicateTable3 VALUES ('second duplicate row')
INSERT INTO duplicateTable3 VALUES ('second duplicate row')

SELECT DISTINCT data
INTO tempTable
FROM duplicateTable3
GO

TRUNCATE TABLE duplicateTable3

DROP TABLE duplicateTable3

exec sp_rename 'tempTable', 'duplicateTable3'

In this solution, the SELECT DISTINCT will select all the rows from our table except for the duplicates. These rows are immediately inserted into a table named tempTable. This is a temporary table in the sense that we will use it to temporarily store the unique rows. However, it is not a true temporary table (i.e. one that lives in the temporary database), because we need the table to exist in the current database, so that it can later be renamed, using sp_Rename.

The sp_Rename command is an absolutely horrible way of renaming textual objects, such as stored procedures, because it does not update all the system tables consistently. However, it works well for non-textual schema objects, such as tables.

New Techniques for Removing Duplicate Rows in SQL Server 2005

SQL Server 2005 has introduced the row_number() function, which provides an alternative means of identifying duplicates. Rewriting the first example, for tables with no primary key, we can now assign a row number to each row in a duplicate group, with a command such as:

DECLARE @duplicateTable4 TABLE (data VARCHAR(20))

INSERT INTO @duplicateTable4 VALUES ('not duplicate row')
INSERT INTO @duplicateTable4 VALUES ('duplicate row')
INSERT INTO @duplicateTable4 VALUES ('duplicate row')
INSERT INTO @duplicateTable4 VALUES ('second duplicate row')
INSERT INTO @duplicateTable4 VALUES ('second duplicate row')

SELECT data, row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr
FROM @duplicateTable4

The result will show:

data nr
-------------------- --------------------
duplicate row 1
duplicate row 2
not duplicate row 1
second duplicate row 1
second duplicate row 2

In the above example, we specify an ordering and partitioning for the row_number() function. Note that the row_number() is a ranking window function, therefore the ORDER BY and the PARTITION BY in the OVER clause are used only to determine the value for the nr column, and they do not affect the row order of the query. Also, while the above is similar to our previous GROUP BY clause, there is a big difference concerning the returned rows. With GROUP BY you must use an aggregate on the columns that are not listed after the GROUP BY. With the OVER clause there is no such restriction, and you can get access to the individual rows in the groups specified by the PARTITION BY clause. This gives us access to the individual duplicate rows, so we can get not only the number of occurrences, but also a sequence number for the individual duplicates. To filter out the duplicate rows only, we could just put the above query into a CTE or a subquery. The CTE approach is as follows:

DECLARE @duplicateTable4 TABLE (data VARCHAR(20))

INSERT INTO @duplicateTable4 VALUES ('not duplicate row')
INSERT INTO @duplicateTable4 VALUES ('duplicate row')
INSERT INTO @duplicateTable4 VALUES ('duplicate row')
INSERT INTO @duplicateTable4 VALUES ('second duplicate row')
INSERT INTO @duplicateTable4 VALUES ('second duplicate row')

WITH numbered
AS ( SELECT data, row_number() OVER ( PARTITION BY data ORDER BY data ) AS nr
FROM @duplicateTable4
)

DELETE FROM numbered
WHERE nr > 1

This solution will even work with large objects, if you stick to the new large object types introduced in SQL Server 2005: i.e. use VARCHAR(MAX) instead of TEXT, NVARCHAR(MAX) instead of NTEXT, and VARBINARY(MAX) instead of IMAGE. These new types are comparable to the deprecated TEXT, NTEXT and IMAGE, and they have the advantage that you will be able to use them with both DISTINCT and row_number().

Category : Sql Server | Read More......

Era of New Technology

"We've been on this journey before.... and we know the way"

Captions

About Me

My Blog List

Removing Duplicates from a Table in SQL

Blog Archive