« Items to do from today's meeting | Main | Del.icio.us Technology Revealed »

Removing Duplicates From a Database

You've inserted multiple copies of the same data into a table. You want to get rid of all but one copy of each, and you want to keep the one with the lowest ID. You are using a DMBS with sub-selects, such as PostgreSQL.

You will do this:

select min(wine_id) as canonical from
        wine, (select distinct wine_name from wine) as names
        where wine.wine_name = names.wine_name
        group by names.wine_name;

If that looks right, you will proceed:

delete from wine where wine_id not in
    (select min(wine_id) as canonical from
        wine, (select distinct wine_name from wine) as names
        where wine.wine_name = names.wine_name
        group by names.wine_name);

More generally, if you have a table T with pkey T_ID and the desired data is unique on some column UNIQUE_COLUMN, it's:

select min(T_ID) as canonical from
        T, (select distinct UNIQUE_COLUMN from T) as equiv_classes
        where T.UNIQUE_COLUMN = equiv_classes.UNIQUE_COLUMN
        group by equiv_classes.UNIQUE_COLUMN;

Post a comment