Big Data Cleansing: Duplicate Removal

Big Data Cleansing: Duplicate Removal

A large European bank over the years of using various information systems had accumulated millions of customer records. Some of the records contained typos; some were logical duplicates. The bank needed an effective way to cleanse the data and bring them to the format required for further operation.
Industry:
Banking and Finance
Region:
Western Europe
Technologies:
Java
Volume:
<= 1 man-year

Problem

The customer records had been gathered from different sources. They had different address formats, typos, and errors. The bank needed clear, structured information without errors and duplicates.

But those data were private: the bank could not provide us with direct access to customer records. To create an effective algorithm for cleansing such volumes of data, however, it is necessary to consider their specifics.

Solution

We implemented an algorithm for dividing the records that were likely to be duplicates into clusters.

The algorithm then selected the best record in each cluster, replaced all the links to duplicate records with the link to this one, and finally removed the duplicates.

It took several iterations to develop a fast enough method. The initial versions of the algorithm were not effective, so we wrote additional code to gather a number of metrics from the data to improve it without accessing the data themselves.

We use cookies.