Big Data Cleansing: Duplicate Removal

A large European bank, having used various information systems over the years, had accumulated millions of customer records. Some records contained typos, while others were logical duplicates. The bank needed an effective way to cleanse the data and bring it to the required format for further operations.
Industry:
Banking and Finance
Region:
Germany
Technologies:
Java
Volume:
0.5 man/year

Challenge


The client was looking for a custom software solution to help their bank deal with customer records collected from various sources. The records had different address formats, typos, and errors. The bank required clear, structured information without errors and duplicates.

However, due to privacy concerns, the bank couldn't provide SCD with direct access to customer records. To create an effective algorithm for cleansing such data volumes required careful consideration of their specifics.

Solution


To come up with an effective solution, we implemented an algorithm to identify likely duplicate records by grouping them into clusters.

The algorithm then selected the best record in each cluster, replaced all the links to duplicate records with the link to the chosen one, and removed the duplicates.

Several iterations were necessary to develop an efficient method. Initial versions of the algorithm were not effective, prompting us to write additional code to gather metrics from the data and improve it without accessing the data directly.