Big Data Cleansing: Duplicate Removal
Challenge
The client was looking for a custom software solution to help their bank deal with customer records collected from various sources. The records had different address formats, typos, and errors. The bank required clear, structured information without errors and duplicates.
However, due to privacy concerns, the bank couldn't provide SCD with direct access to customer records. To create an effective algorithm for cleansing such data volumes required careful consideration of their specifics.
Solution
To come up with an effective duplicate removal solution, we implemented an algorithm to identify likely duplicate records by grouping them into clusters.
The algorithm then selected the best record in each cluster, replaced all the links to duplicate records with the link to the chosen one, and removed the duplicates.
Several iterations were necessary to develop an efficient method. Initial versions of the algorithm were not effective, prompting us to write additional code to gather metrics from the data and improve it without accessing the data directly.