Big Data Cleansing: Removing Duplicates
Description
A large European bank had used various information systems over the years as a result of which, they had collected millions of customer records. Some of these records contained typos, while others were logical duplicates. The bank needed an efficient and rapid solution to clean the data, remove duplicates, and bring it to a structured format for further operations.
Challenge
The main challenge was to quickly develop an algorithm that could efficiently handle a large volume of data and remove duplicates at high speed. Another key challenge was that we didn’t have direct access to the actual data (but similar, simulated data) making testing difficult.
Solution
To come up with an effective duplicate removal solution and meet strict deadlines, we very quickly implemented an algorithm to identify duplicate records by grouping them into clusters. The algorithm then selected the best record in each cluster, replaced all the links to duplicate records with the link to the chosen one, and removed the duplicates.
Several iterations were necessary to develop an efficient method. Initial versions of the algorithm were not effective, prompting us to write additional code to gather metrics from the data and improve it without accessing the data directly.
Looking for an automated data cleansing solution?
Connect with us and the first consultation will be provided free of charge.