Big Data Cleansing: Removing Duplicates

Discover how we assisted a large bank from Germany in cleaning duplicate data from millions of customer records gathered over multiple years. Our solution enhanced the organization of their records and improved the overall effectiveness of their processes.

Technologies:

Java

Industry:

Banking and Finance

Region:

Germany

Volume:

0.5 man/year

Description

A large European bank had used various information systems over the years as a result of which, they had collected millions of customer records. Some of these records contained typos, while others were logical duplicates. The bank needed an efficient and rapid solution to clean the data, remove duplicates, and bring it to a structured format for further operations.

Challenge

The main challenge was to quickly develop an algorithm that could efficiently handle a large volume of data and remove duplicates at high speed. Another key challenge was that we didn’t have direct access to the actual data (but similar, simulated data) making testing difficult.

Solution

To come up with an effective duplicate removal solution and meet strict deadlines, we very quickly implemented an algorithm to identify duplicate records by grouping them into clusters. The algorithm then selected the best record in each cluster, replaced all the links to duplicate records with the link to the chosen one, and removed the duplicates.

Several iterations were necessary to develop an efficient method. Initial versions of the algorithm were not effective, prompting us to write additional code to gather metrics from the data and improve it without accessing the data directly.

Looking for an automated data cleansing solution?

Connect with us and the first consultation will be provided free of charge.

Big Data Cleansing: Removing Duplicates

Description

Challenge

Solution

Looking for an automated data cleansing solution?

Contact us on messengers: