The customer records had been gathered from different sources. They had different address formats, typos, and errors. The bank needed clear, structured information without errors and duplicates.
But those data were private: the bank could not provide us with direct access to customer records. To create an effective algorithm for cleansing such volumes of data, however, it is necessary to consider their specifics.
We implemented an algorithm for dividing the records that were likely to be duplicates into clusters.
The algorithm then selected the best record in each cluster, replaced all the links to duplicate records with the link to this one, and finally removed the duplicates.
It took several iterations to develop a fast enough method. The initial versions of the algorithm were not effective, so we wrote additional code to gather a number of metrics from the data to improve it without accessing the data themselves.