Logo
My Journal
Blog

Timeline

Blog

Remove Duplicate Rows

One interesting practical algorithmic problem I solved as part of my work at Oracle, is the optimization of existing logic for identification and removal of duplicates. The problem statement is like this. We have multiple rows (around 100000) of the following form:

Schedule Name Description Currency Schedule Type Start Date End Date
Rate Schedule 1 Test Rate Schedule 1 USD Person 08/20/15 11/30/18
Rate Schedule 2 Test Rate Schedule 2 INR Job 08/20/15 11/30/18
Rate Schedule 1 Test Rate Schedule 1 USD Person 08/20/15 11/30/18

Here, Row 1 and Row 3 are duplicates. The initial approach adopted to identify duplicates was to use selection sort technique of comparing each cell value in 1st row with corresponding cell value in next n-1 rows, each cell value in 2nd row with corresponding cell value in next n-2 rows, each cell value in 3rd row with corresponding cell value in next n-3 rows and so on. The Asymptotic Time Complexity of this implementation is Θ(n2). This was the most costliest step in the entire web service. I proposed two alternatives to decrease the Asymptotic Time Complexity. The two solutions proposed have a pre-computation step for each row, which is like this. I concatenate values in each row with a set of two rare special characters ($, @) and hash the obtained string using SHA – 256 algorithm and form a new column with all such strings (one per each row). So, the above table would become something like the following.

Schedule Name Description Currency Schedule Type Start Date End Date Hash Value
Rate Schedule 1 Test Rate Schedule 1 USD Person 08/20/15 11/30/18 41d43f3dcd2e9a1789812221c5a11490d8e699dc228e284a5f5f979367610cf3
Rate Schedule 2 Test Rate Schedule 2 INR Job 08/20/15 11/30/18 ecd008cef30a767952b1a6bfa7db95ae18cdc7a326882042bf8ab8b81150ce95
Rate Schedule 1 Test Rate Schedule 1 USD Person 08/20/15 11/30/18 41d43f3dcd2e9a1789812221c5a11490d8e699dc228e284a5f5f979367610cf3

For example, the Hash Value for Row 1 is obtained by taking,

SHA256(Rate Schedule 1$@Test Rate Schedule 1$@USD$@Person$@08/20/15@11/30/18) = 41d43f3dcd2e9a1789812221c5a11490d8e699dc228e284a5f5f979367610cf3

Two rows are said to be duplicate if and only if their hash values are same.

The two approaches proposed to remove duplicates rows are as follows:

(i) Sorting and Removing the Duplicates which takes Θ(n*lg(n)).

(ii) Place all the values of the new column with the Hash Values in a Hash Set (hs) and search for duplicates in hs. Each search in hs takes a O(1) amortized time complexity. So, ‘n’ searches would take a time of O(n).

Approach 1 worked better for this practical use case of the maximum amount of rows we are supporting for upload in this web service (even though it is asymptotically slower compared to Approach 2). The reason why Approach 1 beats Approach 2 here is most likely because of the constants (involved in) but ignored by O-notation, must be more for O(n) (approach 2) than the ones in O(n*lg(n)) (approach 1) up to the maximum limit supported by our web service (n = 100,000).

Leave A Comment