#### Remove Duplicate Rows

One interesting practical algorithmic problem I solved as part of my work at Oracle, is the optimization of existing logic for identification and removal of duplicates. The problem statement is like this. We have multiple rows (around 100000) of the following form:

Schedule Name |
Description |
Currency |
Schedule Type |
Start Date |
End Date |

Rate Schedule 1 | Test Rate Schedule 1 | USD | Person | 08/20/15 | 11/30/18 |

Rate Schedule 2 | Test Rate Schedule 2 | INR | Job | 08/20/15 | 11/30/18 |

Rate Schedule 1 | Test Rate Schedule 1 | USD | Person | 08/20/15 | 11/30/18 |

Here, Row 1 and Row 3 are duplicates. The initial approach adopted to identify duplicates was to use selection sort technique of comparing each cell value in 1st row with corresponding cell value in next n-1 rows, each cell value in 2nd row with corresponding cell value in next n-2 rows, each cell value in 3rd row with corresponding cell value in next n-3 rows and so on. The Asymptotic Time Complexity of this implementation is Θ(n^{2}). This was the most costliest step in the entire web service. I proposed two alternatives to decrease the Asymptotic Time Complexity. The two solutions proposed have a pre-computation step for each row, which is like this. I concatenate values in each row with a set of two rare special characters ($, @) and hash the obtained string using SHA – 256 algorithm and form a new column with all such strings (one per each row). So, the above table would become something like the following.

Schedule Name |
Description |
Currency |
Schedule Type |
Start Date |
End Date |
Hash Value |

Rate Schedule 1 | Test Rate Schedule 1 | USD | Person | 08/20/15 | 11/30/18 | 41d43f3dcd2e9a1789812221c5a11490d8e699dc228e284a5f5f979367610cf3 |

Rate Schedule 2 | Test Rate Schedule 2 | INR | Job | 08/20/15 | 11/30/18 | ecd008cef30a767952b1a6bfa7db95ae18cdc7a326882042bf8ab8b81150ce95 |

Rate Schedule 1 | Test Rate Schedule 1 | USD | Person | 08/20/15 | 11/30/18 | 41d43f3dcd2e9a1789812221c5a11490d8e699dc228e284a5f5f979367610cf3 |

For example, the Hash Value for Row 1 is obtained by taking,

SHA256(Rate Schedule 1$@Test Rate Schedule 1$@USD$@Person$@08/20/15@11/30/18) = 41d43f3dcd2e9a1789812221c5a11490d8e699dc228e284a5f5f979367610cf3

Two rows are said to be duplicate if and only if their hash values are same.

The two approaches proposed to remove duplicates rows are as follows:

(i) Sorting and Removing the Duplicates which takes Θ(n*lg(n)).

(ii) Place all the values of the new column with the Hash Values in a Hash Set (*hs*) and search for duplicates in *hs*. Each search in *hs* takes a O(1) amortized time complexity. So, ‘n’ searches would take a time of O(n).

Approach 1 worked better for this practical use case of the maximum amount of rows we are supporting for upload in this web service (even though it is asymptotically slower compared to Approach 2). The reason why Approach 1 beats Approach 2 here is most likely because of the constants (involved in) but ignored by O-notation, must be more for O(n) (approach 2) than the ones in O(n*lg(n)) (approach 1) up to the maximum limit supported by our web service (n = 100,000).