Problem Description
Given a DataFrame called customers with columns: customer_id, name, and email, some rows have duplicate emails. The task is to remove rows that have duplicate email addresses and keep only the first occurrence of each email.
Key Insights
- We are filtering duplicates based on the email column.
- Only the first occurrence of each email should be retained.
- This is analogous to a "distinct" operation where order matters (i.e., the first appearance is kept).
- Use appropriate data structures (like a set) to keep track of seen emails if solving without a built-in function.
Space and Time Complexity
Time Complexity: O(n), where n is the number of rows, since each row is processed once. Space Complexity: O(n) in the worst-case scenario when all emails are unique, for storing seen emails.
Solution
The solution involves scanning through the rows of the DataFrame and recording emails that have already been encountered. Each time a row is processed, check if the email is in the set of seen emails:
- If not, add the email to the set and include the row in the output.
- If it is already in the set, skip the row. This approach ensures that only the first occurrence of any duplicate email is kept.
In pandas, a built-in method called drop_duplicates can achieve this. For languages without such utilities, a manual approach using a set or hash table is used to track seen emails.