We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Drop Duplicate Rows

Number: 3071

Difficulty: Easy

Paid? No

Companies: N/A


Problem Description

Given a DataFrame called customers with columns: customer_id, name, and email, some rows have duplicate emails. The task is to remove rows that have duplicate email addresses and keep only the first occurrence of each email.


Key Insights

  • We are filtering duplicates based on the email column.
  • Only the first occurrence of each email should be retained.
  • This is analogous to a "distinct" operation where order matters (i.e., the first appearance is kept).
  • Use appropriate data structures (like a set) to keep track of seen emails if solving without a built-in function.

Space and Time Complexity

Time Complexity: O(n), where n is the number of rows, since each row is processed once. Space Complexity: O(n) in the worst-case scenario when all emails are unique, for storing seen emails.


Solution

The solution involves scanning through the rows of the DataFrame and recording emails that have already been encountered. Each time a row is processed, check if the email is in the set of seen emails:

  • If not, add the email to the set and include the row in the output.
  • If it is already in the set, skip the row. This approach ensures that only the first occurrence of any duplicate email is kept.

In pandas, a built-in method called drop_duplicates can achieve this. For languages without such utilities, a manual approach using a set or hash table is used to track seen emails.


Code Solutions

# Python solution using pandas
import pandas as pd

def drop_duplicate_rows(customers: pd.DataFrame) -> pd.DataFrame:
    # Drop duplicate rows based on the 'email' column, keeping the first occurrence.
    return customers.drop_duplicates(subset=['email'], keep='first')

# Example usage:
data = {
    'customer_id': [1, 2, 3, 4, 5, 6],
    'name': ['Ella', 'David', 'Zachary', 'Alice', 'Finn', 'Violet'],
    'email': ['emily@example.com', 'michael@example.com', 'sarah@example.com', 'john@example.com', 'john@example.com', 'alice@example.com']
}
customers_df = pd.DataFrame(data)
result_df = drop_duplicate_rows(customers_df)
print(result_df)
← Back to All Questions