We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Find Duplicate File in System

Number: 609

Difficulty: Medium

Paid? No

Companies: Meta, Dropbox, Google, Netflix


Problem Description

Given a list of directory info strings, each containing a directory path followed by one or more files with their contents (formatted as "fileName(content)"), return all groups of duplicate files in the system. Two files are considered duplicate if they have exactly the same content. Each file path is constructed by combining the directory path and the file name.


Key Insights

  • Parse each input string by separating the directory path from the file details.
  • Extract the file name and content using string manipulation (split by spaces and use the parenthesis as delimiters).
  • Use a hashmap (or dictionary) to map file content to a list of full file paths.
  • Return only those groups that have more than one file path.

Space and Time Complexity

Time Complexity: O(N * K), where N is the number of directory strings and K is the average length of the strings including file details.
Space Complexity: O(N * K) for storing the hashmap mapping file content to file paths.


Solution

We iterate over each directory info string from the input. For each string, we:

  1. Split the string to obtain the directory path and the file information segments.
  2. For each file segment, extract the file name and content by locating the '(' separator.
  3. Construct the full file path by appending the file name to the directory path.
  4. Map the file content to a list of these full paths using a hashmap.
  5. Finally, return the lists from the hashmap that contain more than one file path, indicating duplicate files.

This approach leverages simple string manipulation and a hashmap to efficiently group and identify duplicate files.


Code Solutions

# Python solution with detailed comments
class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        # Hashmap to map file content to list of file paths
        content_to_paths = {}
        # Process each directory information string
        for path in paths:
            # Split the string by spaces: first part is the directory, remaining are file entries
            parts = path.split(" ")
            directory = parts[0]
            # Process each file entry in the current directory
            for file_info in parts[1:]:
                # Split file_info into file name and content based on '('
                file_name, file_content = file_info.split("(")
                # Remove the closing parenthesis ')' from file_content
                file_content = file_content[:-1]
                # Construct the full file path
                full_path = directory + "/" + file_name
                # Map file content to corresponding file paths
                if file_content not in content_to_paths:
                    content_to_paths[file_content] = []
                content_to_paths[file_content].append(full_path)
        # Only return groups that have more than one file (duplicates)
        return [group for group in content_to_paths.values() if len(group) > 1]
← Back to All Questions