Problem Description
Given a list of absolute folder paths, build the hierarchical folder structure and identify duplicate subtrees (folders with identical subfolder structures). If two or more folders are identical (they have the same non-empty set of same subfolders recursively), then mark all such folders and their entire subtrees for deletion. After a one-time deletion on all marked folders, return the paths of the remaining folders.
Key Insights
- Build a tree (trie) structure to represent the folder hierarchy.
- Serialize each subtree into a unique string that captures the folder name and its sorted children’s serialization.
- Use a hash map to count how many times each subtree serialization appears.
- Mark nodes (and their subtrees) for deletion if their serialization occurs more than once.
- Traverse the tree again to output all remaining folder paths, skipping any marked nodes.
Space and Time Complexity
Time Complexity: O(N * L * log(degree)) where N is the number of folder nodes, L is the average length of each serialization and log factor comes from sorting the children. Space Complexity: O(N * L) due to the tree structure and storage of serialization strings.
Solution
We first build a tree from the given paths, where each node represents a folder and has a dictionary of children nodes. We then traverse the tree from the bottom-up (postorder DFS) to compute a canonical string representation (serialization) for each subtree. This serialization includes the folder name and the sorted serializations of its children. We count the occurrences of each serialization in a global dictionary. If any subtree has a count greater than one, we mark that node for deletion. Finally, we perform another DFS traversal to collect all remaining folder paths by skipping over any subtree marked for deletion.