We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Repeated DNA Sequences

Number: 187

Difficulty: Medium

Paid? No

Companies: Amazon, LinkedIn, Bloomberg, Grammarly, Google, Apple, Tesla


Problem Description

Given a string representing a DNA sequence (composed only of 'A', 'C', 'G', and 'T'), find all 10-letter-long sequences (substrings) that occur more than once in the DNA molecule. The answer can be returned in any order.


Key Insights

  • Use a fixed-size sliding window (of length 10) to extract every possible 10-letter sequence.
  • Use a hash set (or hash map) to keep track of sequences that have been seen.
  • A second hash set is used to store sequences that are repeated.
  • Optimization can be done with bit manipulation (mapping characters to bits), but simple string slicing works under the given constraints.
  • This approach ensures O(n) time complexity by processing each substring once.

Space and Time Complexity

Time Complexity: O(n), where n is the length of the string. Space Complexity: O(n) in the worst-case scenario, due to storage of sequences. (Note: Using bit manipulation can reduce the space overhead.)


Solution

The solution uses a sliding window of size 10 to traverse the input string. As each 10-letter sequence is generated, it is checked against a hash set containing previously seen sequences. If a sequence has been seen before, it is added to a separate hash set for repeated sequences. Finally, the repeated sequences are returned as a list. This approach efficiently handles the problem by using constant time operations for checking and storing sequences.


Code Solutions

def findRepeatedDnaSequences(s):
    # If the string is shorter than 10 characters, no sequence can repeat.
    if len(s) < 10:
        return []
    seen = set()      # To keep track of sequences that have been seen
    repeated = set()  # To store sequences that occur more than once
    
    # Slide a window of size 10 over the string.
    for i in range(len(s) - 9):
        seq = s[i:i+10]  # Extract the substring of length 10
        # If sequence is already seen, add to the repeated set.
        if seq in seen:
            repeated.add(seq)
        else:
            seen.add(seq)
    
    # Convert the set of repeated sequences into a list and return.
    return list(repeated)

# Example usage:
print(findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"))
← Back to All Questions