We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Shortest Matching Substring

Number: 3692

Difficulty: Hard

Paid? No

Companies: N/A


Problem Description

Given a string s and a pattern p that contains exactly two '*' wildcards (each matching any sequence of zero or more characters), return the length of the shortest contiguous substring of s that matches p. The empty substring is considered valid. If no substring of s can match p, return -1.

For example, if s = "abaacbaecebce" and p = "bacce", one matching substring is "baecebce" (of length 8).


Key Insights

  • The pattern p always contains exactly two '*' characters, so p can be split into three parts: A, B, and C (which could be empty).
  • A valid match is any substring T of s that can be partitioned as: T = (any prefix) + A + (any string) + B + (any string) + C + (any suffix). In terms of matching, T must contain A, then later B, then later C (with possibly zero characters in between).
  • To obtain the shortest matching substring, you want to pick occurrences of A, B, and C that are “as close together as possible.”
  • A good approach is to first find all occurrences (starting indices) of A, B, and C in s. Then, for each occurrence of A you can binary–search for the next occurrence of B that appears after the end of A, and similarly find an occurrence of C after the end of B.
  • The challenge is handling edge cases when any of the segments (A, B, or C) is empty (recall that p might begin or end with '*' or, in an extreme, be just "**", where the answer would be 0).
  • Efficient substring search (using for example KMP) and binary search on sorted lists of occurrence indices keeps the solution within acceptable complexity bounds.

Space and Time Complexity

Time Complexity: O(n + m + occ*log(occ)), where n = s.length, m = p.length, and occ is the number of occurrences found for a given segment (in worst-case O(n)). Space Complexity: O(n) to store the occurrence indices.


Solution

We start by splitting p into three segments: A, B, and C, where p = A + '' + B + '' + C. In s, for each segment we use a KMP-based search (or any efficient substring search algorithm) to compute a sorted list of start indices where that segment occurs. (If the segment is empty, we conceptually allow it to “match” at every position; however, for the purpose of minimizing substring length we only need to consider boundary conditions.)

For each occurrence index a for A (if A is nonempty; if empty, any index may serve as a starting point), we determine the end index of the A match (a + |A|). Then, using binary search on all occurrences of B, we try to find an occurrence b such that b ≥ a + |A|. For the found b, the match for B extends to b + |B|. Then, again by binary search on all occurrences of C, we find c such that c ≥ b + |B|. The candidate matching substring spans from a to c + |C| - 1 and its length is (c + |C| - a).

We iterate over all possible occurrences of A in s and take the minimum such substring length. If we cannot form a valid chain, we return -1. Special handling is done when one or more segments are empty. For example, if p equals "**", the answer is 0 as the empty substring is valid.

This solution uses:

  • KMP (Knuth-Morris-Pratt) algorithm for efficient pattern searching.
  • Binary search to quickly find the next occurrence index that meets our boundary requirements.

A similar strategy is applied for each of the target languages.


Code Solutions

Below are code solutions in Python, JavaScript, C++, and Java with line-by-line comments.

# Define a helper function to perform KMP search for all occurrences 
def kmp_search(text, pattern):
    # If the pattern is empty, return all indices as possible match positions.
    if not pattern:
        return list(range(len(text) + 1))
    # Build the longest prefix suffix (lps) array
    m = len(pattern)
    lps = [0] * m
    j = 0
    # Construct lps for pattern
    for i in range(1, m):
        while j > 0 and pattern[i] != pattern[j]:
            j = lps[j-1]
        if pattern[i] == pattern[j]:
            j += 1
            lps[i] = j
    # KMP search in text
    occurrences = []
    j = 0
    for i in range(len(text)):
        while j > 0 and text[i] != pattern[j]:
            j = lps[j-1]
        if text[i] == pattern[j]:
            j += 1
            if j == m:
                # Append start index of occurrence.
                occurrences.append(i - m + 1)
                j = lps[j-1]
    return occurrences

# Use binary search to get the smallest index in sorted_list that is >= target.
import bisect
def find_ge(sorted_list, target):
    idx = bisect.bisect_left(sorted_list, target)
    return sorted_list[idx] if idx < len(sorted_list) else None

def shortest_matching_substring(s, p):
    # Split the pattern into three parts: A, B, C; p always has 2 '*' characters.
    parts = p.split('*')
    if len(parts) != 3:
        # Problem constraint ensures exactly 2 '*' so this should not happen.
        return -1
    A, B, C = parts[0], parts[1], parts[2]
    
    # Special case: if all parts are empty i.e. p == "**", the empty substring matches.
    if A == "" and B == "" and C == "":
        return 0

    # Compute occurrences for each part using KMP. If part is empty, we treat it specially.
    occ_A = kmp_search(s, A) if A else list(range(len(s)+1))
    occ_B = kmp_search(s, B) if B else list(range(len(s)+1))
    occ_C = kmp_search(s, C) if C else list(range(len(s)+1))
    
    # Ensure the occurrences lists are sorted.
    occ_B.sort()
    occ_C.sort()
    
    min_length = float('inf')
    # Iterate over all occurrences for A (or all positions if A is empty)
    for a in occ_A:
        # When A is nonempty, match for A ends at a_end.
        a_end = a + len(A)
        # Find in occ_B a starting index b that is >= a_end.
        b = find_ge(occ_B, a_end)
        if b is None:
            continue  # No valid B occurrence found for this a.
        b_end = b + len(B)
        # Find in occ_C a starting index c that is >= b_end.
        c = find_ge(occ_C, b_end)
        if c is None:
            continue
        # The candidate substring spans from index a to c_end.
        c_end = c + len(C)
        candidate_length = c_end - a
        if candidate_length < min_length:
            min_length = candidate_length

    return min_length if min_length != float('inf') else -1

# Example usage:
print(shortest_matching_substring("abaacbaecebce", "ba*c*ce"))  # Expected output 8
print(shortest_matching_substring("baccbaadbc", "cc*baa*adb"))  # Expected output -1
print(shortest_matching_substring("a", "**"))                   # Expected output 0
print(shortest_matching_substring("madlogic", "*adlogi*"))       # Expected output 6
← Back to All Questions