LeetCode Solutions Blog

HTML Entity Parser

Number: 1526

Difficulty: Medium

Paid? No

Problem Description

Given a string that may contain HTML entities (special sequences starting with "&" and ending with ";"), replace each recognized entity with its corresponding character. Only a fixed set of known HTML entities should be replaced. Unrecognized sequences should remain unchanged.

Key Insights

Use a hash table (dictionary) to map HTML entity strings to their corresponding characters.
Instead of using repetitive string replaces (which might lead to overlapping issues), a single pass scan through the string ensures linear time replacement.
When encountering an ampersand ("&"), check if the following substring matches any of the known entities.
If a valid entity is found, append its corresponding character to the result; otherwise, simply append the current character.

Space and Time Complexity

Time Complexity: O(n), where n is the length of the input text, since we process each character at most once. Space Complexity: O(n) for storing the output string; the additional dictionary uses constant space.

Solution

The solution employs a single pass scan algorithm combined with a dictionary to map HTML entities to the actual characters. As we traverse the string, we check for the presence of an entity starting at every '&'. If a valid entity is detected, we substitute it; if not, we continue. This approach eliminates the overhead of multiple searches and replacements, keeping the solution efficient and straightforward.

Code Solutions

# Define the function to parse HTML entities
def entityParser(text):
    # Mapping of known HTML entities to their corresponding characters
    entity_to_char = {
        "&quot;": "\"",  # Quotation Mark
        "&apos;": "'",   # Single Quote Mark
        "&amp;": "&",    # Ampersand
        "&gt;": ">",     # Greater Than Sign
        "&lt;": "<",     # Less Than Sign
        "&frasl;": "/"   # Slash
    }
    
    result = []  # List to accumulate result characters
    i = 0  # Current index in the text
    
    while i < len(text):
        # If the current character is '&', check for valid entity
        if text[i] == '&':
            # Try all entity lengths by checking dictionary keys
            found = False
            # Iterate through each possible entity from the mapping keys
            for entity, char in entity_to_char.items():
                # Check if the substring from i with length of the entity matches
                if text.startswith(entity, i):
                    result.append(char)  # Append the corresponding character
                    i += len(entity)     # Move the index past the entity
                    found = True
                    break
            # If no matching entity found, append the ampersand normally
            if not found:
                result.append(text[i])
                i += 1
        else:
            # Current character is not '&'; append normally
            result.append(text[i])
            i += 1
    
    # Join the list into a string and return
    return "".join(result)

# Example usage:
input_text = "&amp; is an HTML entity but &amp;ambassador; is not."
print(entityParser(input_text))