We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

HTML Entity Parser

Number: 1526

Difficulty: Medium

Paid? No

Companies: Amazon, Google, Oracle


Problem Description

Given a string that may contain HTML entities (special sequences starting with "&" and ending with ";"), replace each recognized entity with its corresponding character. Only a fixed set of known HTML entities should be replaced. Unrecognized sequences should remain unchanged.


Key Insights

  • Use a hash table (dictionary) to map HTML entity strings to their corresponding characters.
  • Instead of using repetitive string replaces (which might lead to overlapping issues), a single pass scan through the string ensures linear time replacement.
  • When encountering an ampersand ("&"), check if the following substring matches any of the known entities.
  • If a valid entity is found, append its corresponding character to the result; otherwise, simply append the current character.

Space and Time Complexity

Time Complexity: O(n), where n is the length of the input text, since we process each character at most once. Space Complexity: O(n) for storing the output string; the additional dictionary uses constant space.


Solution

The solution employs a single pass scan algorithm combined with a dictionary to map HTML entities to the actual characters. As we traverse the string, we check for the presence of an entity starting at every '&'. If a valid entity is detected, we substitute it; if not, we continue. This approach eliminates the overhead of multiple searches and replacements, keeping the solution efficient and straightforward.


Code Solutions

# Define the function to parse HTML entities
def entityParser(text):
    # Mapping of known HTML entities to their corresponding characters
    entity_to_char = {
        """: "\"",  # Quotation Mark
        "'": "'",   # Single Quote Mark
        "&": "&",    # Ampersand
        ">": ">",     # Greater Than Sign
        "&lt;": "<",     # Less Than Sign
        "&frasl;": "/"   # Slash
    }
    
    result = []  # List to accumulate result characters
    i = 0  # Current index in the text
    
    while i < len(text):
        # If the current character is '&', check for valid entity
        if text[i] == '&':
            # Try all entity lengths by checking dictionary keys
            found = False
            # Iterate through each possible entity from the mapping keys
            for entity, char in entity_to_char.items():
                # Check if the substring from i with length of the entity matches
                if text.startswith(entity, i):
                    result.append(char)  # Append the corresponding character
                    i += len(entity)     # Move the index past the entity
                    found = True
                    break
            # If no matching entity found, append the ampersand normally
            if not found:
                result.append(text[i])
                i += 1
        else:
            # Current character is not '&'; append normally
            result.append(text[i])
            i += 1
    
    # Join the list into a string and return
    return "".join(result)

# Example usage:
input_text = "&amp; is an HTML entity but &amp;ambassador; is not."
print(entityParser(input_text))
← Back to All Questions