We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

UTF-8 Validation

Number: 393

Difficulty: Medium

Paid? No

Companies: Meta, Google


Problem Description

Given an array of integers where each integer represents one byte of data (using only its least significant 8 bits), determine if the sequence represents a valid UTF-8 encoding. In UTF-8, characters are encoded using 1 to 4 bytes. For multi-byte characters, the first byte indicates the total number of bytes (by having a series of 1's followed by a 0) while each subsequent byte starts with the bits "10".


Key Insights

  • The first byte of a character tells how many bytes the character comprises based on the count of leading 1's.
  • For a valid multi-byte character, each subsequent byte must start with "10".
  • Use bit masks to check the most significant bits in each byte.
  • Incrementally validate each byte in the data array by tracking how many continuation bytes are expected.

Space and Time Complexity

Time Complexity: O(n), where n is the number of bytes in the data array. Space Complexity: O(1), since no extra space proportional to input size is used.


Solution

We iterate through each integer byte in the data array. If we are not in the middle of processing a multi-byte character, we count the number of leading 1's in the current byte to determine the expected number of bytes for the character. A count of 0 means it is a single-byte character and is valid by itself. Otherwise, for a multi-byte character, the first byte should have between 2 to 4 leading 1's. Then, we check the next bytes (the continuation bytes) to ensure their two most significant bits are "10". If any byte fails these conditions or if the number of expected continuation bytes does not match, we return false.

Key data structures and techniques used:

  • Bit manipulation to extract and check relevant bits from bytes.
  • A simple counter to track the number of expected continuation bytes.
  • Iteration over the data array with constant time checks.

Code Solutions

def validUtf8(data):
    # Initialize the count for the number of bytes in the current UTF-8 character.
    byte_count = 0
    
    # Iterate through each integer which represents a byte in the data.
    for byte in data:
        # If we are expecting continuation bytes
        if byte_count:
            # Check if the byte starts with '10' by right shifting 6 bits (should equal to 2 in binary: 10)
            if (byte >> 6) != 0b10:
                return False
            # Decrease count of continuation bytes needed
            byte_count -= 1
        else:
            # Count the number of leading 1's in the byte
            mask = 0b10000000
            while mask & byte:
                byte_count += 1
                mask >>= 1
            # For a 1-byte character, the count will be 0
            if byte_count == 0:
                continue
            # UTF-8 encoding allows 1 to 4 bytes; if count is 1 or more than 4, then it's invalid
            if byte_count == 1 or byte_count > 4:
                return False
            # We have already processed the first byte, so decrement to count remaining continuation bytes.
            byte_count -= 1
            
    # If all bytes are processed and no expected continuation bytes remain, it's valid.
    return byte_count == 0

# Example Usage:
# print(validUtf8([197, 130, 1]))  # Expected True
# print(validUtf8([235, 140, 4]))    # Expected False
← Back to All Questions