Problem Description
Given an array of integers where each integer represents one byte of data (using only its least significant 8 bits), determine if the sequence represents a valid UTF-8 encoding. In UTF-8, characters are encoded using 1 to 4 bytes. For multi-byte characters, the first byte indicates the total number of bytes (by having a series of 1's followed by a 0) while each subsequent byte starts with the bits "10".
Key Insights
- The first byte of a character tells how many bytes the character comprises based on the count of leading 1's.
- For a valid multi-byte character, each subsequent byte must start with "10".
- Use bit masks to check the most significant bits in each byte.
- Incrementally validate each byte in the data array by tracking how many continuation bytes are expected.
Space and Time Complexity
Time Complexity: O(n), where n is the number of bytes in the data array. Space Complexity: O(1), since no extra space proportional to input size is used.
Solution
We iterate through each integer byte in the data array. If we are not in the middle of processing a multi-byte character, we count the number of leading 1's in the current byte to determine the expected number of bytes for the character. A count of 0 means it is a single-byte character and is valid by itself. Otherwise, for a multi-byte character, the first byte should have between 2 to 4 leading 1's. Then, we check the next bytes (the continuation bytes) to ensure their two most significant bits are "10". If any byte fails these conditions or if the number of expected continuation bytes does not match, we return false.
Key data structures and techniques used:
- Bit manipulation to extract and check relevant bits from bytes.
- A simple counter to track the number of expected continuation bytes.
- Iteration over the data array with constant time checks.