GoSuB Browser Progress, pt8
Finally, the tokenizer passes all html5lib-tests and is merger into the main branch https://github.com/jaytaph/gosub-browser/
🏁 Tests completed: Ran 6805 tests, 2770 assertions, 2748 succeeded, 22 failed (18 position failures)
Of course, when I say, all html5lib tests, I mean.. almost all of them. The problem I have comes from the fact that rust does not accept invalid utf-8 sequences in strings and chars. This means that surrogate characters (between 0xD800 and 0xDFFF) on themselves are invalid characters and thus not directly supported by char. I solved some of those issues by using an "element" enum instead of a char in the tokenizer. This element can either be a utf-8 character, surrogate character (non utf8, but stored as u16), or an eof. It makes reading the tokenizer a bit easier, as we do not need Option<Char> anymore where None would signal eof.
Anyway,.. because we handle surrogates in a nasty way (with unsafe, and unchecked conversions), it can happen that 2 surrogates chars will be converted to a single non-bmp character (> 0x10000 value). In those cases the count of the string will be off-by-one (2 surrogate characters are converted into 1 single character), so the test will report an incorrect line/col position during testing (even though the test itself passes).
For now, i'm happy enough with the tokenizer as is.. and will going to play around a bit with tokenizing larger sites to see if that works ok enough.
Next stop: dom generation
About jaytaph
Codemuser extraordinaire
Joined: | March 24, 2023 |
Following: | 2 |
Followers: | 2 |
Posts: | 50 |
Comments: | 3 |
Upvotes: | 4 |
Previous musings
- (1) November 2024
- (1) October 2024
- (1) September 2024
- (1) July 2024
- (2) February 2024
- (3) January 2024
- (3) December 2023
- (4) November 2023
- (5) October 2023
- (10) September 2023
- (8) August 2023
- (1) June 2023
- (1) May 2023
- (4) April 2023
- (5) March 2023