CodeMusings of jaytaph

GoSuB Browser Progress, pt8

Last updated at 2023-09-01 08:05:21 (1 year ago)

Finally, the tokenizer passes all html5lib-tests and is merger into the main branch https://github.com/jaytaph/gosub-browser/

🏁 Tests completed: Ran 6805 tests, 2770 assertions, 2748 succeeded, 22 failed (18 position failures)

Of course, when I say, all html5lib tests, I mean.. almost all of them. The problem I have comes from the fact that rust does not accept invalid utf-8 sequences in strings and chars. This means that surrogate characters (between 0xD800 and 0xDFFF) on themselves are invalid characters and thus not directly supported by char. I solved some of those issues by using an "element" enum instead of a char in the tokenizer. This element can either be a utf-8 character, surrogate character (non utf8, but stored as u16), or an eof. It makes reading the tokenizer a bit easier, as we do not need Option<Char> anymore where None would signal eof.

Anyway,.. because we handle surrogates in a nasty way (with unsafe, and unchecked conversions), it can happen that 2 surrogates chars will be converted to a single non-bmp character (> 0x10000 value). In those cases the count of the string will be off-by-one (2 surrogate characters are converted into 1 single character), so the test will report an incorrect line/col position during testing (even though the test itself passes).

For now, i'm happy enough with the tokenizer as is.. and will going to play around a bit with tokenizing larger sites to see if that works ok enough.

Next stop: dom generation

rust gosub tokenizer utf8

Please login to comment or upvote.

Nobody is amused by this yet.

Joined:	March 24, 2023
Following:	3
Followers:	3
Posts:	57
Comments:	3
Upvotes:	5

About jaytaph

Previous musings

External links