GoSuB Browser Progress, pt7

Created at 2023-08-27 09:10:47 (1 year ago)

It seems I hit (another) snag: more than 5000 of the 6000 token tests from html5lib-tests are passing, so that's a big win.

There are 2 issues that needs a proper redesign of the input stream: First, I sometimes seem to have the line/col position off-by-one when creating a parse-error. This happens frequently, but i can't seem to get rid of it. Either sometimes it wants the current position, and sometimes it wants the previous position we just read. I can't seem to get this right at the moment without falling back to adding some additional edge-cases, which seems very wrong.

Secondly, a bigger issue i guess. I'm using Rust's "char" type to fetch a character from the input stream. A char is a utf8 codepoint, BUT it doesn't accept any utf-8 surrogate code points (between 0xD800 and 0xDFFF). However, some of the tests will need to check if this character is send, and if so, if a parse error occurs. So maybe I have to let go of char, and use u32 instead? But this seems wrong as well.

Another issue i've found is the fact that when I do some preprocessing of the stream (mostly replacing 0x0D0x0A to 0x0A and getting rid of all 0x0D's), the line/col might seem off. I think from a users point of view, you want to have the correct line/col based on the USER INPUT stream, not the preprocessed stream as this stream is probably never shown back to the user anyway.

So.. my focus again is on the input stream, and hopefully I don't need to change anything in the tokenizer. If we get this running, the main test issues are about doctype testing, which I haven't implemented yet.

rust gosub utf-8

About jaytaph

Codemuser extraordinaire

avatar Loves building crazy and insane stuff. Happiest when left alone. All I wanted was a Pepsi, just a Pepsi.
Joined:March 24, 2023
Following:2
Followers:2
Posts:49
Comments:3
Upvotes:4
RSS feed