On broken HTML

From time to time the idea is put forth that Opera (and Firefox, for that matter) needs to start dealing with bad code. There are two problems with that view:

Opera already deals with quite a bit of “bad code” (but there’s always room for improvement)
Just dealing with bad code isn’t enough: you have to deal with it the same way someone else does.

#2 is the tough part.

The rules for dealing with good code are, for the most part, specific. If you encounter well-formed HTML, you can be reasonably sure you know what the author meant. But there are very few rules for dealing with bad code. Trying to "deal with it" means trying to guess what the author meant, and sometimes different assumptions are equally as likely.

Example:


<p><b>Here's some text</i> and here's some more.</p>

Did the author close the non-existent italics by mistake, meaning to close the bold? Or did he open bold by mistake, intending to open italics? Or is the closing italics tag left over from copy-and-paste? Depending on what assumptions the browser makes, it should display it as:

Here’s some text and here’s some more.

Here’s some text and here’s some more.

Here’s some text and here’s some more.

And that’s just a simple example. It gets wilder when you throw in issues like inline vs. block elements. A paragraph should never appear inside a tag for text formatting, like bold or italic. By all rights, starting a new paragraph (or more precisely, ending the previous one) should also revert to plain formatting. But a lot of old pages expect the formatting to continue into the next paragraph, because way back when, a P tag was a double-line break, not a container.

Now, suppose that Browser A always makes the first assumption, and Browser B always makes the second. If someone tests their code in Browser A, and it happens to be what they want it to do, they won’t necessarily notice that their code is broken. The result: the site looks wrong in Browser B, and the page author — who thinks the page is fine, since he tested it in Browser A — blames Browser B.

Multiply that scenario by millions of pages and you have a large chunk of the web as we know it today.

So the solution isn’t just to “handle bad code.” It’s to handle that bad code in the same way that the dominant browser handles it. And since there’s no document you can look to for guidance, that means taking every possible chunk of bad code, running it through the other browser, and seeing what it does to it.

And there are a lot of ways to break code!

Even Microsoft did this back when IE was new. At the time, lots of people were writing broken code and testing it by seeing whether it looked right in Netscape. So IE had to make the same assumptions Netscape did on certain things. Once IE became established, they diverged.

Some relevant articles:

The Residual Style Problem (Surfin’ Safari)
Tag Soup: How UAs handle <x> <y> </x> </y> (Hixie’s Natural Log)
XML Error Handling in Web Browsers (Surfin’ Safari, mostly about HTML error handling)

This post originally appeared on Confessions of a Web Developer, my blog at the My Opera community.

K-Squared Ramblings

Sci-fi, comics, humor, photos…it's all fair game.

Related Posts:

Leave a Reply Cancel reply