Regex can't part HTML

Almo · Nov 21, 2011

So, apparently Regex can't parse HTML. At least, not all of it. But people who keep asking how to do it precipitated this answer on StackOverflow:

http://stackoverflow.com/questions/...ept-xhtml-self-contained-tags/1732454#1732454

For those who are curious about why this is true:

Some guy on the internet said:
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

But:

Some other guy on the internet said:
Chuck Norris can parse HTML with regex.

And if you're wondering about this bit:

Jon Skeet cannot parse HTML using regular expressions.

StackOverflow has a reputation system. Jon Skeet has the highest rep there, which is saying something because they have TONS of users.

And if you want to see something really weird, you can go to this chat room at StackOverflow:

http://chat.stackoverflow.com/rooms/7/c

Open a Javascript console at it, and type:

Eggs.Cthulu("<[^>.*]");

Which will spray pieces of the post I linked to at the top all over the page.

There are some VERY strange things out there on the net...

Dessi · Nov 21, 2011

For a start, well-formed HTML requires balanced tags:

Code:

<html>
    <body>
        Hello, <b><i>world!</i></b>
    </body>
</html>

Tags can be nested arbitrarily deep, and must be balanced.

You may already be aware that regular expressions define a finite state machine, and state machines, and fsms cannot match arbitrarily deep nested constructs like balanced parenthesis or balanced braces.

In practice, some regex implementations like perl and .NET support balanced regex matching. These expressions are no longer finite automata anymore.

Even recursive regexes aren't sufficient to deal with some of HTML's peculiarities, because it would need to handle ill-formed HTML:

Improper nesting:

Code:

Hello, <b><i>world!</b></i>

Open tags with no close tags:

Code:

<p>Hello<br />
Line break

<p>World

HTML comments, in particular, HTML comments don't nest, and HTML disallows the text "--" inside another comment:

Code:

Invalid: <!--Hello <!--world--> -->

Valid: <!--Hello <!- - world - -> -->

Attributes in their various forms, with double doubles, single quotes, no quotes, or no value:

Code:

<img src='image.png' height=100 width="200" />

<option selected>Text</option>

Some tags are self-closing, some require close tags:

Code:

Valid: <img src="self-closes.png" />

Invalid: <img src="close-tag.png"></img>

Validity depends on the doctype: <img src="unclosed.png">

There are more than enough gotchas to make any non-trivial parsing with regex (e.g. travesing up and down the DOM) pretty much impossible. Use a real HTML parser instead.

This is The End · Nov 22, 2011

Just looking at some of those makes me get all nervous...

Dessi said:
Code:

Hello, world!

........must fix!

Wowbagger · Nov 22, 2011

Oh yeah?! Well Charles Nelson Reilly figured out how to parse HTML with a regex, before HTML and regex were even invented! So there!!!

Almo · Nov 22, 2011

HTML is really ugly.

NewtonTrino · Nov 22, 2011

Almo said:
HTML is really ugly.

Yes and XML inherits all that garbage.

I welcome our JSON overlords.

Gazpacho · Nov 23, 2011

Almo said:
HTML is really ugly.

You would prefer, perhaps, a binary format?

jj · Nov 24, 2011

Well, could you write something in yacc?

(JUST KIDDING)

Beerina · Nov 24, 2011

Dessi said:
For a start, well-formed HTML requires balanced tags:

Code:

<html> <body> Hello, world! </body> </html>

Tags can be nested arbitrarily deep, and must be balanced.

You may already be aware that regular expressions define a finite state machine, and state machines, and fsms cannot match arbitrarily deep nested constructs like balanced parenthesis or balanced braces.

In practice, some regex implementations like perl and .NET support balanced regex matching. These expressions are no longer finite automata anymore.

Even recursive regexes aren't sufficient to deal with some of HTML's peculiarities, because it would need to handle ill-formed HTML:

Improper nesting:

Code:

Hello, world!

Open tags with no close tags:

Code:

Hello Line break World

HTML comments, in particular, HTML comments don't nest, and HTML disallows the text "--" inside another comment:

Code:

Invalid:  --> Valid: 

Attributes in their various forms, with double doubles, single quotes, no quotes, or no value:

Code:

<img src='image.png' height=100 width="200" /> <option selected>Text</option>

Some tags are self-closing, some require close tags:

Code:

Valid: <img src="self-closes.png" /> Invalid: <img src="close-tag.png"></img> Validity depends on the doctype: <img src="unclosed.png">

There are more than enough gotchas to make any non-trivial parsing with regex (e.g. travesing up and down the DOM) pretty much impossible. Use a real HTML parser instead.

"Oh, what a tangled web we weave, when e'er we practice to deviate from LISP."

Wowbagger · Nov 24, 2011

Gazpacho said:
You would prefer, perhaps, a binary format?

Well, that could save some bandwidth...

Almo · Nov 28, 2011

Gazpacho said:
You would prefer, perhaps, a binary format?

I think I would prefer something that were more rigid in its definition. HTML is really loose.

Pulvinar · Nov 28, 2011

Almo said:
I think I would prefer something that were more rigid in its definition. HTML is really loose.

Yep, it is rather ridiculous. This very page has 24 errors and 23 warnings according to the W3C validator. If every browser rejected this outright, it would have been cleaned.

Beerina · Dec 4, 2011

Of course, any browser worth its salt is very forgiving -- those that don't render crappy code as well as the competition lose out.

Almo · Dec 5, 2011

Beerina said:
Of course, any browser worth its salt is very forgiving -- those that don't render crappy code as well as the competition lose out.

Which of course causes all sorts of problems. It's a Catch-22.

Regex can't part HTML

Almo

Masterblazer

Dessi

Species Traitor

This is The End

Penultimate Amazing

Wowbagger

The Infinitely Prolonged

Almo

Masterblazer

NewtonTrino

Illuminator

Gazpacho

Master Poster

jj

Penultimate Amazing

Beerina

Sarcastic Conqueror of Notions

Wowbagger

The Infinitely Prolonged

Almo

Masterblazer

Pulvinar

Graduate Poster

Beerina

Sarcastic Conqueror of Notions

Almo

Masterblazer