• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Regex can't part HTML

Almo

Masterblazer
Joined
Aug 30, 2005
Messages
6,846
Location
Montreal, Quebec
So, apparently Regex can't parse HTML. At least, not all of it. But people who keep asking how to do it precipitated this answer on StackOverflow:

http://stackoverflow.com/questions/...ept-xhtml-self-contained-tags/1732454#1732454

For those who are curious about why this is true:

Some guy on the internet said:
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.


But:

Some other guy on the internet said:
Chuck Norris can parse HTML with regex.


And if you're wondering about this bit:

Jon Skeet cannot parse HTML using regular expressions.

StackOverflow has a reputation system. Jon Skeet has the highest rep there, which is saying something because they have TONS of users.

And if you want to see something really weird, you can go to this chat room at StackOverflow:

http://chat.stackoverflow.com/rooms/7/c

Open a Javascript console at it, and type:

Eggs.Cthulu("<[^>.*]");

Which will spray pieces of the post I linked to at the top all over the page.

There are some VERY strange things out there on the net...
 
For a start, well-formed HTML requires balanced tags:

Code:
<html>
    <body>
        Hello, <b><i>world!</i></b>
    </body>
</html>

Tags can be nested arbitrarily deep, and must be balanced.

You may already be aware that regular expressions define a finite state machine, and state machines, and fsms cannot match arbitrarily deep nested constructs like balanced parenthesis or balanced braces.

In practice, some regex implementations like perl and .NET support balanced regex matching. These expressions are no longer finite automata anymore.

Even recursive regexes aren't sufficient to deal with some of HTML's peculiarities, because it would need to handle ill-formed HTML:

Improper nesting:
Code:
Hello, <b><i>world!</b></i>

Open tags with no close tags:
Code:
<p>Hello<br />
Line break

<p>World

HTML comments, in particular, HTML comments don't nest, and HTML disallows the text "--" inside another comment:
Code:
Invalid: <!--Hello <!--world--> -->

Valid: <!--Hello <!- - world - -> -->

Attributes in their various forms, with double doubles, single quotes, no quotes, or no value:
Code:
<img src='image.png' height=100 width="200" />

<option selected>Text</option>

Some tags are self-closing, some require close tags:
Code:
Valid: <img src="self-closes.png" />

Invalid: <img src="close-tag.png"></img>

Validity depends on the doctype: <img src="unclosed.png">

There are more than enough gotchas to make any non-trivial parsing with regex (e.g. travesing up and down the DOM) pretty much impossible. Use a real HTML parser instead.
 
Last edited:
Well, could you write something in yacc?

(JUST KIDDING)
 
For a start, well-formed HTML requires balanced tags:

Code:
<html>
    <body>
        Hello, <b><i>world!</i></b>
    </body>
</html>

Tags can be nested arbitrarily deep, and must be balanced.

You may already be aware that regular expressions define a finite state machine, and state machines, and fsms cannot match arbitrarily deep nested constructs like balanced parenthesis or balanced braces.

In practice, some regex implementations like perl and .NET support balanced regex matching. These expressions are no longer finite automata anymore.

Even recursive regexes aren't sufficient to deal with some of HTML's peculiarities, because it would need to handle ill-formed HTML:

Improper nesting:
Code:
Hello, <b><i>world!</b></i>

Open tags with no close tags:
Code:
<p>Hello<br />
Line break

<p>World

HTML comments, in particular, HTML comments don't nest, and HTML disallows the text "--" inside another comment:
Code:
Invalid: <!--Hello <!--world--> -->

Valid: <!--Hello <!- - world - -> -->

Attributes in their various forms, with double doubles, single quotes, no quotes, or no value:
Code:
<img src='image.png' height=100 width="200" />

<option selected>Text</option>

Some tags are self-closing, some require close tags:
Code:
Valid: <img src="self-closes.png" />

Invalid: <img src="close-tag.png"></img>

Validity depends on the doctype: <img src="unclosed.png">

There are more than enough gotchas to make any non-trivial parsing with regex (e.g. travesing up and down the DOM) pretty much impossible. Use a real HTML parser instead.

"Oh, what a tangled web we weave, when e'er we practice to deviate from LISP."
 
I think I would prefer something that were more rigid in its definition. HTML is really loose.

Yep, it is rather ridiculous. This very page has 24 errors and 23 warnings according to the W3C validator. If every browser rejected this outright, it would have been cleaned.
 
Of course, any browser worth its salt is very forgiving -- those that don't render crappy code as well as the competition lose out.
 

Back
Top Bottom