• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Help needed: Program or configure a web / XML crawler

Oystein

Penultimate Amazing
Joined
Dec 9, 2009
Messages
18,903
Hey y'all,

I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.

Here is the site (may take a minute to load, if you are on a weak computer or connection):

www.ae911truth.org/signatures/ae.html

This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):
Code:
<a data-link="xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt"></a>

And that links to full URL's such as
Code:
http://www.ae911truth.org/signatures/xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt

Those .txt files contain stuff like this:

<?xml version="1.0" encoding="UTF-8"?>

<person>
<first_name><![CDATA[Ken]]></first_name>
<middle_name></middle_name>
<last_name><![CDATA[Gorski]]></last_name>
<title></title>
<degree><![CDATA[B Architecture Professional Degree, University of Kansas, 1972]]></degree>
<city><![CDATA[El Paso]]></city>
<state><![CDATA[TX]]></state>
<country><![CDATA[US]]></country>
<occupation_status>Degreed + Licensed</occupation_status>
<tech_biography><![CDATA[I'm a licensed architect and AIA member.]]></tech_biography>
<statement_911><![CDATA[I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.]]></statement_911>
<photo></photo>
<license_info><![CDATA[6477 TX]]></license_info>
</person>

Which I want to have translated into a simpler CSV/spreadsheet like format such as
Code:
url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX
(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)

Then the same for same sorts of linked profiles on
Code:
http://www.ae911truth.org/signatures/general.html
http://www.ae911truth.org/signatures/general.html#A
http://www.ae911truth.org/signatures/general.html#B
...

I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.


Thanks!!
 
Are you looking for a program (or set of programs) that work in Windows, or in Linux? My first impulse would be write a Perl or Python program, both of which are supported natively on Linux but require downloading extra stuff for Windows. In addition, the program I'd write would run on the command line (in Windows, that's cmd.exe).
 
I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.
A script like that would be very easy. If I have time, I'll throw something together for you.

Do you have a language preference? C#? Ruby? Something else?
 
Are you looking for a program (or set of programs) that work in Windows, or in Linux? My first impulse would be write a Perl or Python program, both of which are supported natively on Linux but require downloading extra stuff for Windows. In addition, the program I'd write would run on the command line (in Windows, that's cmd.exe).

I currently work on Windows 7. cmd.exe wouldn't be a problem. The extra stuff ... if it's nothing that gets permanently loaded to eat up resources and interfere with whatever - and if it's both free and legal - no prob.
 
It's quite easy actually, turns out the data already is in xml format (albeit not standard).
Example: http://www.ae911truth.org/signatures/xml/arch.xml.txt

Oooh, this makes the task very simple ^_^ Note that this XML document is missing the tech_biography, occupational_status, statement_911, and photo nodes. It's easy enough to use the XSL 'document(url)' function to query the full XML and populate the data in the output.

Use this site to perform a simple XSL transform:

Inputs:

URL: http://www.ae911truth.org/signatures/xml/arch.xml.txt
XSLT:

Code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="no" method="text" />
    <xsl:strip-space elements="*"/>
    <xsl:preserve-space elements="" />
    
    <xsl:template match="/">
      <text>url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
</text>
      <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="person/link">
        <xsl:variable name="row" select="document(concat('http://www.ae911truth.org/signatures/', .))/person" />
        <xsl:value-of select="normalize-space(.)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/first_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/middle_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/last_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/title)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/degree)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/city)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/state)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/country)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/tech_biography)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/statement_911)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/photo)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/license_info)" /><xsl:text>
</xsl:text>
    </xsl:template>
    
    <xsl:template match="text()" />
</xsl:stylesheet>

This will output exactly what you want. You can save the result to a file and view it in Excel or LibreOffice Calc. ^_^

These steps can be automated easily with a short script which apply this transform and pipes its results to a file, saving you some of the repetitive of applying the transform manually.
 
Last edited:
Here's a short script to generate an output file in Powershell which should be available to you with Windows 7

Code:
$source = "http://www.ae911truth.org/signatures/xml/arch.xml.txt"
$output = "list.csv"
[xml]$people = Invoke-WebRequest $source
$people.root.person | Select-Object @{label="First Name";Expression={$_.first_name.'#cdata-section'}}, 
@{label="Middle Name";Expression={$_.middle_name.'#cdata-section'}}, 
@{label="Last Name";Expression={$_.last_name.'#cdata-section'}}, 
@{label="title";Expression={$_.title.'#cdata-section'}}, 
@{label="Degree";Expression={$_.degree.'#cdata-section'}}, 
@{label="City";Expression={$_.city.'#cdata-section'}}, 
@{label="State";Expression={$_.state.'#cdata-section'}}, 
@{label="Country";Expression={$_.country.'#cdata-section'}} , 
@{label="License Information";Expression={$_.license_info.'#cdata-section'}} | Export-Csv $output
 
Last edited:
It's quite easy actually, turns out the data already is in xml format (albeit not standard).
Example: http://www.ae911truth.org/signatures/xml/arch.xml.txt
Here as a html table with bio and statements:
http://temp.oan.co/arch.html

1. THANKS!

2. How did you stumble upon the first URL? ../arch.xml.txt? The petition signers list has five categories. What you post there is only the first - "arch" = "Architects (Degreed & Licensed – Active & Retired)". There are four more:
- ???.xml.txt = "Architectural Professionals (Degreed Only)"
- eng.xml.txt = "Engineers (Degreed & Licensed – Active & Retired)"
- ???.xml.txt = "Engineering Professionals (Degreed Only)"
- ???.xml.txt = "Non-U.S. Architects and Engineers & Architectural and Engineering Professionals"
I'd need all five, plus the "General Public"

3. Your html table is almost good. I am missing:
- <link> (e.g. "xml/supporters/U/MichaelStokerPark-CityUTUS.xml.txt")
- <occupation_status> (e.g. "Degreed + Licensed")
- <display_order>
- <photo> (if that ever contains a value)
And of course the other person categorie
And a way for me to do it myself in the future!

4. I have to check something: Sometimes, the <tech_biography> and <statement_911> fields contained more than one paragraph. At least that was the case before they changed the web publishing format to what it is now. It seems like your html only has a single paragraph per entry. Gotta check if it really is so, and if the problem is with your translation, or with their publishing.

5. Thanks again!
 
Oooh, this makes the task very simple ^_^ Note that this XML document is missing the tech_biography, occupational_status, statement_911, and photo nodes. It's easy enough to use the XSL 'document(url)' function to query the full XML and populate the data in the output.

Use this site to perform a simple XSL transform:

Inputs:

URL: http://www.ae911truth.org/signatures/xml/arch.xml.txt
XSLT:

Code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="no" method="text" />
    <xsl:strip-space elements="*"/>
    <xsl:preserve-space elements="" />
    
    <xsl:template match="/">
      <text>url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
</text>
      <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="person/link">
        <xsl:variable name="row" select="document(concat('http://www.ae911truth.org/signatures/', .))/person" />
        <xsl:value-of select="normalize-space(.)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/first_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/middle_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/last_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/title)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/degree)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/city)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/state)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/country)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/tech_biography)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/statement_911)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/photo)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/license_info)" /><xsl:text>
</xsl:text>
    </xsl:template>
    
    <xsl:template match="text()" />
</xsl:stylesheet>

This will output exactly what you want. You can save the result to a file and view it in Excel or LibreOffice Calc. ^_^

These steps can be automated easily with a short script which apply this transform and pipes its results to a file, saving you some of the repetitive of applying the transform manually.

Whoa, excellent!!

I got 2 problems left:
a) Gotta find the URLs for the lists other than the 300-something "Architects (Licensed)"
b) Some special characters get screwed up, at least in my system (language:DE - not sure if that causes the problem). Any ideas where and how to fix this?
 
Whoa, excellent!!

I got 2 problems left:
a) Gotta find the URLs for the lists other than the 300-something "Architects (Licensed)"
Chrome's Javascript console is fabulous for digging into difference sources a page depends on. ^_^

One of the javascript files refers to an API page, which has lists to all engineers and general public:

http://www.ae911truth.org/signatures/api.xml.txt

Be aware that my XSL issues an HTTP request to retrieve specific data on each record. Running this script will issue 1000s of HTTP requests in a very short period of time, which might be interpreted as a DoS by a sysadmin. Try not to run your script more than once per day.

b) Some special characters get screwed up, at least in my system (language:DE - not sure if that causes the problem). Any ideas where and how to fix this?
Try setting the encoding in the XSL to UTF-8:

Code:
<xsl:output indent="no" method="text" encoding="UTF-8" />
 
Last edited:
Chrome's Javascript console is fabulous for digging into difference sources a page depends on. ^_^

One of the javascript files refers to an API page, which has lists to all engineers and general public:

http://www.ae911truth.org/signatures/api.xml.txt
Great! Thanks again!
Be aware that my XSL issues an HTTP request to retrieve specific data on each record. Running this script will issue 1000s of HTTP requests in a very short period of time, which might be interpreted as a DoS by a sysadmin. Try not to run your script more than once per day.
As tempting as this is... :p
I am running it a few times today, and certainly to retrieve the five "professional" or "architects and engineers" categories, but will get at the general public day by day ;)
Try setting the encoding in the XSL to UTF-8:

Code:
<xsl:output indent="no" method="text" encoding="UTF-8" />

This didn't work, for example when someone uses this apostrophe: "didnt" instead of the standard: "didn't".
Example: http://www.ae911truth.org/signatures/xml/supporters/U/GrazynaSamborskaCherry-HillNJUS.xml.txt


ETA: Or try to run the script on http://www.ae911truth.org/signatures/xml/xmisc-.xml.txt which contains all those with last name initials not included in the English 26-letter alphabet (the German umlauts, Czech consonants..); it generates only a handful HTTP request and gets you lots of special characters.
 
Last edited:
Great! Thanks again!

As tempting as this is... :p
I am running it a few times today, and certainly to retrieve the five "professional" or "architects and engineers" categories, but will get at the general public day by day ;)
There are 30+ files on http://www.ae911truth.org/signatures/api.xml.txt , what are the filenames of the documents you want to read? Once I have that, I can tweak my script slightly to pull data only from those files.

This didn't work, for example when someone uses this apostrophe: "didnt" instead of the standard: "didn't".
Example: http://www.ae911truth.org/signatures/xml/supporters/U/GrazynaSamborskaCherry-HillNJUS.xml.txt


ETA: Or try to run the script on http://www.ae911truth.org/signatures/xml/xmisc-.xml.txt which contains all those with last name initials not included in the English 26-letter alphabet (the German umlauts, Czech consonants..); it generates only a handful HTTP request and gets you lots of special characters.
I'm a bit busy at work today to debug this at the moment. Will have a look later ^_^
 
There are 30+ files on http://www.ae911truth.org/signatures/api.xml.txt , what are the filenames of the documents you want to read? Once I have that, I can tweak my script slightly to pull data only from those files.
I want all of them! :D

The first five:

<category>xml/arch.xml.txt</category>
<category>xml/archprof.xml.txt</category>
<category>xml/eng.xml.txt</category>
<category>xml/engprof.xml.txt</category>
<category>xml/non-us.xml.txt</category>​
are my main interest (those are the architects and engineers that "Architects & Engineers for 9/11 Truth" flaunt as an argument from authority). They total a bit over 2300 persons right now (slowly growing).
I may want to download these at least once a month, sometimes more often.

The rest are "Other Supporters" or "General Public", grouped into files by last name initial, and of secondary interest. I want to get a full download "soon", but then only occasionally, perhaps every half year. It wouldn't hurt if I don't do fetch them all on the same day. I can manually run the script, one letter of the alphabet per day, and be done by year's end.

I'm a bit busy at work today to debug this at the moment. Will have a look later ^_^
I am in no hurry at all :)
 
In the meantime, I googled other online XSL test tools and tried about 5 or 6. Some of them had no trouble returning the foreign characters correctly. Some wouldn't execute Dessi's script at all, and none called the referenced xml documents to actually crawl the hundreds of person profiles.

So it seems http://xslttest.appspot.com/ alone is buggy with charset/encoding, while being the only test tool I found that does the crawling I am after :boggled:
The files contain too many of those foreign characters, I can't very well work with this bug. :(


Perhaps someone could tutor me on somehow running this locally (?) or some other solution?
 
Did you try my script?

Sorry, had forgotten to consider other replies after I was hooked to the test tool...

Trying it now - did it work for you?
First off, I don't know PowerShell at all. Just found it, started it, copy&pasted ypir script into the upper of 3 windows, and hit F5 to execute.
I get an error message - in German. I translate:
The label "Invoke-WebRequest" Was not recognized as the name of a Cmdlet, a funktion, a script file or an executable program.
Check the spelling of the name, or whether the path is correct (if contained), and repeat the action
At line:3 Character:33
+ [xml]$people = Invoke-WebRequest <<<< $source
+ CategoryInfo : ObjectNotFound: (Invoke-WebRequest:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException

Here is the code (I actually edited it a bit - different output file name, and three more fields - category, link and display_order)

Code:
$source = "http://www.ae911truth.org/signatures/xml/arch.xml.txt"
$output = "arch.csv"
[xml]$people = Invoke-WebRequest $source
$people.root.person | Select-Object @{label="First Name";Expression={$_.first_name.'#cdata-section'}}, 
@{label="Middle Name";Expression={$_.middle_name.'#cdata-section'}}, 
@{label="Last Name";Expression={$_.last_name.'#cdata-section'}}, 
@{label="title";Expression={$_.title.'#cdata-section'}}, 
@{label="Degree";Expression={$_.degree.'#cdata-section'}}, 
@{label="City";Expression={$_.city.'#cdata-section'}}, 
@{label="State";Expression={$_.state.'#cdata-section'}}, 
@{label="Country";Expression={$_.country.'#cdata-section'}} , 
@{label="Category";Expression={$_.category.'#cdata-section'}} , 
@{label="URL";Expression={$_.link.'#cdata-section'}} , 
@{label="DisplayOrder";Expression={$_.display_order.'#cdata-section'}} , 
@{label="License Information";Expression={$_.license_info.'#cdata-section'}} | Export-Csv $output
 
Those changes should be fine, they're working OK for me using Powershell 4.0

It turns out that the Invoke-WebRequest cmdlet was introduced in PowerShell version 3.0 If you getting an error then it sounds like you may have an earlier version. Though it also sounds like you're using the Powershell ISE so I thought that meant you're using version 3.0 at least. Colour me a little confused.

You can type $psversiontable to find your version.

You can upgrade to the latest version here

http://www.microsoft.com/en-gb/download/details.aspx?id=40855

Let me know how it pans out.
 
Just a little amendment to your changes as category and display order aren't stored within character data sections in the xml.

It ran OK but when I looked at the output those columns were blank. This fixes it.

Code:
$source = "http://www.ae911truth.org/signatures/xml/arch.xml.txt"
$output = "arch.csv"
[xml]$people = Invoke-WebRequest $source
$people.root.person | Select-Object @{label="First Name";Expression={$_.first_name.'#cdata-section'}}, 
@{label="Middle Name";Expression={$_.middle_name.'#cdata-section'}}, 
@{label="Last Name";Expression={$_.last_name.'#cdata-section'}}, 
@{label="title";Expression={$_.title.'#cdata-section'}}, 
@{label="Degree";Expression={$_.degree.'#cdata-section'}}, 
@{label="City";Expression={$_.city.'#cdata-section'}}, 
@{label="State";Expression={$_.state.'#cdata-section'}}, 
@{label="Country";Expression={$_.country.'#cdata-section'}} , 
@{label="Category";Expression={$_.category}} , 
@{label="URL";Expression={$_.link.'#cdata-section'}} , 
@{label="DisplayOrder";Expression={$_.display_order}} , 
@{label="License Information";Expression={$_.license_info.'#cdata-section'}} | Export-Csv $output
 

Back
Top Bottom