Help needed: Program or configure a web / XML crawler

Oystein · Dec 3, 2014

Hey y'all,

I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.

Here is the site (may take a minute to load, if you are on a weak computer or connection):

www.ae911truth.org/signatures/ae.html

This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):

Code:

<a data-link="xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt"></a>

And that links to full URL's such as

Code:

http://www.ae911truth.org/signatures/xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt

Those .txt files contain stuff like this:

<?xml version="1.0" encoding="UTF-8"?>

<person>
<first_name><![CDATA[Ken]]></first_name>
<middle_name></middle_name>
<last_name><![CDATA[Gorski]]></last_name>
<title></title>
<degree><![CDATA[B Architecture Professional Degree, University of Kansas, 1972]]></degree>
<city><![CDATA[El Paso]]></city>
<state><![CDATA[TX]]></state>
<country><![CDATA[US]]></country>
<occupation_status>Degreed + Licensed</occupation_status>
<tech_biography><![CDATA[I'm a licensed architect and AIA member.]]></tech_biography>
<statement_911><![CDATA[I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.]]></statement_911>
<photo></photo>
<license_info><![CDATA[6477 TX]]></license_info>
</person>

Which I want to have translated into a simpler CSV/spreadsheet like format such as

Code:

url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX

(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)

Then the same for same sorts of linked profiles on

Code:

http://www.ae911truth.org/signatures/general.html
http://www.ae911truth.org/signatures/general.html#A
http://www.ae911truth.org/signatures/general.html#B
...

I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.

Thanks!!

Blue Mountain · Dec 3, 2014

Are you looking for a program (or set of programs) that work in Windows, or in Linux? My first impulse would be write a Perl or Python program, both of which are supported natively on Linux but require downloading extra stuff for Windows. In addition, the program I'd write would run on the command line (in Windows, that's cmd.exe).

Ratatoskr · Dec 3, 2014

This is something I'd think a small script would do nicely.

I can probably cook up something in a short time

Dessi · Dec 3, 2014

Oystein said:
I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.

A script like that would be very easy. If I have time, I'll throw something together for you.

Do you have a language preference? C#? Ruby? Something else?

Ratatoskr · Dec 3, 2014

It's quite easy actually, turns out the data already is in xml format (albeit not standard).
Example: http://www.ae911truth.org/signatures/xml/arch.xml.txt

Ratatoskr · Dec 3, 2014

Here as a html table with bio and statements:
http://temp.oan.co/arch.html

Oystein · Dec 3, 2014

Blue Mountain said:
Are you looking for a program (or set of programs) that work in Windows, or in Linux? My first impulse would be write a Perl or Python program, both of which are supported natively on Linux but require downloading extra stuff for Windows. In addition, the program I'd write would run on the command line (in Windows, that's cmd.exe).

I currently work on Windows 7. cmd.exe wouldn't be a problem. The extra stuff ... if it's nothing that gets permanently loaded to eat up resources and interfere with whatever - and if it's both free and legal - no prob.

Dessi · Dec 3, 2014

Ratatoskr said:
It's quite easy actually, turns out the data already is in xml format (albeit not standard).
Example: http://www.ae911truth.org/signatures/xml/arch.xml.txt

Oooh, this makes the task very simple ^_^ Note that this XML document is missing the tech_biography, occupational_status, statement_911, and photo nodes. It's easy enough to use the XSL 'document(url)' function to query the full XML and populate the data in the output.

Use this site to perform a simple XSL transform:

Inputs:

URL: http://www.ae911truth.org/signatures/xml/arch.xml.txt
XSLT:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="no" method="text" />
    <xsl:strip-space elements="*"/>
    <xsl:preserve-space elements="" />
    
    <xsl:template match="/">
      <text>url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
</text>
      <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="person/link">
        <xsl:variable name="row" select="document(concat('http://www.ae911truth.org/signatures/', .))/person" />
        <xsl:value-of select="normalize-space(.)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/first_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/middle_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/last_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/title)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/degree)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/city)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/state)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/country)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/tech_biography)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/statement_911)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/photo)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/license_info)" /><xsl:text>
</xsl:text>
    </xsl:template>
    
    <xsl:template match="text()" />
</xsl:stylesheet>

This will output exactly what you want. You can save the result to a file and view it in Excel or LibreOffice Calc. ^_^

These steps can be automated easily with a short script which apply this transform and pipes its results to a file, saving you some of the repetitive of applying the transform manually.

Ocelot · Dec 3, 2014

Here's a short script to generate an output file in Powershell which should be available to you with Windows 7

Code:

$source = "http://www.ae911truth.org/signatures/xml/arch.xml.txt"
$output = "list.csv"
[xml]$people = Invoke-WebRequest $source
$people.root.person | Select-Object @{label="First Name";Expression={$_.first_name.'#cdata-section'}}, 
@{label="Middle Name";Expression={$_.middle_name.'#cdata-section'}}, 
@{label="Last Name";Expression={$_.last_name.'#cdata-section'}}, 
@{label="title";Expression={$_.title.'#cdata-section'}}, 
@{label="Degree";Expression={$_.degree.'#cdata-section'}}, 
@{label="City";Expression={$_.city.'#cdata-section'}}, 
@{label="State";Expression={$_.state.'#cdata-section'}}, 
@{label="Country";Expression={$_.country.'#cdata-section'}} , 
@{label="License Information";Expression={$_.license_info.'#cdata-section'}} | Export-Csv $output

Oystein · Dec 3, 2014

Ratatoskr said:
It's quite easy actually, turns out the data already is in xml format (albeit not standard).
Example: http://www.ae911truth.org/signatures/xml/arch.xml.txt

Ratatoskr said:
Here as a html table with bio and statements:
http://temp.oan.co/arch.html

1. THANKS!

2. How did you stumble upon the first URL? ../arch.xml.txt? The petition signers list has five categories. What you post there is only the first - "arch" = "Architects (Degreed & Licensed – Active & Retired)". There are four more:
- ???.xml.txt = "Architectural Professionals (Degreed Only)"
- eng.xml.txt = "Engineers (Degreed & Licensed – Active & Retired)"
- ???.xml.txt = "Engineering Professionals (Degreed Only)"
- ???.xml.txt = "Non-U.S. Architects and Engineers & Architectural and Engineering Professionals"
I'd need all five, plus the "General Public"

3. Your html table is almost good. I am missing:
- <link> (e.g. "xml/supporters/U/MichaelStokerPark-CityUTUS.xml.txt")
- <occupation_status> (e.g. "Degreed + Licensed")
- <display_order>
- <photo> (if that ever contains a value)
And of course the other person categorie
And a way for me to do it myself in the future!

4. I have to check something: Sometimes, the <tech_biography> and <statement_911> fields contained more than one paragraph. At least that was the case before they changed the web publishing format to what it is now. It seems like your html only has a single paragraph per entry. Gotta check if it really is so, and if the problem is with your translation, or with their publishing.

5. Thanks again!

Oystein · Dec 3, 2014

Dessi said:

Oooh, this makes the task very simple ^_^ Note that this XML document is missing the tech_biography, occupational_status, statement_911, and photo nodes. It's easy enough to use the XSL 'document(url)' function to query the full XML and populate the data in the output.

Use this site to perform a simple XSL transform:

Inputs:

URL: http://www.ae911truth.org/signatures/xml/arch.xml.txt
XSLT:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="no" method="text" />
    <xsl:strip-space elements="*"/>
    <xsl:preserve-space elements="" />
    
    <xsl:template match="/">
      <text>url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
</text>
      <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="person/link">
        <xsl:variable name="row" select="document(concat('http://www.ae911truth.org/signatures/', .))/person" />
        <xsl:value-of select="normalize-space(.)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/first_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/middle_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/last_name)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/title)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/degree)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/city)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/state)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/country)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/tech_biography)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/occupational_status)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/statement_911)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/photo)" /><text>|</text>
        <xsl:value-of select="normalize-space($row/license_info)" /><xsl:text>
</xsl:text>
    </xsl:template>
    
    <xsl:template match="text()" />
</xsl:stylesheet>

This will output exactly what you want. You can save the result to a file and view it in Excel or LibreOffice Calc. ^_^

These steps can be automated easily with a short script which apply this transform and pipes its results to a file, saving you some of the repetitive of applying the transform manually.

Whoa, excellent!!

I got 2 problems left:
a) Gotta find the URLs for the lists other than the 300-something "Architects (Licensed)"
b) Some special characters get screwed up, at least in my system (language

E - not sure if that causes the problem). Any ideas where and how to fix this?

Dessi · Dec 3, 2014

Oystein said:
Whoa, excellent!!

I got 2 problems left:
a) Gotta find the URLs for the lists other than the 300-something "Architects (Licensed)"

Chrome's Javascript console is fabulous for digging into difference sources a page depends on. ^_^

One of the javascript files refers to an API page, which has lists to all engineers and general public:

http://www.ae911truth.org/signatures/api.xml.txt

Be aware that my XSL issues an HTTP request to retrieve specific data on each record. Running this script will issue 1000s of HTTP requests in a very short period of time, which might be interpreted as a DoS by a sysadmin. Try not to run your script more than once per day.

b) Some special characters get screwed up, at least in my system (languageE - not sure if that causes the problem). Any ideas where and how to fix this?

Try setting the encoding in the XSL to UTF-8:

Code:

<xsl:output indent="no" method="text" encoding="UTF-8" />

Oystein · Dec 3, 2014

Dessi said:
Chrome's Javascript console is fabulous for digging into difference sources a page depends on. ^_^

One of the javascript files refers to an API page, which has lists to all engineers and general public:

http://www.ae911truth.org/signatures/api.xml.txt

Great! Thanks again!

Dessi said:
Be aware that my XSL issues an HTTP request to retrieve specific data on each record. Running this script will issue 1000s of HTTP requests in a very short period of time, which might be interpreted as a DoS by a sysadmin. Try not to run your script more than once per day.

As tempting as this is...

I am running it a few times today, and certainly to retrieve the five "professional" or "architects and engineers" categories, but will get at the general public day by day

Dessi said:
Try setting the encoding in the XSL to UTF-8:

Code:

<xsl:output indent="no" method="text" encoding="UTF-8" />

This didn't work, for example when someone uses this apostrophe: "didn’t" instead of the standard: "didn't".
Example: http://www.ae911truth.org/signatures/xml/supporters/U/GrazynaSamborskaCherry-HillNJUS.xml.txt

ETA: Or try to run the script on http://www.ae911truth.org/signatures/xml/xmisc-.xml.txt which contains all those with last name initials not included in the English 26-letter alphabet (the German umlauts, Czech consonants..); it generates only a handful HTTP request and gets you lots of special characters.

Dessi · Dec 3, 2014

Oystein said:
Great! Thanks again!

As tempting as this is...
I am running it a few times today, and certainly to retrieve the five "professional" or "architects and engineers" categories, but will get at the general public day by day

There are 30+ files on http://www.ae911truth.org/signatures/api.xml.txt , what are the filenames of the documents you want to read? Once I have that, I can tweak my script slightly to pull data only from those files.

This didn't work, for example when someone uses this apostrophe: "didn’t" instead of the standard: "didn't".
Example: http://www.ae911truth.org/signatures/xml/supporters/U/GrazynaSamborskaCherry-HillNJUS.xml.txt

ETA: Or try to run the script on http://www.ae911truth.org/signatures/xml/xmisc-.xml.txt which contains all those with last name initials not included in the English 26-letter alphabet (the German umlauts, Czech consonants..); it generates only a handful HTTP request and gets you lots of special characters.

I'm a bit busy at work today to debug this at the moment. Will have a look later ^_^

Oystein · Dec 3, 2014

Dessi said:
There are 30+ files on http://www.ae911truth.org/signatures/api.xml.txt , what are the filenames of the documents you want to read? Once I have that, I can tweak my script slightly to pull data only from those files.

I want all of them!

The first five:

<category>xml/arch.xml.txt</category>
<category>xml/archprof.xml.txt</category>
<category>xml/eng.xml.txt</category>
<category>xml/engprof.xml.txt</category>
<category>xml/non-us.xml.txt</category>

are my main interest (those are the architects and engineers that "Architects & Engineers for 9/11 Truth" flaunt as an argument from authority). They total a bit over 2300 persons right now (slowly growing).
I may want to download these at least once a month, sometimes more often.

The rest are "Other Supporters" or "General Public", grouped into files by last name initial, and of secondary interest. I want to get a full download "soon", but then only occasionally, perhaps every half year. It wouldn't hurt if I don't do fetch them all on the same day. I can manually run the script, one letter of the alphabet per day, and be done by year's end.

Dessi said:
I'm a bit busy at work today to debug this at the moment. Will have a look later ^_^

I am in no hurry at all

Oystein · Dec 7, 2014

In the meantime, I googled other online XSL test tools and tried about 5 or 6. Some of them had no trouble returning the foreign characters correctly. Some wouldn't execute Dessi's script at all, and none called the referenced xml documents to actually crawl the hundreds of person profiles.

So it seems http://xslttest.appspot.com/ alone is buggy with charset/encoding, while being the only test tool I found that does the crawling I am after :boggled:

The files contain too many of those foreign characters, I can't very well work with this bug.

Perhaps someone could tutor me on somehow running this locally (?) or some other solution?

Ocelot · Dec 8, 2014

Did you try my script?

Oystein · Dec 8, 2014

Ocelot said:
Did you try my script?

Sorry, had forgotten to consider other replies after I was hooked to the test tool...

Trying it now - did it work for you?
First off, I don't know PowerShell at all. Just found it, started it, copy&pasted ypir script into the upper of 3 windows, and hit F5 to execute.
I get an error message - in German. I translate:

The label "Invoke-WebRequest" Was not recognized as the name of a Cmdlet, a funktion, a script file or an executable program.
Check the spelling of the name, or whether the path is correct (if contained), and repeat the action
At line:3 Character:33
+ [xml]$people = Invoke-WebRequest <<<< $source
+ CategoryInfo : ObjectNotFound: (Invoke-WebRequest:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException

Here is the code (I actually edited it a bit - different output file name, and three more fields - category, link and display_order)

Code:

$source = "http://www.ae911truth.org/signatures/xml/arch.xml.txt"
$output = "arch.csv"
[xml]$people = Invoke-WebRequest $source
$people.root.person | Select-Object @{label="First Name";Expression={$_.first_name.'#cdata-section'}}, 
@{label="Middle Name";Expression={$_.middle_name.'#cdata-section'}}, 
@{label="Last Name";Expression={$_.last_name.'#cdata-section'}}, 
@{label="title";Expression={$_.title.'#cdata-section'}}, 
@{label="Degree";Expression={$_.degree.'#cdata-section'}}, 
@{label="City";Expression={$_.city.'#cdata-section'}}, 
@{label="State";Expression={$_.state.'#cdata-section'}}, 
@{label="Country";Expression={$_.country.'#cdata-section'}} , 
@{label="Category";Expression={$_.category.'#cdata-section'}} , 
@{label="URL";Expression={$_.link.'#cdata-section'}} , 
@{label="DisplayOrder";Expression={$_.display_order.'#cdata-section'}} , 
@{label="License Information";Expression={$_.license_info.'#cdata-section'}} | Export-Csv $output

Ocelot · Dec 9, 2014

Those changes should be fine, they're working OK for me using Powershell 4.0

It turns out that the Invoke-WebRequest cmdlet was introduced in PowerShell version 3.0 If you getting an error then it sounds like you may have an earlier version. Though it also sounds like you're using the Powershell ISE so I thought that meant you're using version 3.0 at least. Colour me a little confused.

You can type $psversiontable to find your version.

You can upgrade to the latest version here

http://www.microsoft.com/en-gb/download/details.aspx?id=40855

Let me know how it pans out.

Ocelot · Dec 9, 2014

Just a little amendment to your changes as category and display order aren't stored within character data sections in the xml.

It ran OK but when I looked at the output those columns were blank. This fixes it.

Code:

$source = "http://www.ae911truth.org/signatures/xml/arch.xml.txt"
$output = "arch.csv"
[xml]$people = Invoke-WebRequest $source
$people.root.person | Select-Object @{label="First Name";Expression={$_.first_name.'#cdata-section'}}, 
@{label="Middle Name";Expression={$_.middle_name.'#cdata-section'}}, 
@{label="Last Name";Expression={$_.last_name.'#cdata-section'}}, 
@{label="title";Expression={$_.title.'#cdata-section'}}, 
@{label="Degree";Expression={$_.degree.'#cdata-section'}}, 
@{label="City";Expression={$_.city.'#cdata-section'}}, 
@{label="State";Expression={$_.state.'#cdata-section'}}, 
@{label="Country";Expression={$_.country.'#cdata-section'}} , 
@{label="Category";Expression={$_.category}} , 
@{label="URL";Expression={$_.link.'#cdata-section'}} , 
@{label="DisplayOrder";Expression={$_.display_order}} , 
@{label="License Information";Expression={$_.license_info.'#cdata-section'}} | Export-Csv $output

Help needed: Program or configure a web / XML crawler

Penultimate Amazing

Resident Skeptical Hobbit

NWO Squirrel Division

Species Traitor

NWO Squirrel Division

NWO Squirrel Division

Penultimate Amazing

Species Traitor

Illuminator

Penultimate Amazing

Penultimate Amazing

Species Traitor

Penultimate Amazing

Species Traitor

Penultimate Amazing

Penultimate Amazing

Illuminator

Penultimate Amazing

Illuminator

Illuminator