• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Help needed: Program or configure a NEW web crawler

Oystein

Penultimate Amazing
Joined
Dec 9, 2009
Messages
18,903
Hi guys,

this is an update thread to one I opened a bit over 3 years ago. the ae911truth.org website has a new design, and that has changed the way they display a list of petition signers, which I wish to read out and download into a table format from time to time.

So I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.

Here is the new site:
http://www.ae911truth.org/signatures/#/AE/

This page has almost 3000 links to personal profiles; each link has a local href like this:
http://www.ae911truth.org/signatures/#/AE/RichardGageLafayetteCAUS

Previously, those links refered to .txt files which contained XML, and the job was to parse the XML. Now, the content seems to hidden behind cloudflare somehow.

These petition signature records are based on a database with field names contained in the XML files. Now, I struggle to see the markup text that my browser (Chrome on Win7) parses.

What I want to get, with the help of a script that I can run on my amateur Win7 notebook with free amateur tools, is a table in a form like CSV/spreadsheet similar to this:
Code:
url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX
(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)

And the same then for
http://www.ae911truth.org/signatures/#/General/A/
http://www.ae911truth.org/signatures/#/General/B/
etc.


Am I making sense?

HELP!

Thx ;)
 
It appears that the page is using some pretty complex javascript scripts to build the page on the fly. The javascript code is sort of obfuscated too. At least, it is devoid of line terminators, spaces, indentation or other formatting, so making any kind of sense of it is going to take a lot of work. Getting at the actual data is not likely to be an easy task. It's certainly not one I would want to take on.
 
It appears that the page is using some pretty complex javascript scripts to build the page on the fly. The javascript code is sort of obfuscated too. At least, it is devoid of line terminators, spaces, indentation or other formatting, so making any kind of sense of it is going to take a lot of work. Getting at the actual data is not likely to be an easy task. It's certainly not one I would want to take on.
I had the same impression, but I know next to nothing about such scripting. At some point, a script that runs locally on my computer (js do, don't they?) must access a data source, and I hope that source can be identified from the scripts.
 
It can be if you're Google, but it's usually not worth it, which is why people do it.

I would suggest a simpler mechanism: ctrl+a, ctrl+c, ctrl+v into a text editor. Then parse out the names every couple of lines.
 
It can be if you're Google, but it's usually not worth it, which is why people do it.

I would suggest a simpler mechanism: ctrl+a, ctrl+c, ctrl+v into a text editor. Then parse out the names every couple of lines.
In a former version, prior to 2015, they generated a static html page every day. I could download it, and do a number of search&replace actions in a text editor plus a bit of spreadsheet wizzardry to extract what I needed. This was possible because each paragraph of data content was preceded by a field name. That was an hour of manual work, or more.
Then, between 2015 and a few days ago, xml parsing via script cut down on the manual time.

Now, however, with no apparent field names anywhere, doing this manually would mean looking at ca. 3000 records individually - that's a couple of full work days. Unrealistic.
 
Some quick fiddling with perl got me some reasonable results at the first level - getting a scrape of the upfront data from the root page.

More work would be needed to follow the individual URL links to get the specific details within. (my perl skillz are very rusty)
 
Unfortunately, because the page content is now created using JavaScript and not straight HTML, it's much more difficult to scrape. I addition, the comments made by the signers are no longer part of the page, but hidden behind an AJAX call linked to a mouse click on the signer's box. That approach has the advantages of being parsimonious on the site's bandwidth, but at the expense of viewing the page in a non-JavaScript capable browser like elinks.

It would be straightforward to put together a script that could parse a file created using Ctrl+A > Ctrl-C > (switch to text editor) > Ctrl-V > save, but that would lose the comments made by the individual signers. What that script could do, though, is alert you to new signers, so you could manually click their boxes and copy and paste their comments to your list.
 
Last edited:
Some quick fiddling with perl got me some reasonable results at the first level - getting a scrape of the upfront data from the root page.

More work would be needed to follow the individual URL links to get the specific details within. (my perl skillz are very rusty)

Ok, sounds like a non-rusty scriptor might have a shot at digging up what I need? Thanks for that glimpse of hope!


Unfortunately, because the page content is now created using JavaScript and not straight HTML, it's much more difficult to scrape. I addition, the comments made by the signers are no longer part of the page, but hidden behind an AJAX call linked to a mouse click on the signer's box. That approach has the advantages of being parsimonious on the site's bandwidth, but at the expense of viewing the page in a non-JavaScript capable browser like elinks.
Doesn't a browser parse HTML at the end of all scripting - I mean don't JS, AJAX etc. dynamically create HTML for the browser to display? Or am I thinking too much old-school?

It would be straightforward to put together a script that could parse a file created using Ctrl+A > Ctrl-C > (switch to text editor) > Ctrl-V > save, but that would lose the comments made by the individual signers. What that script could do, though, is alert you to new signers, so you could manually click their boxes and copy and paste their comments to your list.
That would assume, among other things, that the order of signatures stays the same, right?

Right now my situation is that I last downloaded from the old design on January 1st; and I had planned to do a new download on April 1st, end-of-quarter thing. I know from their published counts that there should be only about a dozend new signatures during those three months - but a hassle to find them.
 
Ok, sounds like a non-rusty scriptor might have a shot at digging up what I need? Thanks for that glimpse of hope!



Doesn't a browser parse HTML at the end of all scripting - I mean don't JS, AJAX etc. dynamically create HTML for the browser to display? Or am I thinking too much old-school?


That would assume, among other things, that the order of signatures stays the same, right?

Right now my situation is that I last downloaded from the old design on January 1st; and I had planned to do a new download on April 1st, end-of-quarter thing. I know from their published counts that there should be only about a dozend new signatures during those three months - but a hassle to find them.

In chrome, I find that if I right click and "view page source", in the body, I just get a bunch of <script> tags that source javascript files on cloudshare. However, if I right click and "inspect", I can see the generated html.
 
Last edited:
In chrome, I find that if I right click and "view page source", in the body, I just get a bunch of <script> tags that source javascript files on cloudshare. However, if I right click and "inspect", I can see the generated html.

Ah - good idea!
I am starting to play around there...

I see that https://www.ae911truth.org/signatures/static/js/main.min.js appears to talk a lot about the data fields I want to extract - though I can't find yet where they are extracted from.
 
Ah - good idea!
I am starting to play around there...

I see that https://www.ae911truth.org/signatures/static/js/main.min.js appears to talk a lot about the data fields I want to extract - though I can't find yet where they are extracted from.

I used a "man-in-the-middle" proxy to trace all the traffic generated by the page, and noticed a call to this URL:

https:// siteupgrade2. cloudaccess. host /signatures /AE.json

That returns 2.9 MB of JSON formatted data, but with no line breaks. The good news is it has all the data you're looking for, already formatted (although not formatted to your specifications.) No more page scraping!

I'm working on a program now to transform the JSON file into the format you requested. My plan is to use utilities available in the GNU on Windows project, which is a set of over 100 common Unix/Linux utilities for use in the Windows cmd.exe shell. Included in this list are sed (streaming editor) and gawk (a pattern-based string manipulation utility,) which usually makes a project like this very easy to do.
 
I've got it mostly working (and it will work n Windows). There are actually 28 separate files:
  • The main Architects and Engineers list
  • A short "VIP" list
  • The list of public signers in 26 separate files (A - Z)
How do you want the output? All in one huge file, or one file for the Architects and Engineers list and a second file for the VIP and Public (A-Z) lists?
 
I've got it mostly working (and it will work n Windows). There are actually 28 separate files:
  • The main Architects and Engineers list
  • A short "VIP" list
  • The list of public signers in 26 separate files (A - Z)
How do you want the output? All in one huge file, or one file for the Architects and Engineers list and a second file for the VIP and Public (A-Z) lists?
Wow - excellent!

I keep the A&E separate from the general population (VIP and A through Z). Two files.
Within the A&E, there are five categories: licensed architects, "professional" (i.e. unlicensed) architects, licensed engineers, engineering professionals, and non-US A&E. I would usually track that as a field.

Thank you thank you thank you!!!
 
I used a "man-in-the-middle" proxy to trace all the traffic generated by the page, and noticed a call to this URL:

https:// siteupgrade2. cloudaccess. host /signatures /AE.json

That returns 2.9 MB of JSON formatted data, but with no line breaks. The good news is it has all the data you're looking for, already formatted (although not formatted to your specifications.) No more page scraping!
...

I am only now looking at that file (previously, I was on my smartphone and Tapatalk, no fun looking at huge linebreakless files on that).

That looks indeed like it has it all. One record looks like this:
Code:
"AlanHaymondGreenwichNYUS":{"category":"ARCH","city":"Greenwich","country":"US","degree":"B Arch, Rensselaer Polytechnic","discipline":"Architect","first_name":"Alan","hash":"NjAxMzg3NTk=","last_name":"Haymond","license_info":"025143","occupation_status":"Degreed + Licensed","state":"NY","statement_911":"Suspicious on 9/11 about the collapses and the size of the original hole in the Pentagon. Thoroughly convinced of cover up by April '02 - too many unanswered questions. Recommend David Ray Griffin's books.","supporter_title":"Architect","tech_biography":"From northern Virginia, graduate of RPI in Troy, NY, homeowner and father of 2 in upstate NY, carpenter and home renovator, 15 years in architectural firms (12 years licensed), primarily design large medical office buildings."},
Which I'd need parsed similar to this:
Code:
id|category|city|country|degree|discipline|first_name|hash|last_name|license_info|occupation_status|state|statement_911|supporter_title|tech_biography
AlanHaymondGreenwichNYUS|ARCH|Greenwich|US|B Arch, Rensselaer Polytechnic|Architect|Alan|NjAxMzg3NTk=|Haymond|025143|Degreed + Licensed|NY|Suspicious on 9/11 about the collapses and the size of the original hole in the Pentagon. Thoroughly convinced of cover up by April '02 - too many unanswered questions. Recommend David Ray Griffin's books.|Architect|From northern Virginia, graduate of RPI in Troy, NY, homeowner and father of 2 in upstate NY, carpenter and home renovator, 15 years in architectural firms (12 years licensed), primarily design large medical office buildings.

Some records have more, some fewer fields. For examples, some have a "photo_file" (typical value: "bwil32761553.jpg"), and some are missing the "license_info".

The spreadsheet I want to have generated in the end should have a first line with all field names that appear anywhere, then for each record a line, with an empty cell (e.g. "||" in a text file separated by pipes) where the field does not appear in a record, or is empty.
 
Blue Montain,

you have surely noticed that the AE.json file has five sections:
"ARCH":{}
"ARCHPROF":{}
"ENG":{}
"ENGPROF":{}
"NON-US":{}
Each of the brackets containing records of signastories with the same "category". It's probably best to just ignore these sections.


Let's see, they have right now:
"category"|Number|Change since Jan 01
"ARCH"|528 records|-2
"ARCHPROF"|141records|+1
"ENG"|555 records|+2
"ENGPROF"|1049 records|+4
"NON-US"|699 records|+3
SUM:|2972 records|+8

Yep, that's the current count on http://ae911truth.org/ :)

The record identifier, such as "AaronAshkinazyRooseveltNJUS", happily coincides with the URL I recorded from the old webdesign: "xml/supporters/U/AaronAshkinazyRooseveltNJUS.xml.txt".
Of course, that's not a permanent ID - it changes when the edit the name, city, state or country. Still, it allows me to identify new, deleted and changed records since I last downloaded the old version, three months ago.

Next questions:
  • I wonder what the field "hash" is?
  • What is (are) the file name(s) for the "General" population (VIP and letters A-Z)? (I actually expected there to be two more files: One for Names that start with a number 0-9, and one for names that start with some special character - letters with diacritics)
 
Last edited:
Blue Montain,

you have surely noticed that the AE.json file has five sections:
"ARCH":{}
"ARCHPROF":{}
"ENG":{}
"ENGPROF":{}
"NON-US":{}
Each of the brackets containing records of signastories with the same "category". It's probably best to just ignore these sections.
In fact, the section title is repeated in the "category" key on each entry, so it's already part of the dataset.

Next questions:
  • I wonder what the field "hash" is?
  • What is (are) the file name(s) for the "General" population (VIP and letters A-Z)? (I actually expected there to be two more files: One for Names that start with a number 0-9, and one for names that start with some special character - letters with diacritics)
Technically, the hash field is a base-64 encoded value. Base-64 is often used when encoding binary data intended to be sent over a channel that can handle only a limited character set. For example, pictures sent in email are almost always base-64 encoded because email has traditionally supported only 96 symbols, while a picture usually requires 256.

As it turns out, the decoded value is actually a text field consisting of eight digits. This is probably a sequence number and, I suspect, is both unique and permanent. Even if the ID field itself changes, chances are the hash field won't. I'll set up the program so it will put the decoded value in the hash field.

The URLs of the files are as follows:

The process I'm putting together will download all the files for you, so you won't need to do it yourself.
 
Well, this has certainly been an interesting little project! I think I have everything needed to download the data files and re-format them into your desired format.

First, download and install gow (GNU on Windows.) Follow this link to download the installer, and run it.

Next, create a folder in which you will save four files. I recommend you put it into your Documents folder and call it AE911 Signatures.

Copy the text below, paste it into a text editor, and save to a file named get_ae911.bat in the folder you created in the previous step. This is the main script, and the one you'll run when everything is set up.
Code:
@echo off
REM  --------------------------------------------------------------------------
REM  Batch file: get_ae911.bat
REM  Author:     Blue Mountain at internationalskeptics.com/forums
REM  Date:       April 2017
REM  Purpose:    Downloads various JSON files used by ae911truth.com (hosted
REM              at siteupgrade2.cloudaccess.host/signatures) and creates
REM              from them two bar-separated text files containing the
REM              signatories of ae911truth's petition:
REM              * ae911_ae.txt - Architects and Engineers
REM              * ae911_public.txt - members of the general public
REM  License:    I, the author of this file and the three subsiduary files
REM              (_ae911.bat, _ae911.sed, and _ae911.awk), grant all who may
REM              wish to use these files for any purpose whatsoever the right
REM              to do so. Redistribution is freely permitted, with the request
REM              this header remain intact.
REM  Warranty:   No warranty, expressed or implied. These files are guaranteed
REM              only to occupy disc space. If anything breaks, you get to keep
REM              both the pieces.
REM  --------------------------------------------------------------------------

REM Retrieve and format the list of Architects and Engineers
del ae911_ae.txt 2>del_stderr.txt
call _ae911.bat AE ae911_ae.txt

REM Retrieve and format the Public (A-Z) lists
del ae911_public.txt 2>del_stderr.txt
for %%L in (VIP A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _) DO call _ae911.bat OTHER-%%L ae911_public.txt

del del_stderr.txt

The next file is named _ae911.bat (note the leading '_'). The file must have this name or things won't work:
Code:
REM  Sub-process #1 for "get_ae911.bat" to retrieve a JSON file containing
REM  data from ae911truth.org and format it into a .txt file

REM  Parameters:
REM   %1: JSON file to process, without the trailing '.json'
REM   %2: Name of file to which output is appended

echo Processing %1 list and writing to %2
set URL=https://siteupgrade2.cloudaccess.host/signatures
wget --quiet --no-check-certificate -O - %URL%/%1.json 2>wget-stderr.txt | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2
del wget-stderr.txt

The third file is named _ae911.sed (note the leading '_'). The file must have this name or things won't work. The contents are quite cryptic: they're a collection of search-and-replace statements used by sed (streaming editor), which was installed earlier as part of gow.
Code:
# Sub-process #2 for "get_ae911.bat" to receive a very long JSON string (no
# line breaks) and translates it into something useful

# Add a line break after an opening '{'
s/{/{\n/g

# Add a line break before closing '}'
s/}/\n}/g

# Add a line break between a comma and a double-quote
s/,"/,\n"/g

# Add a space after double-quote+colon. Not really necessary, but it helps to
# separate key-value pairs more cleanly
s/":/": /g

# Change '|' into '/' or some lines in the final output will have too many columns
s,|,/,g

The fourth and last file is pretty large, so I'll put it into a new post.
 
Last edited:
The fourth and last file does most of the work. It's written in a language called awk, which is named after the first letter of the last names of its authors. It's rather cryptic, too, but it's really powerful.

Save this to a file named _ae911.awk (note the leading '_'). The file must have this name or things won't work. Ensure you get all 160 lines of the file.

Instructions for running all this are in my next post.

Code:
# Sub-process #3 for "get_ae911.bat" to receive a JSON string formatted by
# "_ae911.sed" and format it into a bar separated list. Output is to stdout;
# caller is expected to redirect it a file.

# JSON text as formatted by "_ae911.sed" (some lines are broken for readability):
# "AlanHaymondGreenwichNYUS": {
# "category": "ARCH",
# "city": "Greenwich",
# "country": "US",
# "degree": "B Arch, Rensselaer Polytechnic",
# "discipline": "Architect",
# "first_name": "Alan",
# "hash": "NjAxMzg3NTk=",
# "last_name": "Haymond",
# "license_info": "025143",
# "occupation_status": "Degreed + Licensed",
# "state": "NY",
# "statement_911": "Suspicious on 9/11 about the collapses and the size of the
#   original hole in the Pentagon. Thoroughly convinced of cover up by April
#   '02 - too many unanswered questions. Recommend David Ray Griffin's books.",
# "supporter_title": "Architect",
# "tech_biography": "From northern Virginia, graduate of RPI in Troy, NY,
#   homeowner and father of 2 in upstate NY, carpenter and home renovator, 15
#   years in architectural firms (12 years licensed), primarily design large
#   medical office buildings."

# Desired output format (lines are broken for readability):
# id|category|city|country|degree|discipline|
#   first_name|hash|last_name|license_info|occupation_status|state|
#   statement_911|
#   supporter_title|
#   tech_biography
# AlanHaymondGreenwichNYUS|ARCH|Greenwich|US|B Arch, Rensselaer Polytechnic|
#   Architect|Alan|NjAxMzg3NTk=|Haymond|025143|Degreed + Licensed|NY|
#   Suspicious on 9/11 about the collapses and the size of the original hole in 
#       the Pentagon. Thoroughly convinced of cover up by April '02 - too many
#       unanswered questions. Recommend David Ray Griffin's books.|
#   Architect|
#   From northern Virginia, graduate of RPI in Troy, NY, homeowner and father
#       of 2 in upstate NY, carpenter and home renovator, 15 years in architectural
#       firms (12 years licensed), primarily design large medical office buildings.

# The Base64 decoder was written by Shane Kerr; https://github.com/shane-kerr/AWK-base64decode

BEGIN {
    signer["first_name"] = ""

    # Fields known to be in the JSON stream
    field_list = "id|category|city|country|degree|discipline|first_name|hash|"
    field_list = field_list "last_name|license_info|occupation_status|state|"
    field_list = field_list "statement_911|supporter_title|tech_biography"

    # Print the field_list as a heading if "list" (passed as a variable on the
    # command line) is "AE" or "OTHER-VIP"
    if (list == "AE" || list == "OTHER-VIP") { print field_list }

    # Break up the list into individual fields
    split(field_list, fields, /\|/)

    # base64 lookup table
    # load symbols based on the alphabet
    for (i=0; i<26; i++) {
        BASE64[sprintf("%c", i+65)] = i
        BASE64[sprintf("%c", i+97)] = i+26
    }
    # load digits 0-9
    for (i=0; i<10; i++) { BASE64[sprintf("%c", i+48)] = i+52 }
    # and finally our two additional characters and our padding character
    BASE64["+"] = 62; BASE64["/"] = 63; BASE64["="] = -1
}

# The main function to decode Base64 data.
#
# Arguments:
# * encoded - the Base64 string
# * result - an array to return the binary data in
#
# We exit on error. For other use cases this should be changed to
# returning an error code somehow.
function base64decode(encoded, result) {
    n = 1
    while (length(encoded) >= 4) {
        g0 = BASE64[substr(encoded, 1, 1)]
        g1 = BASE64[substr(encoded, 2, 1)]
        g2 = BASE64[substr(encoded, 3, 1)]
        g3 = BASE64[substr(encoded, 4, 1)]
        if (g0 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g0) >> "/dev/stderr"
            exit 1
        }
        if (g1 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g1) >> "/dev/stderr"
            exit 1
        }
        if (g2 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g2) >> "/dev/stderr"
            exit 1
        }
        if (g3 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g3) >> "/dev/stderr"
            exit 1
        }

        # we don't have bit shifting in AWK, but we can achieve the same
        # results with multiplication, division, and modulo arithmetic
        result[n++] = (g0 * 4) + int(g1 / 16)
        if (g2 != -1) {
            result[n++] = ((g1 * 16) % 256) + int(g2 / 4)
            if (g3 != -1) { result[n++] = ((g2 * 64) % 256) + g3 }
        }
        encoded = substr(encoded, 5)
    }
    if (length(encoded) != 0) {
        printf("Extra characters at end of Base 64 encoded string: \"%s\"\n",
               encoded) >> "/dev/stderr"
        exit 1
    }
}

# Main code: runs once for every line of input
{ 
    # A line in format |"text": {| gets "text" added to the signer array as an ID
    if (match($0, /^"([^"]+)": {$/, a)) { signer["id"] = a[1] }

    # A line in format |"field_name": "text"| gets added to the signer array
    if (match($0, /^"([[:alnum:]_]+)": "(.*)",?$/, a)) {
        # Base64 decode the "hash" field
        if (a[1] == "hash") {
            base64decode(a[2], x)
            a[2] = ""
            for (i=1; i<=length(x); i++) { a[2] = a[2] sprintf("%c", x[i]) }
        }
        # Trim the left and right spaces from the value in a[2]
        signer[a[1]] = gensub(/[[:space:]]*$/, "", 1, gensub(/^[[:space:]]*/, "", 1, a[2]))
    }
}

# When we find a closing "}", output a line
/^}/{
    if (signer["first_name"] != "") {
        line = ""
        # Go through each field name in "fields"
        for (i=1; i<=length(fields); i++) {
            field_name = fields[i]
            # Add field value to "line" if the signer array has a field with this name
            if (field_name in signer) { line = line signer[field_name] }
            # Add the | to the line
            line = line "|"
        }
        # Print the line, but not the final "|"
        print substr(line, 0, length(line)-1)
    }

    # Set up for the next signatory
    delete signer
}
 
Last edited:

Back
Top Bottom