Help needed: Program or configure a NEW web crawler

Blue Mountain · Apr 5, 2018

How to run this process

Using Windows Explorer, open the parent folder of the folder where you put the four files. If you followed my suggestion in post #19, that will be your Documents folder.
Right-click on the folder where you saved the four files and select Command Prompt Here (that’s a new menu item installed by gow)
In the command window, type get_ae911.bat

If everything works as expected, you should get the following output:

Code:

Processing AE list and writing to ae911_ae.txt
Processing OTHER-VIP list and writing to ae911_public.txt
Processing OTHER-A list and writing to ae911_public.txt
Processing OTHER-B list and writing to ae911_public.txt
Processing OTHER-C list and writing to ae911_public.txt
Processing OTHER-D list and writing to ae911_public.txt
Processing OTHER-E list and writing to ae911_public.txt
Processing OTHER-F list and writing to ae911_public.txt
Processing OTHER-G list and writing to ae911_public.txt
Processing OTHER-H list and writing to ae911_public.txt
Processing OTHER-I list and writing to ae911_public.txt
Processing OTHER-J list and writing to ae911_public.txt
Processing OTHER-K list and writing to ae911_public.txt
Processing OTHER-L list and writing to ae911_public.txt
Processing OTHER-M list and writing to ae911_public.txt
Processing OTHER-N list and writing to ae911_public.txt
Processing OTHER-O list and writing to ae911_public.txt
Processing OTHER-P list and writing to ae911_public.txt
Processing OTHER-Q list and writing to ae911_public.txt
Processing OTHER-R list and writing to ae911_public.txt
Processing OTHER-S list and writing to ae911_public.txt
Processing OTHER-T list and writing to ae911_public.txt
Processing OTHER-U list and writing to ae911_public.txt
Processing OTHER-V list and writing to ae911_public.txt
Processing OTHER-W list and writing to ae911_public.txt
Processing OTHER-X list and writing to ae911_public.txt
Processing OTHER-Y list and writing to ae911_public.txt
Processing OTHER-Z list and writing to ae911_public.txt
Processing OTHER-_ list and writing to ae911_public.txt

And finally, if everything went as expected, you should have two new files in the folder:

ae911_ae.txt
ae911_public.txt

Good luck!

By the way this is bread-and-butter work for an experienced Unix/Linux system administrator. I pretty much live and breathe this stuff, so overall this was a simple project.

Oystein · Apr 5, 2018

I work late shift today, currently on my way to work. Will try this all when I return home in about 10 hours?

Worm · Apr 5, 2018

Oystein · Apr 5, 2018

This is super awesome!! It works, and does what I expected!

I first struggled a bit with creating the correct file names: Using Windows Explorer, the way to create a Editor editable text file is, of course, to right-click in the new directory and click "New" - "Text File". Those, however, are created with a hidden extension ".txt", such that when I type in "get_ae911.bat" as file name, I see "get_ae911.bat", but it's actually "get_ae911.bat.txt". Consequently, when I first tried to run the command line command "get_ae911.bat", I got a message that the command was not found or misspelled.
Luckily, I remembered the most basic DOS commands: dir and copy. So fixed that.

I love the output! It even gets the Unicode-8 thing right - previously, I struggled with getting the right character coding all the way to my spreadsheet, such that the foreign characters would come up as generic boxes or something, not n-tilde, e-accent-grave or u-umlaut etc.

This process also works a LOT faster than what I did in the previous two incarnations of the signatures list!

I still have some post-processing to do; for example, in the "country" field, a country like The Netherlands comes as

NL
The Netherlands
Netherlands
Holland

That's of course a problem within the AE911 data, they are not consistent. I change everything to ISO country codes ("NL" being correct for The Netherlands, e.g.) manually, which is a good thing, as manually working through the files keeps me familiar with the content, and sometimes alerts me to interesting developments.

However, I have a couple of ideas how your output could be optimized (mainly to minimize my post-processing):

Your output does not contain the "photo" field. This is lowest priority - I haven't even checked yet whether the image files named there are still available at all
The order of fields is a bit awkward - for example, "city" and "country" are next to each other, but "state" is several columns to the right. Should naturally be city-state-country. Or: "first_name" and "last_name" are separated by "hash". Is it easily possible to rearrange the order of fields? Then I would suggest the following: category - hash - id - first_name - last_name - supporter_title - degree - city - state - country - license_info - tech_biography - statement_911 - occupation_status - discipline
The Public list got 21986 records - but ae911truth.org currently counts 22130 "Members of the Public". Don't know if the problem is on their side or yours. It's past midnight here, so I won't analyze this any further tonight.

I may look into your code more thoroughly over the weekend and see whether I can figure how to make the changes I suggest.

Again, thanks a thousand times!!!

Blue Mountain · Apr 6, 2018

Oystein said:
This is super awesome!! It works, and does what I expected!

Great to hear that it works!

I first struggled a bit with creating the correct file names: Using Windows Explorer, the way to create a Editor editable text file is, of course, to right-click in the new directory and click "New" - "Text File". Those, however, are created with a hidden extension ".txt", such that when I type in "get_ae911.bat" as file name, I see "get_ae911.bat", but it's actually "get_ae911.bat.txt". Consequently, when I first tried to run the command line command "get_ae911.bat", I got a message that the command was not found or misspelled.
Luckily, I remembered the most basic DOS commands: dir and copy. So fixed that.

That's one of the many things I dislike about Windows. In fact, my first order of business on any new Windows installation I do is turn that feature off.

I love the output! It even gets the Unicode-8 thing right - previously, I struggled with getting the right character coding all the way to my spreadsheet, such that the foreign characters would come up as generic boxes or something, not n-tilde, e-accent-grave or u-umlaut etc.

This process also works a LOT faster than what I did in the previous two incarnations of the signatures list!

Chalk up a couple of points for Unix

. One of the things I've discovered over the years is operating systems derived from Unix are much more capable of handling large datasets and file lists than Windows is. Part of it is the Unix idea of a "stream", where lines of text are acted upon and as soon as possible sent to the next command in a pipeline. So if your pipeline has three different programs in it (like this one: wget | sed | awk), the third program is handling the first line of output at the same time the second program is handling the second line of output, while the first program is generating the third line of output. I rather suspect PowerShell might approach this, but I don't know PowerShell at all, which is why I used the Unix utilities.

I still have some post-processing to do; for example, in the "country" field, a country like The Netherlands comes as

NL

The Netherlands

Netherlands

Holland

That's of course a problem within the AE911 data, they are not consistent. I change everything to ISO country codes ("NL" being correct for The Netherlands, e.g.) manually, which is a good thing, as manually working through the files keeps me familiar with the content, and sometimes alerts me to interesting developments.

In my next post I've given you a new copy of _ae911.awk that addresses this. Look for the following section in the file:

Code:

# Country name normalisation
country_ISO["The Netherlands"] = "NL";
country_ISO["Netherlands"] = "NL";
country_ISO["Holland"] = "NL";

Simply add new entries as required. Capitalisation doesn't matter: "Holland", "HOLLAND", "holland", and even "HoLLaND" in the input will all get treated as "Holland" and changed to "NL".

However, I have a couple of ideas how your output could be optimized (mainly to minimize my post-processing):

Your output does not contain the "photo" field. This is lowest priority - I haven't even checked yet whether the image files named there are still available at all

The order of fields is a bit awkward - for example, "city" and "country" are next to each other, but "state" is several columns to the right. Should naturally be city-state-country. Or: "first_name" and "last_name" are separated by "hash". Is it easily possible to rearrange the order of fields? Then I would suggest the following: category - hash - id - first_name - last_name - supporter_title - degree - city - state - country - license_info - tech_biography - statement_911 - occupation_status - discipline

The Public list got 21986 records - but ae911truth.org currently counts 22130 "Members of the Public". Don't know if the problem is on their side or yours. It's past midnight here, so I won't analyze this any further tonight.

I may look into your code more thoroughly over the weekend and see whether I can figure how to make the changes I suggest.

The "photo" field is now called "photo_file."

In the new _ae911.awk file in my next post there are the following lines:

Code:

# Fields to include in the output; they are written in the order presented here
field_list = ""
field_list = field_list "category|"
field_list = field_list "hash|"
field_list = field_list "id|"
field_list = field_list "first_name|"
field_list = field_list "last_name|"
field_list = field_list "supporter_title|"
field_list = field_list "degree|"
field_list = field_list "city|"
field_list = field_list "state|"
field_list = field_list "country|"
field_list = field_list "license_info|"
field_list = field_list "tech_biography|"
field_list = field_list "statement_911|"
field_list = field_list "occupation_status|"
field_list = field_list "discipline|"
field_list = field_list "photo_file|"

I've ordered them the way you suggested, but if you want to change them just change the order in which the lines appear.

I'm puzzled by the discrepancy in the count of "public" records. When I ran this on my Windows 10 system--and, as far as I know, all Windows systems should work the same, given that we're actually using Unix tools--I got 21,987 lines in ae911_public.txt, which includes the header line. So that matches the count on the web site.

At first I thought perhaps one of the files wasn't being processed, perhaps the "VIP" or "_" file. But the difference in numbers is 144, and none of the input files has exactly that number of records. The closest is "Z" at 141. If the difference was a multiple of 28 (the number of files we're processing) I'd suspect the code is managing to skip some records, but 144/28 is 5.143. I rather doubt the processing is skipping 5.143 records in every file.

Here's a check that you can do. At the command line, type the following:

Code:

wc -l ae911_public.txt

(the "-l" part is "-L" but in lower case.) That will tell you how many lines are in the ae911_public.txt. Accounting for the heading line, if that count is different from the number of rows you get when you import it into a spreadsheet, then the fault is with the spreadsheet program and not the process I've written.

Again, thanks a thousand times!!!

You're welcome! Like I said, this has been a very interesting little project.

Blue Mountain · Apr 6, 2018

Here's the new _ae911.awk file.

Code:

# Sub-process #3 for "get_ae911.bat" to receive a JSON string formatted by
# "_ae911.sed" and format it into a bar separated list. Output is to stdout;
# caller is expected to redirect it a file.

# JSON text as formatted by "_ae911.sed" (some lines are broken for readability):
# "AlanHaymondGreenwichNYUS": {
# "category": "ARCH",
# "city": "Greenwich",
# "country": "US",
# "degree": "B Arch, Rensselaer Polytechnic",
# "discipline": "Architect",
# "first_name": "Alan",
# "hash": "NjAxMzg3NTk=",
# "last_name": "Haymond",
# "license_info": "025143",
# "occupation_status": "Degreed + Licensed",
# "state": "NY",
# "statement_911": "Suspicious on 9/11 about the collapses and the size of the
#   original hole in the Pentagon. Thoroughly convinced of cover up by April
#   '02 - too many unanswered questions. Recommend David Ray Griffin's books.",
# "supporter_title": "Architect",
# "tech_biography": "From northern Virginia, graduate of RPI in Troy, NY,
#   homeowner and father of 2 in upstate NY, carpenter and home renovator, 15
#   years in architectural firms (12 years licensed), primarily design large
#   medical office buildings."

# Desired output format (lines are broken for readability):
# id|category|city|country|degree|discipline|
#   first_name|hash|last_name|license_info|occupation_status|state|
#   statement_911|
#   supporter_title|
#   tech_biography
# AlanHaymondGreenwichNYUS|ARCH|Greenwich|US|B Arch, Rensselaer Polytechnic|
#   Architect|Alan|NjAxMzg3NTk=|Haymond|025143|Degreed + Licensed|NY|
#   Suspicious on 9/11 about the collapses and the size of the original hole in 
#       the Pentagon. Thoroughly convinced of cover up by April '02 - too many
#       unanswered questions. Recommend David Ray Griffin's books.|
#   Architect|
#   From northern Virginia, graduate of RPI in Troy, NY, homeowner and father
#       of 2 in upstate NY, carpenter and home renovator, 15 years in architectural
#       firms (12 years licensed), primarily design large medical office buildings.

# The Base64 decoder was written by Shane Kerr; https://github.com/shane-kerr/AWK-base64decode

BEGIN {
    signer["first_name"] = ""

    # Fields to include in the output; they are written in the order presented here
    field_list = ""
    field_list = field_list "category|"
    field_list = field_list "hash|"
    field_list = field_list "id|"
    field_list = field_list "first_name|"
    field_list = field_list "last_name|"
    field_list = field_list "supporter_title|"
    field_list = field_list "degree|"
    field_list = field_list "city|"
    field_list = field_list "state|"
    field_list = field_list "country|"
    field_list = field_list "license_info|"
    field_list = field_list "tech_biography|"
    field_list = field_list "statement_911|"
    field_list = field_list "occupation_status|"
    field_list = field_list "discipline|"
    field_list = field_list "photo_file|"

    # Remove a trailing "|" from field_list if there is one
    sub(/\|$/, "", field_list)

    # Print the field_list as a heading if "list" (passed as a variable on the
    # command line) is "AE" or "OTHER-VIP"
    if (list == "AE" || list == "OTHER-VIP") { print field_list }

    # Break up the list into individual fields
    split(field_list, fields, /\|/)

    # Country name normalisation
    country_ISO["The Netherlands"] = "NL";
    country_ISO["Netherlands"] = "NL";
    country_ISO["Holland"] = "NL";

    # base64 lookup table
    # load symbols based on the alphabet
    for (i=0; i<26; i++) {
        BASE64[sprintf("%c", i+65)] = i
        BASE64[sprintf("%c", i+97)] = i+26
    }
    # load digits 0-9
    for (i=0; i<10; i++) { BASE64[sprintf("%c", i+48)] = i+52 }
    # and finally our two additional characters and our padding character
    BASE64["+"] = 62; BASE64["/"] = 63; BASE64["="] = -1

    # Lower-case all the keys in country_ISO
    for (c in country_ISO) {
        country_ISO[tolower(c)] = country_ISO[c]
        delete country_ISO[c]
    }
}

# The main function to decode Base64 data.
#
# Arguments:
# * encoded - the Base64 string
# * result - an array to return the binary data in
#
# We exit on error. For other use cases this should be changed to
# returning an error code somehow.
function base64decode(encoded, result) {
    n = 1
    while (length(encoded) >= 4) {
        for (i=1; i<=4; i++) {
            c = substr(encoded, i, 1)
            if (c in BASE64) {
                g[i-1] = BASE64[c]
            } else {
                printf("Unrecognized character \"%c\" in Base 64 encoded string\n",
                    c) >>"/dev/stderr"
                exit 1
            }
        }
        # we don't have bit shifting in AWK, but we can achieve the same
        # results with multiplication, division, and modulo arithmetic
        result[n++] = (g[0] * 4) + int(g[1] / 16)
        if (g[2] != -1) {
            result[n++] = ((g[1] * 16) % 256) + int(g[2] / 4)
            if (g[3] != -1) { result[n++] = ((g[2] * 64) % 256) + g[3] }
        }
        encoded = substr(encoded, 5)
    }
    if (length(encoded) != 0) {
        printf("Extra characters at end of Base 64 encoded string: \"%s\"\n",
               encoded) >> "/dev/stderr"
        exit 1
    }
}

# Main code: runs once for every line of input
{ 
    # A line in format |"text": {| gets "text" added to the signer array as an ID
    if (match($0, /^"([^"]+)": {$/, a)) { signer["id"] = a[1] }

    # A line in format |"field_name": "text"| gets added to the signer array
    if (match($0, /^"([[:alnum:]_]+)": "(.*)",?$/, a)) {
        # Normalise the country name field
        if (a[1] == "country") {
            a[2] = tolower(a[2])
            if (a[2] in country_ISO) { a[2] = country_ISO[a[2]] }
        }
        # Base64 decode the "hash" field
        if (a[1] == "hash") {
            base64decode(a[2], x)
            a[2] = ""
            for (i=1; i<=length(x); i++) { a[2] = a[2] sprintf("%c", x[i]) }
        }
        # Trim the left and right spaces from the value in a[2]
        signer[a[1]] = gensub(/[[:space:]]*$/, "", 1, gensub(/^[[:space:]]*/, "", 1, a[2]))
    }
}

# When we find a closing "}", output a line
/^}/{
    if (signer["first_name"] != "") {
        line = ""
        # Go through each field name in "fields"
        for (i=1; i<=length(fields); i++) {
            field_name = fields[i]
            # Add field value to "line" if the signer array has a field with this name
            if (field_name in signer) { line = line signer[field_name] }
            # Add the | to the line
            line = line "|"
        }
        # Print the line, but not the final "|"
        print substr(line, 0, length(line)-1)
    }

    # Set up for the next signatory
    delete signer
}

Bonus! Here's a list of most of the non-ISO country names; you can just copy and paste this into _ae911.awk. Two notes:

Note they all use XX as the ISO country code. You'll had to do the translation yourself. There are 177 entries in the list--have fun!
The list includes only those entries where the country name was longer than two characters. It's entirely possible some people entered a bad two character country code when signing the petition. If you find any of those, simply add them to the country_ISO list. For example, if someone in Germany entered DW instead of DE, add the following line to the program:
Code:
```
country_ISO["DW"] = "DE"
```

Code:

    country_ISO["afghanistan"] = "XX"
    country_ISO["algeria"] = "XX"
    country_ISO["aotearoa/new zealand"] = "XX"
    country_ISO["argentina"] = "XX"
    country_ISO["auatralia"] = "XX"
    country_ISO["aus"] = "XX"
    country_ISO["australia"] = "XX"
    country_ISO["austria"] = "XX"
    country_ISO["bahamas"] = "XX"
    country_ISO["bahrain"] = "XX"
    country_ISO["barbados"] = "XX"
    country_ISO["belgium"] = "XX"
    country_ISO["belize"] = "XX"
    country_ISO["bolivia"] = "XX"
    country_ISO["bosnia/us"] = "XX"
    country_ISO["brasil"] = "XX"
    country_ISO["brazil"] = "XX"
    country_ISO["brescia, italy"] = "XX"
    country_ISO["bul"] = "XX"
    country_ISO["bulgaria"] = "XX"
    country_ISO["cayman islands"] = "XX"
    country_ISO["cda"] = "XX"
    country_ISO["cdn"] = "XX"
    country_ISO["chile"] = "XX"
    country_ISO["china"] = "XX"
    country_ISO["colombia"] = "XX"
    country_ISO["costa rica"] = "XX"
    country_ISO["country.ca"] = "XX"
    country_ISO["croatia"] = "XX"
    country_ISO["cymru, britain"] = "XX"
    country_ISO["cyprus"] = "XX"
    country_ISO["czechia"] = "XX"
    country_ISO["czech rep."] = "XX"
    country_ISO["czech republic"] = "XX"
    country_ISO["czech republik"] = "XX"
    country_ISO["denmark"] = "XX"
    country_ISO["deutschland"] = "XX"
    country_ISO["ecuador"] = "XX"
    country_ISO["egypt"] = "XX"
    country_ISO["england"] = "XX"
    country_ISO["england - uk."] = "XX"
    country_ISO["england, uk"] = "XX"
    country_ISO["espaã±a"] = "XX"
    country_ISO["españa"] = "XX"
    country_ISO["estonia"] = "XX"
    country_ISO["europe"] = "XX"
    country_ISO["faroe islands"] = "XX"
    country_ISO["finland"] = "XX"
    country_ISO["føroyar"] = "XX"
    country_ISO["fra"] = "XX"
    country_ISO["france"] = "XX"
    country_ISO["france/vietnam"] = "XX"
    country_ISO["french"] = "XX"
    country_ISO["ger"] = "XX"
    country_ISO["germany"] = "XX"
    country_ISO["germany, berlin"] = "XX"
    country_ISO["great britain"] = "XX"
    country_ISO["great britian"] = "XX"
    country_ISO["greece"] = "XX"
    country_ISO["greenland"] = "XX"
    country_ISO["hawaiian kingdom"] = "XX"
    country_ISO["hong kong"] = "XX"
    country_ISO["hong kong sar china"] = "XX"
    country_ISO["hungary"] = "XX"
    country_ISO["iceland"] = "XX"
    country_ISO["india"] = "XX"
    country_ISO["indonesia"] = "XX"
    country_ISO["irak"] = "XX"
    country_ISO["iraq"] = "XX"
    country_ISO["ireland"] = "XX"
    country_ISO["israel"] = "XX"
    country_ISO["italia"] = "XX"
    country_ISO["italy"] = "XX"
    country_ISO["itay"] = "XX"
    country_ISO["jamaica wi"] = "XX"
    country_ISO["japan"] = "XX"
    country_ISO["jordan"] = "XX"
    country_ISO["kenya"] = "XX"
    country_ISO["kingdom of morocco"] = "XX"
    country_ISO["korea(south)"] = "XX"
    country_ISO["ksa"] = "XX"
    country_ISO["kuwait"] = "XX"
    country_ISO["lebanon"] = "XX"
    country_ISO["liechtenstein"] = "XX"
    country_ISO["lisbon"] = "XX"
    country_ISO["lithuania"] = "XX"
    country_ISO["luxembourg"] = "XX"
    country_ISO["malaysia"] = "XX"
    country_ISO["malta"] = "XX"
    country_ISO["martinique"] = "XX"
    country_ISO["mauritius"] = "XX"
    country_ISO["mex"] = "XX"
    country_ISO["mexico"] = "XX"
    country_ISO["méxico"] = "XX"
    country_ISO["montserrat"] = "XX"
    country_ISO["morocco"] = "XX"
    country_ISO["mozambique"] = "XX"
    country_ISO["nederland"] = "XX"
    country_ISO["nepal"] = "XX"
    country_ISO["netherland"] = "XX"
    country_ISO["netherlandas"] = "XX"
    country_ISO["netherlands antilles"] = "XX"
    country_ISO["netherlands, the"] = "XX"
    country_ISO["new zealand."] = "XX"
    country_ISO["new zealand"] = "XX"
    country_ISO["newzealand"] = "XX"
    country_ISO["nigeria"] = "XX"
    country_ISO["nld"] = "XX"
    country_ISO["northern ireland"] = "XX"
    country_ISO["norway"] = "XX"
    country_ISO["nowhere"] = "XX"
    country_ISO["nzd"] = "XX"
    country_ISO["oman"] = "XX"
    country_ISO["pakistan"] = "XX"
    country_ISO["palestine"] = "XX"
    country_ISO["panama"] = "XX"
    country_ISO["peru"] = "XX"
    country_ISO["perú"] = "XX"
    country_ISO["philippines"] = "XX"
    country_ISO["poland"] = "XX"
    country_ISO["portugal"] = "XX"
    country_ISO["p.r.c."] = "XX"
    country_ISO["principality of andorra"] = "XX"
    country_ISO["qatar"] = "XX"
    country_ISO["rep of ireland"] = "XX"
    country_ISO["rep' of ireland"] = "XX"
    country_ISO["republic of ireland"] = "XX"
    country_ISO["republic of kazakhstan"] = "XX"
    country_ISO["republic of korea"] = "XX"
    country_ISO["romania"] = "XX"
    country_ISO["rsa"] = "XX"
    country_ISO["russia"] = "XX"
    country_ISO["russian federation"] = "XX"
    country_ISO["s. africa"] = "XX"
    country_ISO["saudi arabia"] = "XX"
    country_ISO["scotland"] = "XX"
    country_ISO["scotland / uk"] = "XX"
    country_ISO["scotland (uk)"] = "XX"
    country_ISO["scotland, uk"] = "XX"
    country_ISO["serbia"] = "XX"
    country_ISO["singapore"] = "XX"
    country_ISO["slovakia"] = "XX"
    country_ISO["slovakia - eu"] = "XX"
    country_ISO["sloveinja"] = "XX"
    country_ISO["slovenia"] = "XX"
    country_ISO["south africa"] = "XX"
    country_ISO["south korea"] = "XX"
    country_ISO["spain"] = "XX"
    country_ISO["sri lanka"] = "XX"
    country_ISO["swden"] = "XX"
    country_ISO["sweden"] = "XX"
    country_ISO["switzerland"] = "XX"
    country_ISO["swizerland"] = "XX"
    country_ISO["syria"] = "XX"
    country_ISO["taiwan"] = "XX"
    country_ISO["thailand"] = "XX"
    country_ISO["the czech republic"] = "XX"
    country_ISO["tunisia"] = "XX"
    country_ISO["turkey"] = "XX"
    country_ISO["turkiye"] = "XX"
    country_ISO["uae"] = "XX"
    country_ISO["u k"] = "XX"
    country_ISO["u.k."] = "XX"
    country_ISO["u.k"] = "XX"
    country_ISO["ukraine"] = "XX"
    country_ISO["uk/spain"] = "XX"
    country_ISO["united arab emirates"] = "XX"
    country_ISO["united kingdom"] = "XX"
    country_ISO["united kingdom (england)"] = "XX"
    country_ISO["united kingdom of great britai"] = "XX"
    country_ISO["uruguay"] = "XX"
    country_ISO["usa (expatriate)"] = "XX"
    country_ISO["venezuela"] = "XX"
    country_ISO["vietnam"] = "XX"
    country_ISO["wales"] = "XX"
    country_ISO["wales (uk)"] = "XX"
    country_ISO["wales, uk"] = "XX"

xterra · Apr 7, 2018

Blue mountain,

where is this command

wc -l

coming from? If I type it plus a file name, Windows returns "not recognized."

Blue Mountain · Apr 7, 2018

Childlike Empress said:
His own words. What else to you think someone would "need" this information for?

Good job, though.

Thanks for the info. As for needing the information, it could be merely an academic exercise to chart the growth of the petition and attempt to see how the "statement_911" field changes over time. For example, is there a trend toward paranoid conspiracy thinking, or are responders now leaning toward the idea that the original investigation was somehow flawed (despite the report of the 9/11 Commission.)

Oystein, what's the nature of your "stalk and harass" activities? Are you merely engaging some people by email, or are you doing other things such as doxxing and informing employers?

Blue Mountain · Apr 7, 2018

xterra said:
Blue mountain,

where is this command

coming from? If I type it plus a file name, Windows returns "not recognized."

It's from gow (GNU on Windows) that I mentioned at the start of post #19.

xterra · Apr 7, 2018

Thanks. I seem to have missed that piece of information.

I just looked at his documentation.

Oystein · Apr 7, 2018

Blue Mountain said:
...
All the information is right there on the site. For people who want to take the time and effort, it's possible to get all its contents by clicking, selecting, copying, and pasting. It just would be incredibly tedious.

Less tedious, to be sure, than following up by finding out phone number, street addresses, email, employers...

For a while, several years ago, I did the tedious work of searching each of the (licensed) Profesional Engineneer's license in the public online lookup resources provided by the 50 states. Today, that's 555 professional engineers (under 20% of the total 2972), back then I think 300 - 330. Over the course of several months I got a list of these nearly complete. The results were relatively simple:

The proportion of civil engineers on the list i roughly equal the proportion of civil engineers among all P.E.s (i.e. a bit more than half of them)
Almost all (something like 95%) of the signer listed as P.E. have indeed at some point in their lives been P.E.s, i.e. very few cases of inflated or wrong credentials.
Not few are retired though, or licenses have expired, have been revoked, etc. - which never concerned me, as the point is that they have, in the course of their academic and professional careers, achieved enough expertise to pass the licensing exam in their respective fields of engineering.
AE911T has recruited well under 0.1% of the P.E. population - they are a tiny, tiny fringe of loons

That was a lot of work back then, and I have no intention of doing another comprehenive search for additional data like this.

One possibility is that I might, in the future, contact the occasional new signer, or the occasional drop-out, to ask them politely if they are aware that they are listed by AE911, whether they agreed to this, and whether the information was actually provided by them (particularly, whether the personal statement is theirs, or has been ghost-written by AE). And to ask the deleted signatories why their names were removed.
I have never done this, but suspect that, by and large, the entries are legit; and that the drop-outs surely have not all come to realize the folly of AE911T's message. I suspect some have developed into even madder Truthers, to whom the message of AE911T isn't crazy enough. You know, schisms within one religion can be more vicious than the differences between different religions.

Oystein · Apr 7, 2018

Back to the process. I have three problems right now:

1. It doesn't finish filling the ae911_public.txt file. It processes OTHER-A, OTHER-B, OTHER-C, ... but then stops sooner or later with the message (again, I translate from German):

The process cannot access this file, because it is being used by another process

I guess the problem is not with your code, but with my computer/Windows - each letter's output gets written into the same file, which is opened and closed for each letter - and apparently doesn't get closed fast enough by Windows to become available in time for the next letter file-access.

A couple of times now, it stopped after the letter "X", which has the least output of all letters, so I guess that's a problem. I edited get_ae911.bat to put the letter X in last place, behind the underscore, and then the process completed!

2. The "Country name normalisation" just doesn't work. I have tried a couple of changes, no result.
I noticed that in the "new version" of the .awk file, you have these lines:

Code:

    # Country name normalisation
    country_ISO["The Netherlands"] = "NL";
    country_ISO["Netherlands"] = "NL";
    country_ISO["Holland"] = "NL";

Note: Each line ends with a semicolon.
But then your long list of country name instances doesn't have the semicolons:

Code:

    country_ISO["afghanistan"] = "XX"
    country_ISO["algeria"] = "XX"
    country_ISO["aotearoa/new zealand"] = "XX"

etc.
I added the semicolons. Didn't help.

3. (Extremely minor - I can fix this with a simple "search and replace" in any editor or spreadsheet program): Sometimes there is a double-quote (") - it's output not simply as a double-quote but with a leading backslash, like this: \"

Blue Mountain · Apr 7, 2018

Oystein said:
Back to the process. I have three problems right now:

1. It doesn't finish filling the ae911_public.txt file. It processes OTHER-A, OTHER-B, OTHER-C, ... but then stops sooner or later with the message (again, I translate from German):

The process cannot access this file, because it is being used by another process
I guess the problem is not with your code, but with my computer/Windows - each letter's output gets written into the same file, which is opened and closed for each letter - and apparently doesn't get closed fast enough by Windows to become available in time for the next letter file-access.

A couple of times now, it stopped after the letter "X", which has the least output of all letters, so I guess that's a problem. I edited get_ae911.bat to put the letter X in last place, behind the underscore, and then the process completed!

That was a good bit of troubleshoooting. I did the testing on a version of WIndows 10 running as a virtual machine under a Linux host, which probably made things run slow enough that Windows was able to close the file between cycles. If it gets to be too much of a hassle, in _ae911.bat try adding a "sync" statement (part of gow) after "wget:"

Code:

REM  Sub-process #1 for "get_ae911.bat" to retrieve a JSON file containing
REM  data from ae911truth.org and format it into a .txt file

REM  Parameters:
REM   %1: JSON file to process, without the trailing '.json'
REM   %2: Name of file to which output is appended

echo Processing %1 list and writing to %2
set URL=https://siteupgrade2.cloudaccess.host/signatures
wget --quiet --no-check-certificate -O - %URL%/%1.json 2>wget-stderr.txt | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2
del wget-stderr.txt
sync

In Unix, "sync" forces writes to the disk. I don't know if it actually works on Windows, but it may slow things down just enough to get Windows to close the file. If that doesn't work, try replacing the "sync" with something that takes a bit of time to do, such as:

Code:

powershell Start-Sleep -Milliseconds 10

The idea here is to have the batch file do something in order to give Windows time to close the file. Increase the value after -Milliseconds to 100 if things still don't work. Increase by 100 until they do.

2. The "Country name normalisation" just doesn't work. I have tried a couple of changes, no result.

<snip>

I added the semicolons. Didn't help.

I added the semicolons by accident to the AWK program. They're actually are a Perl thing. I do a lot of programming in Perl, where they're required to end a statement. AWK doesn't need them if the statement ends with a newline.

The real culprit is technical in nature, but can be fixed by making the following change (replace the 6 lines in your current code with these ones):

Code:

# Lower-case all the keys in country_ISO
for (c in country_ISO) {
    country_ISO[tolower(c)] = country_ISO[c]
    if (tolower(c) != c) { delete country_ISO[c] }
}

3. (Extremely minor - I can fix this with a simple "search and replace" in any editor or spreadsheet program): Sometimes there is a double-quote (") - it's output not simply as a double-quote but with a leading backslash, like this: \"

For now I recommend you simply do a search and replace.

Oystein · Apr 8, 2018

Blue Mountain said:
... If it gets to be too much of a hassle, in _ae911.bat try adding a "sync" statement (part of gow) after "wget:"
... If that doesn't work, try replacing the "sync" with something that takes a bit of time to do, such as:

Code:

powershell Start-Sleep -Milliseconds 10

The idea here is to have the batch file do something in order to give Windows time to close the file. Increase the value after -Milliseconds to 100 if things still don't work. Increase by 100 until they do.

sync didn't improve the situation, neither did sleep of up to half a second.
From time to time, often enough, the process finishes.
A bit of a hassle, but considering how much time I save with your help relative to the previous processes, that is absolutely manageable. In any case, I'd know how to break the batch up to run only a few, or just one, letter at a time.

So I am fine.

Blue Mountain said:
...
The real culprit is technical in nature, but can be fixed by making the following change (replace the 6 lines in your current code with these ones):

Code:

# Lower-case all the keys in country_ISO for (c in country_ISO) { country_ISO[tolower(c)] = country_ISO[c] if (tolower(c) != c) { delete country_ISO[c] } }

Yay, that works! Thanks

Blue Mountain said:
For now I recommend you simply do a search and replace.

No problem.

I am all set and happy and thoughts and prayers and all that!

Oystein · May 2, 2018

Blue Mountain said:
I used a "man-in-the-middle" proxy to trace all the traffic generated by the page, and noticed a call to this URL:

https:// siteupgrade2. cloudaccess. host /signatures /AE.json

That returns 2.9 MB of JSON formatted data, but with no line breaks. The good news is it has all the data you're looking for, already formatted (although not formatted to your specifications.) No more page scraping!

I'm working on a program now to transform the JSON file into the format you requested. ....

About 4 weeks later. Problem!

I ran your process on April 07. It returned 2972 records, even though, at that time, their homepage showed the number 2,975 of records.

By today, the number of the homepage has increased to 2,983, but the process still returns only 2972 records - and they are exactly the same 2972 as in early April!

So apparently, the .json file has not been changed. Has their online list changed? I would not know how to find that out. BUT!
In Chrome, I can "Save as..." the page, and get 1,163 KB of HTML; in those, I find the individual signers, starting with a tag:
<div class="supporter-tile col-xs-12 col-sm-6 col-md-4" data-filter="richard gage aia, architect b arch us" data-id="RichardGageLafayetteCAUS">
of which the hilighted is the same for every "supporter tile".
And this file has 2,982 supporter tiles!

So it appears that processing the file https:// siteupgrade2. cloudaccess. host /signatures /AE.json does NOT read the current roster of signatories!

Blue Mountain · May 3, 2018

I checked the site again using my man-in-the-middle proxy. They've changed the hosting site for the JSON files from siteupgrade2.cloudaccess.host to ae911truth.org. Change the URL line in _ae911.bat to read as follows:

Code:

set URL=https://ae911truth.org/signatures

When I ran the updated script tonight, I got 2,983 lines in ae911_ae.txt and 21,999 lines in ae911_public.txt (both counts include the heading line.)

Oystein · May 3, 2018

I feel stupid

I do exactly what you toold me - but both output files contain only the first (header) line.

Here is a copy&paste of my _ae911.bat:

Code:

REM  Sub-process #1 for "get_ae911.bat" to retrieve a JSON file containing
REM  data from ae911truth.org and format it into a .txt file

REM  Parameters:
REM   %1: JSON file to process, without the trailing '.json'
REM   %2: Name of file to which output is appended

echo Processing %1 list and writing to %2
REM current URL as of 2018-05-03
set URL=https://ae911truth.org/signatures

REM transitional URL no longer valid
REM set URL=https://siteupgrade2.cloudaccess.host/signatures

wget --quiet --no-check-certificate -O - %URL%/%1.json 2>wget-stderr.txt | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2
del wget-stderr.txt

REM Next line to wait a bit for Windows to close ae911_public.txt
powershell Start-Sleep -Milliseconds 500

When I REM the first "set URL" and un-REM the second, I get the same 2,277 KB of AE signatories from siteupgrade2.cloudaccess.host/signatures as before.

My browser opens https://www.ae911truth.org/signatures/AE.json just fine.

I've tried to add "www.", I've tried to go by "http" instead of "https".

Blue Mountain · May 3, 2018

Oystein said:
I feel stupid

I do exactly what you told me - but both output files contain only the first (header) line.

Here is a copy&paste of my _ae911.bat:
(snipped)

When I REM the first "set URL" and un-REM the second, I get the same 2,277 KB of AE signatories from siteupgrade2.cloudaccess.host/signatures as before.

My browser opens https://www.ae911truth.org/signatures/AE.json just fine.

Oh! No need to feel stupid ... it appears the web site has upgraded its TLS security layer to a version that's unsupported by the older version of wget used by GoW. I've re-written it using the powershell Invoke-WebRequest function. Here's the new _ae911.bat file.

Code:

REM  Sub-process #1 for "get_ae911.bat" to retrieve a JSON file containing
REM  data from ae911truth.org and format it into a .txt file

REM  Parameters:
REM   %1: JSON file to process, without the trailing '.json'
REM   %2: Name of file to which output is appended

echo Processing %1 list and writing to %2
set URL=https://www.ae911truth.org/signatures
powershell (Invoke-Webrequest -Uri https://ae911truth.org/signatures/%1.json -UseBasicParsing).content | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2

I've tested this version using Windows. (I tested my earlier fix using Linux and assumed the same change would work under Windows :blush:

.)

BTW, if you've added all the country codes to the country_ISO array in _ae911.sed, could you post it here, please?

Oystein · May 5, 2018

Blue Mountain said:
Oh! No need to feel stupid ... it appears the web site has upgraded its TLS security layer to a version that's unsupported by the older version of wget used by GoW. I've re-written it using the powershell Invoke-WebRequest function. Here's the new _ae911.bat file.

Code:

REM Sub-process #1 for "get_ae911.bat" to retrieve a JSON file containing REM data from ae911truth.org and format it into a .txt file REM Parameters: REM %1: JSON file to process, without the trailing '.json' REM %2: Name of file to which output is appended echo Processing %1 list and writing to %2 set URL=https://www.ae911truth.org/signatures powershell (Invoke-Webrequest -Uri https://ae911truth.org/signatures/%1.json -UseBasicParsing).content | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2

I've tested this version using Windows. (I tested my earlier fix using Linux and assumed the same change would work under Windows .)

BTW, if you've added all the country codes to the country_ISO array in _ae911.sed, could you post it here, please?

Thanks again!
In the "powershell ..." line, you want to exchange the literal "https://ae911truth.org/signatures" with the parameter %URL%
Process runs to finish (I no longer have the problem that often the process would exit because ae911_public.txt could not get closed in time)
Unfortunately, the output no longer is Unicode UTF-8. I get "hieroglyphs" in place of diacritical characters when I open the .txt output file with MS Excel or LibreOffice, no matter what code I specify (I also tried UTF-7, UTF-16, Windows ANSI, several "Western European" codes, others). The standard Windows Editor gives me "Baden WǬrttemberg" instead of "Baden Württemberg" when I open as UTF-8.
I'll check whether I have the country-code translating list complete shortly!

Oystein · May 5, 2018

Blue Mountain said:
...
BTW, if you've added all the country codes to the country_ISO array in _ae911.sed, could you post it here, please?

You mean _ae911.awk
Here it is:

Code:

    # Country name normalisation
    country_ISO["afghanistan"] = "AF"
    country_ISO["algeria"] = "DZ"
    country_ISO["aotearoa/new zealand"] = "NZ"
    country_ISO["argentina"] = "AR"
    country_ISO["auatralia"] = "AU"
    country_ISO["aus"] = "AU"
    country_ISO["australia"] = "AU"
    country_ISO["austria"] = "AT"
    country_ISO["bahamas"] = "BS"
    country_ISO["bahrain"] = "BH"
    country_ISO["barbados"] = "BB"
    country_ISO["belgium"] = "BE"
    country_ISO["belize"] = "BZ"
    country_ISO["bolivia"] = "BO"
    country_ISO["brasil"] = "BR"
    country_ISO["brazil"] = "BR"
    country_ISO["brescia, italy"] = "IT"
    country_ISO["bul"] = "BG"
    country_ISO["bulgaria"] = "BG"
    country_ISO["cayman islands"] = "KY"
    country_ISO["cda"] = "CA"
    country_ISO["cdn"] = "CA"
    country_ISO["chile"] = "CL"
    country_ISO["china"] = "CN"
    country_ISO["colombia"] = "CO"
    country_ISO["costa rica"] = "CR"
    country_ISO["country.ca"] = "US"
    country_ISO["croatia"] = "HR"
    country_ISO["cymru, britain"] = "GB"
    country_ISO["cyprus"] = "CY"
    country_ISO["czechia"] = "CZ"
    country_ISO["czech rep."] = "CZ"
    country_ISO["czech republic"] = "CZ"
    country_ISO["czech republik"] = "CZ"
    country_ISO["denmark"] = "DK"
    country_ISO["deutschland"] = "DE"
    country_ISO["ecuador"] = "EC"
    country_ISO["egypt"] = "EG"
    country_ISO["england"] = "GB"
    country_ISO["england - uk."] = "GB"
    country_ISO["england, uk"] = "GB"
    country_ISO["espaã±a"] = "ES"
    country_ISO["españa"] = "ES"
    country_ISO["estonia"] = "EE"
    country_ISO["faroe islands"] = "FO"
    country_ISO["finland"] = "FI"
    country_ISO["føroyar"] = "FO"
    country_ISO["fra"] = "FR"
    country_ISO["france"] = "FR"
    country_ISO["france/vietnam"] = "VN"
    country_ISO["french"] = "FR"
    country_ISO["ger"] = "DE"
    country_ISO["germany"] = "DE"
    country_ISO["germany, berlin"] = "DE"
    country_ISO["great britain"] = "GB"
    country_ISO["great britian"] = "GB"
    country_ISO["greece"] = "GR"
    country_ISO["greenland"] = "GL"
    country_ISO["hawaiian kingdom"] = "US"
    country_ISO["holland"] = "NL"
    country_ISO["hong kong"] = "CN"
    country_ISO["hong kong sar china"] = "CN"
    country_ISO["hungary"] = "HU"
    country_ISO["iceland"] = "IS"
    country_ISO["india"] = "IN"
    country_ISO["indonesia"] = "ID"
    country_ISO["irak"] = "IQ"
    country_ISO["iraq"] = "IQ"
    country_ISO["ireland"] = "IE"
    country_ISO["israel"] = "IL"
    country_ISO["italia"] = "IT"
    country_ISO["italy"] = "IT"
    country_ISO["itay"] = "IT"
    country_ISO["jamaica wi"] = "JM"
    country_ISO["japan"] = "JP"
    country_ISO["jordan"] = "JO"
    country_ISO["kenya"] = "KE"
    country_ISO["kingdom of morocco"] = "MA"
    country_ISO["korea(south)"] = "KR"
    country_ISO["ksa"] = "SA"
    country_ISO["kuwait"] = "KW"
    country_ISO["lebanon"] = "LB"
    country_ISO["liechtenstein"] = "LI"
    country_ISO["lisbon"] = "PT"
    country_ISO["lithuania"] = "LT"
    country_ISO["luxembourg"] = "LU"
    country_ISO["malaysia"] = "MY"
    country_ISO["malta"] = "MT"
    country_ISO["martinique"] = "MQ"
    country_ISO["mauritius"] = "MU"
    country_ISO["mex"] = "MX"
    country_ISO["mexico"] = "MX"
    country_ISO["méxico"] = "MX"
    country_ISO["montserrat"] = "MS"
    country_ISO["morocco"] = "MA"
    country_ISO["mozambique"] = "MZ"
    country_ISO["nederland"] = "NL"
    country_ISO["nepal"] = "NP"
    country_ISO["netherland"] = "NL"
    country_ISO["netherlandas"] = "NL"
    country_ISO["netherlands"] = "NL"
    country_ISO["netherlands antilles"] = "AN"
    country_ISO["netherlands, the"] = "NL"
    country_ISO["new zealand."] = "NZ"
    country_ISO["new zealand"] = "NZ"
    country_ISO["newzealand"] = "NZ"
    country_ISO["nigeria"] = "NG"
    country_ISO["nld"] = "NL"
    country_ISO["northern ireland"] = "GB"
    country_ISO["norway"] = "NO"
    country_ISO["nzd"] = "NZ"
    country_ISO["oman"] = "OM"
    country_ISO["pakistan"] = "PK"
    country_ISO["palestine"] = "PS"
    country_ISO["panama"] = "PA"
    country_ISO["peru"] = "PE"
    country_ISO["perú"] = "PE"
    country_ISO["philippines"] = "PH"
    country_ISO["poland"] = "PL"
    country_ISO["portugal"] = "PT"
    country_ISO["pr"] = "US"
    country_ISO["p.r.c."] = "CN"
    country_ISO["principality of andorra"] = "AD"
    country_ISO["qatar"] = "QA"
    country_ISO["rep of ireland"] = "IE"
    country_ISO["rep' of ireland"] = "IE"
    country_ISO["republic of ireland"] = "IE"
    country_ISO["republic of kazakhstan"] = "KZ"
    country_ISO["republic of korea"] = "KR"
    country_ISO["romania"] = "RO"
    country_ISO["rsa"] = "ZA"
    country_ISO["russia"] = "RU"
    country_ISO["russian federation"] = "RU"
    country_ISO["s. africa"] = "ZA"
    country_ISO["saudi arabia"] = "SA"
    country_ISO["scotland"] = "GB"
    country_ISO["scotland / uk"] = "GB"
    country_ISO["scotland (uk)"] = "GB"
    country_ISO["scotland, uk"] = "GB"
    country_ISO["serbia"] = "RS"
    country_ISO["singapore"] = "SG"
    country_ISO["slovakia"] = "SK"
    country_ISO["slovakia - eu"] = "SK"
    country_ISO["sloveinja"] = "SI"
    country_ISO["slovenia"] = "SI"
    country_ISO["south africa"] = "ZA"
    country_ISO["south korea"] = "KR"
    country_ISO["spain"] = "ES"
    country_ISO["sri lanka"] = "LK"
    country_ISO["swden"] = "SE"
    country_ISO["sweden"] = "SE"
    country_ISO["switzerland"] = "CH"
    country_ISO["swizerland"] = "CH"
    country_ISO["syria"] = "SY"
    country_ISO["taiwan"] = "TW"
    country_ISO["thailand"] = "TH"
    country_ISO["the czech republic"] = "CZ"
    country_ISO["the netherlands"] = "NL"
    country_ISO["tunisia"] = "TN"
    country_ISO["turkey"] = "TR"
    country_ISO["turkiye"] = "TR"
    country_ISO["uae"] = "AE"
    country_ISO["u k"] = "GB"
    country_ISO["u.k."] = "GB"
    country_ISO["u.k"] = "GB"
    country_ISO["ukraine"] = "UA"
    country_ISO["united arab emirates"] = "AE"
    country_ISO["united kingdom"] = "GB"
    country_ISO["united kingdom (england)"] = "GB"
    country_ISO["united kingdom of great britai"] = "GB"
    country_ISO["uruguay"] = "UY"
    country_ISO["usa (expatriate)"] = "US"
    country_ISO["venezuela"] = "VE"
    country_ISO["vietnam"] = "VN"
    country_ISO["wales"] = "GB"
    country_ISO["wales (uk)"] = "GB"
    country_ISO["wales, uk"] = "GB"

Currently, with the output not being UTF-8, this list doesn't catch the country specifiers that contain diacriticals, such as "føroyar", "españa" or "perú". Also, the slash in "scotland / uk" is not recognized.

A few non-ISO country designators are not contained, as they don't actually denote any country, such as "nowhere" or "Europe". Such cases still require manual checking.

I occasionally made somewhat arbitrary choices about how to define countries: Puerto Rico is considered part of "US", Hong Kong is part of "CN" (China), but "netherlands antilles" are their own entity "AN" rather than NL. Similarly, I have "martinique" as a separate entity "MQ" rather than lumping it to "FR".

Help needed: Program or configure a NEW web crawler

Resident Skeptical Hobbit

Penultimate Amazing

Illuminator

Penultimate Amazing

Resident Skeptical Hobbit

Resident Skeptical Hobbit

So far, so good...

Resident Skeptical Hobbit

Resident Skeptical Hobbit

So far, so good...

Penultimate Amazing

Penultimate Amazing

Resident Skeptical Hobbit

Penultimate Amazing

Penultimate Amazing

Resident Skeptical Hobbit

Penultimate Amazing

Resident Skeptical Hobbit

Penultimate Amazing

Penultimate Amazing