• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Help needed: Program or configure a NEW web crawler

I've finally found some time to update the process to save the retrieved pages in UTF-8. Here's the updated _ae911.bat file:

Code:
REM  Sub-process #1 for "get_ae911.bat" to retrieve a JSON file containing
REM  data from ae911truth.org and format it into a .txt file
REM  Parameters:
REM   %1: JSON file to process, without the trailing '.json'
REM   %2: Name of file to which output is appended

echo Processing %1 list and writing to %2
set URL=https://ae911truth.org/signatures
powershell "(Invoke-Webrequest -Uri %URL%/%1.json -UseBasicParsing).content | Set-Content -Path _temp.txt"
sed -f _ae911.sed _temp.txt | awk -f _ae911.awk -v list=%1 >>%2
del _temp.txt

Unfortunately, it didn't solve as many encoding issues as I had hoped. At one point a lot of the text appears to have been incorrectly converted from UTF-8 to ISO-8859-1. The result of the incorrect conversion was then stored in a Windows database encoded as Windows-1252, with the result that invalid ISO-8859-1 characters got coverted to "?". So that's why a word like "École" now appears as "�?cole" in the spreadsheet.
 
Last edited:
I've finally found some time to update the process to save the retrieved pages in UTF-8. Here's the updated _ae911.bat file:

...
Thanks yet again - how can I ever repay the debt? I'll probably not try this before Sunday, will let you know.

Unfortunately, it didn't solve as many encoding issues as I had hoped. At one point a lot of the text appears to have been incorrectly converted from UTF-8 to ISO-8859-1. The result of the incorrect conversion was then stored in a Windows database encoded as Windows-1252, with the result that invalid ISO-8859-1 characters got coverted to "?". So that's why a word like "École" now appears as "�?cole" in the spreadsheet.
Ah ok, yeah, that makes kinda sense. I have been struggling with those "�?" long before last month's change.
 

Back
Top Bottom