• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Importing html posts to database

nimzov

Unregistered
Joined
Apr 12, 2004
Messages
954
I post on many forums. And I would like to transfer my messages from these forums to my private forum. My forum is vbulletin and the db is mysql 4.1.21 and is accessed by php 4.4.4

I know how to do that between database with an import. But the problem is that I do not have access to the database of the forum on which I am posting.

Is it possible to automate the transfer of my messages displayed on an html page to my database ?

Thanks for your help.

nimzo
 
I'd advise that you check the usage policies for the forums you're posting on before considering this. Depending on the policy, you might be committing a breach of copyright.
 
I'd advise that you check the usage policies for the forums you're posting on before considering this. Depending on the policy, you might be committing a breach of copyright.
These are my messages that I want to copy. I think I own the copyright on my messages. And it's not different from copy-pasting them. I am looking for technical more than legal help.

nimzo
 
Last edited:
You are also likely to piss off those boards with an attempt to use some sort of script to pluck all your posts off the boards.
 
You are also likely to piss off those boards with an attempt to use some sort of script to pluck all your posts off the boards.
Not really.

On some board (not this one) I have made numerous posts and very long messages on water fluoration for example. When I read the thread I want simply (?) to extract my messages and put them in a database with subject, body, date, title, etc. each in a field in the database. I just want to automate the text extraction from the HTML page. This is not much different from doing it manually

I do not want to copy or download the forum like a webcopier program.

nimzo
 
Last edited:
These are my messages that I want to copy. I think I own the copyright on my messages. And it's not different from copy-pasting them. I am looking for technical more than legal help.

nimzo

There still might be a problem, depending on the policy. Look out for something that says "all content is owned by..."

But yeah, you're after technical help and I can't help you there. So I'll shut up now.
 
There still might be a problem, depending on the policy. Look out for something that says "all content is owned by..."

But yeah, you're after technical help and I can't help you there. So I'll shut up now.
I will probably have to write a Perl script with these awful regular expressions that I have so much trouble with. :eek:

I thought there might be some piece of software that could help me extract the content of the html page.

And yes, I am willing to fight in court to get my messages back. :)

Thanks.

nimzo
 
On some board (not this one) I have made numerous posts and very long messages on water fluoration for example. When I read the thread I want simply (?) to extract my messages and put them in a database with subject, body, date, title, etc. each in a field in the database. I just want to automate the text extraction from the HTML page. This is not much different from doing it manually

Except doing it automagically may cause the server to balk at it and get pissy at an attempt to remote clone.

I do not want to copy or download the forum like a webcopier program.

There is no difference other than how you treat the information once it is retrieved. It is the automagic retrieval that may cause an issue - you should definitely ask first.
 
Except doing it automagically may cause the server to balk at it and get pissy at an attempt to remote clone.

There is no difference other than how you treat the information once it is retrieved. It is the automagic retrieval that may cause an issue - you should definitely ask first.
Maybe I was not clear. I do not want to automate the retrieval. I know this may be problematic. What I want to automate is the parsing of the HTML page I am reading on the screen. I want to extract the information from that page.

In other words. The process I want to automate is parsing the HTML page (which has already been served by the server) and extracting the content. I do not want to automate the interaction with the server.

nimzo said:
Is it possible to automate the transfer of my messages displayed on an html page to my database ?
The question of automation in my OP has to do with regular expressions and reading a page not with interacting with a server.

nimzo
 
Last edited:
Then it's all rather going to depend on whatever the specifics of the HTML generated by the various forums give you really. There are various Perl libraries and such for XML parsing and so forth.
 
If I'm reading this correctly, you want to convert HTML to text, right?

There are a number of open-source programs that will do that, the most notable of which is Lynx. Lynx is available for both Windows and Linux. However, it doesn't handle frames; if the forums contain frames, then w3m - which is Linux only, as far as I know - would work.

None of these programs handle scripting (Java/JavaScript) or, of course, images.

On the other hand, if you want just the straight HTML, then wget - pretty standard for Linux, available for Windows from gnuwin32.sourceforge.net - can be used to grab the web pages from the command line.
 
Maybe I was not clear. I do not want to automate the retrieval. I know this may be problematic. What I want to automate is the parsing of the HTML page I am reading on the screen. I want to extract the information from that page.

In other words. The process I want to automate is parsing the HTML page (which has already been served by the server) and extracting the content. I do not want to automate the interaction with the server.


The question of automation in my OP has to do with regular expressions and reading a page not with interacting with a server.

nimzo
If it were me, and I had the time (and enough posts to matter,) I'd use Firefox, and write an extension that added a "save to database" button to any post in my name.

That's what'd get my vote.
 
I've done this sort of thing a billiont imes using PErl and WWW:MEchanize or just LWP.

If you need some help on those "godawful regexps" just ask.
 
I've done this sort of thing a billiont imes using PErl and WWW:MEchanize or just LWP.
If you need some help on those "godawful regexps" just ask.
Thanks everyone for your suggestions.

Writing an extension to firefox looks a bit complicated to me. And I could not find the binaries for Lynx for W2k.

I have a limited experience with Perl so I think I will go that way with scribble's help.

I will get back when I get stuck in RegEx. :)

Thanks again.

nimzo
 
The "DOS" versions of Lynx should run on Win2K (in particular, one says it runs on Windows NT - that'll work fine on Win2K or WinXP).
 
If the website uses HTML tables, you can select in IE, copy, and paste it into Excel, from there you can manipulate it, and export it as csv and then use a "SELECT INTO... FROM FILE..." into MySQL. The usefulness of this depends on the formating of the HTML table in question. Firefox won't import into separate cells in Excel. Don't know if that will be on any help or not
 
If the website uses HTML tables, you can select in IE, copy, and paste it into Excel, from there you can manipulate it, and export it as csv and then use a "SELECT INTO... FROM FILE..." into MySQL. The usefulness of this depends on the formating of the HTML table in question. Firefox won't import into separate cells in Excel. Don't know if that will be on any help or not
Hi.

I like the idea but unfortunately the Paste Special (HTML) in Excel (2000) pastes each paragraph of the message body in a different cell.

But I will look more closely into it.

Thanks for the suggestion.

nimzo
 
Last edited:
Hi.

I like the idea but unfortunately the Paste Special (HTML) in Excel (2000) pastes each paragraph of the message body in a different cell.

But I will look more closely into it.

Thanks for the suggestion.

nimzo

Unfortunately, it sounds like the table isn't formatted in way that will make importing easy. If you do manage to get the Excel the way you want it, the rest is easy, I wanted to add to my comments more specifics on the SQL statement, here's an example:
LOAD DATA LOCAL INFILE 'test.csv'
INTO TABLE results_2007
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(column1, column2, column3, etc);
If you are importing paragraphs, then you can make the csv separated by something other than a comma and use that for your field termination as the text will contain a lot of commas so tabs would work best.
Good luck.
 

Back
Top Bottom