• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

How does Google search work?

Nova Land

/
Tagger
Joined
Aug 24, 2002
Messages
6,015
Location
Whitleyville, TN, surrounded by cats
I recently learned that my assumption about how Google search works is wrong.

I thought that when one typed a phrase in quotation marks into Google's search window that Google would provide a hit list of all the searchable sites which contained that exact phrase. Thus, a search for, say, "shot in the dark" would turn up more hits than a search for "a shot in the dark", because all the sites which turned up as hits in a search for "a shot in the dark" would also turn up with hits in a search for "shot in the dark".

But apparently that is not always the case. For example, A search for "snow in Alaska" gives 812 hits, but a search for "all the snow in Alaska" gives 7800 hits.

Can anyone explain why this is so?



> hat tip to Myriad for providing the "all the snow in Alaska" example, over in this post in Puzzles <
 
Just a hunch.... "all the snow in Alaska" is an actual quote from a verse. There may be some mechanism for looking up cites or quotes that tells the search engine that you have to get "X" amount of the words in the expression?
Using "all the snow in Alaska" the first x hits are from the quotation by Bing Crosby, but googling "snow in Alaska", it turns up, but not from any of the famous quotation pages, and well into the hits, not in the top twenty...

Other searches adding "all the" or other modifiers show what you'd expect... higher numbers of hits without adding in "all the". Even "all the tea in China" versus "tea in China"....

Anyhow - like I said, just a hunch on the quotation within quotes, thing.
 
Note the 'similar pages' links within google search results. Presumably this link represents a slew of pages that are not returned by the curent search but actualy ('technically') do match your search. Remember that its pointless to return pages that are so similar as to be rationally considered equal in content.

Also note that google does not index overly-abundant words such as 'is' and 'a'
 
Note the 'similar pages' links within google search results. Presumably this link represents a slew of pages that are not returned by the curent search but actualy ('technically') do match your search. Remember that its pointless to return pages that are so similar as to be rationally considered equal in content.


But Google returns not just a listing of hits but also a number of hits. It was my impression that, while the listing may omit similar searches, the number provided is inclusive of these items. Am I wrong on that?

[quick check]

Oooh, yes I am. This is interesting. The initial search results for "All the snow in Alaska" turned up a claimed 7810 hits -- but only turned up 7 pages (which, at 10 listings to a page, would be considerably less than 7810). I clicked the page 7 link to get to the end of the listings, because that often reduces the number of pages returned as similar links get removed from the listing. Guess what? That reduced the claimed number of hits to 5 pages and 41 hits.

I think the initial claim of 7810 hits is erroneous, and that's where the problem lies. But Myriad (over in the Puzzle thread where this came up) mentioned having found other examples where this occurred. I should go back over there and inquire what some other examples he found were...

Also note that google does not index overly-abundant words such as 'is' and 'a'

Are you sure about that? I got significantly different hit-counts for "flash in the pan" and "a flash in the pan", which is the phrase I was initially testing, and this occurred for several other phrases I tested where both searches were identical except search had a leading a and the other search didn't.
 
So where's Myriad, then, since he started all this trouble?

(I'm really curious to know the answer. Don't we have any Google-Miners on here?)
 
Teh intarwebz gotz fairyz 2 get ur pagez.


Well yes, of course, everyone knows that. But I'm trying to figure out what rules these fairyz operate by. I'd just assumed that since the custom title fairyz here are pretty much infallible that the Google search fairyz were as well. But turning up more hits for "all the snow in Alaska" than for "snow in Alaska" just seems wrong.
 
I have my own questions about Google. I set up a test forum once, on my server at home, and of course used my name when setting up an account. No links to it at all, zero, zilch. No one knew it was there but me. One day I google my name (as we are all wont to do from time to time) and this test forum was the #3 hit. I always thought Google rated sites based on the number of links to that site but there is no way this was the case in this instance. It didn't even have a URL, just an IP address, at the time. I was completely surprised to see it there. This was only about 6 months ago.

Just checked, it's now the #1 Google hit for my name. Very strange. Of course now it has a URL but still, it's just on my server at home and no links to it whatsoever, and it was just put up so I could see what it looked like.
 
I always thought Google rated sites based on the number of links to that site but there is no way this was the case in this instance.

Exactly how google rates sites is highly secret. In this case it might be that google doesn't like the look of anything else with your name on. In adition your name probably features fairly prominantly on the site which may mean that google thinks it is more focused on you while other hits are just passing mentions.
 
There's a big difference between the science of how a large number of documents are searched for a list of words and how Google includes and ranks pages. Which do you want to know about?

If you question is the former, I recommend "The Art of Computer Programming, Volume 3: Sorting and Searching" by Donald Knuth.
 
I have my own questions about Google. I set up a test forum once, on my server at home, and of course used my name when setting up an account. No links to it at all, zero, zilch. No one knew it was there but me. One day I google my name (as we are all wont to do from time to time) and this test forum was the #3 hit. I always thought Google rated sites based on the number of links to that site but there is no way this was the case in this instance...


I'm as ignorant of how Google does rankings as I am of how they do searches, but I seem to recall reading (when I downloaded the Google search bar) that having the search bar on one's computer let's Google personalize the hit list it returns, so that the sites one is more likely to be interested (based on what one did or didn't click on from previous searches) come up higher and the sites one is less likely to be interested in (again based on what one did or didn't click for previous searches) appear lower on the list. I wonder if that might be at play in your site moving up to the # 1 spot in the searches you did.

A simple test of this would be to have a friend do a search on your name from their computer -- or for you to do this search on a library computer -- and see if your site still comes out as the first one listed.
 
There's a big difference between the science of how a large number of documents are searched for a list of words and how Google includes and ranks pages. Which do you want to know about?


The former.

If you question is the former, I recommend "The Art of Computer Programming, Volume 3: Sorting and Searching" by Donald Knuth.


Is this by any chance available as a DC or Marvel comic book?
 
Imagine that for every possible input word we form an incredibly long binary (made up of 0's and 1's) number. Every digit of the binary number represents a document. So, if we have a million webpages that are made up of a thousand different words, we will have a thousand binary numbers that are a million digits long. Now, the thing that you have to realize here is that even though we are dealing with a lot of big numbers, the number is going to be very sparse. Even though we need to know whether or not "rubarb" is on every single document in our inventory, it is only going to be on a few documents.

Forming and storing these large numbers takes a long time. In fact, adding new documents takes a whole lot of time, because you have to lengthen the number for every word in your dictionary. However, there are ways of representing this number so that it is compressed and we can preform logical operations on them very quickly (and, nor, negation). This is what makes searching fast. It is called k-Nearest-Neighbor with P-trees. Have I mentioned magic, yet? Anways, here's a paper I wrote on document search algorithms in it in college. It's amazingly short! Let me know if you can't read Word files.
 

Attachments


Back
Top Bottom