How does Google search work?

Nova Land · May 26, 2007

I recently learned that my assumption about how Google search works is wrong.

I thought that when one typed a phrase in quotation marks into Google's search window that Google would provide a hit list of all the searchable sites which contained that exact phrase. Thus, a search for, say, "shot in the dark" would turn up more hits than a search for "a shot in the dark", because all the sites which turned up as hits in a search for "a shot in the dark" would also turn up with hits in a search for "shot in the dark".

But apparently that is not always the case. For example, A search for "snow in Alaska" gives 812 hits, but a search for "all the snow in Alaska" gives 7800 hits.

Can anyone explain why this is so?

> hat tip to Myriad for providing the "all the snow in Alaska" example, over in this post in Puzzles <

Foolmewunz · May 26, 2007

Just a hunch.... "all the snow in Alaska" is an actual quote from a verse. There may be some mechanism for looking up cites or quotes that tells the search engine that you have to get "X" amount of the words in the expression?
Using "all the snow in Alaska" the first x hits are from the quotation by Bing Crosby, but googling "snow in Alaska", it turns up, but not from any of the famous quotation pages, and well into the hits, not in the top twenty...

Other searches adding "all the" or other modifiers show what you'd expect... higher numbers of hits without adding in "all the". Even "all the tea in China" versus "tea in China"....

Anyhow - like I said, just a hunch on the quotation within quotes, thing.

rockoon · May 26, 2007

Note the 'similar pages' links within google search results. Presumably this link represents a slew of pages that are not returned by the curent search but actualy ('technically') do match your search. Remember that its pointless to return pages that are so similar as to be rationally considered equal in content.

Also note that google does not index overly-abundant words such as 'is' and 'a'

Nova Land · May 26, 2007

rockoon said:
Note the 'similar pages' links within google search results. Presumably this link represents a slew of pages that are not returned by the curent search but actualy ('technically') do match your search. Remember that its pointless to return pages that are so similar as to be rationally considered equal in content.

But Google returns not just a listing of hits but also a number of hits. It was my impression that, while the listing may omit similar searches, the number provided is inclusive of these items. Am I wrong on that?

[quick check]

Oooh, yes I am. This is interesting. The initial search results for "All the snow in Alaska" turned up a claimed 7810 hits -- but only turned up 7 pages (which, at 10 listings to a page, would be considerably less than 7810). I clicked the page 7 link to get to the end of the listings, because that often reduces the number of pages returned as similar links get removed from the listing. Guess what? That reduced the claimed number of hits to 5 pages and 41 hits.

I think the initial claim of 7810 hits is erroneous, and that's where the problem lies. But Myriad (over in the Puzzle thread where this came up) mentioned having found other examples where this occurred. I should go back over there and inquire what some other examples he found were...

Also note that google does not index overly-abundant words such as 'is' and 'a'

Are you sure about that? I got significantly different hit-counts for "flash in the pan" and "a flash in the pan", which is the phrase I was initially testing, and this occurred for several other phrases I tested where both searches were identical except search had a leading a and the other search didn't.

Ducky · May 26, 2007

Teh intarwebz gotz fairyz 2 get ur pagez.

Foolmewunz · May 26, 2007

So where's Myriad, then, since he started all this trouble?

(I'm really curious to know the answer. Don't we have any Google-Miners on here?)

Nova Land · May 26, 2007

fowlsound said:
Teh intarwebz gotz fairyz 2 get ur pagez.

Well yes, of course, everyone knows that. But I'm trying to figure out what rules these fairyz operate by. I'd just assumed that since the custom title fairyz here are pretty much infallible that the Google search fairyz were as well. But turning up more hits for "all the snow in Alaska" than for "snow in Alaska" just seems wrong.

Starthinker · May 26, 2007

I have my own questions about Google. I set up a test forum once, on my server at home, and of course used my name when setting up an account. No links to it at all, zero, zilch. No one knew it was there but me. One day I google my name (as we are all wont to do from time to time) and this test forum was the #3 hit. I always thought Google rated sites based on the number of links to that site but there is no way this was the case in this instance. It didn't even have a URL, just an IP address, at the time. I was completely surprised to see it there. This was only about 6 months ago.

Just checked, it's now the #1 Google hit for my name. Very strange. Of course now it has a URL but still, it's just on my server at home and no links to it whatsoever, and it was just put up so I could see what it looked like.

geni · May 26, 2007

Starthinker said:
I always thought Google rated sites based on the number of links to that site but there is no way this was the case in this instance.

Exactly how google rates sites is highly secret. In this case it might be that google doesn't like the look of anything else with your name on. In adition your name probably features fairly prominantly on the site which may mean that google thinks it is more focused on you while other hits are just passing mentions.

rdaneel · May 26, 2007

geni said:
Exactly how google rates sites is highly secret.

Actually, they explain it all here.

strathmeyer · May 26, 2007

There's a big difference between the science of how a large number of documents are searched for a list of words and how Google includes and ranks pages. Which do you want to know about?

If you question is the former, I recommend "The Art of Computer Programming, Volume 3: Sorting and Searching" by Donald Knuth.

Nova Land · May 27, 2007

Starthinker said:
I have my own questions about Google. I set up a test forum once, on my server at home, and of course used my name when setting up an account. No links to it at all, zero, zilch. No one knew it was there but me. One day I google my name (as we are all wont to do from time to time) and this test forum was the #3 hit. I always thought Google rated sites based on the number of links to that site but there is no way this was the case in this instance...

I'm as ignorant of how Google does rankings as I am of how they do searches, but I seem to recall reading (when I downloaded the Google search bar) that having the search bar on one's computer let's Google personalize the hit list it returns, so that the sites one is more likely to be interested (based on what one did or didn't click on from previous searches) come up higher and the sites one is less likely to be interested in (again based on what one did or didn't click for previous searches) appear lower on the list. I wonder if that might be at play in your site moving up to the # 1 spot in the searches you did.

A simple test of this would be to have a friend do a search on your name from their computer -- or for you to do this search on a library computer -- and see if your site still comes out as the first one listed.

Nova Land · May 27, 2007

strathmeyer said:
There's a big difference between the science of how a large number of documents are searched for a list of words and how Google includes and ranks pages. Which do you want to know about?

The former.

If you question is the former, I recommend "The Art of Computer Programming, Volume 3: Sorting and Searching" by Donald Knuth.

Is this by any chance available as a DC or Marvel comic book?

rockoon · May 27, 2007

Nova Land said:
Is this by any chance available as a DC or Marvel comic book?

It contains a lot of calculus glyphs.. does that count?

strathmeyer · May 27, 2007

Imagine that for every possible input word we form an incredibly long binary (made up of 0's and 1's) number. Every digit of the binary number represents a document. So, if we have a million webpages that are made up of a thousand different words, we will have a thousand binary numbers that are a million digits long. Now, the thing that you have to realize here is that even though we are dealing with a lot of big numbers, the number is going to be very sparse. Even though we need to know whether or not "rubarb" is on every single document in our inventory, it is only going to be on a few documents.

Forming and storing these large numbers takes a long time. In fact, adding new documents takes a whole lot of time, because you have to lengthen the number for every word in your dictionary. However, there are ways of representing this number so that it is compressed and we can preform logical operations on them very quickly (and, nor, negation). This is what makes searching fast. It is called k-Nearest-Neighbor with P-trees. Have I mentioned magic, yet? Anways, here's a paper I wrote on document search algorithms in it in college. It's amazingly short! Let me know if you can't read Word files.

How does Google search work?

Nova Land

/

Foolmewunz

Grammar Resistance Leader, TLA Dictator

rockoon

Graduate Poster

Nova Land

/

Ducky

Unregistered

Foolmewunz

Grammar Resistance Leader, TLA Dictator

Nova Land

/

Starthinker

Philosopher

geni

Anti-homeopathy illuminati member

rdaneel

Illuminator

strathmeyer

Master Poster

Nova Land

/

Nova Land

/

rockoon

Graduate Poster

strathmeyer

Master Poster

Attachments