How does Google work so fast? Any techies with an inside track?

by AlmostAtheist 10 Replies latest jw friends

  • AlmostAtheist
    AlmostAtheist

    I'm banging on an Oracle database with a table holding 200,000,000 records. Just getting a count of those bad boys takes several seconds.

    It got me to wondering about Google. They index 8 billion pages. How many gazillion records must exist describing all those pages? Maybe I could see sending in a request and getting the answer back in a coupla weeks. But they find the pages that contain my search terms (and exclude the ones that I put a minus sign in front of) and return a nicely formatted page complete with ads in less than a second. How are they doing that?

    I understand that the technology's been around for years and Google's not doing anything especially new. But I'm still curious. I can't believe that just having more computers in your server farm somehow makes you able to search that much faster. When it comes down to it, you still have to go hit an index, crawl across it pulling in all the matches, join it to the matches for the other terms in the query, fetch back the title and description of the 100 pages on page 1, come up with a reasonable guess of how many more pages there are. There's a ton of stuff happening there and it's insanely fast.

    Does anybody know how they pull it off?

    Dave

  • Valis
  • AlmostAtheist
    AlmostAtheist

    So you think rather than search all 8 billion pages, they only search the 15-20 non-porn pages if your query seems unporny?

    Yeah, that makes some sense, now that you mention it.

    Dave

  • New Worldly Translation
    New Worldly Translation

    It's thought that Google keeps the entire network in ultra-fast Ram distributed among thousands of servers (conservative estimate 500 terabytes). It also has a cache of the most common searched for terms that can be returned immediately. Nothing is written to disk as seek times are too slow to be effective

    No-one knows for sure though as it's a closely guarded secret.

  • Pole
    Pole

    I don't think anyone not working for Google is going to answer your question in a sufficiently informative way :-).

    For my petty projects though, I try to follow these two rules generally which I'm sure you know too well:

    1) Hardware (that's pretty obvious)
    2) Powerful indexing and caching algorithms

    The biggest database I've ever worked on developing has 120,000,000+ records served by MySQL. It is relatively fast only because we had to think really hard about proper indexing and caching results at some stage. You can kill any machine/farm without considering the issues of caching ans indexing.

    Edited to add: Any improvement if you run the same query on your 200 million Oracle table for the second time?

    Pole

  • SixofNine
    SixofNine

    I have it from a good source that while the other services (yahoo, msn, aol, about, etc) labor away with the traditional and trusty mice, google found a way to harness the boundless energy of squirrels.

  • joenobody
    joenobody

    I do a lot of Oracle work as well and wanted to know the answer.

    Google's own answer is hilarious!

    http://www.google.com/technology/pigeonrank.html

  • AlmostAtheist
    AlmostAtheist
    Any improvement if you run the same query on your 200 million Oracle table for the second time?

    Yeah, it cuts the time in half on the second run. But of course this isn't much help for my real problem, since it's the first run that needs to speed up. To get what I need, I have to join several tables, then group by fields from two tables. It's so slow, Oracle bombs out saying the snapshot is too old. (It reports the data as it existed at the time the query began, so if any changes occur while your query is running, it rolls those changes back long enough to tell you what the data looked like at the start of your query. When the rollback information is no longer available, Oracle reports the error and terminates the query.)

    I've read that "Pigeon Cluster (PC)" article on Google before. You're right, it's hilarious! Maybe I should try pigeons on this Oracle project...

    I can buy that "running in memory" thing, that would speed things up considerably. But you'd need a horrendous amount of RAM, more than could be addressable on a single machine. So the multiple machines would need to be networked, which would be a bottleneck. And you couldn't just crawl through each machine looking for what you needed. There'd have to be an index into the machines to know which one to hit to get the bits of data you wanted. What a setup... I'd love to see how it works. Trouble is, once you know how the "magic" works, it doesn't seem so magical anymore. Maybe I'll just believe the Pigeon story and leave it at that.

    Dave

    Dave

  • Elsewhere
    Elsewhere

    There are two basic ways to arrange data in a database: Flat and Relational.

    Flat is blazing fast, but relational is far more flexible.

    I'm guessing that what the search engines are doing is a night process where they scan the relational joins for common search patters, then populate flat tables with the data for the next days search. Basically they take advantage of both worlds... flat and relational.

    An example of this would be accounting records... To calculate the end balance for an account using a relational database one would have to sum all of the transactions from the first day the account was created. It doesn't take a great leap to conclude that this would take forever if you had a few thousand accounts which had been open for a few years. The solution to this is to create a flat account review table and every night sum only the account transactions for a given day and then put the end result in each record of the account summary table. The result would be extremely fast response times when looking for account balances.

  • Pole
    Pole

    That's right Elsewhere, although I've also heard of the object-oriented DB approach (everything has to be OO these days). I've also used temporary "flat" tables generated from relational ones to speed up common searches on large tables.

    Pole

Share this

Google+
Pinterest
Reddit