About Ask Jeeves / Ask.com crawler

Published: 12 August 2006
Author: Serban Ghita

IAC Search & Media (formerly Ask Jeeves, Inc.) was founded in 1996 in Berkeley, California by David Warthen, CTO and veteran software developer, and Garrett Gruener, venture capitalist at Alta Partners and founder of Virtual Microsystems.

The original software was implemented by Gary Chevsky which worked on question answering and information retrieval technologies.

The company was first incorporated in June 1996, the staff consisted of Dave Warthen and Garrett Gruener, the two founders, plus Gary – the software architect - and three content editors, jammed in a couple of rooms in a historic building in downtown Berkeley, a block away from the Berkley University campus.

Gary Chevsky wrote on Ask Jeeves blog a funny introduction about how they started the business:

"It had all the glory of a startup - the greasy smell from a Chinese restaurant downstairs, the occasional cockroaches crawling across my desk, the turn-of-the-century elevator that required a human operator, the folding bed for sleeping in the office, and a bunch of people enthused about building something different." - cited from Ask Blog (2005/04)

April 1997 - Ask.com is beta released for friends-only

The site was running off of two Dell servers under Gary's desk. He says that he was alerted through pager when there were problems on the site. On occasion when he couldn't determine what's wrong, he hooked his debugger up to the live boxes, set a breakpoint, and step through the code, against the live queries. That means that if i run a query that will cause problems with the returning results, i will never get my results because Gary was debugging the query.

After a few weeks, Ask.com had a couple of thousand queries a day.  Then Yahoo gives it a "Cool Pick of the Week" mention, and the traffic doubles (from 4K to 8K), then other people start noticing, and before long, we have 150K queries a day.

Because of the growing fame, the Ask.com staff began talking to Dell about using their technology on their corporate site to answer tech support queries, then with Altavista – the king of search back then – where they want to use the technology to answer popular questions.

1998 - Ask.com reached 1 milion queries per day

1998-1999 - Ask.com went public, had 800 employees

1999 was the year of a turning point for Ask.com, because of the emerging of another search engine - Google

2001 - the number of employees shrunk to 200 and there were financial problems

At this point everybody espected that Ask.com will die (many said that this was caused by the old image of the search engine - the logo), the auctions fell from 190$ to 1$ in 16 months.

2005 - Ask.com was bought by IAC/InterActiveCorp for $1.85 billion - $28.24 a share.

Technical overview

1999 - Ask acquired Direct Hit, a Massachusetts company that had developed the world's first "click popularity" search technology, which was licensed to MSN and Lycos, among others

2001 - Ask acquired Teoma, a 10-person start-up out of Rutgers University in New Brunswick, New Jersey, and its unique index and search relevancy technology. Teoma was the first, and is still the only, major search technology based upon the clustering concept of subject-specific popularity: ExpertRank. In fact, Teoma means "expert" in Gaelic.

Ask.com uses ExpertRank algorithm for returning their results: "Our ExpertRank algorithm provides relevant search results by identifying the most authoritative sites on the Web. With Ask search technology, it's not just about who's biggest: it's about who's best. Our ExpertRank algorithm goes beyond mere link popularity (which ranks pages based on the sheer volume of links pointing to a particular page) to determine popularity among pages considered to be experts on the topic of your search. This is known as subject-specific popularity. Identifying topics (also known as "clusters"), the experts on those topics, and the popularity of millions of pages amongst those experts -- at the exact moment your search query is conducted -- requires many additional calculations that other search engines do not perform. The result is world-class relevance that often offers a unique editorial flavor compared to other search engines."

ExpertRank is in fact the algorithm that was running on Teoma, and it was incorporated in Ask.com since 2001.

About the crawler

  • The crawler goes to a Web address (URL) and downloads the HTML page.
  • The crawler follows hyperlinks from the page, which are URLs on the same site or on different sites.
  • The crawler adds new URLs to its list of URLs to be crawled. It continually repeats this function, discovering new URLs, following links, and downloading them.
  • The crawler excludes some URLs if it has downloaded a sufficient number from the Web site or if it appears that the URL might be a duplicate of another URL already downloaded.
  • The files of crawled URLs are then built into a search catalog. These URL's are displayed as part of search results on the site powered by Ask's search technology when a relevant match is made.
  • The crawler will download only one page at a time from your site (specifically, from your IP address). After it receives a page, it will pause a certain amount of time before downloading the next page. This delay time may range from 0.1 second to hours. The quicker your site responds to the crawler when it asks for pages, the shorter the delay.
  • It obeys noarchive tag (< META NAME = "ROBOTS" CONTENT = "NOARCHIVE" > or < META NAME = "TEOMA" CONTENT = "NOARCHIVE" >) . It also obeys the rest of the meta tags variations.
  • It obeys 1994 Robots Exclusion Standard
  • follows 301 or 302 redirects
  • follows HREF links, SRC links and re-directs
  • it can crawl dynamic URLs (they are passive to duplicate detection)
  • supports gzip or other compression formats
  • supports the "Crawl-Delay" robots.txt directive

The bot signature is: Mozilla/2.0 (compatible; Ask Jeeves/Teoma).

TODO: Why would Ask.com want to use such signature, because it might block the bot from certain pages, which verify the browser credentials, and could redirect it to pages with no content or text like: "Your browser is too old."