Verasys 2k

Published: 16 July 2006
Author: Serban Ghita
About the project

The project started as my interest for database mining growed. I was always troubled about how Google, Yahoo and other major search engines gather and show the results for a certain query. Every database developer knows that finding and returning the right results is more than a simple regex query. The results MUST be relevat for the user.

In order to do that you have to think about certain aspects before start developing your extractor script:

  • what is the size of the database (scalability issues)
  • what type of data rows are we searching (important for the way we rank the results)
  • are there any extra sorting measures that we have to bear in mind (maybe you find two pages from the same url that rank the same, and mabe you want to show them both <<Google style>> as they may be relevant to the user)

At first i knew i could never match the power of a real search engine, so i started it small. I gathered about 400+ .ro sites that have the same topic: Tourism and travel . The reason why i chose this topic is because i knew i will find medium-sized web pages varying from:

  • small (1-20 unique pages) to medium-sized sites (about 15.000+ unique pages)
  • pages with bad HTML code writen and pages that we full CSS coded
  • the URLs of these set of pages would also vary from ones filled with variables and session ids to nice and smooth custom made URLs
  • also the last reason that i had in mind at the time was the kinds of HTTP header redirects, javascript redirects, etc.

As the time passed, i went into a lot of trouble dealing with:

  • creating the right sitemap, gathering all the internal links from the site
  • all sorts of redirects: HTTP header, javascript, meta also frames were a problem
  • sloppy code, not everyone knows how to make a decent page with a decent code (not every page has an <html> tag at the beginning or at the end; same thing apply to common sense tags like <head>, <title>, <body>)
  • ugly URLs, that passed al sorts of characters and garbage variables, not to mention session id-s, realised that not every page has a standard extension (eg. .htm, .html, .php)
  • timeout and networking problems (still having those, because i am quite limited at timeout management and networking is not quite my field)
  • diversity of languages, charsets
  • html entities and custom HTML tags
  • stripping the tags
  • database growth
  • data processing, which is hardware consumming (regex segmentation faults, etc)

I ran the project online during 3 months until november 2005, and in this time other problems appeared:

  • database design (transition from old concepts to new concepts takes time if you are dealing with a lot of data)
  • result caching (i had to cache the results for the queries, because the database could not deal with an instant URL -> PAGE -> WORD ranking and ordering)
  • slow results for multiple terms in a query (ranking takes more computation)
  • certain terms reveiled unwanted results due to some spamming techniques
  • refreshing and updating the database

The reason why the project is temporarly offline is because of the research time that i have to spend in order to get things going. I realised that a crawler is mainly made of 3 components:

  1. the fetcher (grabs HTML documents and helps in creating the sitemap)
  2. the processer (grabs the locally HTML files and computes the ranks for each word in every page)
  3. the searcher (this process is dealing with queries and returning results)

As you can see, all 3 major procesess are very complex, each has it's own problems and challenges. Of course, some of you who work in the search engine industry might say: "what about the spam cleaner, the duplicate cleaner ... " or other intermediate process that might exist between the processer and the searcher; i did not mentioned that because i will try to discuss that later when i have some practical examples and some experience involving that processes.

All the aspects shown above are from the situations encountered with VERASYS 2k crawler from mid 2005 until february 2006. At that point VERASYS 2k was and still is a project that crawls certain web pages and analyse them.

In the beginning of the new year (2006), came another challenge for the 2k engine. As we progressed in the code with iT PROMO (biggest Romanian IT&C e-commerce shop), i knew i had to find a way to deliver visitors a fast and reliable way of finding the desired product on site. The only solution, besides an easy navigational scheme, was to put a search engine that would find products, and find them good.

This is when i split the project in 2 parts, and i used the theory from the first VERASYS 2k engine. Not only the theory, but the powerful concept that drives every major search engine: "Find fast and relevant results for the query input". Finding fast the results would not be a major problem, because we are talking about a database with 90.000+ records. So we have a half static / half dinamic database: structure remains the same all the time and 20% the records are changing/updating daily.

The other second component, and major problem, is returning the relevant results for the query input. Now that's a real challenge! I took a quick look at other e-commerce shops, could not find any refferences or similar examples (something to build upoan), so i knew at least i'm the first who attempts this (i guess). The final result (and yes, i still got some improvement to do) can be seen at iT PROMO section 'Cauta in site' (that's Search the site in english).

The features that this little 2k search has are:

  • all sorts of advanced search (any-all-exact, categories search refine, price range, etc)
  • real time ranking
  • real time correction and spelling (not all therms covered yet), synonims
  • sugestions
  • multiple therm search
More information on VERASYS 2k crawler

The crawler is made using:

  • PHP
  • MySQL
  • XML

PHP and MySQL can both sound like "limitations" to you, but i have developed 2k based on these software packages because they are a part of my daily job. I guess the only thing without fronteers and fancy might be the XML part, but don't think too far, because i've used only basic XML.

In the present i use VERASYS 2k for research like statistics about HTML tags and attributes on a webpage.

TODO

Any feedback to my e-mail address would be highly apreciated!

  • create a database structure with HTML tags (+deprecated) and attributes, and link them toghether
  • create the script that parse already saved HTML in order to extract the statisticts about PAGE -> TAG -> ATTRIBUTE
  • create a method of learning "new" attributes
  • finding a way to parse HTML documents using a DTD (Document Type Definition) files
  • finding a way to parse and validate HTML document just like W3 Validator does

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.