The SygolBot Spider

Information about the SygolBot spider can be found on page http://www.robotstxt.org/wc/active/html/sygol.html even though this information is out of date. The correct  info is:

Name SygolBot
Cover Page http://www.sygol.com
Details Page http://www.sygol.com/who.asp
Operational Status active
Description Very standard robot: it gets all words and links from a page end then indexes the first and stores the latter for further crawling.
Robot Purpose indexing: gather pages for the Sygol search engine
Software Type standalone
Software Platform All Windows from 95 to latest.
Software Language Visual Basic
Availability none
Owner's Name Giorgio Galeotti
Owner's Home Page http://www.sygol.com
Owner's Email Address i n f o @ s y g o l . c o m
Exclusion Protocol yes
Exclusion Tag SygolBot
Supports NOINDEX Yes
Robot Host http://www.sygol.com
HTTP From No
HTTP User-Agent 1 SygolBot http://www.sygol.com
HTTP User-Agent 2 SygolBot http://www.sygol.it
History It all started in 1999 as a hobby to try crawling the web and putting together a good search engine with very little hardware resources.
Environment Hobby
Identifier sygol
Updated Mon, 07 Jun 2004 14:50:01 GMT
Update By Giorgio Galeotti

 

SYGOL respects the Robots Exclusion standard!

Before downloading any page from a domain, the spiders will look for exclusions for that domain in the local cache (at most 3 weeks old). If there was nothing in the cache, then the spiders will look for a robots.txt file in the domain.  If a robots.txt file is found that disallows something, then the spiders will:

  1. Delete the old exclusions, if any,  from the cache in order to prepare for the newly found ones.

  2. Store the new exclusions in the cache to minimize bandwidth.

  3. Mark all URLs in the domain inadvertently spidered in the past as 'to be deleted from the index'.

  4. Within minutes, delete from the index all the excluded pages from point 3.

If a robots.txt file is not found, or it is found but it does not disallow anything, then the spiders will look for the robots.txt file before downloading each and every page in the domain, since there are too many domains on the Internet to store this allows everything information for all of them in a database.

Notes