Author Topic: our scrape engine  (Read 1292 times)

Offline lacadaemon

  • Do you like my hat?
  • prometheas
  • ******
  • Posts: 11,334
    • View Profile
our scrape engine
« on: April 05, 2012, 06:27:33 PM »
this is a cross-post from the thread
your user control panel is not reporting peer stats
. a brief explanation of our backend scrape engine:

i guess i should explain the new system to give you an idea of the challenge for stat reporting. the h33t system reports on all trackers for all torrents. there are approximately 27,000 trackers registered in the h33t db of which many are bad trackers or pid trackers (tracker per torrent). of those 27,000 trackers only about 6,000 are interesting for scraping stats. of those 6,000 good trackers only 10% or exactly 662 have fullscrape enabled permitting me to call a scrape on all their torrents. it is important to know that bittorrent protocol informally requires i only use fullscrape to pull torrent stats, although other sites like isohunt do scrape individual torrents this is considered an abuse of the protocol because of the strain it puts on remote sites and the network as a whole

those 662 trackers place over 5,500,000 individual torrent stats into the db and it is those stats that i use to display on the site. it takes the scrape engine 2 hours 45 minutes (165 minutes) to scrape those 662 trackers and write the 5,500,000 results into the db. that is an average of 4 trackers per minute (or 15 seconds per tracker). that is an average of 33,333 torrent stats per minute (or 556 per second). the scrape is single threaded C++ and the server is a 8GB RAM Quad-Core 2.4Ghz Xeon. the db is 100% dynamic in that new trackers and torrents are constantly being uploaded by members and the system must decide between new good trackers and new bad trackers whilst it runs the fullscrape on the good. all this is supported by a 16GB RAM master database server with 16x 2.8Ghz cpu cores crunching not only the torrent stst writes/updates but also pushing out those stats for the main site search engine and other queries. interesting for everyone to know that this big server was paid for buy the h33t community back in the days when i ran a fund raising for equipment, she is a beauty and keeps us alive

i should also mention that the system scrapes the torrent individually at the point of upload for the trackers contained inside the torrent, the scrape engine then updates the torrent stats with all other trackers on the torrent when it gets round to that torrent in its cycle

there are ways to increase the frequency of torrent stats to reduce the 165 minute cycle and bring the stats closer to live, for example: code a new multi-threaded scrape engine and/or create mutliple instances of the scrape. there are external conditions to be considered most importantly that fullscrape of a remote tracker should not be abused, we are already scraping all trackers 9 times per 24 hours and an increase in this could be considered spam and/or abuse by the administrators of those remote trackers

it is a complex system and i am open to suggestions. i built this system working alone, it has been very much a voyage of discovery and hence difficult to discuss with the community because i did not always know what was possible. i dont know how many other sites have a stats capability even close to h33t in terms of frequency of updates and accuracy. if somebody shows me a site with better stats then i will consider investing in a system that increases the frequency to once/hour. atm i am investing all the learning from h33t into a new system because i think it is a better use of my resources to design a torrent site that fixes all the problems and challenges of our now aging legacy system. in other words, i am writing a system that is my 6 years worth of experience of this site that if i knew then what i know now i would have built h33t to be right in the beginning. needless to say there is a ton of stuff that will be a first and found on no other site

i hope you enjoy this small insight into the h33t backend system. maybe i will also do a short desciption of the front end systems which are significantly more complex in terms of integrated services. i am particularily proud of how the h33t interface appears so simple while presenting a large, totally dynamic and complex data set. and of course it has to be fast  Grin