Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Their dump would be more useful if they...

(1) Used a preexisting aggregate web content format. Their ad hoc format is simple enough, but can't handle content with NULLs, and loses valuable information (such as time of capture -- you can't trust server 'Date' headers -- and resolved IP address at time of collection).

They could use the Internet Archive classic 'ARC' format (not to be confused with the older compression format of the same name):

http://www.archive.org/web/researcher/ArcFileFormat.php

Or the newer, more involved and chatty but still relatively straightforward 'WARC' format:

http://archive-access.sourceforge.net/warc/

(2) Explained how the 3.2 million pages in their initial dump were chosen. (That's only a tiny sliver of the web; where did they start and what did they decide to collect and put in this dataset?)

(FYI, I work at the Internet Archive.)



I don't want to hijack the thread, but I'd love to see an "Ask someone who works at the Internet Archive" post.


Also a list of indexed files would be interesting. I've recently looked for a spider of a couple of large sites (for us to use for demos for potential customers), but I'd like to know what's in there before sorting out space to decompress a 22 GB file.

I notice on the Internet Archive file that you link it mentions that those files are no longer accessible -- are there similar places that you can grab spidered content?


Bulk data access to the historic archive (other than via the public Wayback Machine) is currently only available by special arrangement with research projects. We don't really have a good system for enabling such access, so it happens rarely, on a case-by-case basis.

If you just need fresh web content, it's not hard to collect for yourself quite a bit of broad material in a short period on a small budget, with an open source crawler.

The data from Dotbot might be good, or potential data feeds from Wikia Search/Grub.


If I'm not mistaken, wikia does not want you to download their crawls using a robot: http://soap.grub.org/arcs/robots.txt


They may not want automated downloads, but that robots.txt is in the wrong place. The standard only provides for robots.txt to be found and respected at the root (/robots.txt), not any subdirectory-lookalike path (/arcs/robots.txt).


I also find it strange that there are no 304 status codes. Does that mean that they blindly download each page when they update their index?

Unless you are Google or Yahoo with thousands of servers, you could save some time by only processing pages that have actually been modified.


gojomo, I have looked at the file formats. Could you propose some off-the-shelf web spiders? I would like to accumulate a lot of text for NLP research.


I don't work at Internet Archive, but I've found Heritrix is great. Crawling is all about crazy wonky edge cases and it deals with lots of them.

Use the version 1.x series, not the new version 2. 1.x is less convoluted and easier to use. There might be some http://en.wikipedia.org/wiki/Second-system_effect I think.


Try out Nutch - http://lucene.apache.org/nutch/ or the one archive.org uses Heretix - http://crawler.archive.org/


At the Internet Archive we've created Heritrix for 'archival quality' crawling -- especially when you want to get every media type, and sites to complete/arbitrary depth, in large but polite crawls. (It's possible, but not usual for us, to configure it to only collect textual content.)

The Nutch crawler is also reasonable for broad survey crawls. HTTrack is also reasonable for 'mirroring' large groups of sites to a filesystem directory tree.


Could you outline how to configure it to collect only textual content?


Very roughly:

(1) Add a scope rule that throws out discovered URIs with popular non-textual extensions (.gif, .jpe?g, .mp3, etc.) before they are even queued.

(2) Add a 'mid-fetch' rule to FetchHTTP module that early-cancels any fetches with unwanted MIME types. (These rules run after HTTP headers are available.)

(3) add a processor rule to whatever is writing your content to disk (usually ARCWriterProcessor) that skips writing results (such as the early-cancelled non-textual results above) of unwanted MIME types.

Followup questions should go to the Heritrix project discussion list, http://groups.yahoo.com/group/archive-crawler/ .


Also, what do you mean by "broad survey crawls"?


'Broad' crawl means roughly, "I'm interested in the whole web, let the crawler follow links everywhere." Even starting with a small seed set, such crawls quickly expand to touch hundreds of thousands or millions of sites.

Even as a fairly large operation, you're might to be happy with a representative/well-linked set of 10 million, 100 million, 1 billion, etc. URLs -- which is only a subset of the whole web, hence a 'survey'.

A constrasting kind of crawl would be to focus on some smaller set of sites/domains you want to crawl as deeply and completely as possible. You might invest weeks or many months to get these deeply in a gradual, polite manner.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: