Nutch and crawling millions of websites -
can use nutch 1.10 in order crawl millions of websites several number of round?
i don't understand database created when launch nutch 1.10. enough crawl important data site?
i have file list of url's take 2 gigabytes.
yes, can. goal of nutch. however, crawling millions of websites takes time , space, , in order do, need setup environment correctly.
in nutch 1.x "crawling database", e.g. urls visited, urls frontier (next urls visit),etc. persisted hadoop filesystem. place you'll first inject list of urls.
in addition, in order view indexed data, can use solr (or elasticsearch).
i recommend first going through nutch 1.x tutorial short list of urls , getting know how use nutch , plugins.
after that, setup hadoop cluster tutorials hadoop site, , crawl away!
Comments
Post a Comment