Nutch and crawling millions of websites -

- September 15, 2011

can use nutch 1.10 in order crawl millions of websites several number of round?

i don't understand database created when launch nutch 1.10. enough crawl important data site?

i have file list of url's take 2 gigabytes.

yes, can. goal of nutch. however, crawling millions of websites takes time , space, , in order do, need setup environment correctly.

in nutch 1.x "crawling database", e.g. urls visited, urls frontier (next urls visit),etc. persisted hadoop filesystem. place you'll first inject list of urls.

in addition, in order view indexed data, can use solr (or elasticsearch).

i recommend first going through nutch 1.x tutorial short list of urls , getting know how use nutch , plugins.

after that, setup hadoop cluster tutorials hadoop site, , crawl away!

Search This Blog

Running

Nutch and crawling millions of websites -

Comments

Post a Comment

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -

Why does a .NET 4.0 program produce a system.unauthorizedAccess error on a Windows Server 2012 machine with .NET 4.5 installed? -