c# - Using threads to parse multiple Html pages faster -
here's i'm trying do:
- get 1 html page url contains multiple links inside
- visit each link
- extract data visited link , create object using it
so far did simple , slow way:
public list<link> searchlinks(string name) { list<link> foundlinks = new list<link>(); // gethtmldocument() returns htmldocument using input url. htmldocument doc = gethtmldocument(au_search_url + fixspaces(name)); var link_list = doc.documentnode.selectnodes(@"/html/body/div[@id='parent-container']/div[@id='main-content']/ol[@id='searchresult']/li/h2/a"); foreach (var link in link_list) { // todo threads // getobject() creates object using data gathered foundlinks.add(getobject(link.innertext, link.attributes["href"].value, getlatestepisode(link.attributes["href"].value))); } return foundlinks; }
to make faster/efficient need implement threads, i'm not sure how should approach it, because can't randomly start threads, need wait them finish, thread.join() kind of solves 'wait threads finish' problem, becomes not fast anymore think, because threads launched after earlier 1 finished.
the simplest way offload work multiple threads use parallel.foreach()
in place of current loop. this:
parallel.foreach(link_list, link => { foundlinks.add(getobject(link.innertext, link.attributes["href"].value, getlatestepisode(link.attributes["href"].value))); });
i'm not sure if there other threading concerns in overall code. (note, example, no longer guarantee data added foundlinks
in same order.) long there's nothing explicitly preventing concurrent work taking place take advantage of threading on multiple cpu cores process work.
Comments
Post a Comment