java - How to extract content from a pdf having both text and images? -
i have pdf file has 2 types of pages, normal text pages , pages coming scanned documents. text content can extracted usin either pdfbox or tika. these libraries can't ocr , need tess4j. how can combine tess4j , pdfbox (or tika) in order extract content both text , scanned pages?
edited:------ found solution follows doesn't work well
for(pdpage page:pages){ pdpropertylist pagefonts = page.getresources().getproperties(); map<string,pdxobjectimage> img = page.getresources().getimages(); set<string> keys = img.keyset(); iterator iter = keys.iterator(); while(iter.hasnext()){ string k = (string) iter.next(); system.out.println(k); pdxobjectimage ci = img.get(k); ci.write2file(k); file imagefile = new file(k+".jpg"); tesseract instance = tesseract.getinstance(); try{ string result = instance.doocr(imagefile); system.out.println(result); } catch(tesseractexception e){ system.err.println(e.getmessage()); } } }
the problem although image fies saved have quality, tess4j not operate on them , extract nonsense. tess4j able ocr them if pdf passed in first place.
in summary, extracting images pdfbox affect quality of ocr process, know why?
Comments
Post a Comment