java - How to extract content from a pdf having both text and images? -


i have pdf file has 2 types of pages, normal text pages , pages coming scanned documents. text content can extracted usin either pdfbox or tika. these libraries can't ocr , need tess4j. how can combine tess4j , pdfbox (or tika) in order extract content both text , scanned pages?

edited:------ found solution follows doesn't work well

for(pdpage page:pages){          pdpropertylist pagefonts = page.getresources().getproperties();         map<string,pdxobjectimage> img = page.getresources().getimages();          set<string> keys = img.keyset();         iterator iter = keys.iterator();          while(iter.hasnext()){              string k = (string) iter.next();             system.out.println(k);                           pdxobjectimage ci = img.get(k);              ci.write2file(k);              file imagefile = new file(k+".jpg");             tesseract instance = tesseract.getinstance();               try{                 string result = instance.doocr(imagefile);                           system.out.println(result);             }             catch(tesseractexception e){                 system.err.println(e.getmessage());             }                        }     } 

the problem although image fies saved have quality, tess4j not operate on them , extract nonsense. tess4j able ocr them if pdf passed in first place.

in summary, extracting images pdfbox affect quality of ocr process, know why?


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -