java - How to extract content from a pdf having both text and images? -

- March 15, 2014

i have pdf file has 2 types of pages, normal text pages , pages coming scanned documents. text content can extracted usin either pdfbox or tika. these libraries can't ocr , need tess4j. how can combine tess4j , pdfbox (or tika) in order extract content both text , scanned pages?

edited:------ found solution follows doesn't work well

for(pdpage page:pages){          pdpropertylist pagefonts = page.getresources().getproperties();         map<string,pdxobjectimage> img = page.getresources().getimages();          set<string> keys = img.keyset();         iterator iter = keys.iterator();          while(iter.hasnext()){              string k = (string) iter.next();             system.out.println(k);                           pdxobjectimage ci = img.get(k);              ci.write2file(k);              file imagefile = new file(k+".jpg");             tesseract instance = tesseract.getinstance();               try{                 string result = instance.doocr(imagefile);                           system.out.println(result);             }             catch(tesseractexception e){                 system.err.println(e.getmessage());             }                        }     }

the problem although image fies saved have quality, tess4j not operate on them , extract nonsense. tess4j able ocr them if pdf passed in first place.

in summary, extracting images pdfbox affect quality of ocr process, know why?

Search This Blog

Running

java - How to extract content from a pdf having both text and images? -

Comments

Post a Comment

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -

c# - Block closing WPF window (from a different thread) if OpenFileDialog is open -