How to extract images from a file using Apache TIka? -
i have pdf (or other type of files such .doc, .ppt, etc) contain text images. how can extract images files using tika?
can run ocr on extracted images using tess4j or other lib?
this how call tika:
autodetectparser parser = new autodetectparser(); bodycontenthandler handler = new bodycontenthandler(writelimit); metadata metadata = new metadata(); inputstream stream = new fileinputstream("file.pdf"); parser.parse(stream, handler, metadata);
p.s. have tika-app.jar.
the way this:
inputstream stream = new fileinputstream(inputfile); parser parser = new autodetectparser(); bodycontenthandler handler = new bodycontenthandler( integer.max_value); tesseractocrconfig config = new tesseractocrconfig(); pdfparserconfig pdfconfig = new pdfparserconfig(); parsecontext parsecontext = new parsecontext(); parsecontext.set(tesseractocrconfig.class, config); parsecontext.set(pdfparserconfig.class, pdfconfig); parsecontext.set(parser.class, parser); // need add make // sure recursive parsing // happens! metadata metadata = new metadata(); parser.parse(stream, handler, metadata, parsecontext); string text = handler.tostring().trim();
1) ensure have tesseract installed using 'tesseract-ocr-setup-3.05.00dev.exe' from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ , have path (it installed in program files, if windows) placed in path environment variable. restart windows if needed. pass (yes any!) file , extract. 2) download tess4j-3.0.0.jar from: https://sourceforge.net/projects/tess4j/?source=typ_redirect , refer jar using:
<dependency> <groupid>net.sourceforge.tess4j</groupid> <artifactid>tess4j</artifactid> <version>3.0.0</version> </dependency>
then, these:
<dependency> <groupid>org.apache.tika</groupid> <artifactid>tika-core</artifactid> <version>1.13</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers --> <dependency> <groupid>org.apache.tika</groupid> <artifactid>tika-parsers</artifactid> <version>1.13</version> </dependency> <dependency> <groupid>commons-io</groupid> <artifactid>commons-io</artifactid> <version>2.5</version> </dependency> <dependency> <groupid>com.github.jai-imageio</groupid> <artifactid>jai-imageio-core</artifactid> <version>1.3.0</version> </dependency> <!-- https://mvnrepository.com/artifact/net.java.dev.jna/jna --> <dependency> <groupid>net.java.dev.jna</groupid> <artifactid>jna</artifactid> <version>4.2.2</version> </dependency> <dependency> <groupid>log4j</groupid> <artifactid>log4j</artifactid> <version>1.2.11</version> </dependency>
however, if using ubuntu, tesseract should installed using apt-get. work.
Comments
Post a Comment