python - Converting a PDF file to Base64 to index into Elasticsearch -

- July 15, 2011

i need index pdfs elasticsearch. that, need convert files base64. using attachment mapping.

i used following python code convert file base64 encoded string:

from elasticsearch import elasticsearch import base64 import constants  def index_pdf(pdf_filename):     encoded = ""     open(pdf_filename) f:         data = f.readlines()         line in data:             encoded += base64.b64encode(f.readline())     return encoded  if __name__ == "__main__":     encoded_pdf = index_pdf("test.pdf")     index_dsl = {         "pdf_id": "1",         "text": encoded_pdf     }     constants.es_client.index(             index=constants.index_name,             doc_type=constants.type_name,             body=index_dsl,             id="1"     )

the creation of index document indexing works fine. issue don't think file has been encoded in right way. tried encoding file using online tools , different encoding bigger compared 1 using python.

here pdf file.

i tried querying text data suggested in documentation of plugin.

get index_pdf/pdf/_search {   "query": {     "match": {       "text": "piece text"     }   } }

i gives 0 hits. how should go it?

the encoding snippet incorrect opening pdf file in "text" mode.

depending on file size open file in binary mode , use encode string method example:

def pdf_encode(pdf_filename):     return open(pdf_filename,"rb").read().encode("base64");

or if file size large have break encoding chunks did not if there module simple below example code:

 def chunk_24_read(pdf_filename) :     open(pdf_filename,"rb") f:         byte = f.read(3)         while(byte) :             yield  byte             byte = f.read(3)   def pdf_encode(pdf_filename):     encoded = ""     length = 0     data in chunk_24_read(pdf_filename):         char in base64.b64encode(data) :             if(length  ,  length % 76 == 0):                encoded += "\n"                length = 0              encoded += char               length += 1     return encoded

Search This Blog

Running

python - Converting a PDF file to Base64 to index into Elasticsearch -

Comments

Post a Comment

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -

c++ - How to tell if a type is a subclass using type_info? -