python - Converting a PDF file to Base64 to index into Elasticsearch -
i need index pdfs elasticsearch. that, need convert files base64. using attachment mapping.
i used following python code convert file base64 encoded string:
from elasticsearch import elasticsearch import base64 import constants def index_pdf(pdf_filename): encoded = "" open(pdf_filename) f: data = f.readlines() line in data: encoded += base64.b64encode(f.readline()) return encoded if __name__ == "__main__": encoded_pdf = index_pdf("test.pdf") index_dsl = { "pdf_id": "1", "text": encoded_pdf } constants.es_client.index( index=constants.index_name, doc_type=constants.type_name, body=index_dsl, id="1" )
the creation of index document indexing works fine. issue don't think file has been encoded in right way. tried encoding file using online tools , different encoding bigger compared 1 using python.
here pdf file.
i tried querying text data suggested in documentation of plugin.
get index_pdf/pdf/_search { "query": { "match": { "text": "piece text" } } }
i gives 0 hits. how should go it?
the encoding snippet incorrect opening pdf file in "text" mode.
depending on file size open file in binary mode , use encode string method example:
def pdf_encode(pdf_filename): return open(pdf_filename,"rb").read().encode("base64");
or if file size large have break encoding chunks did not if there module simple below example code:
def chunk_24_read(pdf_filename) : open(pdf_filename,"rb") f: byte = f.read(3) while(byte) : yield byte byte = f.read(3) def pdf_encode(pdf_filename): encoded = "" length = 0 data in chunk_24_read(pdf_filename): char in base64.b64encode(data) : if(length , length % 76 == 0): encoded += "\n" length = 0 encoded += char length += 1 return encoded
Comments
Post a Comment