Twitter json data in Hadoop -


i have done twitter data streaming hdfs. twitter-agent configuration:

#setting properties of agent  twitter-agent.sources=source1  twitter-agent.channels=channel1  twitter-agent.sinks=sink1    #configuring sources  twitter-agent.sources.source1.type=com.cloudera.flume.source.twittersource  twitter-agent.sources.source1.channels=channel1  twitter-agent.sources.source1.consumerkey=<consumer-key>  twitter-agent.sources.source1.consumersecret=<consumer-secret>  twitter-agent.sources.source1.accesstoken=<access-token>  twitter-agent.sources.source1.accesstokensecret=<access-token-secret>  twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata    #configuring channels  twitter-agent.channels.channel1.type=memory  twitter-agent.channels.channel1.capacity=10000  twitter-agent.channels.channel1.transactioncapacity=100    #configuring sinks  twitter-agent.sinks.sink1.channel=channel1  twitter-agent.sinks.sink1.type=hdfs  twitter-agent.sinks.sink1.hdfs.path=flume/tweets  twitter-agent.sinks.sink1.rollsize=0  twitter-agent.sinks.sink1.rollcount=10000  twitter-agent.sinks.sink1.batchsize=1000  twitter-agent.sinks.sink1.filetype=datastream  twitter-agent.sinks.sink1.writeformat=text

twitter data streamed successfully. every flumedata file in hdfs this:

seq!org.apache.hadoop.io.longwritable"org.apache.hadoop.io.byteswritable�	���^�kd��h?�tn ���h{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"tue jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":null,"source":"<a href=\"http://tweetlogix.com\" rel=\"nofollow\">tweetlogix<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"id_str":"613363262709723139","in_reply_to_user_id":null,"favorite_count":0,"id":613363262709723139,"text":"morning.","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172225","entities":{"urls":[],"hashtags":[],"user_mentions":[],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":-14400,"friends_count":195,"profile_image_url_https":"https://pbs.twimg.com/profile_images/613121771093532673/ma5npv6x_normal.jpg","listed_count":16,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","default_profile_image":false,"favourites_count":891,"description":"see, on way piece of burger burger king.....","created_at":"sat apr 30 00:51:06 +0000 2011","is_translator":false,"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","protected":false,"screen_name":"nilesdontcurrr","id_str":"290266873","profile_link_color":"ff0000","id":290266873,"geo_enabled":false,"profile_background_color":"ffffff","lang":"en","profile_sidebar_border_color":"ffffff","profile_text_color":"34aa7a","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/613121771093532673/ma5npv6x_normal.jpg","time_zone":"eastern time (us & canada)","url":null,"contributors_enabled":false,"profile_background_tile":true,"profile_banner_url":"https://pbs.twimg.com/profile_banners/290266873/1432844093","statuses_count":68154,"follow_request_sent":null,"followers_count":4611,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"niles.","location":"new york city.","profile_sidebar_fill_color":"afdfb7","notifications":null}}

when parse json data in hive i'm getting errors

caused by: org.apache.hadoop.hive.serde2.serdeexception: org.codehaus.jackson.jsonparseexception: unexpected character ('s' (code 83)): expected valid value (number, string, array, object, 'true', 'false' or 'null')   @ [source: java.io.stringreader@5fdcaa40; line: 1, column: 2]

i think error because of line first line in every flumedata file. seq!org.apache.hadoop.io.longwritable"org.apache.hadoop.io.byteswritable� ���^�kd��h?�tn ���h right?

isn't twitter json data supposed start {"in_reply_to_status_id_str":......} ?

flume generating files in binary format instead of text format. because few of properties in config file not set correctly, including below 2 properties.

twitter-agent.sinks.sink1.filetype=datastream twitter-agent.sinks.sink1.writeformat=text 

correct way set properties below.

twitter-agent.sinks.sink1.hdfs.filetype=datastream twitter-agent.sinks.sink1.hdfs.writeformat=text 

Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -