Twitter json data in Hadoop -
i have done twitter data streaming hdfs. twitter-agent configuration:
#setting properties of agent twitter-agent.sources=source1 twitter-agent.channels=channel1 twitter-agent.sinks=sink1 #configuring sources twitter-agent.sources.source1.type=com.cloudera.flume.source.twittersource twitter-agent.sources.source1.channels=channel1 twitter-agent.sources.source1.consumerkey=<consumer-key> twitter-agent.sources.source1.consumersecret=<consumer-secret> twitter-agent.sources.source1.accesstoken=<access-token> twitter-agent.sources.source1.accesstokensecret=<access-token-secret> twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata #configuring channels twitter-agent.channels.channel1.type=memory twitter-agent.channels.channel1.capacity=10000 twitter-agent.channels.channel1.transactioncapacity=100 #configuring sinks twitter-agent.sinks.sink1.channel=channel1 twitter-agent.sinks.sink1.type=hdfs twitter-agent.sinks.sink1.hdfs.path=flume/tweets twitter-agent.sinks.sink1.rollsize=0 twitter-agent.sinks.sink1.rollcount=10000 twitter-agent.sinks.sink1.batchsize=1000 twitter-agent.sinks.sink1.filetype=datastream twitter-agent.sinks.sink1.writeformat=text
twitter data streamed successfully. every flumedata file in hdfs this:
seq!org.apache.hadoop.io.longwritable"org.apache.hadoop.io.byteswritable� ���^�kd��h?�tn ���h{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"tue jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":null,"source":"<a href=\"http://tweetlogix.com\" rel=\"nofollow\">tweetlogix<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"id_str":"613363262709723139","in_reply_to_user_id":null,"favorite_count":0,"id":613363262709723139,"text":"morning.","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172225","entities":{"urls":[],"hashtags":[],"user_mentions":[],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":-14400,"friends_count":195,"profile_image_url_https":"https://pbs.twimg.com/profile_images/613121771093532673/ma5npv6x_normal.jpg","listed_count":16,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","default_profile_image":false,"favourites_count":891,"description":"see, on way piece of burger burger king.....","created_at":"sat apr 30 00:51:06 +0000 2011","is_translator":false,"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","protected":false,"screen_name":"nilesdontcurrr","id_str":"290266873","profile_link_color":"ff0000","id":290266873,"geo_enabled":false,"profile_background_color":"ffffff","lang":"en","profile_sidebar_border_color":"ffffff","profile_text_color":"34aa7a","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/613121771093532673/ma5npv6x_normal.jpg","time_zone":"eastern time (us & canada)","url":null,"contributors_enabled":false,"profile_background_tile":true,"profile_banner_url":"https://pbs.twimg.com/profile_banners/290266873/1432844093","statuses_count":68154,"follow_request_sent":null,"followers_count":4611,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"niles.","location":"new york city.","profile_sidebar_fill_color":"afdfb7","notifications":null}}
when parse json data in hive i'm getting errors
caused by: org.apache.hadoop.hive.serde2.serdeexception: org.codehaus.jackson.jsonparseexception: unexpected character ('s' (code 83)): expected valid value (number, string, array, object, 'true', 'false' or 'null') @ [source: java.io.stringreader@5fdcaa40; line: 1, column: 2]
i think error because of line first line in every flumedata file. seq!org.apache.hadoop.io.longwritable"org.apache.hadoop.io.byteswritable� ���^�kd��h?�tn ���h
right?
isn't twitter json data supposed start {"in_reply_to_status_id_str":......}
?
flume generating files in binary format instead of text format. because few of properties in config file not set correctly, including below 2 properties.
twitter-agent.sinks.sink1.filetype=datastream twitter-agent.sinks.sink1.writeformat=text
correct way set properties below.
twitter-agent.sinks.sink1.hdfs.filetype=datastream twitter-agent.sinks.sink1.hdfs.writeformat=text
Comments
Post a Comment