博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hadoop-MR实现日志清洗(一)
阅读量:4561 次
发布时间:2019-06-08

本文共 12912 字,大约阅读时间需要 43 分钟。

1.日志内容样式
目前所接触到的日志一种是网页请求日志,一种是埋点日志,一种后端系统日志。
1.1请求日志
请求日志是用户访问网站时,打开网址或点击网站上了项目元素时,向服务器发送或提交的资源请求。
(论坛日志)
27.38.53.84 - - [30/May/2013:23:37:57 +0800] "GET /uc_server/data/avatar/000/00/50/90_avatar_small.jpg HTTP/1.1" 200 1828218.28.247.140 - - [30/May/2013:23:37:57 +0800] "GET /static/image/common/swfupload.swf?preventswfcaching=1369928282717 HTTP/1.1" 200 13333123.147.245.79 - - [30/May/2013:23:37:57 +0800] "GET /static/js/swfupload.queue.js?y7a HTTP/1.1" 304 -182.242.227.232 - - [30/May/2013:23:37:56 +0800] "GET /misc.php?mod=patch&action=ipnotice&inajax=1&ajaxtarget=ip_notice HTTP/1.1" 200 65183.67.254.204 - - [30/May/2013:23:37:56 +0800] "POST /forum.php?mod=post&action=newthread&fid=72&extra=&topicsubmit=yes&inajax=1 HTTP/1.1" 200 425110.255.113.85 - - [30/May/2013:23:37:59 +0800] "GET /uc_server/avatar.php?uid=26294&size=middle HTTP/1.1" 301 -111.37.4.243 - - [30/May/2013:23:37:58 +0800] "POST /source/plugin/pcmgr_url_safeguard/url_api.inc.php HTTP/1.1" 200 1300125.82.229.229 - - [30/May/2013:23:38:05 +0800] "GET /uc_server/data/avatar/000/07/18/34_avatar_middle.jpg HTTP/1.1" 200 3790122.70.237.247 - - [30/May/2013:23:38:03 +0800] "GET /forum.php?mod=image&aid=18696&size=300x300&key=3e12991ed5ff7ecd&nocache=yes&type=fixnone&ramdom=dZqQb HTTP/1.1" 200 39594111.37.4.243 - - [30/May/2013:23:38:04 +0800] "GET /forum.php?mod=misc&action=postreview&do=support&tid=11228&pid=44989&hash=29c64660&infloat=yes&handlekey=login&referer=http%3A%2F%2Fbbs.itcast.cn%2Fforum.php%3Fmod%3Dviewthread%26tid%3D11228&inajax=1&ajaxtarget=fwin_content_login HTTP/1.1" 302 -49.5.1.14 - - [30/May/2013:23:38:09 +0800] "GET /api/connect/like.php HTTP/1.1" 200 722

 

(商城日志)
183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"222.68.172.190 - - [18/Sep/2013:06:50:08 +0000] "-" 400 0 "-" "-"58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /nodejs-socketio-chat/ HTTP/1.1" 200 10818 "http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"58.248.178.212 - - [18/Sep/2013:06:51:40 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 200 786 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"180.168.34.26 - - [18/Sep/2013:07:11:08 +0000] "-" 400 0 "-" "-"180.168.34.26 - - [18/Sep/2013:07:11:08 +0000] "-" 400 0 "-" "-"50.116.27.194 - - [18/Sep/2013:07:11:29 +0000] "POST /wp-cron.php?doing_wp_cron=1379488288.8893849849700927734375 HTTP/1.0" 200 0 "-" "WordPress/3.6; http://blog.fens.me"222.35.232.69 - - [18/Sep/2013:16:14:17 +0000] "GET /wp-content/uploads/2013/05/favicon.ico HTTP/1.1" 200 1150 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"114.252.89.91 - - [18/Sep/2013:16:14:20 +0000] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 58 "http://blog.fens.me/wp-admin/post.php?post=2445&action=edit&message=10" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"58.209.132.183 - - [18/Sep/2013:16:29:17 +0000] "GET /images/2.jpg HTTP/1.1" 200 105089 "http://image.baidu.com/i?ct=503316480&z=&tn=baiduimagedetail&ipn=d&word=%E6%B5%99%E6%B1%9F%E5%AE%89%E5%90%89&step_word=&ie=utf-8&in=17038&cl=2&lm=-1&st=&pn=0&rn=1&di=47839122900&ln=1998&fr=&&fmq=1379521091792_R&ic=&s=&se=&sme=0&tab=&width=&height=&face=&is=&istype=&ist=&jit=&objurl=http%3A%2F%2Fnews.eastday.com%2Feastday%2F06news%2Fchina%2Fzh2green%2Fanji%2Fnode327399%2Fimages%2F01517676.jpg" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)"

 

1.2埋点日志
埋点日志是电商网站采用的技术手段,当用户浏览曝光的商时,主动记录曝光的商品列表、停留时间、点击的商品、点击的组件等信息,服务运营,优化商城布局,常见的埋点日志有浏览、点击、曝光日志。
(浏览)
2018-08-28 11:59:58,263 - site: leeyk99, ip: 188.133.207.46, refer: https://m.leeyk99.com/ru/user/login?redirection=%2Fru%2FSneakers-c-1913.html%3Ficn%3Dsneakers%26ici%3Dmru_navbar15menu01dir02&prot=1, agent: Mozilla/5.0 (Linux; Android 5.1.1; SM-G531H Build/LMY48B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mru","language":"ru","money_type":"RUB","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"360X640","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"Android","os_versions":"5.1.1","browser_name":"Chrome","browser_versions":"68.0.3440.91","session_id":"","timestamp":1535428798994,"local_time":"2018/8/28 10:59:58","device_id":"","cookie_id":"5BCE0E1F_DAFD_2E64_F24E_B3B6D5D6BAC5","member_id":"","login":0,"page_id":3,"page_name":"page_real_class","page_param":{"category_id":"1913","source_category_id":"1745"},"start_time":1535428764401,"end_time":1535428798994,"tab_page_id":"page_real_class1535428764401"}2018-08-28 11:59:58,272 - site: leeyk99, ip: 74.205.199.213, refer: https://m.leeyk99.com/us/Watermelon-Print-Round-Beach-Blanket-p-365584-cat-1866.html, agent: Mozilla/5.0 (Linux; Android 6.0; HUAWEI CAM-L21 Build/HUAWEICAM-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"360X640","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"Android","os_versions":"6.0","browser_name":"Chrome","browser_versions":"68.0.3440.91","session_id":"","timestamp":1535428797165,"local_time":"2018/8/27 20:59:57","device_id":"","cookie_id":"B66A47CF_5522_DC84_F221_F0848C812BCA","member_id":"","login":0,"page_id":7,"page_name":"page_goods_detail","page_param":{"goods_id":365584,"traceid":"sm`1535428371336`B66A47CF_5522_DC84_F221_F0848C812BCA"},"start_time":1535428797165,"end_time":"","tab_page_id":"page_goods_detail1535428797165"}2018-08-28 11:59:58,274 - site: leeyk99, ip: 99.174.207.56, refer: https://m.leeyk99.com/us/Striped-Ringer-Tee-p-469810-cat-1738.html?rrec=true, agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"375X667","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"iOS","os_versions":"11.4.1","browser_name":"Mobile Safari","browser_versions":"11.0","session_id":"","timestamp":1535428797977,"local_time":"2018/8/27 22:59:57","device_id":"","cookie_id":"D56B15A4_37D3_9164_CA60_3B4CDB382F2D","member_id":"","login":0,"page_id":7,"page_name":"page_goods_detail","page_param":{"goods_id":469810,"traceid":"sm`1535428730780`D56B15A4_37D3_9164_CA60_3B4CDB382F2D"},"start_time":1535428797977,"end_time":"","tab_page_id":"page_goods_detail1535428797977"}2018-08-28 11:59:58,293 - site: leeyk99, ip: 172.56.35.21, refer: https://m.leeyk99.com/us/FB-US-Striped-20180402-A-D7-vc-64042.html?utm_source=facebook.com&utm_medium=cpc&utm_campaign=fbadsus_20180408_mobmpa_Food_FB-US-Striped-20180402-A-D7-vc-64042_3554_&url_from=fbadsus_20180408_mobmpa_Food_FB-US-Striped-20180402-A-D7-vc-64042_3554_, agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 Instagram 24.0.0.14.205 (iPhone7,2; iOS 11_4_1; en_US; en-US; scale=2.00; gamut=normal; 750x1334), body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"375X667","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"iOS","os_versions":"11.4.1","browser_name":"WebKit","browser_versions":"605.1.15","session_id":"","timestamp":1535428797377,"local_time":"2018/8/27 23:59:57","device_id":"","cookie_id":"3BFCB287_A97B_AA24_DBCC_86DC346D3100","member_id":"","login":0,"page_id":2,"page_name":"page_virtual_class","page_param":{"category_id":"64042"},"start_time":1535428797377,"end_time":"","tab_page_id":"page_virtual_class1535428797377"}

 

点击、曝光的日志内容与浏览的类似,根据埋点需求不同,采集记录的数据略有不同,记录的核心内容就是body里的内容。
埋点日志是根据需求设计记录的内容,格式齐整,内容规范,一般使用Hive-正则即可进行过滤入库,像这个浏览日志,只需要创建一张表,指定以下正则格式,即可入库使用日志:
'input.regex'='([0-9\\.\\- :,]+) \\- site: ([\\w]+), ip: ([0-9\\.\\- :,]+), refer: (.*), agent: (.*), body: ([\\[\\{].*[\\}\\]])'

 
1.3后端系统日志
后端系统日志是系统自己主动记录的,通常是前端或其他系统向后端系统请求接口数据,后端系统记录接口请求信息或接口返回结果信息。这种数据通常是系统间约定好的,因此是格式非常规范的日志数据,也可以直接使用Hive的正则技术处理数据。
例如:
格式一:(结果信息)
2018-07-03 06:50:00,142 [XNIO-2 task-28] INFO  com.leeyk99.bi.abt.rest.CoreApiController - 1A42F7C6_B904_A334_AB87_5A69A7034DA0  leeyk99PcRealClass 66 158

 

格式二:(接口信息)
2018-07-03 20:39:46,043 [XNIO-2 task-211] INFO  com.leeyk99.bi.abt.filter.LogFilter - GET  /api/v1/bi/abt?cid=973EA838_E20E_74E4_41AB_E218DA91D73E&uid=&site=mtw&terminal=leeyk99-M&lan=zh-tw took 1ms and returned 200
 
(1.2\1.3中的leeyk99是对源数据中某个公司品牌的替换)
 
关于Hive正则技术处理比较规范的日志数据,可以查看:  (博客园)或者  (印象笔记)
本篇学习使用Hadoop-MR清洗请求日志。
 
2.请求日志采集入库
对于日志文件的采集,我们数仓一般不会直接去生产系统去采集,而是由运维或者专门的小组负责日志采集,一般是采集落到HDFS或S3文件系统上或者某台接口机上,数仓采集入库这些文件,进行清洗加工。
ELK结构(Elasticsearch , Logstash, Kibana )提供了一整套解决方案,并且都是开源软件,之间互相配合使用,完美衔接,高效的满足了很多场合的应用,这个结构是面向平台或系统用户的,用来查看监视日志,跟踪系统运行状况的。
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。
  • Flume+Kafka+Storm+mysql构建大数据实时系统
  • Flume+HDFS+KafKa+Strom实现实时推荐,反爬虫服务等服务
  • Flume+Hadoop+Hive的离线分析网站用户浏览行为路径
  • Flume+Logstash+Kafka+Spark Streaming进行实时日志处理分析
  • Flume+Spark + ELK数据系统实时监控平台
ftp文件传输也是一种非常重要的文件服务方式,但对于大量的日志可能不太适用。除非是日志离线归档收集好,再传输到接口机上供第三方取用。
关于实时收集等模式,暂无涉猎。
 
3.配置Maven-Hadoop环境
3.1.项目初始化
com.leeyk99.udp
hadoop-mapreduce
1.0-SNAPSHOT

 

目标:创建一个Maven项目,配置Hadoop运行环境需要的Jar文件。
 
3.2.配置pom.xml
配置Hadoop运行需要的JAR文件
4.0.0
com.leeyk99.udp
hadoop-mapreduce
1.0-SNAPSHOT
org.apache.hadoop
hadoop-core
1.2.1
org.apache.hadoop
hadoop-common
2.7.6
org.apache.hadoop
hadoop-hdfs
2.7.6
org.apache.hadoop
hadoop-client
2.7.6
log4j
log4j
1.2.17

 

关于IDEA上Maven项目JAR文件自动下载配置,参考笔记
自动下载后,IDEA给该Maven项目下载了很多JAR文件(External Libraries下),除了我们自己配置的核心文件,还有相关必要的文件也被下载了, 省去了我们逐个下载的麻烦。
 
 
 
 
 

转载于:https://www.cnblogs.com/leeyuki/p/9560793.html

你可能感兴趣的文章
MethodChannel 实现flutter 与 原生通信
查看>>
lua的性能优化
查看>>
vs2012 出现断点无法命中 解决方案。
查看>>
weex图片加载更多方法loadmore的使用
查看>>
创建您的 ActiveReports Web端在线报表设计器
查看>>
项目复审
查看>>
FreeMarker学习
查看>>
hihocoder 1631
查看>>
2018大都会赛 A Fruit Ninja【随机数】
查看>>
【实战HTML5与CSS3】用HTML5和CSS3制作页面(上)
查看>>
小公司的一年,一起看看小公司的前端可以怎么做
查看>>
oracle数据批处理
查看>>
Json网络解析
查看>>
[转]Google Chrome/IE/FireFox查看HTTP请求头request header响应头
查看>>
Harris角点检测
查看>>
Struts2的处理流程及为Action的属性注入值
查看>>
设计中最常用的CSS选择器
查看>>
Maven项目打包成可执行Jar文件
查看>>
nginx http proxy 正向代理
查看>>
对BFC的总结
查看>>