有意思的 nginx 访问日志检查

为什么说检查个日志还有意思呢?

小站本来就是个私人的站,没名没气的,访问量自然是不用提了,博主自己也在不停的优化之。这两天,例行检查 nginx 访问日志的时候,发现一些关于搜索引擎好玩的东西,特此记录下来。

grep 出 baidu 的访问记录,可以看到有以下这些:

123.125.71.73 - - [21/Jul/2013:20:00:42 +0000] "GET /tag/jsf/feed HTTP/1.1" 200 6527 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.139 - - [21/Jul/2013:20:47:12 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.185 - - [21/Jul/2013:22:47:11 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.148 - - [21/Jul/2013:23:47:10 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.141 - - [22/Jul/2013:00:47:26 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.162 - - [22/Jul/2013:02:48:38 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.76 - - [22/Jul/2013:03:47:26 +0000] "GET / HTTP/1.1" 200 32655 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.169 - - [22/Jul/2013:03:48:58 +0000] "GET / HTTP/1.1" 200 49405 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.116 - - [22/Jul/2013:05:48:40 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.159 - - [22/Jul/2013:05:49:04 +0000] "GET / HTTP/1.1" 200 39185 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.100 - - [22/Jul/2013:06:47:54 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.161 - - [22/Jul/2013:06:48:29 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.97 - - [22/Jul/2013:08:45:48 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.158 - - [22/Jul/2013:08:46:23 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.72 - - [22/Jul/2013:09:55:43 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.155 - - [22/Jul/2013:09:57:15 +0000] "GET / HTTP/1.1" 200 49405 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.89 - - [22/Jul/2013:11:46:58 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.183 - - [22/Jul/2013:11:49:08 +0000] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.101 - - [22/Jul/2013:13:46:29 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.170 - - [22/Jul/2013:13:47:52 +0000] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.106 - - [22/Jul/2013:22:45:43 +0800] "GET / HTTP/1.1" 200 49405 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.159 - - [22/Jul/2013:22:49:27 +0800] "GET / HTTP/1.1" 200 20205 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

在来看一下 google 的一天访问记录:

66.249.75.156 - - [21/Jul/2013:19:59:41 +0000] "GET /robots.txt HTTP/1.1" 200 59 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [21/Jul/2013:19:59:41 +0000] "GET /category/java HTTP/1.1" 200 24484 "-" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:07:54:46 +0000] "GET /robots.txt HTTP/1.1" 200 59 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:07:54:46 +0000] "GET /author/bilxio HTTP/1.1" 200 51551 "-" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:09:41:52 +0000] "GET /feed HTTP/1.1" 200 36036 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:09:41:53 +0000] "GET / HTTP/1.1" 200 51169 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:14:10 +0000] "GET /backup-vps-to-dropbox.html HTTP/1.1" 200 16969 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:14:59 +0000] "GET /category/linux HTTP/1.1" 200 14244 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:15:02 +0000] "GET /wp-content/uploads/2013/07/2013-07-21_163424.jpg HTTP/1.1" 200 62249 "-" "Googlebot-Image/1.0" 66.249.75.156 - - [22/Jul/2013:10:15:12 +0000] "GET /wp-content/uploads/2013/07/2013-07-21_165536.jpg HTTP/1.1" 200 178591 "-" "Googlebot-Image/1.0" 66.249.75.156 - - [22/Jul/2013:10:15:26 +0000] "GET /wp-content/uploads/2013/07/2013-07-21_162104.jpg HTTP/1.1" 200 95795 "-" "Googlebot-Image/1.0" 66.249.75.156 - - [22/Jul/2013:10:15:51 +0000] "GET /2013/07 HTTP/1.1" 200 14054 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:16:47 +0000] "GET /tag/vps HTTP/1.1" 200 14190 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:17:43 +0000] "GET /tag/backup HTTP/1.1" 200 14205 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:18:39 +0000] "GET /tag/linux-2 HTTP/1.1" 200 14204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:19:01 +0000] "GET /wp-content/themes/twentytwelve/my/highlight.min.css HTTP/1.1" 200 1860 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:19:02 +0000] "GET /wp-content/themes/twentytwelve/style.css?ver=3.5.1 HTTP/1.1" 200 35282 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:19:38 +0000] "GET /tag/dropbox HTTP/1.1" 200 14210 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:20:45 +0000] "GET /backup-vps-to-dropbox.html/feed HTTP/1.1" 200 821 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:25:59 +0000] "GET /category/linux/feed HTTP/1.1" 200 4920 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:26:12 +0000] "GET /tag/backup/feed HTTP/1.1" 200 4917 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:26:43 +0000] "GET /tag/dropbox/feed HTTP/1.1" 200 4919 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:26:46 +0000] "GET /tag/linux-2/feed HTTP/1.1" 200 4917 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.156 - - [22/Jul/2013:10:27:32 +0000] "GET /tag/vps/feed HTTP/1.1" 200 4911 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.84.212 - - [22/Jul/2013:22:20:02 +0800] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google (+https://developers.google.com/+/web/snippet/)" 66.249.84.121 - - [22/Jul/2013:22:20:07 +0800] "GET / HTTP/1.1" 200 51169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google (+https://developers.google.com/+/web/snippet/)" 66.249.84.212 - - [22/Jul/2013:22:20:07 +0800] "GET /favicon.ico HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google (+https://developers.google.com/+/web/snippet/)"

通过对比,我们可以发现一些很有意思的现象:

  1. 百度的访问记录中根本没有 robots.txt, Google第一条访问记录就是读取robots.txt
  2. 百度的爬虫始终就只有一种,而 Google 的爬虫不止一种, 粗看一下,至少有四种:

关于对百度的 robots 的遵循问题的吐槽,网上已经是俯仰皆是,我就不说啥了,反正感觉就是无力。

而Google的爬虫,则非常有意思,仔细观察,它的 User-Agent 多达四种,其中甚至还有专门的针对mobile 设备访问的爬虫,让人吃惊,UA也特别逗,居然是iphone (具体描述是: iPhone; U; CPU iPhone OS 4_1 like Mac OS X),让我想象一下,是不是Google搞了一大堆 iPhone 堆在一个机架上,充当蜘蛛爬去各种能适应移动设备访问的网站呢?很是好奇呢。

PS: 以上这些分析基于一个前提:本站没有在任何搜索引擎提交过。如果是提交过之后,可能会有不一样的爬去策略吧。写下本段的时候,已经在主要搜索引擎注册了站点。爬虫的爬取行为在持续观察中。

© 2022, Bill X.