nutch安装，使用，二次开发入门 ( by quqi99 )

安哥网络 · 发表于 2014-9-3 16:28:17

nutch安装，使用，二次开发入门 ( by quqi99 )
作者：张华发表于：2007-05-24 ( http://blog.csdn.net/quqi99 )

版权声明：可以任意转载，转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明。

1 Nutch(windows环境下)1.1 Nutch安装

参考资料：http://www.blogjava.net/dev2dev/archive/2006/02/01/29415.aspx Nutch在Windows中安装之细解

由于运行Nutch自带的脚本命令需要Linux的环境，所以必须首先安装Cygwin来模拟这种环境。

1）安装cgwin

2) 下载nutch-0.9.tar.gz, 用winrar解压后，例如放在g:/nutch-0.9

3) 安装nutch，打开cgwin，运行命令：

cd /cygdrive/g/nutch-0.9 (也就是进入nutch解压的目录)

bin/nutch (执行nutch脚本安装)

4) OK !

1.2 Nutch使用入门

资料：http://blog.csdn.net/zjzcl/archive/2006/02/06/593138.aspx

Nutch 使用之锋芒初试（包括下载及检索两部分）

注意：请使用JDK1.5，用JDK1。4会报错误：unsupported major.minor version 49.0 n

设置环境变量：NUTCH_JAVA_HOME = c:/jdk1.5

1.2.1 抓取少量网站

1) 在nutch的安装目录新建一个文件url.txt，指明要抓取网站的顶级网址，写入：

http://www.aerostrong.com.cn

2) 编辑conf/crawl-urlfilter.txt，修改MY.DOMAIN.NAME部分

# accept hosts in MY.DOMAIN.NAME

#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/

+^http://www.aerostrong.com.cn

3) 运行脚本命令抓取，脚本命令全是linux下的shell命令，在window下运行它需要cgwin，打开cgwin，运行命令:

cd /cygdrive/g/nutch-0.9 (也就是进入nutch解压的目录)

bin/nutch crawl url.txt -dir crawled -depth 3 -threads 4 >& crawl.log

参数解释：

-dir dir names the directory to put the crawl in.

-depth depth indicates the link depth from the root page that should be crawled.

-delay delay determines the number of seconds between accesses to each host.

-threads threads determines the number of threads that will fetch in parallel.

1.2.2 抓取整个因特网

http://hedong.3322.org/archives/000247.html 试用nutch

1、概念解释：

1） web database: nutch所知道的page,以及在这些page里头的links (由injector通过DMOZ往里添加page，Dmoz(The Open Directory Project/ODP)是一个人工编辑管理的目录集合，为搜索引擎提供结果或数据。)

2) segments.：是指page的一个集合，对它进行抓取与索引都作为同一个单元对待。它包含以下类型：

Fetchlist 这些page的名称的集合

Fetcher output: 这些page文件的集合

Index: lucene格式的索引输出

2、建立web database与segments

初始准备
	mkdir db	建立目录存放web database
	mkdir segments
	bin/nutch admin db -create	建一个新的空的数据库(这步出错了)
第一轮抓取
	bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000	从DMOZ列表中取得URL并加入数据库
	bin/nutch generate db segments	根据数据库内容，生成一个抓取列表(fetchlist)
	s1=`ls -d segments/2* \| tail -1`	刚才生成的抓取列表放在最后一个目录中，取其名
	bin/nutch fetch ＄s1	利用机器人抓页面
	bin/nutch updatedb db ＄s1	利用抓取结果更新数据库
第二轮抓取
	bin/nutch analyze db 5	迭代5次分析页面的链接
	bin/nutch generate db segments -topN 1000	将排行前1000个URL生成新的抓取列表
	s2=`ls -d segments/2* \| tail -1`	执行抓取、更新、并迭代2次分析链接
	bin/nutch fetch ＄s2
	bin/nutch updatedb db ＄s2
第三轮抓取
	bin/nutch analyze db 2
	bin/nutch generate db segments -topN 1000
	s3=`ls -d segments/2* \| tail -1`
	bin/nutch fetch ＄s3
	bin/nutch updatedb db ＄s3
	bin/nutch analyze db 2	（为下一次做准备？）
索引并去重
	bin/nutch index ＄s1
	bin/nutch index ＄s2
	bin/nutch index ＄s3
	bin/nutch dedup segments dedup.tmp

1.2.3 检索

1) 将nutch-0.9.war包丢到tomcat发布目录

2) 修改配置文件指定索引库.( WEB-INFclasses下的nutch-site.xml):

<?xml version="1.0"?>
　　<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

　　　<nutch-conf>
　　　<property>
　　　　<name>searcher.dir</name>
　　　　<value>G:/nutch-0.9/crawled</value>
　　　</property>
　　 </nutch-conf>

注意，当复制上述配置文件时，如果出现下列错误，是因为复制文件时带有空格或编码格式，重敲一遍即可: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence

3) 测试http://172.17.1.122:8081

注意，若查询字符串输出中文时出现编码问题，这个问题和nutch关系不大，主要是tomcat有关系，修改tomcat的server.xml，在Connector元素中增加属性：

URIEncoding="UTF-8" useBodyEncodingForURI="true"

1.3 nutch的二次开发

参考：

http://www.mysoo.com.cn/news/2007/200721679.shtml Google式的搜索引擎实现

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html Introduction to Nutch, Part 1: Crawling

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Introduction to Nutch, Part 2: Searching

资料

Nutch相关资料收集整理 http://www.gispark.com/html/spatial/2006/1008/294.html

摘自：http://blog.csdn.net/quqi99/article/details/1624210

		自动登录	找回密码
密码			立即注册

nutch安装，使用，二次开发入门 ( by quqi99 )

相关帖子

优秀会员

助人为乐

辛勤工作

技术精英

多才多艺

优秀班竹

灌水天才

星球管理

宣传大使

灌水之王

财富勋章

版主勋章

动漫勋章

勤奋会员

论坛精英

PS高手

心

8

闪游皮肤

双鱼座

8★8➹

志愿者

乖