`
yang7229693
  • 浏览: 25668 次
  • 性别: Icon_minigender_1
  • 来自: 武汉
社区版块
存档分类
最新评论

Eclipse编译nutch-1.0

阅读更多
在wiki上的nutch板块找到的解决方案(http://wiki.apache.org/nutch/RunNutchInEclipse1.0),虽然是e文,不过还好不难,照着做了,却发现以前照着网上配置不成功的原因,原来nutch-1.0如果不修改代码,导入进去是有两处错误的,而那些文章根本就没有提及过,真的很无语,下面是自己配置成功地方法。
1.配置cygwin的环境变量,这一步很重要,如果没有配置的话,后面就会出现"Failed to get the current user's information" 或者 'Login failed: Cannot run program "bash"'的错误。
2.新建一个工程,随便取个名字,选择"Create project from existing source",指向自己nutch-1.0的目录。
3.点击下一步,切换到"Libraries"选择"Add Class Folder..." 按钮,从列表中选择"conf"。这里要说一下,看过的很多帖子这一步不太一样。
4.切换到"Order and Export"找到"conf",把它移到顶端。
5.切换到"Source"将output folder设置为Nutch /bin/tmp_build(这一步视自己情况而定),点击finish完成导入。
6.修改nutch-defaul.xml,nutch-site.xml,crawl-urlfilter.txt。
7.从http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/,http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/下载MP3跟rtf的jar文件,分别拷贝到src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/文件夹下
8.刷新几下,右键选择工程文件夹,选择Build Path->Configure Build Path...在弹出的窗口上,切换到Libraries,选择Add Jars...,添加刚才下载的jar文件到工程。
9.到这一步,一般的工程都会有两个错误,nutch的official 1.0 release版本中,这两个问题因为licensing issues没有修复。接下来的就是最关键的部分了。
修改src\plugin\parse-rtf\src\java\org\apache\nutch\parse\rtf下RTFParseFactory.java
添加import org.apache.nutch.parse.ParseResult;
将public Parse getParse(Content content) {
改为public ParseResult getParse(Content content) {
将return new ParseStatus(ParseStatus.FAILED,
                               ParseStatus.FAILED_EXCEPTION,
                               e.toString()).getEmptyParse(conf);
改为return new ParseStatus(ParseStatus.FAILED,
                ParseStatus.FAILED_EXCEPTION,
              e.toString()).getEmptyParseResult(content.getUrl(), getConf());
将return new ParseImpl(text,
                         new ParseData(ParseStatus.STATUS_SUCCESS,
                                       title,
                                       OutlinkExtractor.getOutlinks(text, this.conf),
                                       content.getMetadata(),
                                       metadata));
改为return ParseResult.createParseResult(content.getUrl(),
                             new ParseImpl(text,
                                     new ParseData(ParseStatus.STATUS_SUCCESS,
                                             title,
                                             OutlinkExtractor.getOutlinks(text, this.conf),
                                             content.getMetadata(),
                                             metadata)));

修改src\plugin\parse-rtf\src\test\org\apache\nutch\parse\rtf下的TestRTFParser.java
将parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
改为parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);
到这一步,eclipse上面的工程就会没有错误了
10.选择Run->Run As->Java Application在弹出的Select Java Application上选择Crawl-org.apache.nutch.crawl,第一次运行由于没有设置参数,所以不会有什么,接下来,选择Run->Run Configurations…在左边的Java Application下面会有Crawl这一项,选择它,
切换到Arguments,Program Arguments的内容就是要设置的参数,填上urls -dir crawl -depth 3 -topN 50(这里视自己的具体情况而定,urls为链接)在VM arguments下面填上-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
选择Run即可,一般的情况下,如果不出意外的话,运行没什么问题,可以看到抓取页面的过程,但是我在这里碰到了一个问题,就是Java Heap Size的问题,查看logs/hadoop.log文件,如果出现类似java.lang.OutOfMemoryError: Java heap space语句,那么一般都是这个问题,Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments
设置为-Xms5m –Xmx250m,其中Xms为最小内存,Xmx为最大内存。
下面是一些错误
Eclipse: Cannot create project content in workspace
The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.
plugin dir not found
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml
<property>
  <name>plugin.folders</name>
  <value>/home/....../nutch-0.9/src/plugin</value>
No plugins loaded during unit tests in Eclipse
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.
Unit tests work in eclipse but fail when running ant in the command line
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml
Run ant test again. That should have solved the problem.
If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target?
classNotFound
• open the class itself, rightclick
• refresh the build dir
debugging hadoop classes
• Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
o Remove the hadoopXXX.jar from your classpath libraries
o Checkout the hadoop brunch that is used within nutch
o configure a hadoop project similar to the nutch project within your eclipse
o add the hadoop project as a dependent project of nutch project
o you can now also set break points within hadoop classes lik inputformat implementations etc.
Failed to get the current user's information
On Windows, if the crawler throws an exception complaining it "Failed to get the current user's information" or 'Login failed: Cannot run program "bash"', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.
6
0
分享到:
评论
5 楼 fs08ab 2015-07-10  
帖主好,import org.apache.nutch.parse.ParseResult在哪?找不到啊
4 楼 zyy571137 2010-07-13  
我在Myeclipse 6.5中部署 Nutch-1.0 按照你的方法,改掉了一些错误,但还有些问题,src\plugin\parse-rtf\src\java\org\apache\nutch\parse\rtf下RTFParseFactory.java 中没有import com.etranslate.tm.processing.rtf.ParseException;
import com.etranslate.tm.processing.rtf.RTFParser;这个类,并且在
RTFParserDelegateImpl.java 中import com.etranslate.tm.processing.rtf.RTFParserDelegate; 这个类也没有找到,你遇到这些问题了没。



哈哈。他们开了一个小玩笑。我们下下来的jar文件要解压。改名才行
原来的jar里面不是正常目录!O(∩_∩)O~。
3 楼 xubogang 2010-04-17  
你写的太棒了,我刚整完,这是还想给看到的朋友说一下,您还可以补充一个东西:
把cygwin/bin  添加到path中,我之前没有添加就出现了如下错误:
Cannot run program "whoami": CreateProcess error
  
不过,很感谢您的文章。O(∩_∩)O谢谢
2 楼 ibc789 2010-01-09  
我在Myeclipse 6.5中部署 Nutch-1.0 按照你的方法,改掉了一些错误,但还有些问题,src\plugin\parse-rtf\src\java\org\apache\nutch\parse\rtf下RTFParseFactory.java 中没有import com.etranslate.tm.processing.rtf.ParseException;
import com.etranslate.tm.processing.rtf.RTFParser;这个类,并且在
RTFParserDelegateImpl.java 中import com.etranslate.tm.processing.rtf.RTFParserDelegate; 这个类也没有找到,你遇到这些问题了没。
1 楼 jettang 2009-11-04  
写得很详细

我搜索了很多资料,只有你这个写得最准确了,连RTF和MP3的修正都写了

相关推荐

Global site tag (gtag.js) - Google Analytics