Nutch2 solrclean alternative hbase ruby script gora

For implementations that use nutch2 as crawler along with solr, if you are struggling to setup solrclean for 404 and other page removals use the following alternative ruby script. The following script runs on Hbase datastore (nutch with gora).

I modified check_meta.rb file to create nutchsolrclean.rb under hbase-0.90.4/bin ($HBASE_HOME/bin) to add fetch 404 url rows and delete them using shell script (deleteall.sh)

hbase-0.90.4/bin/nutchsolrclean.rb

for source click here nutchsolrclean source

hbase-0.90.4/bin/deleteall.sh

for source click here deleteall source

Execute the below command to run the clean

>> cd /xyz/abc/hbase-0.90.4/bin

>> ./hbase org.jruby.Main test_scan.rb

The above command will delete all the 404 or other page doesnot exist rows from Nutch database. (Hbase)