Nutch Custom Authentication cookies session management to crawl secure enterprise websites
Cookie based authentication
After a lot of research to find a plugin for cookie management and custom nlm/basic or other SAML complaint authentication i had to give up and build my own quick patch.
I am sharing this post so that if someone has some time can create a plugin and share it to the community. If in need of an urgent solution use the same patch or code in your projects for enterprise session/form based authentication for crawling.
Step 1: Define config properties
Cookie Management
I have created the below two config properties to manage session with cookies. (nutch-default.xml/nutch-site.xml)
<property>
<name>http.auth.csv.cookienames</name>
<value>_SessionId,JSessionID</value>
</property>
<property>
<name>http.auth.cookie.policy</name>
<value>netscape</value>
</property>
http.auth.csv.cookienames - Defines the cookie names that manage auth session on a secure website
http.auth.cookie.policy - Cookie policy notch uses to read cookies and maintain cookie for the rest of the crawl. (code works flawlessly with netscape policy)
Form Authentication
Enterprise websites that have a normal form (or combination of single sign on with form) can use the following properties to authenticate the website.
Below properties define the location of the login form (url) and credentials required to authenticate.
<property>
<name>http.auth.csv.urls</name>
<value>http://foobar.com/SSO/login.jsp?logintype=normal,http://foobar.com/login.jsp?logintype=normal</value>
</property>
Authentication (user/password) to authenticate using the above form urls. (The first URL would be the login urls and the next few URLs might help in Single sign on or intermediate steps to complete authentication (First url needs username/password and the rest of the urls above use cookies)
httpclient-auth.xml
<auth-configuration>
<credentials username="enterpriseuser" password="passwordxyz">
<default scheme="NTLM"/>
</credentials>
</auth-configuration>