Nutch Custom Authentication cookies session management to crawl secure enterprise websites

Java code changes Step 2 >>

Cookie based authentication

After a lot of research to find a plugin for cookie management and custom nlm/basic or other SAML complaint authentication i had to give up and build my own quick patch.

I am sharing this post so that if someone has some time can create a plugin and share it to the community. If in need of an urgent solution use the same patch or code in your projects for enterprise session/form based authentication for crawling.

Step 1: Define config properties

Cookie Management

I have created the below two config properties to manage session with cookies. (nutch-default.xml/nutch-site.xml)

<property>

<name>http.auth.csv.cookienames</name>

<value>_SessionId,JSessionID</value>

</property>

<property>

<name>http.auth.cookie.policy</name>

<value>netscape</value>

</property>

http.auth.csv.cookienames - Defines the cookie names that manage auth session on a secure website

http.auth.cookie.policy - Cookie policy notch uses to read cookies and maintain cookie for the rest of the crawl. (code works flawlessly with netscape policy)

Form Authentication

Enterprise websites that have a normal form (or combination of single sign on with form) can use the following properties to authenticate the website.

Below properties define the location of the login form (url) and credentials required to authenticate.

<property>

<name>http.auth.csv.urls</name>

<value>http://foobar.com/SSO/login.jsp?logintype=normal,http://foobar.com/login.jsp?logintype=normal</value>

</property>

Authentication (user/password) to authenticate using the above form urls. (The first URL would be the login urls and the next few URLs might help in Single sign on or intermediate steps to complete authentication (First url needs username/password and the rest of the urls above use cookies)

httpclient-auth.xml

<auth-configuration>

<credentials username="enterpriseuser" password="passwordxyz">

<default scheme="NTLM"/>

</credentials>

</auth-configuration>

Java code changes Step 2 >>