Nutch Custom Authentication cookies session management to crawl secure enterprise websites
Cookie based authentication
After a lot of research to find a plugin for cookie management and custom nlm/basic or other SAML complaint authentication i had to give up and build my own quick patch.
I am sharing this post so that if someone has some time can create a plugin and share it to the community. If in need of an urgent solution use the same patch or code in your projects for enterprise session/form based authentication for crawling.
Step 1: Define config properties
I have created the below two config properties to manage session with cookies. (nutch-default.xml/nutch-site.xml)
http.auth.csv.cookienames - Defines the cookie names that manage auth session on a secure website
Enterprise websites that have a normal form (or combination of single sign on with form) can use the following properties to authenticate the website.
Below properties define the location of the login form (url) and credentials required to authenticate.
<credentials username="enterpriseuser" password="passwordxyz">