Moving or Migrating historic data into Big data platform
I was moving Terra bytes of into Big data platform for the first time we started using the cluster. It is a real pain in the neck to stage the data into the cloud before moving it into the platform. We were using Hortonworks on Azure platform.
I tried achieve various options
- Creating a FTP/SFTP location with HDFS as storage. Which would have been the easiest with to stream data directly from the source to HDFS - Failed (I didnt find a reliable solution)
- Move files directly to Data node filesystem. - Failed - (Had to split source files smaller for each datanode. Planning and organizing was a pain).
- Create Windows VM in Azure and attach 50 1TB disks to create a stage location or Blog storage with unlimited space. - Success (Was able to create about 50TB temporary staging space)
Below are the steps to follow to migrate data into Big data platform. Once all the steps are complete upload files using Filezilla (FTP client) and use Samba mount to mount the drive on HDFS master and Pipe the zip files to HDFS using similar command shown below.
hdfs -cat X | gzip -d | hdfs -put - Y
Step 1: Create windows VM
Login to Azure. Click on the New resource. Under Load More select Windows Server.