Unless you're on a different planet Map-Reduce is not a new term in your vocabulary. Hadoop is an open source Map-Reduce framework implemented in Java for processing large amounts of data in parallel. Although the framework is implemented in Java, the Map-Reduce applications need not be written in Java.
Setting up Hadoop on your personal computer
In this section, we will try to setup Hadoop on a GNU/Linux machine in a pseudo-distributed mode setup. In such a setup, each Hadoop daemon will run in a separate Java process.
Requirements
At the minimum, you need Java 1.6.x (Sun is preferred), sshd, and rsync. If you're running OpenSUSE, you can install all of them by running the following command:
shell> sudo zypper in java-1_6_0-sun openssh rsync
Once you have the pre-requisites, you can download the Hadoop distribution from one of Apache's Download Mirrors. At this time, since we're planning to use Pig with Hadoop, be sure to download one of the 0.18 versions of Hadoop only. This is because at the time of this writing, the latest version (0.3.0) of Pig works only with Hadoop 0.18.x versions.
Installing Hadoop
Unpack the downloaded Hadoop distribution. If you have not installed Java under its standard path, edit /path/to/hadoop/conf/hadoop-env.sh and set JAVA_HOME to the root of your Java installation. Next edit /path/to/hadoop/conf/hadoop-site.xml and within the configuration section paste the following lines:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
Starting Hadoop
Check if you can ssh to your machine without a passphrase:
shell> ssh localhost
If not, run the following commands:
shell> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
shell> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Format a new distributed file-system:
shell> ./bin/hadoop namenode -format
In case you get an exception like INFO util.MetricsUtil: Unable to obtain hostName java.net.UnknownHostException: check if your hostname has an entry in /etc/hosts file.
shell> grep `hostname` /etc/hosts
Now that you're done with configuring, you're ready to start the Hadoop daemons.
shell> ./bin/start-all.sh
If everything went well, you can check the list of files under the newly created HDFS.
shell> ./bin/hadoop fs -ls /
You'll see the output similar to the one below:
Found 1 itemsdrwxr-xr-x - yourunixusername supergroup 0 2009-09-05 22:10 /tmp
A Quick Test to find out if everything is fine
Open your favorite browser and hit http://localhost:50070/. If everything went fine, you should see that there's one live datanode. If not there is a problem. If you have downgraded from your previous version of Hadoop (to be able to run Pig), then you may notice ERROR org.apache.hadoop.dfs.DataNode: org.apache.hadoop.dfs.IncorrectVersionException: Unexpected version of storage directory in the datanode log file. If it is so, you'll have to stop the Hadoop cluster, physically remove the Hadoop data directory (typically /tmp/hadoop-yourunixusername) and run the setup (namenode -format) again.
Pig
Pig is a platform for analyzing large data sets. Pig Latin is a SQL-like language that lets you specify a sequence of data transformations (split, join, filter) over large sets of data. The Pig engine compiles Pig Latin into Map-Reduce to be run on Hadoop.
That was a quick overview about Pig. In the following section, I'll show you how to setup Pig on the locally running Hadoop cluster.
Setting up Pig to run in Map-Reduce mode to use the local Hadoop cluster
Assuming Hadoop is already installed, we have satisfied the basic requirements to run Pig. Pig can be downloaded from one of Apache's Download Mirrors.
Unpack the downloaded distribution. The Pig launcher script is located under the bin directory. To use Pig with the installed Hadoop cluster, you need to set the PIG_CLASSPATH variable to the Hadoop's conf directory and set the PIG_HADOOP_VERSION to the appropriate Hadoop version.
For example, in my system, Hadoop is unpacked (installed) under $HOME/bin/hadoop, and the installed version is 0.18.3. So, I have the following launched script for launching Pig.
#!/bin/sh
PIG_PATH=$HOME/bin/pig-0.3.0
PIG_CLASSPATH=$PIG_PATH/pig-0.3.0-core.jar:$HOME/bin/hadoop-0.18.3/conf \
PIG_HADOOP_VERSION=0.18.3 \
$PIG_PATH/bin/pig $@
Running pig should print something like:
2009-09-05 23:17:41,113 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2009-09-05 23:17:41,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
grunt>
To test, you can run the ls command in the grunt prompt to check if things are fine.
shell> pig
2009-09-05 23:22:13,845 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2009-09-05 23:22:14,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
grunt> ls /
hdfs://localhost:9000/tmp <dir>
grunt>
This is my first post on this topic, going forward I'm planning to post more articles, until then, keep Grunting!