mugshot, larkspur

g13n


Gopal Venkatesan's Former Journal

My writings on free/open source software and other technologies


Previous Entry Share Next Entry
Hadoop and Pig on your laptop or personal computer
mugshot, larkspur
g13n

This content has been updated and moved to a new place.

Yes, I'm writing this article again for the second time for there have been some changes with respect to Hadoop, and also I'll show you how to install it better.

I'll choose two popular server operating systems CentOS, and FreeBSD for the purpose of this installation tutorial. Yes, Hadoop runs on FreeBSD without any glitches albeit unsupported as per their documentation. This blog post assumes that you're either running CentOS version 5.4 or you're running FreeBSD 8.0 RELEASE.

As part of the standard convention, shell> refers to commands to be entered as a normal user and shell# refers to commands to be entered as the system supervisor (root). I'll assume that you're running a Bourne compatible shell (either Bash or Ksh) and lines beginning with # are comments for reference.

For CentOS, you can do:


shell> cat /etc/redhat-release

CentOS release 5.4 (Final)

On FreeBSD, you can run:


shell> uname -mrs

FreeBSD 8.0-RELEASE i386

System Requirements

Of course it is recommended to have a computer with at least a gigabyte (1G) of RAM.

On the software side, you need the following:

  • Java version 1.6 (Sun JDK preferred)
  • sshd (installed and running)
  • rsync

On CentOS the pre-requisites can be met by installing the packages using yum(1):


shell> sudo yum install java-1.6.0-openjdk openssh-server rsync

# After the installation is successful ...

shell> sudo /sbin/service sshd start

On FreeBSD it is best to install them from ports.

At the time of writing, Pig stable version is 0.5.0 and it works with the latest Hadoop version 0.20.1.

Installing Hadoop

We'll create a special user that runs Hadoop.

On CentOS you can use useradd(8) command to add a new user.


shell# groupadd hadoop

shell# useradd -g hadoop -s /bin/sh -m hadoop

shell# /usr/bin/passwd hadoop

# Set some password for the user

On FreeBSD you can use the pw(8) command to add a new user.


shell# groupadd hadoop

shell# pw useradd -g hadoop -s /bin/sh -m hadoop

shell# /usr/bin/passwd hadoop

# Set some password for the user

Download and unpack Hadoop version 0.20.1 under a directory say, /usr/local/sfw such that it is installed under /usr/local/sfw/hadoop-0.20.1. Henceforth I'll refer to /usr/local/sfw/hadoop-0.20.1 as $HADOOP_PATH and hadoop as $HADOOP_USER.

Setup necessary permissions for $HADOOP_USER.


shell# mkdir $HADOOP_PATH/logs

shell# chown $HADOOP_USER:$HADOOP_GROUP $HADOOP_PATH/logs

Edit $HADOOP_PATH/conf/hadoop-env.sh and set the JAVA_HOME environment variable.

The Hadoop configuration has been split into multiple files now. The older hadoop-site.xml has been deprecated in favor of these new files.

Edit $HADOOP_PATH/conf/core-site.xml and overwrite it with the following contents:


<?xml version="1.0"?>

<configuration>

    <property>

        <name>fs.default.name</name>

        <value>hdfs://localhost:9000</value>

    </property>

</configuration>

Edit $HADOOP_PATH/conf/hdfs-site.xml and overwrite it with the following contents:


<?xml version="1.0"?>

<configuration>

    <property>

        <name>dfs.replication</name>

        <value>1</value>

    </property>

</configuration>

Edit $HADOOP_PATH/conf/mapred-site.xml and overwrite it with the following contents:


<?xml version="1.0"?>

<configuration>

    <property>

        <name>mapred.job.tracker</name>

        <value>localhost:9001</value>

    </property>

</configuration>

Starting Hadoop

Check if you can ssh to your machine without a passphrase as $HADOOP_USER.


shell> su - $HADOOP_USER

shell> ssh localhost

If not, create a passphraseless ssh key using ssh-keygen(1).


shell> whoami

hadoop

shell> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

shell> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

If you're not able to ssh still, the permissions for the files under $HOME/.ssh may be incorrect. Unfortunately on some systems ssh(1) fails silently without any warning.


# As user $HADOOP_USER

shell> chmod 711 $HOME/.ssh

shell> chmod 600 $HOME/.ssh/*

Format a new distributed file-system:


# As user $HADOOP_USER

shell> PATH=$PATH:$HADOOP_PATH/bin; export PATH

shell> hadoop namenode -format

In case you get an exception like:


INFO util.MetricsUtil: Unable to obtain hostName java.net.UnknownHostException

Check if your hostname has an entry in /etc/hosts file and add it if it is not found.

shell> grep `hostname` /etc/hosts

Now that you're done with configuring, you're ready to start the Hadoop.


shell> $HADOOP_PATH/bin/start-all.sh

If everything went well, you can check the list of files under the newly created HDFS. You'll see the output similar to the one below:


shell> ./bin/hadoop fs -ls /

Found 1 itemsdrwxr-xr-x - hadoop supergroup 0 2009-09-05 22:10 /tmp

Starting Hadoop on boot

To start Hadoop everytime upon boot, I have created rc(8) scripts for CentOS and FreeBSD. They can be downloaded from my GitHub samples repository. The CentOS version has its name suffixed with -centos54 whereas the FreeBSD version has -freebsd8 suffix to its name.

Installing Hadoop boot script on CentOS

As root copy the downloaded hadoop_rc_script-centos54 to /etc/rc.d/init.d/hadoop.

Carefully review and set the HADOOP_PATH and HADOOP_USER values appropriately in the script and save it.

Next run chkconfig(8) to add the script to the run-level.


shell# /sbin/chkconfig --add hadoop

That's it, Hadoop will be automatically started upon boot. If you want to start it manually, you can use service(8).


shell# /sbin/service hadoop start

Installing Hadoop boot script on FreeBSD

As root copy the downloaded hadoop_rc_script-freebsd8 to /etc/rc.d/hadoop.

Carefully review and set the HADOOP_PATH and HADOOP_USER values appropriately in the script and save it.

That's it, Hadoop will be automatically started upon boot. If you want to start it manually, you can use the following command to start it:


shell# /etc/rc.d/hadoop start

To start Hadoop upon boot, edit /etc/rc.conf and add hadoop_enable="YES".

Installing Pig

Download and unpack Pig version 0.5.0 under a directory say, /usr/local/sfw such that it is installed under /usr/local/sfw/pig-0.5.0. Henceforth I'll refer to /usr/local/sfw/pig-0.5.0 as $PIG_PATH.

To use Pig with installed Hadoop cluster, the PIG_CLASSPATH needs to be set to the installed Hadoop's configuration directory. So, I have the following procedure installed in $HADOOP_USER's $HOME/.profile to invoke Pig with the appropriate environment.


pig() {

    HADOOP_PATH=/usr/local/sfw/hadoop-0.20.1

    PIG_PATH=/usr/local/sfw/pig-0.5.0

    JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk \

    PIG_CLASSPATH=$PIG_PATH/pig-0.5.0-core.jar:$HADOOP_PATH/conf \

    $PIG_PATH/bin/pig $@

}

To test, you can run the ls command in the grunt prompt to check if things are fine.

shell> pig
2010-02-03 22:32:51,685 [main] INFO  org.apache.pig.Main - Logging error messages to: /usr/home/hadoop/pig_1265216571684.log
2010-02-03 22:32:52,325 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2010-02-03 22:32:52,867 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
grunt> ls /
hdfs://localhost:9000/tmp	
grunt>

  • 1

Thanks!

(Anonymous)
Appreciate the FreeBSD rc script, I was about to write one, but yours came up on the first page of Google results.

  • 1
?

Log in