Hadoop HDFS in Standalone Mode

My previous hadoop example operated against the local filesystem, in spite of the fact that I formatted a local HDFS partition. In order to operate against the local HDFS partition it’s necessary to first start the namenode and datanode. I mostly followed these instructions to start those processes. Here’s the most relevant part that I hadn’t done yet.

# Format the namenode
hdfs namenode -format
# Start the namenode
hdfs namenode
# Start a datanode
hdfs datanode

I was then ready to add some directories and data to the local HDFS partition. I got the idea of using http://www.gutenberg.org/ for sample data from this article.

[watrous@myhost ~]$ hdfs dfs -mkdir -p /user/watrous/gutenberg
13/11/15 16:27:30 INFO namenode.FSEditLog: Number of transactions: 4 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 14
[watrous@myhost ~]$ hdfs dfs -ls -R /
drwxr-xr-x   - watrous supergroup          0 2013-11-14 23:16 /user
drwxr-xr-x   - watrous supergroup          0 2013-11-14 23:16 /user/watrous
drwxr-xr-x   - watrous supergroup          0 2013-11-14 23:16 /user/watrous/gutenberg
[watrous@myhost ~]$ hdfs dfs -copyFromLocal /home/watrous/pg20417.txt /user/watrous/gutenberg
13/11/14 23:16:35 INFO hdfs.StateChange: BLOCK* allocateBlock: /user/watrous/gutenberg/pg20417.txt._COPYING_. BP-1860796918-15.50.2.93-1384470255181 blk_1073741825_1001{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:50010|RBW]]}
13/11/14 23:16:35 INFO datanode.DataNode: Receiving BP-1860796918-15.50.2.93-1384470255181:blk_1073741825_1001 src: /127.0.0.1:47033 dest: /127.0.0.1:50010
13/11/14 23:16:36 INFO DataNode.clienttrace: src: /127.0.0.1:47033, dest: /127.0.0.1:50010, bytes: 674570, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1914205347_1, offset: 0, srvID: DS-819252937-15.50.2.93-50010-1384470785167, blockid: BP-1860796918-15.50.2.93-1384470255181:blk_1073741825_1001, duration: 26478978
13/11/14 23:16:36 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073741825_1001{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:50010|RBW]]} size 0
13/11/14 23:16:36 INFO datanode.DataNode: PacketResponder: BP-1860796918-15.50.2.93-1384470255181:blk_1073741825_1001, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
13/11/14 23:16:36 INFO hdfs.StateChange: DIR* completeFile: /user/watrous/gutenberg/pg20417.txt._COPYING_ is closed by DFSClient_NONMAPREDUCE_1914205347_1
[watrous@myhost ~]$ hdfs dfs -ls /user/watrous/gutenberg
Found 1 items
-rw-r--r--   3 watrous supergroup     674570 2013-11-14 23:16 /user/watrous/gutenberg/pg20417.txt

Accessing HDFS from MapReduce classes

It turns out that no modifications are required in the Mapper and Reducer classes to work with files in HDFS. When a new Path object is created with a single String, the constructor performs some analysis to determine if there is a scheme and authority. The global configuration is then used to make the path qualified, which includes applying the hdfs scheme if there isn’t one already provided. Here are two examples where I call the same MapReduce script with different paths, first a local path, then an HDFS path.

[watrous@myhost ~]$ hadoop jar HadoopExample.jar /opt/mount/input/sample/ ~/output.multiple
...
13/11/14 22:44:48 INFO mapred.MapTask: Processing split: file:/opt/mount/input/sample/catalina.out-20131027.gz:0+4416223
...
[watrous@myhost ~]$ hadoop jar HadoopExample.jar /user/watrous/log_myhost.com /user/watrous/output_myhost.com
...
13/11/15 17:07:21 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131029.gz:0+8924461
...

It is also possible to eliminate ambiguity and explicitly define the hdfs paths as shown here.

[watrous@myhost ~]$ hadoop jar HadoopExample.jar hdfs://localhost/user/watrous/log_myhost.com hdfs://localhost/user/watrous/output_myhost.com-withscheme
...
13/11/15 21:49:21 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131031.gz:0+8118929
...

View results in HDFS

To view the results from HDFS, first get the path to the results, then use something like -cat to output results to stdout.

[watrous@myhost ~]$ hdfs dfs -ls /user/watrous/output_myhost.com
Found 2 items
-rw-r--r--   3 watrous supergroup          0 2013-11-15 17:08 /user/watrous/output_myhost.com/_SUCCESS
-rw-r--r--   3 watrous supergroup      17893 2013-11-15 17:08 /user/watrous/output_myhost.com/part-r-00000

With the path it’s now easy to print our results.

[watrous@myhost ~]$ hdfs dfs -cat /user/watrous/output_myhost.com/part-r-00000
2013-10-26 19:18:05,669 1
2013-10-26 19:31:00,452 1
2013-10-27 06:09:33,748 11
2013-10-27 06:09:33,749 25
2013-10-27 06:09:33,750 26
...
Twitter Digg Delicious Stumbleupon Technorati Facebook Email

About Daniel Watrous

I'm a Software & Electrical Engineer and online entrepreneur.

No comments yet... Be the first to leave a reply!

Leave a Reply