JavaTechOne: Hadoop Part 4

1. What is the default block size in HDFS ?

As of Hadoop-2.4.0 release, the default block size in HDFS is 256 MB and prior to that it was 128 MB.

2. What is the benefit of large block size in HDFS ?

Main benefit of large block size is Quick Seek Time. The time to transfer a large file of multiple blocks operates at the disk transfer rate instead of depending much on seek time.

3. What are the overheads of maintaining too large Block size ?

Usually in Mapreduce framework, each map task operate on one block at a time. So, having too few blocks result in too few map tasks running in parallel for longer time which finally results in overall slow down of job performance.

4. If a file of size 10 MB is copied on to HDFS of block size 256 MB, then how much storage will be allocated to the file on HDFS ?

Even though the default HDFS block size is 256 MB, a file which is smaller than a single block doesn’t occupy full block size. So, in this case, the file will occupy just 10 MB but not 256 MB.

5. What are the benefits of block structure concept in HDFS ?

Main benefit is that the ability to store very large files which can be even larger than the size of single disk (node) as the file is broken into blocks and distributed across various nodes on cluster.
Another important advantage is simplicity of storage management as the blocks are fixed size, it is easy to calculate how many can be stored on a given disk.
Blocks replication feature is useful in fault tolerance.

6. What if we upgrade our Hadoop version in which, default block size is higher than the current Hadoop version’s default block size. Suppose 128 MB (Hadoop 0.20.2) to 256 MB (Hadoop 2.4.0).

All the existing files are maintained at block size of 128 MB but any new files copied on to upgraded hadoop are broken into blocks of size 256 MB.

7. What is Block replication ?

Block replication is a way of maintaining multiple copies of same block across various nodes on cluster to achieve fault tolerance. In this, though one of the data node containing the block becomes dead, the block data can be obtained from other live data nodes which contain the same copy of the block data.

8. What is default replication factor and how to configure it ?

The default replication factor in fully distributed HDFS is 3.

This can be configured with dfs.replication in hdfs-site.xml file at site level.

Replication factor can be setup at file level with below FS command.

$ hadoop fs -setrep N /filename

In above command ‘N’ is the new replication factor for the file “/filename”.

9. What is the use of fsck command in HDFS ?

HDFS fsck command is useful to get the files and blocks details of the file system. It's syntax is:

$ hadoop fsck<path> [-move|-delete|-openforwrite][-files[-blocks[-locations|-racks]]]

below are the command options and their purpose.

-move Move corrupted files to /lost+found

-delete Delete corrupted files.

-openforwrite Print out files opened for write.

-files Print out files being checked.

-blocks Print out block report.

-locations Print out locations for every block.

-racks Print out network topology for data-node locations.

10. What is HDFS distributed copy (distcp) ?

distcp is an utility for launching MapReduce jobs to copy large amounts of HDFS files within or in between HDFS clusters.

Syntax for using this tool.

$ hadoop distcp hdfs://namenodeX/srchdfs://namenodeY/dest

source: Link

JavaTechOne

Pages

Thursday, March 12, 2015

Hadoop Part 4

2. What is the benefit of large block size in HDFS ?

3. What are the overheads of maintaining too large Block size ?

4. If a file of size 10 MB is copied on to HDFS of block size 256 MB, then how much storage will be allocated to the file on HDFS ?

5. What are the benefits of block structure concept in HDFS ?

6. What if we upgrade our Hadoop version in which, default block size is higher than the current Hadoop version’s default block size. Suppose 128 MB (Hadoop 0.20.2) to 256 MB (Hadoop 2.4.0).

7. What is Block replication ?

8. What is default replication factor and how to configure it ?

9. What is the use of fsck command in HDFS ?

HDFS fsck command is useful to get the files and blocks details of the file system. It's syntax is:

$ hadoop fsck<path> [-move|-delete|-openforwrite][-files[-blocks[-locations|-racks]]]

below are the command options and their purpose.

10. What is HDFS distributed copy (distcp) ?

No comments:

Post a Comment

-move	Move corrupted files to /lost+found
-delete	Delete corrupted files.
-openforwrite	Print out files opened for write.
-files	Print out files being checked.
-blocks	Print out block report.
-locations	Print out locations for every block.
-racks	Print out network topology for data-node locations.

Pages

Thursday, March 12, 2015

Hadoop Part 4

2. What is the benefit of large block size in HDFS ?

3. What are the overheads of maintaining too large Block size ?

4. If a file of size 10 MB is copied on to HDFS of block size 256 MB, then how much storage will be allocated to the file on HDFS ?

5. What are the benefits of block structure concept in HDFS ?

6. What if we upgrade our Hadoop version in which, default block size is higher than the current Hadoop version’s default block size. Suppose 128 MB (Hadoop 0.20.2) to 256 MB (Hadoop 2.4.0).

7. What is Block replication ?

8. What is default replication factor and how to configure it ?

9. What is the use of fsck command in HDFS ? HDFS fsck command is useful to get the files and blocks details of the file system. It's syntax is: $ hadoop fsck<path> [-move|-delete|-openforwrite][-files[-blocks[-locations|-racks]]] below are the command options and their purpose.

-moveMove corrupted files to /lost+found -deleteDelete corrupted files. -openforwritePrint out files opened for write. -filesPrint out files being checked. -blocksPrint out block report. -locationsPrint out locations for every block. -racksPrint out network topology for data-node locations.

10. What is HDFS distributed copy (distcp) ?

No comments:

Post a Comment

9. What is the use of fsck command in HDFS ?

HDFS fsck command is useful to get the files and blocks details of the file system. It's syntax is:

$ hadoop fsck<path> [-move|-delete|-openforwrite][-files[-blocks[-locations|-racks]]]

below are the command options and their purpose.