Thursday, March 12, 2015

Hadoop Part 1


1. What is Big Data ?

Big data is vast amount of data (generally in GBs or TBs of size) that exceeds the regular processing capacity of the traditional computing servers and requires special parallel processing mechanism. This data is too big and its rate of increase gets accelerated. This data can be either structural or unstructured data which may not be able to process by legacy databases.
2. What is Hadoop?
Hadoop is an open source frame work from Apache Software Foundation for storing & processing large-scale data usually called Big Data using clusters of commodity hardware.
3. Who uses Hadoop ?

Big organizations in which data grows exponentially day by day and must require Hadoop platform to process such huge data. For example FacebookGoogleAmazon, Twitter, IBM, LinkedIn etc… companies uses hadoop technology to solve their big data processing problems.

4. What is commodity hardware?

Commodity hardware is a non-expensive system which is not of high quality or high-availability.Hadoop can be installed on any commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Commodity hardware includes RAM because there will be some services which will be running on RAM.
5. What is the basic difference between traditional RDBMS and Hadoop?
Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it.

RDBMS will be useful when we want to seek one record from Big data, whereas, Hadoop will be useful when we want Big data in one shot and perform analysis on that later.
6. What are the modes in which Hadoop can run ?
Hadoop can run in three modes.
  • Stand alone or Local mode - No daemons will be running in this mode and everything runs in a single JVM.
  • Pseudo distributed mode – All the Hadoop daemons run on a local machine, simulating cluster on a small scale.
  • Fully distributed mode - A cluster of machines will be setup in master/slaves architecture to distribute and process the data across various nodes of commodity hardware.
7. What are main components/projects in Hadoop architecture ?
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • HDFS: Hadoop distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
8. List important default configuration files in Hadoop cluster ?
The default configuration files in hadoop cluster are:
  • core-default.xml
  • hdfs-default.xml
  • yarn-default.xml
  • mapred-default.xml
9. List important site-specific configuration files in Hadoop cluster ?
In order to override any hadoop configuration property’s default values, we need to provide configuration values in site-specific configuration files. Below are the four site-specific .xml configuration files and environment variable setup file.
  • core-site.xml : Common properties are configured in this file.
  • hdfs-site.xml : Site specific hdfs properties are configured in this file
  • yarn-site.xml : Yarn specific properties can be provided in this file.
  • mapred-site.xml : Mapreduce framework specific properties will defined here.
  • hadoop-env.sh : Hadoop environment variables are setup in this file.
All these configuration files should be placed in hadoop’s configuration directory etc/hadoop from hadoop’s home directory.
10. How many hadoop daemon processes run on a Hadoop System ?
As of hadoop-2.5.0 release, three hadoop daemon processes run on a hadoop cluster.
  • NameNode daemon – Only one daemon runs for entire hadoop cluster.
  • Secondary NameNode daemon – Only one daemon runs for entire hadoop cluster.
  • DataNode daemon – One datanode daemon per each datanode in hadoop cluster
11. How to start all hadoop daemons at a time ?
$ start-dfs.sh command can be used to start all hadoop daemons from terminal at a time.
12. If some hadoop daemons are already running and if we need to start any one remaining daemon process then what are the commands to use ?
Instead of start-dfs.sh which will trigger all the hadoop three daemons at a time, we can also start running each daemon separately by the below commands.
13. How to stop all the three hadoop daemons at a time ?
By using stop-dfs.sh command, we can stop the above three daemon processes with a single command.
14. What are commands that need to be used to bring down a single hadoop daemon?
Below hadoop-daemon.sh commands can be used to bring down each hadoop daemon separately.
15. How many YARN daemon processes run on a cluster ?
Two types of Yarn daemons will be running on hadoop cluster in master/slave fashion.
  • ResourceManager – Master daemon process
  • NodeManager – One Slave daemon process per node in a cluster.
16. How to start Yarn daemon processes on a hadoop cluster ?
By running $ start-yarn.sh command from terminal on each machine on hadoop cluster, Yarn daemons can be started.
17. How to verify whether the daemon processes are running or not ?
By using java’s processes command $ jps to check what are all the java processes running on a machine. This command lists down all the daemon processes running on a machine along with their process ids.
18. How to bring down the Yarn daemon processes ?
Using $ stop-yarn.sh command, we can bring down both the Yarn daemon processes running on a machine.
19. Can we start both Hadoop daemon processes and Yarn daemon processes with a single command?
Yes, we can start all the above mentioned five daemon processes (3 hadoop + 2 Yarn) with a single command $ start-all.sh
20. Can we stop all the above five daemon processes with a single command ?
Yes, by using $ stop-all.sh command all the above five daemon processes can be bring down in a single shot.
21. Which operating systems are supported for Hadoop deployment ?
The only supported operating system for hadoop’s production deployment is Linux. However, with some additional software Hadoop can be deployed on Windows for test environments.
22. How could be the various components of Hadoop cluster deployed in production?
Both Name Node and Resource Manager can be deployed on a Master Node, and Data nodes and node managers can be deployed on multiple slave nodes.

There is a need for only one master node for namenode and Resource Manager on the system. The number of slave nodes for datanodes & node managers depends on the size of the cluster.

One more node with hardware specifications same as master node will be needed for secondary namenode.
23. What is structured and unstructured data?
Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns.

Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns.
24. Is Namenode also a commodity?
No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.
25. What is the difference between jps and jps -lm commands ?
jps command returns the process id and short names for running processes. But jps -lm returns long messages along with process id and short names as shown below.

hadoop1@ubuntu-1:~$ jps
5314 SecondaryNameNode
5121 DataNode
5458 Jps
4995 NameNode
hadoop1@ubuntu-1:~$ jps -lm
5314 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
5121 org.apache.hadoop.hdfs.server.datanode.DataNode
5473 sun.tools.jps.Jps -lm
4995 org.apache.hadoop.hdfs.server.namenode.NameNode

source: Link

No comments:

Post a Comment