Friday, March 13, 2015

Hadoop Part 6

1. Why is Hadoop useful?

Hadoop is fault tolerant, meaning the system will simply redirect to another location and resume work when a node is lost. Hadoop is also schema-less and can absorb data of all types, sources, and structures, allowing for deeper analysis.
2. Which directory does Hadoop install to?
Hadoop is installed in cd/usr/lib/hadoop-0.20/

3. What are the four modules that make up the Apache Hadoop framework?

  • Hadoop Common, which contains the common utilities and libraries necessary for Hadoop’s other modules.
  • Hadoop YARN, the framework’s platform for resource-management
  • Hadoop Distributed File System, or HDFS, which stores information on commodity machines
  • Hadoop MapReduce, a programming model used to process  large-scale sets of data

4. Which modes can Hadoop be run in? List a few features for each mode.

  • Standalone, or local mode, which is one of the least commonly used environments. When it is used, it’s usually only for running MapReduce programs. Standalone mode lacks a distributed file system, and uses a local file system instead.
  • Pseudo-distributed mode, which runs all daemons on a single machine. It is most commonly used in QA and development environments.
  • Fully distributed mode, which is most commonly used in production environments. Unlike pseudo-distributed mode, fully distributed mode runs all daemons on a cluster of machines rather than a single one.

5. Where are Hadoop’s configuration files located?

Hadoop’s configuration files can be found inside the conf sub-directory.

6. List Hadoop’s three configuration files.

  • hdfs-site.xml
  • core-site.xml
  • mapred-site.xml

7. What are “slaves” and “masters” in Hadoop?

In Hadoop, slaves are a list of hosts for task tracker servers and datanodes. Masters list hosts for secondary namenode servers.

8. What is /etc/init.d?

/etc/init.d is a Linux directory. In Hadoop, you use this to check the status of daemons or check where they’re located.

9. What is a Namenode?

Namenode exists at the center of the Hadoop distributed file system cluster. It manages metadata for the file system, and datanodes, but does not store data itself.

10. How many Namenodes can run on a single Hadoop cluster?

Only one Namenode process can run on a single Hadoop cluster. The file system will go offline if this Namenode goes down.

11. What is a datanode?

Unlike Namenode, a datanode actually stores data within the Hadoop distributed file system. Datanodes run on their own Java virtual machine process.

12. How many datanodes can run on a single Hadoop cluster?

Hadoop slave nodes contain only one datanode process each.

13. What is job tracker in Hadoop?

Job tracker is used to submit and track jobs in MapReduce.

14. How many job tracker processes can run on a single Hadoop cluster?

Like datanodes, there can only be one job tracker process running on a single Hadoop cluster. Job tracker processes run on their own Java virtual machine process. If job tracker goes down, all currently active jobs stop.

15. What sorts of actions does the job tracker process perform?

  • Client applications send the job tracker jobs.
  • Job tracker determines the location of data by communicating with Namenode.
  • Job tracker finds nodes in task tracker that has open slots for the data.
  • Job tracker submits the job to task tracker nodes.
  • Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.
  • Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.
16. How does job tracker schedule a job for the task tracker?
When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned datanode.

17. What does the mapred.job.tracker command do?

The mapred.job.tracker command will provide a list of nodes that are currently acting as a job tracker process.

18. What is “PID”?

PID stands for Process ID.

19. What is “jps”?

jps is a command used to check if your task tracker, job tracker, datanode, and Namenode are working.

20. Is there another way to check whether Namenode is working?

Besides the jps command, you can also use: /etc/init.d/hadoop-0.20-namenode status.

21. How would you restart Namenode?

To restart Namenode, you could either write:

  • sudo hdfs
  • su-hdfs
  • /etc/init.d/ha, press enter, then /etc/init.d/hadoop-0.10-namenode start
and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.

22. What is “fsck”?

fsck standards for File System Check.

23. What are the port numbers for job tracker, task tracker, and Namenode?

The port number for job tracker is 30, the port number for task tracker is 60, and the port number for Namenode is 70.

23. What is a “map” in Hadoop?

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

24. What is a “reducer” in Hadoop?

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

25. What are the parameters of mappers and reducers?

The four parameters for mappers are:
  • LongWritable (input)
  • text (input)
  • text (intermediate output)
  • IntWritable (intermediate output)
The four parameters for reducers are:
  • Text (intermediate output)
  • IntWritable (intermediate output)
  • Text (final output)
  • IntWritable (final output)

26. Is it possible to rename the output file, and if so, how?

Yes, it is possible to rename the output file by utilizing a multi-format output class.

27. List the network requirements for using Hadoop.

  • Secure Shell (SSH) for launching server processes
  • Password-less SSH connection

28. Which port does SSH work on?

SSH works on the default port number, 22.

29. What is streaming in Hadoop?

As part of the Hadoop framework, streaming is a feature that lets engineers code with MapReduce in any language, as long as that programming language is able to accept and produce standard output. Even though Hadoop is Java-based, the chosen language doesn’t have to be Java. It can be Perl, Ruby, etc. If you want to use customization in MapReduce, however, Java must be used.
30. What is the difference between Input Split and an HDFS Block?
InputSplit and HDFS Block both refer to the division of data, but InputSplit handles the logical division while HDFS Block handles the physical division.

31. What does the file hadoop-metrics.properties do?

The hadoop-metrics.properties file controls reporting in Hadoop.
32. What is FaultTolerance‬?
Ans: Suppose you have a file stored‬ in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS‬. In Hadoop‬, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

33. What are the key features of HDFS‬?
Ans: HDFS is highly fault-tolerant, with high throughput, suitable for‪ applications‬ with large data‬ sets, streaming access to file system data and can be built out of commodity hardware‬.

34. What is HDFS‬?
Ans: HDFS is a file system designed for storing very large files with streaming data‬ access patterns, running clusters on commodity hardware‬.

35. What are the core components of Hadoop‬?
Ans: Core components of Hadoop are HDFS‬ and MapReduce‬. HDFS is basically used to store large data‬ sets and Map Reduce is used to process such large data sets.

36. What is structured and unstructured data‬?
Ans: Structured data‬ is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database‬ where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, Videos, documents, email‬, logs and random text. It is not in the form of rows and columns.

37. What is the basic difference between traditional RDBMS‬ and Hadoop‬?
Ans: Traditional RDBMS is used for transactional systems to report and archive the data‬, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Bigdata‬, whereas, Hadoop will be useful when you want Big data in one shot and perform analysis on that later.

38. Give examples of some companies‬ that are using Hadoop‬ structure?
Ans: A lot of companies are using the Hadoop structure such a Cloudera,‪#‎EMC‬‪#‎MapR‬‪#‎Hortonworks‬, Amazon, ‪#‎Facebook‬, eBay, Twitter, Googleand so on.

39. What are some of the characteristics of ‪#‎Hadoop‬ framework?
Ans: ‪#‎Hadoopframework‬ is written in Java‬. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s ‪#‎BigData‬ and Distributed File System. Hadoop handles large files/data throughput and supports data‬ intensive distributed applications. Hadoop is scalable as more nodes can be easily added to it.

40. Why do we need ‪‎Hadoop‬?
Ans: Everyday a large amount of unstructured ‪‎data‬ is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce‬ which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.

41. What is Hadoop‬?
Ans: Hadoop is a framework‬ that allows for distributed processing of large data sets across clusters of commodity computers using a simple‪ programming‬ model.

42. How Big is BigData‬?
Ans: With time, data volume is growing exponentially. Earlier we used to talk about Megabytes‬ or Gigabytes‬. But time has arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in every two years!

43. Can you give a detailed overview about the BigData‬ being generated by ‪‎Facebook‬?
Ans: As of December 31, 2012, there are 1.06 billion monthly active users on Facebook and 680 million mobile users. On an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook. And why not! There are so many activities going on facebook from wall posts, sharing images, videos, writing comments and liking posts, etc. In fact, Facebook started using Hadoop in mid-2009 and was one of the initial users of Hadoop‬.

44. Can you give some examples of BigData‬?
Ans: There are many real life examples of Big Data! Facebook‬ is generating 500+ terabytes of data‬ per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time. All these are day to day examples of Big Data!

45: What is InputSplit in Hadoop‬?
Ans: When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit.

46. What is Secondary NameNode‬?
Ans: Secondary NameNode is usually running on a seperate physical machine because it requries plenty of cpu and as much memory as namenode to perform merge.it keeps merge namespace which is used for when namenode id failure.

47. What is TaskTracker‬?
Ans: TaskTrackers can run task given by jobtracker‬ and send progress to jobtracker which can store record of overall performance of job.if any task is failure,jobtracker an reschedule it on another tasktracker.

48. What is ‪‎DataNode‬?
Ans: DataNode are workers i,e slaves ,of the file system.It can store and retrieve block of ‪#‎data‬ told by the client through namenode which containing metadata‬ of the file and ‪#‎directories‬.

49. What is NameNode‬?
Ans: In HDFS‬ cluster has two types of nodes‬,which are working as master and slave.Here,namenode is master.NameNode should store filesystem namespace which containing filesytem tree and metadata for all of the files and directories in the tree.This information stored in local disk in form of two files(namespace image,edit log).NameNode also knows datanode on which actual data‬ is located. without namenode filesystem cannot used.

50. What is ‪‎MapReduce‬?
Ans: Mapreduce is proceesing model in ‪‎hadoop‬,which can processed any type of data‬ i,e structured and unstructured data.It is mainly came from divide and conquer strategy.MapReduce job is a unit of work ,which consists of input data,MapReduce program and configurations.Hadoop runs the job by diving into two task:map task and reduce task.

No comments:

Post a Comment