JavaTechOne: Mapreduce Interview Questions

1. What is Mapreduce ?

Mapreduce is a framework for processing big data (huge data sets using a large number of commodity computers). It processes the data in two phases namely Map and Reduce phase. This programming model is inherently parallel and can easily process large-scale data with the commodity hardware itself.

It is highly integrated with hadoop distributed file system for processing distributed across data nodes of clusters.

2. What is YARN ?

YARN stands for Yet Another Resource Negotiator which is also called as Next generation Mapreduce or Mapreduce 2 or MRv2.

It is implemented in hadoop 0.23 release to overcome the scalability short come of classic Mapreduce framework by splitting the functionality of Job tracker in Mapreduce frame work into Resource Manager and Scheduler.

3. What is data serialization ?

Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage.

4. What is deserialization of data ?

Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop provides Writables for serialization and deserialization purpose.

5. What are the Key/Value Pairs in Mapreduce framework ?

Mapreduce framework implements a data model in which data is represented as key/value pairs. Both input and output data to mapreduce framework should be in key/value pairs only.

6. What are the constraints to Key and Value classes in Mapreduce ?

Any data type that can be used for a Value field in a mapper or reducer must implement org.apache.hadoop.io.Writable Interface to enable the field to be serialized and deserialized.

By default Key fields should be comparable with each other. So, these must implement hadoop’s org.apache.hadoop.io.WritableComparable Interface which in turn extends hadoop’s Writable interface and java.lang.Comparable interfaces.

7. What are the main components of Mapreduce Job ?

Main driver class which provides job configuration parameters.
Mapper class which must extend org.apache.hadoop.mapreduce.Mapper class and provide implementation for map () method.
Reducer class which should extend org.apache.hadoop.mapreduce.Reducer class.

8. What are the Main configuration parameters that user need to specify to run Mapreduce Job ?

On high level, the user of mapreduce framework needs to specify the following things:

The job’s input location(s) in the distributed file system.
The job’s output location in the distributed file system.
The input format.
The output format.
The class containing the map function.
The class containing the reduce function but it is optional.
The JAR file containing the mapper and reducer classes and driver classes.

9. What are the main components of Job flow in YARN architecture ?

Mapreduce job flow on YARN involves below components.

A Client node, which submits the Mapreduce job.
The YARN Resource Manager, which allocates the cluster resources to jobs.
The YARN Node Managers, which launch and monitor the tasks of jobs.
The MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
The HDFS file system is used for sharing job files between the above entities.

10. What is the role of Application Master in YARN architecture ?

Application Master performs the role of negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the tasks.

Application Master requests containers for all map tasks and reduce tasks.Once Containers are assigned to tasks, Application Master starts containers by notifying its Node Manager. Application Master collects progress information from all tasks and aggregate values are propagated to Client Node or user.

Application master is specific to a single application which is a single job in classic mapreduce or a cycle of jobs. Once the job execution is completed, application master will no longer exist.

11. What is identity Mapper ?

Identity Mapper is a default Mapper class provided by hadoop. When no mapper is class is specified in Mapreduce job, then this mapper will be executed.

It doesn’t process/manipulate/ perform any computation on input data rather it simply writes the input data into output. It’s class name is org.apache.hadoop.mapred.lib.IdentityMapper.

12. What is identity Reducer ?

It is a reduce phase’s counter part for Identity mapper in map phase. It simply passes on the input key/value pairs into output directory. Its class name is org.apache.hadoop.mapred.lib.IdentityReducer.

When no reducer class is specified in Mapreduce job, then this class will be picked up by the job automatically.

13. What is chain Mapper ?

Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.

In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.

Its class name is org.apache.hadoop.mapreduce.lib.ChainMapper.

14. What is chain reducer ?

Chain reducer is similar to Chain Mapper class through which a chain of mappers followed by a single reducer can be run in a single reducer task. Unlike Chain Mapper, chain of reducers will not be executed in this, but chain of mappers will be run followed by a single reducer.

Its class name is org.apache.hadoop.mapreduce.lib.ChainReducer.

15. How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer classes ?

In Chain Mapper,

ChainMapper.addMapper() method is used to add mapper classes.

In ChainReducer,

ChainReducer.setReducer() method is used to specify the single reducer class.
ChainReducer.addMapper() method can be used to add mapper classes.

16. What is side data distribution in Mapreduce framework ?

The extra read-only data needed by a mapreduce job to process the main data set is called as side data.

There are two ways to make side data available to all the map or reduce tasks.

Job Configuration
Distributed cache

17. How to distribute side data using job configuration ?

Side data can be distributed by setting an arbitrary key-value pairs in the job configuration using the various setter methods on Configuration object.

In the task, we can retrieve the data from the configuration returned by Context ’s

getConfiguration() method.

18. When can we use side data distribution by Job Configuration and when it is not supposed ?

Side data distribution by job configuration is useful only when we need to pass a small piece of meta data to map/reduce tasks.

We shouldn’t use this mechanism for transferring more than a few KB’s of data because it put pressure on the memory usage, particularly in a system running hundreds of jobs.

19. What is Distributed Cache in Mapreduce ?

Distributed cache mechanism is an alternative way of side data distribution by copying files and archives to the task nodes in time for the tasks to use them when they run.

To save network bandwidth, files are normally copied to any particular node once per job.

20. How to supply files or archives to mapreduce job in distributed cache mechanism ?

The files that need to be distributed can be specified as a comma-separated list of URIs as the argument to the -files option in hadoop job command. Files can be on the local file system, on HDFS.

Archive files (ZIP files, tar files, and gzipped tar files) can also be copied to task nodes by distributed cache by using -archives option. these are un-archived on the task node.

The -libjars option will add JAR files to the classpath of the mapper and reducer tasks.

$ hadoop jar example.jar ExampleProgram -files Inputpath/example.txt input/filename /output/

21. How distributed cache works in Mapreduce Framework ?

When a mapreduce job is submitted with distributed cache options, the node managers copies the the files specified by the -files , -archives and -libjars options from distributed cache to a local disk. The files are said to be localized at this point.

local.cache.size property can be configured to setup cache size on local disk of node managers. Files are localized under the ${hadoop.tmp.dir}/mapred/local directory on the node manager nodes.

22. What will hadoop do when a task is failed in a list of suppose 50 spawned tasks ?

It will restart the map or reduce task again on some other node manager and only if the task fails more than 4 times then it will kill the job. The default number of maximum attempts for map tasks and reduce tasks can be configured with below properties in mapred-site.xml file.

mapreduce.map.maxattempts

mapreduce.reduce.maxattempts

The default value for the above two properties is 4 only.

23. Consider case scenario: In Mapreduce system, HDFS block size is 256 MB and we have 3 files of size 256 KB, 266 MB and 500 MB then how many input splits will be made by Hadoop framework ?

Hadoop will make 5 splits as follows

- 1 split for 256 KB file

- 2 splits for 266 MB file (1 split of size 256 MB and another split of size 10 MB)

- 2 splits for 500 MB file (1 Split of size 256 MB and another of size 244 MB)

24. Why can’t we just have the file in HDFS and have the application read it instead of distributed cache ?

Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50 map or reduce tasks, it will use the same file copy from distributed cache.

On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing the same file from node manager’s Local FS is much faster than from HDFS data nodes.

25. What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during run time of the application ?

Distributed cache mechanism provides service for copying just read-only data needed by a mapreduce job but not the files which can be updated. So, there is no mechanism to synchronize the changes made in distributed cache as changes are not allowed to distributed cached files.

26. After restart of namenode, Mapreduce jobs started failing which worked fine before restart. What could be the wrong ?

The cluster could be in a safe mode after the restart of a namenode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again. This is a very common mistake by Hadoop administrators.

27. What do you always have to specify for a MapReduce job ?

The classes for the mapper and reducer.
The classes for the mapper, reducer, and combiner.
The classes for the mapper, reducer, partitioner, and combiner.
None; all classes have default implementations.

28. How many times will a combiner be executed ?

At least once.
Zero or one times.
Zero, one, or many times.
It’s configurable.

29. You have a mapper that for each key produces an integer value and the following set of reduce operations

Reducer A: outputs the sum of the set of integer values.
Reducer B: outputs the maximum of the set of values.
Reducer C: outputs the mean of the set of values.
Reducer D: outputs the difference between the largest and smallest values
in the set.

Which of these reduce operations could safely be used as a combiner ?

All of them.
A and B.
A, B, and D.
C and D.
None of them.

Explanation: Reducer C cannot be used because if such reduction were to occur, the final reducer could receive from the combiner a series of means with no knowledge of how many items were used to generate them, meaning the overall mean is impossible to calculate.

Reducer D is subtle as the individual tasks of selecting a maximum or minimum are safe for use as combiner operations. But if the goal is to determine the overall variance between the maximum and minimum value for each key, this would not work. If the combiner that received the maximum key had values clustered around it, this would generate small results; similarly for the one receiving the minimum value. These sub ranges have little value in isolation and again the final reducer cannot construct the desired result.

30. What is Uber task in YARN ?

If the job is small, the application master may choose to run them in the same JVM as itself, since it judges the overhead of allocating new containers and running tasks in them as outweighing the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different to Mapreduce
1, where small jobs are never run on a single task tracker.)

Such a job is said to be Uberized, or run as an Uber task.

31. How to configure Uber Tasks ?

By default a job that has less than 10 mappers only and one reducer, and the input size is less than the size of one HDFS block is said to be small job. These values may
be changed for a job by setting mapreduce.job.ubertask.maxmaps , mapreduce.job.ubertask.maxreduces , and mapreduce.job.ubertask.maxbytes

It’s also possible to disable Uber tasks entirely by setting mapreduce.job.ubertask.enable to false.

32. What are the ways to debug a failed mapreduce job ?

Commonly there are two ways.

By using mapreduce job counters
YARN Web UI for looking into syslogs for actual error messages or status.

33. What is the importance of heartbeats in HDFS/Mapreduce Framework ?

A heartbeat in master/slave architecture is a signal indicating that it is alive. A datanode sends heartbeats to Namenode and node managers send their heartbeats to Resource Managers to tell the master node that these are still alive.

If the Namenode or Resource manager does not receive heartbeat from any slave node then they will decide that there is some problem in data node or node manager and is unable to perform the assigned task, then master (namenode or resource manager) will reassign the same task to other live nodes.

34. Can we rename the output file ?

Yes, we can rename the output file by implementing multiple format output class.

35. What are the default input and output file formats in Mapreduce jobs ?

If input file or output file formats are not specified, then the default file input or output formats are considered as text files.

36. What are the methods in the Mapper class and order of their invocation?

run(Context context)

{

setup(context);

while (context.nextKeyValue())

{

map(context.getCurrentKey(), context.getCurrentValue(), context);

}

cleanup(context);

}

setup(Context context)

map(Writable key, Writable value, Context context)

cleanup(Context context)

The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. We can override all above methods in our code.

Each of these methods can access the job’s configuration data by using Context.getConfiguration().

37. What are the methods in the Reducer class and order of their invocation?

run(Context context)

{

setup(context);

while (context.nextKeyValue())

{

reduce(context.getCurrentKey(), context.getCurrentValue(), context);

}

cleanup(context);

}

setup(Context context)

reduce(Writable key, Writable value, Context context)
cleanup(Context context)

The Reducer class contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. We can override all above methods in our code.

38. How can we add the arbitrary key-value pairs in your mapper?

We can set arbitrary (key, value) pairs of configuration data in our Job, with Job.getConfiguration().set(“key”, “val”), and we can retrieve this data in mapper with Context.getConfiguration().get(“key”).

This kind of functionality is typically done in the Mapper’s setup() method.

39. Which object can be used to get the progress of a particular job?

Context

40. How can we control particular key should go in a specific reducer?

We can control keys (and hence records) to be processed in a particular Reducer by implementing a custom Partitioner class.

41. What is Nlineoutputformat?

Nlineoutputformat splits ‘n’ lines of input as one split.

42. What is the difference between an HDFS Block and Input Split?

HDFS Block is the physical division of the data and Input Split is the logical division of the data.

43. What is keyvaluetextinputformat?

In keyvaluetextinputformat, each line in the text file is a ‘record‘. The first separator character divides each line. Everything before the separator is the key and everything after the separator is the value. For instance, Key: text, value: text.

44. Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?

We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

45. Can we process different input file directories with different input formats, like some text files and some sequence files in a single MR job?

Yes, we can implement this by MultipleInputs.addInputPath() methods in job driver class. might set up the input as follows:

MultipleInputs.addInputPath(job,inputPath1,TextInputFormat.class, Mapper1.class)

MultipleInputs.addInputPath(job,inputPath2,SequeneFileInputFormat.class, Mapper2.class);

Here Mapper1 class handles TextInputFormat data and Mapper2 class handles SequenceFileInputFormat data.

46. What is the need for serialization in Mapreduce ?

Below are the two necessities for serialization in Hadoop.

In Hadoop cluster, data is stored in only binary stream format but object structured data can’t be stored directly hadoop data nodes.
Only Binary stream data can be transferred across data nodes in hadoop cluster. So, Serialization is needed to convert the object structured data into binary stream format.

47. How does the nodes in a hadoop cluster communicate with each other?

Inter process communication between nodes in a hadoop cluster is implemented using Remote Procedure Calls (RPC).

Inter process communication happens in below three stages.

RPC protocol uses serialization to convert the message from source data node into a binary stream data.
Binary stream data is transferred to the remote destination node
Destination node then use De-serialization to convert the binary stream data into object structured data and then it reads object structured data.

48. What is the Hadoop in built serialization framework ?

Writables are the hadoop’s own serialization format which serializes the data into compact size and ensures fast transfer across nodes. Writables are written in Java and supported by Java only.

49. What is Writable and its methods in hadoop library ?

Writable is an Interface in hadoop library and it provides below two methods for serializing and de-serializing the data.

write(DataOutput out) – Writes data into DataOutput binary stream.

readFields(DataInput in) – Reads data from DataInput binary stream.

package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable
{
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

50. What is WritableComparable in hadoop library ?
WritableComparable interface is a sub interface of the Writable and java.lang.comparable interfaces.

package org.apache.hadoop.io;
public interface WritableComparable<T> extends Writable,Comparable <T>
{ }

source:Link

Pages

Friday, March 13, 2015

Mapreduce Interview Questions