Top 20 Hadoop Interview Questions and Answers

1. What is Hadoop framework?
Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

2. How large an amount of work? 
Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes.

3. What problems can Hadoop solve?
Hadoop will allow you to process massive amounts of data very quickly. Hadoop is known as a distributing processing engine which leverages data locality. That means it was designed to execute transformations and processes where the data actually exists. Another benefit of value is from an analytics perspective, Hadoop allows you load raw data and then define the structure of the data at the time of query. This means that Hadoop is quick, flexible, and able to handle any type of analysis you want to conduct.

4. What is MapReduce in Hadoop?
Ans: Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

5. What are the tasks perform in MapReduce framework?
Ans:The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

6. What MapReduce consist ?
Ans:The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

7. What is Distributed Cache in mapreduce framework?
Ans:DistributedCache distributes application-specific, large, read-only files efficiently. DistributedCache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications.

8. What is IsolationRunner in Hadoop?
IsolationRunner is a utility to help debug MapReduce programs.To use the IsolationRunner, first set keep.failed.task.files to true. IsolationRunner will run the failed task in a single jvm, which can be in the debugger, over precisely the same input. 

9. What is Namenode in Hadoop?
Namenode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode. It is important for this file system to store its metadata reliably. Furthermore, while the file data is accessed in a write once and read many model, the metadata structures (e.g., the names of files and directories) can be modified by a large number of clients concurrently. It is important that this information is never desynchronized. Therefore, it is all handled by a single machine, called the NameNode. The NameNode stores all the metadata for the file system.

10. What is DataNode in Hadoop?
The data node is where the actual data resides.Some interesting traits of the same are as follows:
  • All datanodes send a heartbeat message to the namenode every 3 seconds to say that they are alive. If the namenode does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead/out of service and initiates replication of blocks which were hosted on that data node to be hosted on some other data node.
  • The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high.
  • When the datanode stores a block of information, it maintains a checksum for it as well. The data nodes update the namenode with the block information periodically and before updating verify the checksums. If the checksum is incorrect for a particular block i.e. there is a disk level corruption for that block, it skips that block while reporting the block information to the namenode. In this way, namenode is aware of the disk level corruption on that datanode and takes steps accordingly.
11. What is Secondary NameNode?
The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.

12. What is JobTracker in Hadoop?
The primary function of the job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.) 

13. What is TaskTracker in Hadoop? 
The TaskTracker executes the Mapper/ Reducer task as a child process in a separate jvm. The child-task inherits the environment of the parent TaskTracker. The user can specify additional options to the child-jvm via the mapred.

14. What if job tracker machine is down? 
Single point failure from execution point of view.

15. What is Job Configuration(JobConf) in Hadoop? 
JobConf(Job Configuration) represents a MapReduce job configuration. JobConf is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution.

16. What are the Hadoop configuration files?
Hadoop configuration is driven by two types of important configuration files:
  1. Read-only default configuration: src/core/core-default.xml, src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml.
  2. Site-specific configuration: conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml.
17. What is shuffleing in mapreduce?
The map phase guarantees that the input to the reducer will be sorted on its key. The process by which output of the mapper is sorted and transferred across to the reducers is known as the shuffle.
What is partitioning?

18. What is Hadoop Streaming?
Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.

19. What is Hadoop Pipes? 
Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNI based).

20. What is Reporter?
Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.
Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive.