hadoop Interview Questions
(5 questions)
Q1. What is Hadoop?
“Hadoop is an open-source distributed framework designed to store and process large volumes of data across multiple machines.”
-
It uses HDFS for distributed storage, where data is split into blocks and stored across different nodes, and MapReduce for parallel processing of that data.
-
Instead of relying on a single expensive system, Hadoop distributes both data and computation across a cluster of commodity machines.
-
This makes it fault-tolerant, as data is replicated across nodes, and horizontally scalable, allowing us to handle growing data by simply adding more machines.
-
In simple terms, Hadoop solves the problem of processing data that is too large for a single machine by distributing it across a cluster.
Q2. What are the components of Hadoop???
“Hadoop architecture consists of multiple components that work together to handle distributed storage, processing, and cluster coordination.”
- NameNode (Master)
- Manages metadata — file structure, block distribution, and DataNode locations
- Does not store actual data
- If it fails, the cluster becomes inaccessible
- Standby NameNode (High Availability)
- Acts as a backup for the active NameNode
- Stays in sync using Edit Logs via JournalNodes
- Takes over immediately in case of failure
- DataNodes (Workers)
- Store the actual data blocks on local disks
- Send heartbeat signals every 3 seconds to report health
- JournalNodes
- Maintain shared Edit Logs
- Ensure both NameNodes remain synchronized
- ZooKeeper
- Handles cluster coordination and leader election
- Ensures high availability of NameNode
- YARN (Resource Manager)
- Manages CPU and memory across the cluster
- Decides where tasks should run
- Enables processing frameworks like MapReduce
- Heartbeat Mechanism
- DataNodes send heartbeat every 3 seconds
- If multiple heartbeats are missed, the node is marked dead and data is re-replicated
Q3. What is Safe Mode in Hadoop?
“Safe mode in Hadoop is a read-only state of the NameNode, where no write operations are allowed to ensure data integrity.”
- What happens in Safe Mode?
- No write operations (no create, delete, or modify)
- Cluster is running, but not fully operational
- When does it occur?
- On startup/restart:
- NameNode loads FS Image and replays Edit Log
- Waits for DataNodes to send block reports
- Verifies replication before becoming active
- Automatically:
- Triggered when under-replicated blocks exceed threshold
- On startup/restart:
- Checkpointing (related concept)
- Process of merging Edit Log + FS Image
- Creates an updated snapshot for faster recovery
- Done by Standby NameNode (HA) or Secondary NameNode (non-HA)
- Exit condition
- Once sufficient block replication is confirmed → NameNode exits safe mode
- Manual commands (optional to mention)
- Enter:
hdfs dfsadmin -safemode enter - Leave:
hdfs dfsadmin -safemode leave - Check:
hdfs dfsadmin -safemode get
- Enter:
Q4. What is FS Image and Edit Log in Hadoop?
“FS Image and Edit Log are the two core files that together maintain the complete state of the NameNode.”
- FS Image
- It is a snapshot of the file system metadata at a specific point in time.
- It stores details like directory structure, file names, block information, permissions, and replication factor.
- One important point is that it does not store DataNode locations, as those are rebuilt dynamically when DataNodes send block reports.
- You can think of it as a photograph of the file system.
- Edit Log
- It is a transaction log that records every change made after the last FS Image.
- This includes operations like file creation, deletion, rename, and permission updates.
- It keeps growing continuously until a checkpoint happens.
- You can think of it as a diary of all recent changes.
- How they work together
- On startup, the NameNode first loads the FS Image as the base state.
- Then it replays the Edit Log to apply all recent changes.
- This combination reconstructs the latest state of the file system in memory.
- Checkpointing (important concept)
- Over time, the Edit Log becomes large, which can slow down restarts.
- So Hadoop performs checkpointing, where the Edit Log is merged into the FS Image to create a fresh, compact snapshot.
- This is triggered based on time interval or number of transactions.
- Who performs checkpointing
- In non-HA setup, it is done by the Secondary NameNode
- In HA setup, it is done by the Standby NameNode
Q5. What is Split Brain Scenario in Hadoop?
“Split brain is a critical failure scenario in Hadoop HA where both the Active and Standby NameNode believe they are active at the same time and start writing simultaneously.”
- How it happens (high level)
- Due to issues like network glitch or GC pause, the Active NameNode becomes temporarily unresponsive.
- The ZKFC (ZooKeeper Failover Controller) detects this and its session with ZooKeeper expires.
- ZooKeeper triggers failover, and the Standby NameNode becomes Active.
- Meanwhile, the old Active recovers and still thinks it is Active.
- Now both NameNodes try to write to Journal Nodes at the same time.
- Why it is dangerous
- Two NameNodes create conflicting metadata.
- Clients may get incorrect block locations.
- Can lead to data corruption or inconsistent state.
- Role of ZKFC (important point)
- Runs on each NameNode and monitors its health locally.
- Communicates with ZooKeeper to manage failover.
- Acts as a bridge between NameNode and ZooKeeper.
- How Hadoop prevents split brain
- Fencing (first protection):
- Old Active is forcefully stopped (e.g., kill process or power off).
- ZooKeeper lock:
- Only one NameNode can hold the active lock at a time.
- Epoch number (final safety):
- Journal Nodes accept writes only from the latest active NameNode.
- Fencing (first protection):