hadoop Interview Questions

(5 questions)

Q1. What is Hadoop?

“Hadoop is an open-source distributed framework designed to store and process large volumes of data across multiple machines.”

  • It uses HDFS for distributed storage, where data is split into blocks and stored across different nodes, and MapReduce for parallel processing of that data.

  • Instead of relying on a single expensive system, Hadoop distributes both data and computation across a cluster of commodity machines.

  • This makes it fault-tolerant, as data is replicated across nodes, and horizontally scalable, allowing us to handle growing data by simply adding more machines.

  • In simple terms, Hadoop solves the problem of processing data that is too large for a single machine by distributing it across a cluster.

Q2. What are the components of Hadoop???

“Hadoop architecture consists of multiple components that work together to handle distributed storage, processing, and cluster coordination.”

  • NameNode (Master)
    • Manages metadata — file structure, block distribution, and DataNode locations
    • Does not store actual data
    • If it fails, the cluster becomes inaccessible
  • Standby NameNode (High Availability)
    • Acts as a backup for the active NameNode
    • Stays in sync using Edit Logs via JournalNodes
    • Takes over immediately in case of failure
  • DataNodes (Workers)
    • Store the actual data blocks on local disks
    • Send heartbeat signals every 3 seconds to report health
  • JournalNodes
    • Maintain shared Edit Logs
    • Ensure both NameNodes remain synchronized
  • ZooKeeper
    • Handles cluster coordination and leader election
    • Ensures high availability of NameNode
  • YARN (Resource Manager)
    • Manages CPU and memory across the cluster
    • Decides where tasks should run
    • Enables processing frameworks like MapReduce
  • Heartbeat Mechanism
    • DataNodes send heartbeat every 3 seconds
    • If multiple heartbeats are missed, the node is marked dead and data is re-replicated

Explore More

Q3. What is Safe Mode in Hadoop?

“Safe mode in Hadoop is a read-only state of the NameNode, where no write operations are allowed to ensure data integrity.”

  • What happens in Safe Mode?
    • No write operations (no create, delete, or modify)
    • Cluster is running, but not fully operational
  • When does it occur?
    • On startup/restart:
      • NameNode loads FS Image and replays Edit Log
      • Waits for DataNodes to send block reports
      • Verifies replication before becoming active
    • Automatically:
      • Triggered when under-replicated blocks exceed threshold
  • Checkpointing (related concept)
    • Process of merging Edit Log + FS Image
    • Creates an updated snapshot for faster recovery
    • Done by Standby NameNode (HA) or Secondary NameNode (non-HA)
  • Exit condition
    • Once sufficient block replication is confirmed → NameNode exits safe mode
  • Manual commands (optional to mention)
    • Enter: hdfs dfsadmin -safemode enter
    • Leave: hdfs dfsadmin -safemode leave
    • Check: hdfs dfsadmin -safemode get

Explore More

Q4. What is FS Image and Edit Log in Hadoop?

“FS Image and Edit Log are the two core files that together maintain the complete state of the NameNode.”

  • FS Image
    • It is a snapshot of the file system metadata at a specific point in time.
    • It stores details like directory structure, file names, block information, permissions, and replication factor.
    • One important point is that it does not store DataNode locations, as those are rebuilt dynamically when DataNodes send block reports.
    • You can think of it as a photograph of the file system.
  • Edit Log
    • It is a transaction log that records every change made after the last FS Image.
    • This includes operations like file creation, deletion, rename, and permission updates.
    • It keeps growing continuously until a checkpoint happens.
    • You can think of it as a diary of all recent changes.
  • How they work together
    • On startup, the NameNode first loads the FS Image as the base state.
    • Then it replays the Edit Log to apply all recent changes.
    • This combination reconstructs the latest state of the file system in memory.
  • Checkpointing (important concept)
    • Over time, the Edit Log becomes large, which can slow down restarts.
    • So Hadoop performs checkpointing, where the Edit Log is merged into the FS Image to create a fresh, compact snapshot.
    • This is triggered based on time interval or number of transactions.
  • Who performs checkpointing
    • In non-HA setup, it is done by the Secondary NameNode
    • In HA setup, it is done by the Standby NameNode

Explore More

Q5. What is Split Brain Scenario in Hadoop?

“Split brain is a critical failure scenario in Hadoop HA where both the Active and Standby NameNode believe they are active at the same time and start writing simultaneously.”

  • How it happens (high level)
    • Due to issues like network glitch or GC pause, the Active NameNode becomes temporarily unresponsive.
    • The ZKFC (ZooKeeper Failover Controller) detects this and its session with ZooKeeper expires.
    • ZooKeeper triggers failover, and the Standby NameNode becomes Active.
    • Meanwhile, the old Active recovers and still thinks it is Active.
    • Now both NameNodes try to write to Journal Nodes at the same time.
  • Why it is dangerous
    • Two NameNodes create conflicting metadata.
    • Clients may get incorrect block locations.
    • Can lead to data corruption or inconsistent state.
  • Role of ZKFC (important point)
    • Runs on each NameNode and monitors its health locally.
    • Communicates with ZooKeeper to manage failover.
    • Acts as a bridge between NameNode and ZooKeeper.
  • How Hadoop prevents split brain
    • Fencing (first protection):
      • Old Active is forcefully stopped (e.g., kill process or power off).
    • ZooKeeper lock:
      • Only one NameNode can hold the active lock at a time.
    • Epoch number (final safety):
      • Journal Nodes accept writes only from the latest active NameNode.