What is Hadoop, and what are its main components?

Hadoop3 min read
What is Hadoop, and what are its main components?
hadoophadoop components

Hadoop is an open-source distributed framework designed to store and process massive amounts of data across a cluster of commodity machines. Hadoop architecture is built around five main components.

What is Hadoop?

  • Hadoop is an open-source distributed framework designed to store and process massive amounts of data across a cluster of commodity machines.
  • It has two core components — HDFS (Hadoop Distributed File System) for distributed storage, and MapReduce for distributed processing. Instead of relying on one powerful expensive machine, Hadoop spreads both the data and the computation across many affordable nodes.
  • Two key strengths make it production-grade:
    • Fault-tolerant — data is automatically replicated across multiple nodes, so if one machine fails, no data is lost
    • Horizontally scalable — you can increase capacity simply by adding more nodes to the cluster, with no downtime
  • A simple way to say it in an interview:
    • "Hadoop solves the problem of storing and processing data that is too big to fit on a single machine, by distributing both the storage and the computation across a cluster."

What are the components of Hadoop?

Hadoop architecture is built around five main components:

  • NameNode: The master node of HDFS. It stores all the metadata — which file has how many blocks, and which DataNode holds which block. It never stores actual data. If NameNode goes down, the entire cluster is blind.
  • Standby NameNode: A hot standby for the Active NameNode. It continuously reads the Edit Log from the Journal Nodes so it is always in sync. If the Active NameNode fails, the Standby takes over immediately with zero data loss.
  • DataNodes: The worker nodes that actually store the data blocks on their local disks. They report their health and block status to the NameNode via heartbeats every 3 seconds.
  • Journal Nodes: A shared log service that both the Active and Standby NameNode read and write to. This is how the Standby stays in sync with every change the Active makes.
  • ZooKeeper: Acts as the cluster coordinator. It monitors all nodes, handles leader election (deciding which NameNode becomes Active), and ensures high availability across the cluster.
  • YARN (Yet Another Resource Negotiator): The resource management layer. It decides which node has available CPU and memory to run a task. MapReduce runs on top of YARN.
  • Heartbeat mechanism: Every DataNode sends a heartbeat signal to the NameNode every 3 seconds. If the NameNode misses 10 consecutive heartbeats from a DataNode, it marks that node as dead and automatically replicates its blocks to other healthy nodes.

Interview tip: Most candidates forget YARN. Mentioning it immediately separates you from the crowd.

Components_of_Hadoop
Components of Hadoop