Introduction To Big Data & Hadoop

1 GB * 1024 = 1 PB (Peta Byte) .

Facebook processes 7 PB (approx) of data daily , there are approx 20 PB webpages total on internet servers .

Before this much of data what we used was a DS (Distributed System ) - Divide data and process parallelly , It was not a platform just that we connected different machines together to work .

Whenever multiple machines are in cooperation with one another the problem of failures arises :

Network Failure.
Individual compute nodes may overheat, crash ,experience hard-drive failures .
Data may be corrupted during transmission .
Clocks may not be synchronised.
Locks may not be released.

If issue occurs we need to write a code for fix . Platform for above is Hadoop .

Predecessor of Hadoop

Grid Computing : Data is stored at one place and multiple CPU work together in parallel to process data .

MPI - It gives control to programmer but it requires explicitly handling the mechanics of the data flow , exposed via low-level C-routines such as sockets.

To solve all these problems we have a platform which enable coders to work on actual problem and rest all things will be handled by platform.

HADOOP - It is a platform (MR- Map reduce solved by this ) .

SPARK - It is a platform (Diff. for diff. needs) .

HADOOP :

Data is distributed and is processed parallelly where data is independent.
Processing in Hadoop operates only at high-level ie the programmers think in terms of data models (such as key-value pairs for MR).
Map-Reduce framework spares the programmer from having to think about failures ie Implementation of Hadoop framework detects the failed tasks and distributes them on healthy nodes.

History :

Predecessor of Hadoop

HADOOP :

Comments