1 GB * 1024 = 1 PB (Peta Byte) .
Facebook processes 7 PB (approx) of data daily , there are approx 20 PB webpages total on internet servers .
Before this much of data what we used was a DS (Distributed System ) - Divide data and process parallelly , It was not a platform just that we connected different machines together to work .
Whenever multiple machines are in cooperation with one another the problem of failures arises :
Network Failure.
Individual compute nodes may overheat, crash ,experience hard-drive failures .
Data may be corrupted during transmission .
Clocks may not be synchronised.
Locks may not be released.
If issue occurs we need to write a code for fix . Platform for above is Hadoop .
Predecessor of Hadoop
Grid Computing : Data is stored at one place and multiple CPU work together in parallel to process data .
MPI - It gives control to programmer but it requires explicitly handling the mechanics of the data flow , exposed via low-level C-routines such as sockets.
To solve all these problems we have a platform which enable coders to work on actual problem and rest all things will be handled by platform.
HADOOP - It is a platform (MR- Map reduce solved by this ) .
SPARK - It is a platform (Diff. for diff. needs) .
HADOOP :
Data is distributed and is processed parallelly where data is independent.
Processing in Hadoop operates only at high-level ie the programmers think in terms of data models (such as key-value pairs for MR).
Map-Reduce framework spares the programmer from having to think about failures ie Implementation of Hadoop framework detects the failed tasks and distributes them on healthy nodes.
History :
Comments