Zhaoliang Meng , he is the author in these books ‘ How to build Hadoop on the Cloud Amazon AWS in one hour ’, ‘How to Create QlikView Dashboards in 30 minutes’, ‘Oracle SQL and Efficient Entry Guildlines’ and ‘How to Build a website in 2 hours’. He is interested in Big Data, Data Analytics, Cloud, Business Intelligence, Data Science area, and likes to build websites and mobile apps. He has passion to write the articles and share with the internet friends, and he is living in Columbus, Ohio, US now. Please feel free to reach out to me (firstname.lastname@example.org) if you have any questions.
What is the biggest opportunity in the internet next ten years? We may think it should be something like Big Data , Cloud, Data scientist, Java Developer, Python Developer etc. As we know, now the internet grow so fast, Facebook now sees 100 million hours of daily watch time, and more than 250 billion photos have been uploaded to Facebook. This equates to 350 million photos per day. The most important, Facebook generates 4 petabytes (Big Data) of data per day. And more and more companies grow up the dataset of business like Facebook, they need to space and system to handle the big data problem.
It doesn’t matter what you’re working on in the future. If you’re the Cloud Engineer, or Data Scientist, or Python Developer, the first problem is that how to handle the big data generates every day by the business system. It means the big data will be the foundation for the business and IT infrastructure. And the Apache Hadoop is the solution for the big data.
Do we have enough people, time, money and other resources to build the data center for the big data of business? Of course NO . There are a few companies like Facebook, Google,MicroSoft, Amazon and Alibaba have enough money to build the data center. But we can use the Cloud service in these companies, like Amazon AWS . The Cloud service is easy to use, and provide the high speed, secure, agility, reliable, scalable at low cost.
Big Data and Cloud technologies are big area, this book will not talk about the architecture of Big Data and Cloud, and how they work. We know time is expensive, and nobody have enough to learn the technologies. This book shows you how to build the Hadoop (Big Data) system on the Cloud Amazon AWS easily by following these steps. You don’t know the big data, and Cloud how to work; and you don’t have the coding experience; and you don’t have any IT experience. It is fine; this book is just for you. This book will tell you how to build your own Hadoop (big data) system on the Cloud .
What’s the most learning experience you never forget? It’s making mistake, right. When we read it, a few days, we forget; when we do it, we remember. We may remember something if we make some mistakes. Let’s do it and make some mistakes, and have fun too.
09/12/2019 in Columbus, OH, US
1 Hadoop for big data
1.1 What is Hadoop
The Apache Hadoop project develops open-source software for reliable , scalable , distributed computing .
The Apache Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.
1.2 Hadoop Architecture
Hadoop is made up of several modules that are supported by a large ecosystem of technologies. Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems.
Hadoop follows a master slave architecture design for data storage and distributed data processing using HDFS(Hadoop Distributed File System) and MapReduce respectively. The master node for data storage is Hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker .
The slave nodes in the Hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations.
Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise .
1.3 Hadoop HDFS Application Architecture
A file on HDFS is split into multiple bocks and each is replicated within the Hadoop cluster . A block on HDFS is a blob of data within the underlying file system with a default size of 64MB.The size of a block can be extended up to 256 MB based on the requirements.
Hadoop HDFS (Hadoop Distributed File System) stores the application data and file system metadata separately on dedicated servers. NameNode and DataNode are the two critical components of the Hadoop HDFS architecture .
Application data is stored on servers referred to as DataNodes and file system metadata is stored on servers referred to as NameNode .
HDFS replicates the file content on multiple DataNodes based on the replication factor to ensure reliability of data. The NameNode and DataNode communicate with each other using TCP based protocols. For the Hadoop architecture to be performance efficient, HDFS must satisfy certain pre-requisites
- All the hard drives should have a high throughput.
- Good network speed to manage intermediate data transfer and block replications.
1.4 Hadoop NameNode and DataNode
Name Node keeps track of where Data blocks is stored for a particular file.
NameNode is the centerpiece of HDFS.
NameNode is also known as the Master
NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster .
NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.
NameNode is so critical to HDFS and when the NameNode is down, HDFS or Hadoop cluster is inaccessible and considered down.
NameNode is a single point of failure in Hadoop cluster.
NameNode is usually configured with a lot of memory because the block locations are help in main memory.
Data Nodes store the data and return it on request
DataNode is responsible for storing the actual data in HDFS.
DataNode is also known as the Slave
NameNode and DataNode are in constant communication .
When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.
DataNode is usually configured with a lot of hard disk space because the actual data is stored in the DataNode.
1.5 When to Use Hadoop
1.5.1 Data Size and Data Diversity
When you are dealing with huge volumes of data coming from various sources and in a variety of formats then you can say that you are dealing with Big Data. In this case, Hadoop is the right technology for you.
1.5.2 Future Planning
It is all about getting ready for challenges you may face in future. If you anticipate Hadoop as a future need then you should plan accordingly. To implement Hadoop on you data you should first understand the level of complexity of data and the rate with which it is going to grow. So, you need a cluster planning. It may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data.
1.5.3 Multiple Frameworks for Big Data
There are various tools for various purposes. Hadoop can be integrated with multiple analytic tools to get the best out of it, like Mahout for Machine-Learning, R and Python for Analytics and visualization, Python, Spark for real time processing, MongoDB and Hbase for Nosql database, Pentaho for BI etc.
1.5.4 Lifetime Data Available
When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability . There is no limit to the size of cluster that you can have. You can increase the size anytime as per your need by adding datanodes to it with minimal cost.
1.6 When NOT to use Hadoop
1.6.1 Real Time Analytics
If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly. It is because Hadoop works on batch processing, hence response time is high .
Since Hadoop cannot be used for real time analytics, people explored and developed a new way in which they can use the strength of Hadoop (HDFS) and make the processing real time. So, the industry accepted way is to store the Big Data in HDFS and mount Spark over it. By using spark the processing can be done in real time and in a flash (real quick).
1.6.2 Not a Replacement for Existing Infrastructure
Hadoop is not a replacement for your existing data processing infrastructure. However, you can use Hadoop along with it.
All the historical big data can be stored in Hadoop HDFS and it can be processed and transformed into a structured manageable data. After processing the data in Hadoop you need to send the output to relational database technologies for BI, decision support, reporting etc.
1.6.3 Multiple Smaller DataSets
Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools.
1.6.4 Novice Hadoopers
Unless you have a better understanding of the Hadoop framework, it’s not suggested to use Hadoop for production. Hadoop is a technology which should come with a disclaimer: “Handle with care”. You should know it before you use it or else you will end up like the kid below.
1.6.5 Where Security is the primary Concern
Many enterprises — especially within highly regulated industries dealing with sensitive data — aren’t able to move as quickly as they would like towards implementing Big Data projects and Hadoop.
There are multiple ways to ensure that your sensitive data is secure with the elephant (Hadoop). Encrypt your data while moving to Hadoop . You can easily write a MapReduce program using any encryption Algorithm which encrypts the data and stores it in HDFS.
Finally, you use the data for further MapReduce processing to get relevant insights.