How to Build Hadoop on Cloud Amazon AWS in one hour

Book%20of%20How%20to%20build%20Hadoop%20on%20Cloud

《How to Build Hadoop on Cloud Amazon AWS in one hour》 is available on Amazon

This book is available on Amazon.

image

About Author

Zhaoliang Meng , he is the author in these books ‘ How to build Hadoop on the Cloud Amazon AWS in one hour ’, ‘How to Create QlikView Dashboards in 30 minutes’, ‘Oracle SQL and Efficient Entry Guildlines’ and ‘How to Build a website in 2 hours’. He is interested in Big Data, Data Analytics, Cloud, Business Intelligence, Data Science area, and likes to build websites and mobile apps. He has passion to write the articles and share with the internet friends, and he is living in Columbus, Ohio, US now. Please feel free to reach out to me (zhao.liang.meng@hotmail.com) if you have any questions.

Preface

What is the biggest opportunity in the internet next ten years? We may think it should be something like Big Data , Cloud, Data scientist, Java Developer, Python Developer etc. As we know, now the internet grow so fast, Facebook now sees 100 million hours of daily watch time, and more than 250 billion photos have been uploaded to Facebook. This equates to 350 million photos per day. The most important, Facebook generates 4 petabytes (Big Data) of data per day. And more and more companies grow up the dataset of business like Facebook, they need to space and system to handle the big data problem.

It doesn’t matter what you’re working on in the future. If you’re the Cloud Engineer, or Data Scientist, or Python Developer, the first problem is that how to handle the big data generates every day by the business system. It means the big data will be the foundation for the business and IT infrastructure. And the Apache Hadoop is the solution for the big data.

Do we have enough people, time, money and other resources to build the data center for the big data of business? Of course NO . There are a few companies like Facebook, Google,MicroSoft, Amazon and Alibaba have enough money to build the data center. But we can use the Cloud service in these companies, like Amazon AWS . The Cloud service is easy to use, and provide the high speed, secure, agility, reliable, scalable at low cost.

Big Data and Cloud technologies are big area, this book will not talk about the architecture of Big Data and Cloud, and how they work. We know time is expensive, and nobody have enough to learn the technologies. This book shows you how to build the Hadoop (Big Data) system on the Cloud Amazon AWS easily by following these steps. You don’t know the big data, and Cloud how to work; and you don’t have the coding experience; and you don’t have any IT experience. It is fine; this book is just for you. This book will tell you how to build your own Hadoop (big data) system on the Cloud .

What’s the most learning experience you never forget? It’s making mistake, right. When we read it, a few days, we forget; when we do it, we remember. We may remember something if we make some mistakes. Let’s do it and make some mistakes, and have fun too.

Zhaoliang Meng

09/12/2019 in Columbus, OH, US

1 Hadoop for big data

1.1 What is Hadoop

The Apache Hadoop project develops open-source software for reliable , scalable , distributed computing .

The Apache Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.

image

1.2 Hadoop Architecture

Hadoop is made up of several modules that are supported by a large ecosystem of technologies. Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems.


Hadoop follows a master slave architecture design for data storage and distributed data processing using HDFS(Hadoop Distributed File System) and MapReduce respectively. The master node for data storage is Hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker .
image
The slave nodes in the Hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations.

Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise .
2%20Hadoop%20Archtecture

1.3 Hadoop HDFS Application Architecture

A file on HDFS is split into multiple bocks and each is replicated within the Hadoop cluster . A block on HDFS is a blob of data within the underlying file system with a default size of 64MB.The size of a block can be extended up to 256 MB based on the requirements.
image
Hadoop HDFS (Hadoop Distributed File System) stores the application data and file system metadata separately on dedicated servers. NameNode and DataNode are the two critical components of the Hadoop HDFS architecture .

Application data is stored on servers referred to as DataNodes and file system metadata is stored on servers referred to as NameNode .

HDFS replicates the file content on multiple DataNodes based on the replication factor to ensure reliability of data. The NameNode and DataNode communicate with each other using TCP based protocols. For the Hadoop architecture to be performance efficient, HDFS must satisfy certain pre-requisites

  • All the hard drives should have a high throughput.
  • Good network speed to manage intermediate data transfer and block replications.

1.4 Hadoop NameNode and DataNode

image
Name Node keeps track of where Data blocks is stored for a particular file.

  • NameNode is the centerpiece of HDFS.

  • NameNode is also known as the Master

  • NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster .

  • NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.

  • NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.

  • NameNode is so critical to HDFS and when the NameNode is down, HDFS or Hadoop cluster is inaccessible and considered down.

  • NameNode is a single point of failure in Hadoop cluster.

  • NameNode is usually configured with a lot of memory because the block locations are help in main memory.
    image
    Data Nodes store the data and return it on request

  • DataNode is responsible for storing the actual data in HDFS.

  • DataNode is also known as the Slave

  • NameNode and DataNode are in constant communication .

  • When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.

  • When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.

  • DataNode is usually configured with a lot of hard disk space because the actual data is stored in the DataNode.

1.5 When to Use Hadoop

1.5.1 Data Size and Data Diversity

When you are dealing with huge volumes of data coming from various sources and in a variety of formats then you can say that you are dealing with Big Data. In this case, Hadoop is the right technology for you.
image

1.5.2 Future Planning

It is all about getting ready for challenges you may face in future. If you anticipate Hadoop as a future need then you should plan accordingly. To implement Hadoop on you data you should first understand the level of complexity of data and the rate with which it is going to grow. So, you need a cluster planning. It may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data.
image

1.5.3 Multiple Frameworks for Big Data

There are various tools for various purposes. Hadoop can be integrated with multiple analytic tools to get the best out of it, like Mahout for Machine-Learning, R and Python for Analytics and visualization, Python, Spark for real time processing, MongoDB and Hbase for Nosql database, Pentaho for BI etc.

1.5.4 Lifetime Data Available

When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability . There is no limit to the size of cluster that you can have. You can increase the size anytime as per your need by adding datanodes to it with minimal cost.
image

1.6 When NOT to use Hadoop

1.6.1 Real Time Analytics

If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly. It is because Hadoop works on batch processing, hence response time is high .

Since Hadoop cannot be used for real time analytics, people explored and developed a new way in which they can use the strength of Hadoop (HDFS) and make the processing real time. So, the industry accepted way is to store the Big Data in HDFS and mount Spark over it. By using spark the processing can be done in real time and in a flash (real quick).
image

1.6.2 Not a Replacement for Existing Infrastructure

Hadoop is not a replacement for your existing data processing infrastructure. However, you can use Hadoop along with it.

All the historical big data can be stored in Hadoop HDFS and it can be processed and transformed into a structured manageable data. After processing the data in Hadoop you need to send the output to relational database technologies for BI, decision support, reporting etc.
image

1.6.3 Multiple Smaller DataSets

Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools.
image
image

1.6.4 Novice Hadoopers

Unless you have a better understanding of the Hadoop framework, it’s not suggested to use Hadoop for production. Hadoop is a technology which should come with a disclaimer: “Handle with care”. You should know it before you use it or else you will end up like the kid below.
image

1.6.5 Where Security is the primary Concern

Many enterprises — especially within highly regulated industries dealing with sensitive data — aren’t able to move as quickly as they would like towards implementing Big Data projects and Hadoop.
image
There are multiple ways to ensure that your sensitive data is secure with the elephant (Hadoop). Encrypt your data while moving to Hadoop . You can easily write a MapReduce program using any encryption Algorithm which encrypts the data and stores it in HDFS.

Finally, you use the data for further MapReduce processing to get relevant insights.
image

《How to Build Hadoop on Cloud Amazon AWS in one hour》 is available on Amazon

2 Setup 4 Linux Servers on Amazon AWS for Hadoop

2.1 Cloud Benefits

Why many companies want to build the services on the cloud, let us see the cloud benefits.
image

2.2 Diagram of 4 Ubuntu Linux Servers for Hadoop

image

2.3 Setup 4 Ubuntu Linux servers for Hadoop on the Amazon AWS EC2

Login in the aws.amazon.com
image

2.3.1 Choose Amazon Machine Image (AMI) on the Amazon Elastic Compute Cloud (Amazon EC2)

Step 1 : Click “EC2” service
image
Step 2 : Click “Launch Instance”
image
Step 3 : Choose “Free tier only” if we want to save some money.
image
Step 4 : Click the “Select” in “Ubuntu Server 18.04 LTS (HVM), SSD Volum Type”
image

2.3.2 Choose Instance Type

Choose the CPU and Memory in the Linux Server. Now choose the default CPU and memory for building this Hadoop system. Choose 1 CPU and 1GB, then click “Next: Configure Instance Details”
image

2.3.3 Configure Instance

Change the “1” to “ 4 ” in the “Number of instances”, it means we want to setup 4 Ubuntu Linux servers for Hadoop, and then click “Next: Add Storage”
image

2.3.4 Add Storage

Change the number in the “Size (Gib)”, now keep default “8” GB for the storage, then click “Next:Add Tags”.
image
2.3.5 Add Tags

Click the “click to add a Name tag”.
image
Fill in the “Name” in the “Key” and the “Node” in the “Value”, then click “Next: Configure Security Group”
image

2.3.6 Configure Security Group

Fill in the name (Example: “BigData”) in “Security group name:”, and the description (Example: “Hadoop for Zhaoliang”) in the “Description”.

Choose the “All traffic” from the dropdown list in “Type”, and then click “Review and Launch”.
image

2.3.7 Review the Instances

Click the “Launch”
image
Choose the “create a new key pair” from the dropdown list, and fill in the name (Example: “Zhaoliang-Hadoop-Key”) in the “Key pair name”, and then click “Download Key Pair” to save this key in the local desktop. (It will be used to login the Ubuntu Linux Servers without password)
image
After download the “Zhaoliang-Hadoop-Key.pem” to the desktop, and then click “Launch Instances”
image
When these instances are running, you can scroll down and click the “View Instances”.
image
You can click the pen image to change the name in each instance.

Change these 4 instances name for “NameNode”, “DataNode1”, “DataNode2” and “DataNode3”.

image

《How to Build Hadoop on Cloud Amazon AWS in one hour》 is available on Amazon

3 Setup PuTTY and WinSCP to Connect the Hadoop Cluster

3.1 What is PuTTY

PuTTY is a free and open-source terminal emulator; serial console and network file transfer application. It supports several network protocols, including SCP, SSH , Telnet, rlogin, and raw socket connection.
image

3.2 What is WinSCP

WinSCP is a free and open-source SFTP , FTP, WebDAV, Amazon S3 and SCP client for Microsoft Windows. Its main function is secure file transfer between a local and a remote computer. Beyond this, WinSCP offers basic file manager and file synchronization functionality.
image

3.3 Generate the private key through PuTTYgen

Step 1 : Open the “PuTTYgen” software in your Windows or MacOS.
image
After open the “PuTTYgen”, see below picture.
image
Step 2 : Load the “Zhaoliang-Hadoop-Key.pem” from the menu “File”->”Load primary key”
image
Step 3 : After load the “Zhaoliang-Hadoop-Key.pem” file, and then click “Save private key”
image
Click “Yes” in the popup dialogue of “Are you sure you want to save this key without a passphrase to protect it?”
image
Save the key to your desktop, example “Zhaoliang-Hadoop-Key.ppk” is my private key.
image

3.4 Connect to the Hadoop NameNode via PuTTY

Step 1 : Check the NameNode Public DNS(Domain Name System) and copy it, example: “ec2-3-14-126-2.us-east-2.compute.amazonaws.com
image
Step 2 : Open “PuTTY” software.

Create a new session, input “ubuntu@ec2-3-14-126-2.us-east-2.compute.amazonaws.com”, it means username will be “ubuntu”, and login the server name will be “ec2-3-14-126-2.us-east-2.compute.amazonaws.com” (Your NameNode public DNS). And name “HadoopNameNode” in “Saved Sessions” if you want.
image
Click the “SSH” in the left side and click the “Auth”; and click “Browse” the private key (Example: “Zhaoliang-Hadoop-Key.ppk” (This key to connect the Amazon EC2 Linux without password).

Then click “Open”
image
image
Click “Yes” in the popup dialogue of “PuTTY Security Alert”. Now you’re in the Hadoop NameNode Server.
image
Step 3 : using the same method to login the Hadoop DataNode1 Server, Hadoop DataNode2 Server and Hadoop DataNode3 Server. See below picture.
image

3.5 Connect to the Hadoop NameNode Server through WinSCP for transfer the files

Step 1 : Start the “WinSCP” software; click the “New Site” for connecting the Hadoop NameNode Server
image
Step 2 : Copy the paste the NameNode public DNS to the “Host name:”, and input the “ubuntu” in the “User name:” (Linux ubuntu system has the default username “ubuntu”), and then click the “Advanced” to find the private key from your desktop.
image
Choose the “SSH”->”Authentication” in the left size, and browse the private key (Example: “Zhaoliang-Hadoop-Key.ppk”) , then click “OK”.
image
You can click “Save” to save this new site to “NameNode”
image
After name it “NameNode”, and then click “OK”
image
And then click “Login” to connect the Hadoop NameNode Server.
image
Click “Yes” in the popup dialogue “Continue connecting to an unknown server and add its host key to cache?”
image
You will see below picture after you connect the Hadoop NameNode Server.
image
Note: If you cannot see the hidden folders and files. Please choose the “Show hidden files” in the “WinSCP” software.
image
Click the menu “Options”->”Preferences…”

Click the “Panels”, and then choose the “Show hidden files”. And then “OK”.

image

3.6 Connect to Hadoop DataNode1, DataNode2 and DataNode3 server through WinSCP

Follow above method to connect DataNode1, DataNode2 and DataNode3 like Hadoop NameNode Server. See below picture.

image

# 《How to Build Hadoop on Cloud Amazon AWS in one hour》is available on Amazon

4 Setup Passwordless SSH for these 4 Ubuntu Linus Servers

4.1 Edit the “config” file in your desktop

This “config” file will upload to the 4 nodes servers, these 4 servers can talk with each other through the private key without password.
“Config” file content:

Host namenode
HostName namenode_public_dns
User ubuntu
IdentityFile ~/.ssh/AWS_private_key.pem
Host datanode1
HostName datanode1_public_dns
User ubuntu
IdentityFile ~/.ssh/ AWS_private_key.pem
Host datanode2
HostName datanode2_public_dns
User ubuntu
IdentityFile ~/.ssh/ AWS_private_key.pem
Host datanode3
HostName datanode3_public_dns
User ubuntu
IdentityFile ~/.ssh/ AWS_private_key.pem

Below picture is the example of “config” file.
image

4.2 upload the “config” and private key to the Hadoop NameNode Server

Drag and Drop to upload the private key “Zhaoliang-Hadoop-Key.pem” and “config” files to the NameNode Server /home/ubuntu/.ssh/ folder

image

4.3 Change the permission of the private key on NameNode for security.

Execute below command:
$ sudo chmod 600 ~/.ssh/Zhaoliang-Hadoop-Key.pem
image

4.4 Copy “config” and private key files to DataNode1, DataNode2 and DataNode3 Servers from NameNode Server

Execute below commands in NameNode:

$ scp ~/.ssh/Zhaoliang-Hadoop-Key.pem ~/.ssh/config datanode1:~/.ssh

$scp ~/.ssh/Zhaoliang-Hadoop-Key.pem ~/.ssh/config datanode2:~/.ssh

$scp ~/.ssh/Zhaoliang-Hadoop-Key.pem ~/.ssh/config datanode3:~/.ssh

image

4.5 Generate the Key files on the NameNode from the private key

Execute below command on the NameNode:

$ ssh-keygen -f ~/.ssh/id_rsa -t rsa -P “”
image

This command will be generated two files “id_rsa” and “id_rsa.pub”.
image

4.6 Add the “id_rsa.pub” private key to the “authorized_keys” file on the NameNode server

You will see two lines in the “authorized_keys” file (/home/ubuntu/.ssh/authorized_keys) after execute below command.

Execute below command on NameNode:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
image

4.7 Add the “id_rsa.pub” private key to the DataNode1, DataNode2 and DataNode3 Servers from NameNode Server

Execute below commands on the NameNode:

$ cat ~/.ssh/id_rsa.pub | ssh datanode1 ‘cat >> ~/.ssh/authorized_keys’

$ cat ~/.ssh/id_rsa.pub | ssh datanode2 ‘cat >> ~/.ssh/authorized_keys’

$ cat ~/.ssh/id_rsa.pub | ssh datanode3 ‘cat >> ~/.ssh/authorized_keys’
image
After execute these 3 commands, please check the “authorized_keys” file (/home/ubuntu/.ssh/authorized_keys) on the DataNode1, DataNode2 and DataNode3 servers, it should have two lines of code.

4.8 Testing the NameNode, DataNode1, DataNode2 and DataNode3 servers connect each other without password

Execute “ssh datanode1” on the NameNode Server, it will connect to the DataNode1 Server without password.

Execute below command on the NameNode:

$ssh datanode1

Now it’s in the DataNode1 server, execute below command:

$ssh datanode2

Now it’s in the DataNode2 server, execute below command:

$ssh datanode3

Now it’s in the DataNode3 server, execute below command:

$ssh namenode

Now it’s in the NameNode server from DataNode3.

See below example this picture “ssh datanode2” to the DataNode2 Server from DataNode1 Server.

image

《How to Build Hadoop on Cloud Amazon AWS in one hour》is available on Amazon

5 Install Hadoop in the 4 Linux Servers

5.1 Check the latest stable Hadoop version and support JDK version

As we know the Hadoop software is building by Java program, it needs the JDK (JVM) to support.

See the versions of Hadoop in https://hadoop.apache.org/releases.html

Now the latest stable Hadoop version is 3.1.2 (Sepember 2019), and it only supports the JDK8 or JDK1.8 (Please don’t install the JDK version that not support by Hadoop).

image

5.2 Update packages on the NameNode, DataNode1, DataNode2 and DataNode3 Servers

Execute below command on the NameNode:

$ sudo apt-get update
image
And then do the same on the DataNode1, DataNode2 and DataNode3 Servers.

5.3 Install the JDK8 on the NameNode, DataNode1, DataNode2, DataNode3 servers

The Hadoop 3.* version only supports JDK8, let’s install JDK8.

Execute below command on the NameNode:

$ sudo apt-get install openjdk-8-jdk
image
Confirm the JDK version after installing.

Execute below command on the NameNode:

$java –version
image
Please do the same on the DataNode1, DataNode2 and DataNode3 Servers.

5.4 Download the latest stable Hadoop version 3.1.2 from Apache Hadoop website on NameNode, DataNode1, DataNode2 and DataNode3 servers.

Below command will download the Hadoop version 3.1.2 from Apache Hadoop website to the “Downnloads” folder on the NameNode server.

Execute below command on the NameNode:

$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz -P ~/Downloads
image
Please do the same on the DataNode1, DataNode2 and DataNode3 servers.

5.5 Uncompress the Hadoop tar file into the /usr/local folder on NameNode, DataNode1, DataNode2 and DataNode3.

Execute below command on the NameNode:

$sudo tar zxvf ~/Downloads/hadoop-* -C /usr/local
image
Move all hadoop related file from /usr/local/hadoop-* folder to /usr/local/hadoop/folder (Example: /usr/local/hadoop-3.1.2/ folder become the /usr/local/hadoop/)

Execute below command on the NameNode:

$sudo mv /usr/local/hadoop-* /usr/local/hadoop
image
Please do the same on the DataNode1, DataNode2 and DataNode3 Server.

5.6 Set the JDK and Hadoop environment variables on NameNode, DataNode1, DataNode2 and DataNode3 server.

Add below script to the /home/ubuntu/.profile file on all Nodes. You can open the /home/ubuntu/.profile in “WinSCP” software, and then copy and paste the script to /home/ubuntu/.profile file, and then click “Save” image.

Add there variables to /home/ubuntu/.profile file of NameNode Server for Hadoop environment through “WinSCP” software.

Content:

export JAVA_HOME=/usr

export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

See below example of picture.
image
Load the variables, below command let the Linux system read the variables (Java and Hadoop environment variables) and enable it.

Execute below command on the NameNode:

$ . ~/.profile
image

Validate or testing the Java and Hadoop environment variables.

Execute below commands on the NameNode:

$ echo $HADOOP_HOME

$echo $JAVA_HOME

$echo $HADOOP_CONF_DIR
image
Please do the same on the DataNode1, DataNode2 and DataNode3 Servers.

《How to Build Hadoop on Cloud Amazon AWS in one hour》is available on Amazon

6 Hadoop NameNode Configuration

6.1 Change the Hadoop folder ownership to the ‘ubuntu’ user from ‘root’ user on NameNode, DataNode1, DataNode2 and DataNode3 servers

Execute below command on the NameNode:

$ sudo chown -R ubuntu /usr/local/hadoop
image
Check the ownership in “WinSCP”, see below picture, the owner is ‘Ubuntu’.
image
Please do the same on the DataNode1, DataNode2 and DataNode3 Servers.

6.2 Change the JAVA_HOME environment in Hadoop on NameNode, DataNode1, DataNode2 and DataNode3 Servers.

Now you have the permission to change the hadoop files, now change the JAVA environment JAVA_HOME environment in the $HADOO_CONF_DIR/hadoop-env.sh (/usr/local/hadoop/etc/hadoop/hadoop-env.sh)

Add below content to hadoop-env.sh(/usr/local/hadoop/etc/hadoop/hadoop-env.sh)) on the NameNode through “WinSCP” software:

export JAVA_HOME=/usr

Change the “#export JAVA_HOME” to “export JAVA_HOME=/usr”, see below picture. And then click “Save” icon.
image
Please do the same on the DataNode1, DataNode2 and DataNode3 Servers.

6.3 Configure the property on NameNode Server only

Find the $HADOOP_CON_DIR/core-site.xml file (/usr/local/hadoop/etc/hadoop folder) change configuration element.

Add below configuration content to the core-site.xml (/usr/local/hadoop/etc/hadoop/ core-site.xml) file on the NameNode.

Content:

fs.defaultFS

hdfs://namenode_public_dns:9000

Note : Change to the namenode_public_dns to your NameNode Public DNS , see below example.
image

6.4 Adding all hostnames to the /etc/hosts file on the NameNode Server only

Please change the ownership or permission in this /etc/hosts file before adding the hostnames to /etc/hosts file on NameNode Server.

Change the ownership to “ubuntu” user from “root” user.

Execute below command on the NameNode:

$ sudo chown ubuntu /etc/hosts
image
Add below content to the /etc/hosts file through WinSCP.

Content:

#NameNode

“NameNode Private IPs” “NameNode Public DNS”

#DataNode

“DataNode1 Private IPs” “DataNode1 Public DNS”

“DataNode2 Private IPs” “DataNode2 Public DNS”

“DataNode3 Private IPs” “DataNode3 Public DNS”

See below example “DataNode3 Private IPs” and “DataNode3 Public DNS” on the DataNode3 Server.
image
Example of adding the private IPs and hostname (public DNS) to the /etc/hosts file of the NameNode Server.
image
Change back the ownership to “root” user from “ubuntu” user for security.

Execute below command on the NameNode:

$sudo chown root /etc/hosts
image

6.5 Configure the replication number on the NameNode Server only

Usually, we can define how many copies of data can keep on the Hadoop system, now we have 3 DataNode Servers, setup 3 copies for replication.

Adding below content to the $HADOOP_CONF_DIR/hdfs-site.xml (/usr/local/hadoop/etc/hadoop/hdfs-site.xml) file on the NameNode Server through “WinSCP” software.

Content:

Example:

6.6 Create the hadoop data directory on the NameNode Server only

Create the hadoop data directory in the $HADOOP_HOME directory.

Execute below command on the NameNode:

$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
image

6.7 Create a file name “masters” on the NameNode Server only

Create a file name “masters” in the $HADOOP_CONF_DIR directory

Execute below command on the NameNode:

$echo “namenode” | cat >>$HADOOP_CONF_DIR/masters
image

6.8 Create a file name “Slaves” on the NameNode Server only

Create a file name “Slaves” and add the host of DataNode1, DataNode2 and DataNode3 to the $HADOOP_CONF_DIR/slaves file

Execute below command on the NameNode:

$echo “datanode1” | cat >>$HADOOP_CONF_DIR/slaves

$echo “datanode2” | cat >>$HADOOP_CONF_DIR/slaves

$echo “datanode3” | cat >>$HADOOP_CONF_DIR/slaves
image
Check the “masters” and “slaves” files on the NameNode Server through “WinSCP”, see below picture.

image

Book%20of%20How%20to%20build%20Hadoop%20on%20Cloud

Ruhetouzi_AD_Hadoop_AWS

It’s so easy to follow this book.

:grin::grin::grin::grin::grin::grin::grin::grin:

:smiley::smiley::smiley::smiley::smiley::smiley::smiley::smiley::smiley::smiley:

Big dada I learned from this book

Now learning the Big data

1赞