Build Hadoop Cluster

搭建Hadoop集群

  1. 准备ubuntu环境
docker pull ubuntu
  1. 启动容器
  2. 下载hadoop,这里为hadoop-3.3.1-aarch64.tar.gz,官方的部署说明可以见 这里
  3. 复制hadoop到docker容器中
docker cp hadoop-3.3.1-aarch64.tar.gz hadoop_template:/root
  1. 登录容器并解压、重命名
tar xvf hadoop-3.3.1-aarch64.tar.gz -C /usr/local/
mv /usr/local/hadoop-3.3.1/ /usr/local/hadoop
  1. 安装hadoop依赖
apt update
apt install openjdk-8-jdk openssh-server
  1. 配置hadoop运行时环境变量及参数 etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

及etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop1:9000</value>
        </property>
</configuration>

etc/hadoop/hdfs-site.xml (optional)

<configuration>
        <property>
                <name>dfs.permissions.enabled</name>
                <value>false</value>
        </property>
</configuration>

etc/hadoop/yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>hadoop1</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>

</configuration>

etc/hadoop/mapred-site.xml

<configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.application.classpath</name>
                <value>
                        /usr/local/hadoop/etc/hadoop,
                        /usr/local/hadoop/share/hadoop/common/*,
                        /usr/local/hadoop/share/hadoop/common/lib/*,
                        /usr/local/hadoop/share/hadoop/hdfs/*,
                        /usr/local/hadoop/share/hadoop/hdfs/lib/*,
                        /usr/local/hadoop/share/hadoop/mapreduce/*,
                        /usr/local/hadoop/share/hadoop/mapreduce/lib/*,
                        /usr/local/hadoop/share/hadoop/yarn/*,
                        /usr/local/hadoop/share/hadoop/yarn/lib/*
                </value>
        </property>
</configuration>

etc/hadoop/workers

hadoop1
hadoop2
hadoop3
  1. 配置ssh免登录、bash启动命令
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
echo service ssh start >> ~/.bashrc
  1. 导出镜像
docker commit yourcontainerid hadoop_template
  1. 创建docker网络,以便容器间访问时使用主机名来访问
docker network create -d bridge hadoop
  1. 创建集群
docker run -dit --network hadoop -h hadoop1 hadoop_template /bin/bash
docker run -dit --network hadoop -h hadoop2 hadoop_template /bin/bash
docker run -dit --network hadoop -h hadoop3 hadoop_template /bin/bash
  1. 在namenode执行hadoop初始化
cd /usr/local/hadoop
./bin/hdfs namenode -format
  1. 启动
./sbin/start-all.sh

The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

配置好后的节点分配

节点 HDFS命名节点 HDFS数据节点 YARN节点
hadoop1 NameNode SecondaryNameNode DataNode ResourceManager NodeManager
hadoop2 DataNode NodeManager
hadoop3 DataNode NodeManager
  1. 节点维护
apt install net-tools #安装netstat
netstat -tunlp #查看端口
jps #查看java进程
./bin/hdfs dfsadmin -safemode leave #退出安全模式

  1. 应用开发
./bin/hdfs dfs -mkdir -p /demo/demoInput
this is a demo file
docker cp demo.txt hadoop1:/root
./bin/hdfs dfs -put ~/demo.txt /demo/demoInput #可以复制多一份增加mapper数量

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class JobDemo {

    public static class DemoMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final IntWritable one = new IntWritable(1);
        private final Text character = new Text();

        @Override
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            for (char c : value.toString().toCharArray()) {
                character.set(Character.toString(c));
                context.write(character, one);
            }
        }
    }

    public static class DemoReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private final IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterable<IntWritable> iterable, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : iterable) {
                sum += value.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Job job = new Job(configuration, "demo");
        job.setCombinerClass(DemoReducer.class);
        job.setJarByClass(JobDemo.class);
        job.setMapperClass(DemoMapper.class);
        job.setReducerClass(DemoReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setNumReduceTasks(2); // 控制reducer数量
        FileInputFormat.addInputPath(job, new Path("/demo/demoInput/"));
        FileOutputFormat.setOutputPath(job, new Path("/demo/demoOutput/"));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
docker cp ~/IdeaProjects/mapreduce-demo/target/mapreduce-demo-1.0-SNAPSHOT.jar hadoop1:/root
./bin/hadoop jar ~/mapreduce-demo-1.0-SNAPSHOT.jar JobDemo
./bin/hdfs dfs -ls /demo/demoOutput
./bin/hdfs dfs -get /demo/demoOutput/part-r-00000 ~/out
docker cp hadoop1:/root/out Desktop
 	4
a	1
d	1
e	2
f	1
h	1
i	3
l	1
m	1
o	1
s	2
t	1
./bin/hdfs dfs -rm -r -f /demo/demoOutput
./sbin/start-all.sh
  1. 图形管理工具

由于hadoop自带的web管理工具涉及到重定向的问题,docker不方便对多个container的端口映射到host,所以这里采用代理的方式访问

docker pull shadowsocks/shadowsocks-libev
docker run -p 8388:8388 -p 8388:8388/udp -d --restart always -h hadoop --network hadoop -e DNS_ADDRS=127.0.0.11 shadowsocks/shadowsocks-libev

host通过ss客户端连接8388后即可访问 http://hadoop1:9870/explorer.html#/