Hadoop 2.7.3 安装

在此记录hadoop 2.7.3版本的安装过程以及基本配置过程。

安装环境

  1. jdk1.8
  2. hadoop 2.7.3
  3. CentOS release 6.7 (Final) * 3
hostname ip
master 172.168.170.84
slave1 172.168.170.88
slave2 172.168.170.89

必需软件

  1. JDK安装(下载地址)
  2. ssh安装
    hadoop中使用ssh来实现cluster中各个node的登陆认证,同时需要进行ssh免密登陆。
    1
    sudo apt-get install ssh
  3. rsync安装
    Ubuntu 12.10已经自带rsync。
    1
    sudo apt-get install rsync
  4. hadoop下载
    从官方mirrors下载对应版本的hadoop。

安装Hadoop

  1. 创建hadoop用户组以及用户
    1
    2
    sudo addgroup hadoop
    sudo adduser --ingroup hadoop hadoop
    重新用hadoop用户登陆到Linux中。
  2. 将hadoop解压到目录/home/hadoop/local/opt
  3. 配置hadoop环境变量
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/
    export PATH=$PATH:$JAVA_HOME/bin
    export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

    export HADOOP_HOME=$HOME/local/opt/hadoop-2.7.3
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_YARN_HOME=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
  4. 进入hadoop-2.7.3/etc/hadoop文件夹修改core-site.xml文件
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    <configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
    </property>
    <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
    </property>
    <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hadoop/local/var/hadoop/tmp</value>
    </property>
    </configuration>
  5. 修改hdfs-site.xml文件
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    <configuration>
    <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>master:9001</value>
    <description># 通过web界面来查看HDFS状态 </description>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/hadoop/local/var/hadoop/hdfs/namenode</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/hadoop/local/var/hadoop/hdfs/datanode</value>
    </property>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description># 每个Block有1个备份</description>
    </property>
    <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
    </property>
    </configuration>
  6. 修改mapred-site.xml
    这个是mapreduce任务的配置,由于hadoop2.x使用了yarn框架,所以要实现分布式部署,必须在mapreduce.framework.name属性下配置为yarn。mapred.map.tasksmapred.reduce.tasks分别为map和reduce的任务数。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    <property>
    <name>mapreduce.jobhistory.address</name>
    <value>master:10020</value>
    </property>
    <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>master:19888</value>
    </property>
    </configuration>
  7. 修改yarn-site.xml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    <configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8032</value>
    </property>
    <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
    </property>
    <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>master:8031</value>
    </property>
    <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>master:8033</value>
    </property>
    <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>master:8088</value>
    </property>
    <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>8192</value>
    </property>
    </configuration>
  8. 修改slaves文件
    1
    2
    slave1
    slave2
  9. 修改hosts文件,命名各个节点的名称。
    1
    2
    3
    4
    127.0.0.1 localhost
    172.168.170.84 master
    172.168.170.88 slave1
    172.168.170.89 slave2
  10. 节点之间ssh免密登陆
    master节点中生成密钥,并添加到.ssh/authorized_keys文件中。
    1
    2
    ssh-keygen -t rsa
    cat id_rsa.pub>> authorized_keys
    master中的/etc/hosts文件和.ssh/authorized_keys文件发送到slave1和slave2文件中。
    1
    2
    3
    scp /etc/hosts hadoop@slave1:/etc/hosts
    scp /home/hadoop/.ssh/authorized_keys hadoop@slave1:/home/hadoop/.ssh/authorized_keys
    scp /home/hadoop/.ssh/authorized_keys hadoop@slave2:/home/hadoop/.ssh/authorized_keys
    完成之后可以利用以下语句测试免密登陆。
    1
    2
    ssh slave1
    ssh slave2
  11. hadoop-2.7.3文件拷贝至slave1和slave2
    1
    2
    scp -r /home/hadoop/local/opt/hadoop-2.7.3 hadoop@slave1:/home/hadoop/local/opt/
    scp -r /home/hadoop/local/opt/hadoop-2.7.3 hadoop@slave2:/home/hadoop/local/opt/

启动Hadoop

  1. 在master节点使用hadoop用户初始化NameNode
    1
    2
    3
    hdfs namenode –format
    #执行后控制台输出,看到 Exiting with status 0 表示格式化成功。
    #如有错误,先删除var目录下的临时文件,然后重新运行该命令
  2. 启动hadoop
    1
    2
    3
    4
    #启动hdfs
    start-dfs.sh
    #启动yarn分布式计算框架
    start-yarn.sh
  3. 用jps命令查看hadoop集群运行情况
    master节点
    1
    2
    3
    4
    5
    Jps
    NameNode
    ResourceManager
    SecondaryNameNode
    JobHistoryServer
    slave节点
    1
    2
    3
    Jps
    DataNode
    NodeManager
  4. 通过以下网址查看集群状态
    1
    2
    http://172.168.170.84:50070
    http://172.168.170.84:8088