124 次浏览

搭建hadoop集群

环境准备

虚拟机:VmWare Fusion12(用于安装系统,VmWare Fusion12安装系统时会固定好好ip地址,使用NAT访问网络,这里可以省去很多时间)
系统:Centos 7(本来想用的centos6,占用空间小一点。但是centos6源不在维护,使用起来很便秘,centos7的维护到2024年。。。)
hadoop版本:2.7.7
JDK:jdk-8u291-linux-x64.rpm
spark版本:spark-3.0.3-bin-hadoop-2.7.7
scala版本:scala-2.13.6
hive版本:apache-hive-2.3.9
mysql版本:mysql-community-5.7

安装Java

首先,将下载的jdk安装使用rpm命令安装:

rpm -ivh  jdk-8u291-linux-x64.rpm

通过安装,会将java安装至/usr,配置java环境变量

vim ~/.bashrc

export JAVA_HOME=/usr/java/latest
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/sbin

source ~/.bashrc

设置SSH免密登录

做好上一步,就可以对整个系统进行克隆,省去重复安装java的问题。使用VmWare Fusion的克隆功能,克隆出两个虚拟机。

在各个系统上执行

ssh-keygen

一路确认就完事了,然后再各个机子上修改主机名(名字自取,只要自己能分别就ok):

在第一台机子上: 
vim /etc/hostname

master

在第二台机子上:
vim /etc/hostname

slave01

在第三台机子上:
vim /etc/hostname

slav02

然后编辑hosts文件(-.-.-.-代表ip地址)

vim /etc/hosts

-.-.-.- master
-.-.-.- slave01
-.-.-.- slave02

然后使用ssh-copy-id命令在各个机子上进行免密处理:

ssh-copy-id master
ssh-copy-id slave01
ssh-copy-id slave02

然后使用ssh命令登录试一下,OK 了。

安装hadoop

使用tar命令解压hadoop文件

tar -zxvf hadoop-2.7.7.tar.gz 
mv hadoop-2.7.7 /opt

修改core-site.xml

<configuration>
        <!-- 这里建议放置在master上,不然有可能出现namenode启动不了的问题-->
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://master:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/hadoop-2.7.7/data/tmp</value>
	</property>
        <!-- 这里用于初始化hiveserver2登录凭证 -->
        <!-- root可以设置为其他名字 -->
	<property>
	    <name>hadoop.proxyuser.root.hosts</name>
	    <value>*</value>
	</property>
	<property>
	        <name>hadoop.proxyuser.root.groups</name>
	        <value>*</value>
	</property>
        <!-- web端查看hdfs目录文件 -->
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>root</value>
        </property>
        <property>
                <name>dfs.permissions.enabled</name>
                <value>false</value>
        </property>
</configuration>

修改hadoop-env.sh,找到export JAVA_HOME修改为

export JAVA_HOME=/usr/java/latest

修改hdfs-site.xml

<configuration>
        <!-- 三台机子,两个replica就OK了 -->
	<property>
		<name>dfs.replication</name>
		<value>2</value>
	</property>
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>slave01:50090</value>
	</property>
</configuration>

修改mapred-site.xml

<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
</configuration>

修改yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
        <!-- 指定resourcemanager位置-->
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>slave01</value>
	</property>
</configuration>

将hadoop分发到其他机子

scp -r /opt/hadoop-2.7.7 root@slave01:/opt
scp -r /opt/hadoop-2.7.7 root@slave02:/opt

配置hadoop环境变量

export HADOOP_HOME=/opt/hadoop-2.7.7
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

这样就要 愉快地开始启动hadoop了,先格式化namenode

hdfs namenode -format

然后运行start-all.sh脚本,结束后运行jps命令,查看namenode,secondenamenode,datanode,resourcemanager是否启动

start-all.sh (现在被拆除两步了start-dfs.sh,start-yarn.sh,在这里偷懒了)

jps

web端查看master:50070

可以看到自己的DataNode,这里可以检查自己的datanode是否被master监听到。这个时候是监听不不到的,因为防火墙的存在。

关闭防火墙

systemctl stop firewalld

systemctl disable firewalld

通过stop-all.sh重启hadoop,就可以看到datanode可以监听到了。

安装spark

tar -zxvf spark-2.3.3-bin-hadoop2.6.tgz -C /opt
tar -zxvf scala-2.11.8.tgz -C /opt

修改spark-env.sh(位置/opt/spark-3.0.3-bin-hadoop2.7/conf)

export SCALA_HOME=/opt/scala-2.13.6
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/opt/hadoop-2.7.7
export HADOOP_CONF_DIR=/opt/hadoop-2.7.7/etc/hadoop
SPARK_MASTER_IP=master
SPARK_LOCAL_DIRS=/opt/spark-3.0.3-bin-hadoop2.7
SPARK_DRIVER_MEMORY=1G

修改slaves文件(位置/opt/spark-3.0.3-bin-hadoop2.7/conf)

slave01
slave02

配置spark环境变量

vim ~/.bashrc

export SPARK_HOME=/opt/spark-3.0.3-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin

source ~/.bashrc

同时将hive-site.xml复制一份到/opt/spark-3.0.3-bin-hadoop2.7/conf

scp hivee-site.xml /opt/spark-3.0.3-bin-hadoop2.7/conf

将文件分发到slave01, slave02(这里偷懒只写传送到slave01上了)

scp ~/.bashrc root@slave1:~/.bashrc
scp -r /opt/spark root@slave01:/opt
scp -r /opt/scala  root@slave01:/opt

启动spark:

cd /opt/spark-3.0.3-bin-hadoop2.7/sbin
./start-all.sh

web端访问 master:8080

安装Hive

tar -zxvf apache-hive-2.3.9-bin.tar.gz

修改hive-env.sh

export JAVA_HOME=/usr/java/latest
export HIVE_HOME=/opt/apache-hive-2.3.9-bin
export HADOOP_HOME=/opt/hadoop-2.7.7

修改hive-site.sh

<configuration>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.jdbc.Driver</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>123456</value>
        </property>
        
		<property>
		     <name>hive.metastore.warehouse.dir</name>
		     <value>/opt/apache-hive-2.3.9-bin/warehouse</value>
	    </property>
		
	    <property>
		     <name>hive.exec.scratchdir</name>
		     <value>tmp</value>
	    </property>
		
	    <property>
		     <name>hive.querylog.location</name>
		     <value>/opt/apache-hive-2.3.9-bin/log</value>
	    </property>
	
	    <property>
             <name>hive.metastore.schema.verification</name>
             <value>false</value>
        </property>
		
	    <property> 
   		     <name>hive.cli.print.current.db</name>
		     <value>true</value>
	    </property>
		
	    <property> 
	        <name>hive.cli.print.header</name>
	        <value>true</value>
	    </property>
		
	    <property> 
	        <name>hive.groupby.skewindata</name>
	        <value>true</value>
	    </property>
	
        <property>
             <name>hbase.zookeeper.quorum</name>
             <value>master:2181,slave01:2181,slave02:2181</value>
        </property>	

	    <!-- 这是hiveserver2 -->
	    <property>
       		 <name>hive.server2.thrift.port</name>
     		 <value>10000</value>
	    </property>

        <property>
       		 <name>hive.server2.thrift.bind.host</name>
       		 <value>master</value>
        </property>
	<property>
		<name>hive.server2.thrift.client.user</name>
		<value>root</value>
	</property>    
	<property>
		<name>hive.server2.thrift.client.password</name>
		<value>123456</value>
	</property>
	<property>
	    <name>datanucleus.schema.autoCreateAll</name>
	    <value>true</value>
	 </property>
</configuration>

安装存储hive元数据的数据库MySQL

yum install mysql-community-server

service mysqld start

grep "password" /var/log/mysqld.log  获得mysql登录密码
msyql -uroot -p

修改mysql的访问权限

grant all privileges on *.* to root@"%" identified by "***";
grant all privileges on *.* to root@"localhost" identified by "***";
grant all on *.* to 'root'@'master' identified by '***';
flush privileges;
exit;

设置mysql自启动
systemctl enable mysqld

将mysql-connector连接工具、hive-site.xml分发到spark的conf目录、lib目录中

cp mysql-connector-java-5.1.47-bin.jar /opt/apache-hive-2.3.9-bin/lib

将连接工具分发到master,slave01,slave02的spark/jars中
cp /opt/apache-hive-2.3.9-bin/lib/mysql-connector-java-5.1.47-bin.jar /opt/spark-3.0.3-bin-hadoop2.7/jars
scp  /opt/apache-hive-2.3.9-bin/lib/mysql-connector-java-5.1.47-bin.jar root@slave01:/opt/spark-3.0.3-bin-hadoop2.7/jars
scp  /opt/apache-hive-2.3.9-bin/lib/mysql-connector-java-5.1.47-bin.jar root@slave02:/opt/spark-3.0.3-bin-hadoop2.7/jars

将hive-site.xml分发打spark/lib中
cp  /opt/apache-hive-2.3.9-bin/conf/hive-site.xml /opt/spark-3.0.3-bin-hadoop2.7/lib
scp  /opt/apache-hive-2.3.9-bin/conf/hive-site.xml root@slave01:/opt/spark-3.0.3-bin-hadoop2.7/lib
scp  /opt/apache-hive-2.3.9-bin/conf/hive-site.xml root@slave02:/opt/spark-3.0.3-bin-hadoop2.7/lib

启动hive以及hive2

hive -->这里将创建hive数据库

hive --service metastore -->这里会将初始化metastore元数据
hive --service hiveserver2  -->启动hive2服务(这两个命令为守护进程,可以使用nohup命令挂在后台运行)
查看是否启动hive2
netstat -nl | grep 10000

连接hive数据库
beeline

!connect jdbc:hive2://master:10000
输入hive-site.xml中配置的用户名以及密码

references
1. https://stackoverflow.com/questions/43180305/cannot-connect-to-hive-using-beeline-user-root-cannot-impersonate-anonymous