生产环境版本
jdk-1.7.0_71
SCALA-2.11.8
ZOOKEPPER-3.4.6
SPARK-2.1.0
HIVE-1.2.1
HBASE-1.0.2
mysql-5.6.33
kafka-2.1.0_0.8.2.0
集群搭建
一.服务器准备
1.挂载数据盘(root)
数据盘的设备名默认由系统分配,I/O优化实例的数据盘设备名从 /dev/vdb递增排列,包括 /dev/vdb−/dev/vdz。如果数据盘设备名为 dev/xvd*
( *
是a−z的任意一个字母),表示您使用的是非I/O优化实例。
查看数据盘
执行命令后,如果不存在 /dev/vdb,表示您的实例没有数据盘。确认数据盘是否已挂载。
fdisk -l
分区数据盘(一般情况分一个区即可)
fdisk -u /dev/vdb p 查看数据盘分区情况 n 创建新分区 p 选择分区类型为主分区 输入分区编号并按回车键。仅创建一个分区,输入1。 输入第一个可用的扇区编号:按回车键采用默认值2048。 输入最后一个扇区编号:仅创建一个分区,按回车键采用默认值。 输入p:查看该数据盘的规划分区情况。 输入w:开始分区,并在分区后退出。
查看新分区
fdisk -lu /dev/vdb ----------------------------------------------------------- Disk /dev/vdb: 21.5 GB, 21474836480 bytes, 41943040 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x3e60020e Device Boot Start End Blocks Id System /dev/vdb1 2048 41943039 20970496 83 Linux
在新分区创建文件系统
如果需要在 Linux、Windows和Mac系统之间共享文件,可以使用
mkfs.vfat
创建VFAT文件系统。mkfs.ext4 /dev/vdb1
备份etc/fstab文件
cp /etc/fstab /etc/fstab.bak
向etc/fstab写入新分区信息
echo /dev/vdb1 /mnt ext4 defaults 0 0 >> /etc/fstab
查看新分区信息
cat /etc/fstab
挂载文件系统
mount /dev/vdb1/ /app #若需要卸载文件系统可执行以下命令: umount /app
查看磁盘使用情况
若出现新建文件系统信息,则挂载成功
df -h
2.创建用户(root用户执行)
#创建用户主目录
useradd -d /app -m app
passwd hadoop
3.修改主机名(root用户执行)
vim /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop03
4.修改hosts文件(root用户执行)
vim /etc/hosts
10.0.0.99 hadoop01
10.0.0.100 hadoop02
10.0.0.101 hadoop03
10.0.0.102 hadoop04
修改完以上文件后 reboot
重启
5.配置ssh免密登录
生成rsa秘钥
ssh-keygen -t rsa
拷贝秘钥至其他服务器上
在一台服务器上配置好所有服务器的公钥,然后复制到其他服务器即可,本机的公钥也需要
scp .ssh/id_rsa.pub hadoop@hadoop02:/app/hadoop/id_rsa.pub cat id_rsa.pub >> ~/.ssh/authorized_keys
修改文件夹权限
1.chmod 700 -R ~/.ssh
其他方式
ssh-copy-id -i ~/.ssh/id_rsa.pub app@192.168.1.233
二.JDK与SCALA环境搭建
1.复制JDK包与SCALA包并解压
scp jdk-1.7.0_71.tar hadoop@10.0.0.99:/app/java
tar -xvf jdk-1.7.0_71.tar
2.配置环境变量
vim /etc/profile
#并输入以下内容
export JAVA_HOME=/app/java/jdk1.7.0_71
export JRE_HOME=/app/java/jdk1.7.0_71/jre
export SCALA_HOME=/app/scala/scala-2.11.8
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$SCALA_HOME/bin:$PATH
三.zookeeper集群搭建
1.zookeeper包下载,解压
tar -xvf zookeeper-3.4.6.tar
2.创建data和logs目录
mkdir data
mkdir logs
3.配置zoo.cfg
cp zoo_sample.cfg zoo.cfg
#修改zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataLogDir=/app/hadoop/zookeeper3.4.6/logs
dataDir=/app/hadoop/zookeeper3.4.6/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=500
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
autopurge.purgeInterval=24
server.1=hadoop01:2888:3888
server.2=hadoop02:2888:3888
server.3=hadoop03:2888:3888
# 心跳间隔时间
tickTime=2000
# 最小SessionTimeou
minSessionTimeout=4000
# 最大SessionTimeou
maxSessionTimeout=100000
4.创建data/myid文件
echo 1 >> myid
5.拷贝文件至其他服务器
scp -r zookeeper3.4.6/ hadoop@hadoop02:/app/data/
6.修改各服务器myid文件
修改为服务器对应server.n 的n值
7.启动zk
./zkServer.sh start
#集群启动需要所有服务都启动完,zk状态才会正常
8.开机自启动
#待完成
9.集群数量单数原因
容错
由于在增删改操作中需要半数以上服务器通过,来分析以下情况。
2台服务器,至少2台正常运行才行(2的半数为1,半数以上最少为2),正常运行1台服务器都不允许挂掉
3台服务器,至少2台正常运行才行(3的半数为1.5,半数以上最少为2),正常运行可以允许1台服务器挂掉
4台服务器,至少3台正常运行才行(4的半数为2,半数以上最少为3),正常运行可以允许1台服务器挂掉
5台服务器,至少3台正常运行才行(5的半数为2.5,半数以上最少为3),正常运行可以允许2台服务器挂掉
6台服务器,至少3台正常运行才行(6的半数为3,半数以上最少为4),正常运行可以允许2台服务器挂掉
通过以上可以发现,3台服务器和4台服务器都最多允许1台服务器挂掉,5台服务器和6台服务器都最多允许2台服务器挂掉
但是明显4台服务器成本高于3台服务器成本,6台服务器成本高于5服务器成本。这是由于半数以上投票通过决定的。
防脑裂
一个zookeeper集群中,可以有多个follower、observer服务器,但是必需只能有一个leader服务器。
如果leader服务器挂掉了,剩下的服务器集群会通过半数以上投票选出一个新的leader服务器。
集群互不通讯情况:
一个集群3台服务器,全部运行正常,但是其中1台裂开了,和另外2台无法通讯。3台机器里面2台正常运行过半票可以选出一个leader。
一个集群4台服务器,全部运行正常,但是其中2台裂开了,和另外2台无法通讯。4台机器里面2台正常工作没有过半票以上达到3,无法选出leader正常运行。
一个集群5台服务器,全部运行正常,但是其中2台裂开了,和另外3台无法通讯。5台机器里面3台正常运行过半票可以选出一个leader。
一个集群6台服务器,全部运行正常,但是其中3台裂开了,和另外3台无法通讯。6台机器里面3台正常工作没有过半票以上达到4,无法选出leader正常运行。
### 四.Hadoop集群搭建
1.下载并拷贝包至服务器
scp hadoop-2.6.0.tat.gz hadoop@10.0.0.99:/app/hadoop
2.创建数据存放目录
mkdir -p app/data/hadoop/dfs/tmp
mkdir -p app/data/hadoop/dfs/data
mkdir -p app/data/hadoop/dfs/journal
mkdir -p app/data/hadoop/dfs/name
3.修改hadoop-env.sh
#修改java环境变量
export JAVA_HOME=/app/java/jdk1.7.0_71
4.修改core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<!--指定tmp存放目录-->
<name>hadoop.tmp.dir</name>
<value>/app/data/hadoop/dfs/tmp</value>
<description>Abaseforothertemporarydirectories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-cluster1</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec
</value>
</property>
<!--指定zookeeper地址-->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
</property>
</configuration>
5.修改hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop01:9001</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/app/data/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/app/data/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>hadoop-cluster1</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/app/hadoop/hadoop-2.6.0/etc/hadoop/excludes</value>
</property>
<property>
<name>dfs.ha.namenodes.hadoop-cluster1</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hadoop-cluster1.nn1</name>
<value>hadoop01:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.hadoop-cluster1.nn1</name>
<value>hadoop01:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hadoop-cluster1.nn2</name>
<value>hadoop02:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.hadoop-cluster1.nn2</name>
<value>hadoop02:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop02:8485;hadoop03:8485;hadoop04:8485/hadoop-cluster1</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.hadoop-cluster1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/app/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/app/data/hadoop/dfs/journal</value>
</property>
<!-- 开启NameNode故障时自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>
6.修改mapred-site.xml
cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
vim mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop01:19888</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
</configuration>
7.修改yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop01:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop01:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop01:8088</value>
</property>
</configuration>
8.修改环境变量
vim /etc/profile
export HADOOP_COMMON_HOME=/app/hadoop/hadoop-2.6.0
export HADOOP_HOME=/app/hadoop/hadoop-2.6.0
export HADOOP_CONF_DIR=/app/hadoop/hadoop-2.6.0/etc/hadoop
export YARN_CONF_DIR=/app/hadoop/hadoop-2.6.0/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export YARN_LOG_DIR=$HADOOP_LOG_DIR
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$SCALA_HOME/bin:$MAVEN_HOME/bin:$ZK_HOME/bin:$HADOOP_COMMON_HOME/bin:$HADOOP_COMMON_HOME/sbin:$PATH
9.分发hadoop包至其他节点
scp -r hadoop-2.6.0 hadoop@hadoop02:/app/hadoop/
scp -r hadoop-2.6.0 hadoop@hadoop03:/app/hadoop/
scp -r hadoop-2.6.0 hadoop@hadoop04:/app/hadoop/
10.启动journalnode节点
./sbin/hadoop-daemon.sh start journalnode
11.格式化zkfc
/bin/hdfs zkfc -formatZK
12.格式化namenode
./bin/hdfs namenode -format
13.启动datanode
./sbin/hadoop-daemon.sh start datanode
14.启动namenode
#namenode1
./sbin/hadoop-daemon.sh start namenode
#namenode2
./bin/hdfs namenode -bootstrapStandby
./sbin/hadoop-daemon.sh start namenode
15.启动yarn
./start-yarn.sh
16.查看集群状态
17.待解决问题
snappy库支持问题.
#安装gcc环境
yum install -y gcc-c++
五.kafka集群搭建
1.下载kafka包并上传至服务器
http://kafka.apache.org/downloads.html
2.解压tar包
tar -xvf kafka_2.10-0.8.2.0.tgz
3.修改配置文件
server.properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults
############################# Server Basics #############################
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
############################# Socket Server Settings #############################
# The port the socket server listens on
port=9092
# Hostname the broker will bind to. If not set, the server will bind to all interfaces
#host.name=localhost
# Hostname the broker will advertise to producers and consumers. If not set, it uses the
# value for "host.name" if configured. Otherwise, it will use the value returned from
# java.net.InetAddress.getCanonicalHostName().
#advertised.host.name=<hostname routable by clients>
# The port to publish to ZooKeeper for clients to use. If this is not set,
# it will publish the same port that the broker binds to.
#advertised.port=<port accessible by clients>
# The number of threads handling network requests
num.network.threads=3
# The number of threads doing disk I/O
num.io.threads=8
# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400
# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600
############################# Log Basics #############################
# A comma seperated list of directories under which to store log files
log.dirs=/app/hadoop/kafka_2.10-0.8.2.0/logs
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=6
# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1
############################# Log Flush Policy #############################
# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000
# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000
############################# Log Retention Policy #############################
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion
log.retention.hours=48
# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#log.retention.bytes=1073741824
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000
# By default the log cleaner is disabled and the log retention policy will default to just delete segments after their retention expires.
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.
log.cleaner.enable=false
############################# Zookeeper #############################
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=hadoop01:2181,hadoop02:2181,hadoop03:2181
# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
4.复制包至其他节点
scp -r kafka-2.10_0.8.2.0 hadoop@hadoop03:/app/hadoop
5.启动各服务器kafka
./kafka-server-start.sh ../config/server.properties &
6.验证集群
#创建一个测试topic
./kafka-topics.sh --create --zookeeper hadoop01:2181 --replication-factor 3 --partitions 1 --topic wxtest
#启动一台kafka hadoop04 的producer并往hadoop03推送数据
./kafka-console-producer.sh --broker-list hadoop03:9092 --topic wxtest
test for hadoop03
#停掉hadoop03的kafka
./kafka-server-stop.sh
#在hadoop02节点启动consumer,看是否接收到hadoop04推送的数据
./kafka-console-consumer.sh --zookeeper hadoop01:2181 --topic wxtest --from-beginning
7.待解决问题
kafka启动后,hadoop02 zk挂掉的问题
六.安装mysql
./mysql_install_db --verbose --user=hadoop --defaults-file=/app/hadoop/mysql-5.6.33-linux-glibc2.5-x86_64/my.cnf --datadir=/app/data/mysql/data/ --basedir=/app/hadoop/mysql-5.6.33-linux-glibc2.5-x86_64 --pid-file=/app/data/mysql/data/mysql.pid --tmpdir=/app/data/mysql/tmp
cp support-files/mysql.server /etc/init.d/mysql
./mysqld_safe --defaults-file=/etc/my.cnf --socket=/app/data/mysql/tmp/mysql.sock --user=hadoop
./mysql -h localhost -S /app/data/mysql/tmp/mysql.sock -u root -p
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'rootbqs123' WITH GRANT OPTION;
create database hive;
alter database hive character set latin1;
七.hive搭建
1.下载hive1包,上传至服务器并解压
apache-hive- 1.2.1-bin.tar.gz
http://hive.apache.org/downloads.html
tar -xvf apache-hive-1.2.1-bin.tar.gz
2.修改hive-env.sh文件
注:hive-env.sh初始时没有,需要复制hive-env.sh.template文件
cp hive-env.sh.template hive-env.sh
vim hive-env.sh
3.修改hive-site.xml文件
4.启动hivemetastore
#-p参数若不指定,默认为9083端口
hive --service metastore -p <port_num>
#客户端使用hive命令进入
hive
5.注意事项
#启动时报错:/tmp/hive on HDFS should be writable. Current permissions are: rwx--x--x
# 当前用户在hdfs无权限写入数据~解决方式
hadoop fs -chmod -R 777 /tmp
八.HBase搭建
hbase采用分布式集群搭建,节点情况如下
hadoop01 | hadoop02 | hadoop03 | hadoop04 |
---|---|---|---|
Hmaster | Hmaster | ||
regionserver | regionserver | regionserver |
1.上传hbase-1.0.2-bin.tar.gz包至服务器
下载地址:http://archive.apache.org/dist/hbase/hbase-1.0.2/
tar -xvf hbase-1.0.2-bin.tar.gz
2.修改hbase-env.sh文件
#
#/**
# * Licensed to the Apache Software Foundation (ASF) under one
# * or more contributor license agreements. See the NOTICE file
# * distributed with this work for additional information
# * regarding copyright ownership. The ASF licenses this file
# * to you under the Apache License, Version 2.0 (the
# * "License"); you may not use this file except in compliance
# * with the License. You may obtain a copy of the License at
# *
# * http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)
# The java implementation to use. Java 1.7+ required.
export JAVA_HOME=/app/java/jdk1.7.0_71/
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
# The maximum amount of heap to use. Default is left to JVM default.
export HBASE_HEAPSIZE=1G
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/
export HBASE_LIBRARY_PATH=$HBASE_LIBRARY_PATH:$HBASE_HOME/lib/native/
# Uncomment below if you intend to use off heap cache. For example, to allocate 8G of
# offheap, set the value to "8G".
# export HBASE_OFFHEAPSIZE=1G
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="$HBASE_OPTS -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly"
# Uncomment one of the below three options to enable java garbage collection logging for the server-side processes.
# This enables basic gc logging to the .out file.
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
# Uncomment one of the below three options to enable java garbage collection logging for the client processes.
# This enables basic gc logging to the .out file.
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
# See the package documentation for org.apache.hadoop.hbase.io.hfile for other configurations
# needed setting up off-heap block caching.
# Uncomment and adjust to enable JMX exporting
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
# NOTE: HBase provides an alternative JMX implementation to fix the random ports issue, please see JMX
# section in HBase Reference Guide for instructions.
# export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
# export HBASE_REST_OPTS="$HBASE_REST_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10105"
# File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
# Uncomment and adjust to keep all the Region Server pages mapped to be memory resident
#HBASE_REGIONSERVER_MLOCK=true
#HBASE_REGIONSERVER_UID="hbase"
# File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default.
export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters
# Extra ssh options. Empty by default.
# export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR"
# Where log files are stored. $HBASE_HOME/logs by default.
# export HBASE_LOG_DIR=${HBASE_HOME}/logs
# Enable remote JDWP debugging of major HBase processes. Meant for Core Developers
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8070"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8071"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8072"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8073"
# A string representing this instance of hbase. $USER by default.
# export HBASE_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See 'man nice'.
# export HBASE_NICENESS=10
# The directory where pid files are stored. /tmp by default.
export HBASE_PID_DIR=/app/hadoop/hbase-1.0.2/pids
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HBASE_SLAVE_SLEEP=0.1
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false
# The default log rolling policy is RFA, where the log file is rolled as per the size defined for the
# RFA appender. Please refer to the log4j.properties file to see more details on this appender.
# In case one needs to do log rolling on a date change, one should set the environment property
# HBASE_ROOT_LOGGER to "<DESIRED_LOG LEVEL>,DRFA".
# For example:
# HBASE_ROOT_LOGGER=INFO,DRFA
# The reason for changing default to RFA is to avoid the boundary case of filling out disk space as
# DRFA doesn't put any cap on the log size. Please refer to HBase-5655 for more context.
3.修改hbase-site.xml
4.修改regionservers文件
写入需要作为从节点的服务器
hadoop02
hadoop03
hadoop04
5.创建backup-masters文件并写入备份主节点
hadoop02
6.链接hadoop的配置
ln -s /app/hadoop/hadoop2.6.0/etc/hadoop/hdfs-site.xml /app/hadoop/hbase-1.0.2/conf/
ln -s /app/hadoop/hadoop2.6.0/etc/hadoop/core-site.xml /app/hadoop/hbase-1.0.2/conf/
7.拷贝包至其他服务器
scp -r /app/hadoop/hbase-1.0.2/ hadoop@hadoop02:/app/hadoop/
scp -r /app/hadoop/hbase-1.0.2/ hadoop@hadoop03:/app/hadoop/
scp -r /app/hadoop/hbase-1.0.2/ hadoop@hadoop04:/app/hadoop/
8.配置环境变量
vim /etc/profile
#添加以下配置
export HBASE_HOME=/home/hadoop/hbase-1.0.2
export PATH=$HABSE_HOME/bin:$PATH
9.启动服务
方式一:
start-hbase.sh #启动日志如下: starting master, logging to /app/hadoop/hbase-1.0.2/logs/hbase-hadoop-master-hadoop01.out hadoop04: starting regionserver, logging to /app/hadoop/hbase-1.0.2/bin/../logs/hbase-hadoop-regionserver-hadoop04.out hadoop02: starting regionserver, logging to /app/hadoop/hbase-1.0.2/bin/../logs/hbase-hadoop-regionserver-hadoop02.out hadoop03: starting regionserver, logging to /app/hadoop/hbase-1.0.2/bin/../logs/hbase-hadoop-regionserver-hadoop03.out hadoop02: starting master, logging to /app/hadoop/hbase-1.0.2/bin/../logs/hbase-hadoop-master-hadoop02.out
方式二:
#启动主节点 hbase-daemon.sh start master #启动从节点 hbase-daemon.sh start regionserver
10.问题
启动异常
java.lang.RuntimeException: Failed construction of Regionserver: class org.apache.hadoop.hbase.regionserver.HRegionServer at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2523) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2538) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2521) ... 5 more Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1905) at org.apache.hadoop.hbase.regionserver.RSRpcServices.<init>(RSRpcServices.java:769) at org.apache.hadoop.hbase.regionserver.HRegionServer.createRpcServices(HRegionServer.java:575) at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:492) ... 10 more Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1811) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903) ... 13 more
原因: hbase-site配置了phoenix相关内容,但lib目录无相关jar包导致
解决办法:
1.去除Phoenix相关配置
2.拷贝Phoenix相关jar包至各目录
本次通过方法1成功启动, phoenix相关配置后续加入
11.hive整合hbase
因为Hive与HBase整合的实现是利用两者本身对外的API接口互相通信来完成的,其具体工作交由Hive的lib目录中的hive-hbase-handler- .jar工具类来实现。所以只需要将hive的 hive-hbase-handler- .jar 复制到hbase/lib中就可以了。
#拷贝至本机目录下
cp /app/hadoop/apache-hive-1.2.1-bin/lib/hive-hbase-handler-1.2.1.jar /app/hadoop/hbase-1.0.2/lib/
#拷贝至其他主机目录下
cd /app/hadoop/apache-hive-1.2.1-bin/lib
scp hive-hbase-handler-1.2.1.jar hadoop@hadoop02:/app/hadoop/hbase-1.0.2/lib/
- 测试整合效果
通过不同主机分别进入hive和hbase
hive
hbase shell
#在hive创建表
在hbase创建表'wx_test_hive_hbase'
create 'wx_test_hive_hbase','INFO'
在hive创建表'hive_wx_test_hive_hbase'
create external table hive_wx_test_hive_hbase(
id string,
area_code string,
area_desc string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,INFO:areaCode,INFO:areaDesc")
TBLPROPERTIES("hbase.table.name" = "wx_test_hive_hbase");
create external table hive_wx_test(
id string,
area_code string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,INFO:areaCode")
TBLPROPERTIES("hbase.table.name" = "wx_test");
在hive和hbase分别插入不同数据
#hive
#hbase
put 'wx_test_hive_hbase','00001','INFO:area_code','0001','INFO:area_desc','深圳'
九.spark集群搭建
1.下载spark并上传服务器
https://archive.apache.org/dist/spark/spark-2.1.0/
2.修改配置文件spark-env.sh
注:原始包里只有spark-env.sh.template文件,需要拷贝一份为spark-env.sh文件
cp spark-env.sh.template spark-env.sh
#修改配置如下: