hive安装及使用

介绍

hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。

软件获取

1
2
3
4
5
6
# 以版本2.3.3为例
http://mirrors.hust.edu.cn/apache/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz

tar xvf apache-hive-2.3.3-bin.tar.gz -C /opt/

ln -s /opt/apache-hive-2.3.3-bin /opt/hive

开始前准备

  • 安装启动 hadoop
  • 安装启动 mysql

初始化hive运行所需目录

1
2
3
4
5
6
7
8
9
10
# 本地
mkdir -p /opt/hive/tmp

# hdfs
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /user/hive/tmp
hdfs dfs -mkdir -p /user/hive/log
hdfs dfs -chmod -R 777 /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/tmp
hdfs dfs -chmod -R 777 /user/hive/log

环境变量配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# JAVA环境变量
export JAVA_HOME=/opt/jdk1.8.0_121
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

# hadoop环境变量
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"

# hive环境变量
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin

配置mysql-server

1
2
3
4
5
6
# 创建hive表
mysql> create database hive default character set latin1;
# 创建hive用户
mysql> grant all privileges on hive.* to hive@'%' identified by 'hive';
# 刷新权限
mysql> flush privileges;

修改hive-site.xml

  • 激活配置文件
1
2
cd /opt/hive/conf
cp hive-default.xml.template hive-site.xml
  • 修改mysql连接配置
1
2
3
4
5
6
7
8
9
10
11
12
# jdbc连接方式
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
# mysql连接配置
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://172.16.7.191:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
# mysql数据库的用户名
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
# 用户对应的密码
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
  • 修改hive运行目录配置
1
2
3
4
5
6
7
8
9
10
11
12
13
<property>  
<name>hive.exec.scratchdir</name>
<value>/user/hive/tmp</value>
</property>

<property>
<name>hive.querylog.location</name>
<value>/user/hive/log/hadoop</value>
<description>Location of Hive run time structured log file</description>
</property>

# 把 ${system:java.io.tmpdir} 改成 /opt/hive/tmp
# 把 {system:user.name} 改成 {user.name}
  • 若遇到异常 MetaException(message:Version information not found in metastore. )
1
2
<name>hive.metastore.schema.verification</name>
<value>false</value>
  • 若遇到异常 hive Required table missing : “DBS“ in Catalog “” Schema “
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<property>
<name>datanucleus.fixedDatastore</name>
<value>false</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateColumns</name>
<value>true</value>
</property>

下载mysql-connector

1
2
# 将下载解压后的jar包拷贝到/opt/hive/lib目录下
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz

初始化hive

1
2
3
4
5
# 初始化表结构
hive --service metastore &

# 初始化数据
schematool -dbType mysql -initSchema

启动服务

1
2
3
hive --service hiveserver2

hive --service hwi

测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
╭─wuyue@wuyue-pc ~  
╰─$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.3-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> create database test;
OK
Time taken: 3.685 seconds
hive> show databases;
OK
default
test
Time taken: 0.127 seconds, Fetched: 2 row(s)
hive>

python连接

1
2
3
4
5
6
7
8
9
10
11
12
pip install pyhive

sudo apt-get install sasl2-bin libsasl2-2 libsasl2-dev libsasl2-modules

# 伪代码

from pyhive import hive
conn = hive.Connect(host='127.0.0.1', port=10000)
cur = conn.cursor()
cur.executor('select now()')
cur.fetchall()
# 结果输出当前时间