Hadoop 之 Hive-PySuper

数据仓库

为了分析数据而来，分析结果给企业决策提供支撑

数仓概念

数据仓库（英语：Data Warehouse，简称数仓、DW）,是一个用于存储、分析、报告的数据系统
数据仓库的目的是构建面向分析的集成化数据环境，分析结果为企业提供决策支持（Decision Support）

专注分许

数据仓库本身并不“生产”任何数据，其数据来源于不同外部系统
同时数据仓库自身也不需要“消费”任何的数据，其结果开放给各个外部应用使用
这也是为什么叫“仓库”，而不叫“工厂”的原因

OLTP 系统

OLTP系统的核心是面向业务，支持业务，支持事务

可以，但是没有必要

所有的业务操作可以分为读、写两种操作，一般来说读的压力明显大于写的压力

如果在OLTP环境直接开展各种分析，需要考虑：

数据分析也是对数据进行读取操作，会让读取压力倍增
OLTP仅存储数周或数月的数据
数据分散在不同系统不同表中，字段类型属性不统一

目的

为了更好的进行各种规模的数据分析，同时也不影响OLTP系统运行

此时需要构建一个集成统一的数据分析平台

目的：面向分析，支持分析，并且和OLTP系统解耦合

构建

数仓是一个用于存储、分析、报告的数据系统，目的是构建面向分析的集成化数据环境
这种面向分析、支持分析的系统称之为OLAP（联机分析处理）系统
数据仓库是OLAP系统的一种实现

特征

面向主题
- 较高层次上数据综合、归类并进行分析利用的抽象
- 抽象层次上对数据进行完整、一致和准确的描述
集成性
- 数据通常会分布在多个操作型系统中，彼此分散、独立、异构
- 必然要经过统一与综合，对数据进行抽取、清理、转换和汇总
- 要统一源数据中所有矛盾之处
- 进行数据综合和计算
非易失性
- 数据仓库是分析数据的平台，而不是创造数据的平台
- 数据仓库的数据反映的是一段相当长的时间内历史数据的内容
- 数据仓库中一般有大量的查询操作，但修改和删除操作很少
时变性
- 包含各种粒度的历史数据，数据可能与某个特定日期、星期、月份、季度或者年份有关
- 当业务变化后会失去时效性，因此数据仓库的数据需要随着时间更新，以适应决策的需要
- 从这个角度讲，数据仓库建设是一个项目，更是一个过程

开发语言（SQL）

SQL的难点不在语法，知道怎么用就行了

后面在具体情况下，再说说使用什么操作

Hive

建立在Hadoop之上的开源数据仓库系统
将存储在Hadoop文件中的结构化、半结构化数据文件映射为一张数据库表，基于表提供了一种类似SQL的查询模型
称为Hive查询语言（HQL），用于访问和分析存储在Hadoop文件中的大型数据集
Hive核心是将HQL转换为MapReduce程序，然后将程序提交到Hadoop群集执行
Hive由Facebook实现并开源

Hive 和 Hadoop

从功能来说，数据仓库软件，至少需要具备下述两种能力
- 存储数据的能力
- 分析数据的能力
Apache Hive作为一款大数据时代的数据仓库软件，当然也具备上述两种能力
只不过Hive并不是自己实现了上述两种能力，而是借助Hadoop
Hive利用HDFS存储数据，利用MapReduce查询分析数据
这样突然发现Hive没啥用，不过是套壳Hadoop罢了
其实不然，Hive的最大的魅力在于用户专注于编写HQL，Hive帮您转换成为MapReduce程序完成对数据的分析

模拟 Hive

Hive能将数据文件映射成为一张表，这个映射是指什么？
- 文件和表之间的对应关系
Hive软件本身到底承担了什么功能职责
- SQL语法解析编译成MapReduce

架构图

组件

用户接口

包括CLI、JDBC/ODBC、WebGUI

CLI(command line interface)为shell命令行
Hive中的Thrift服务器允许外部客户端通过网络与Hive进行交互
类似于JDBC或ODBC协议。WebGUI是通过浏览器访问Hive

#### 元数据存储

通常是存储在关系数据库如 mysql/derby中
Hive 中的元数据包括
- 表的名字
- 表的列和分区及其属性
- 表的属性（是否为外部表等）
- 表的数据所在目录等

驱动程序

语法解析器、计划编译器、优化器、执行器

完成 HQL 查询语句从词法分析、语法分析、编译、优化以及查询计划的生成
生成的查询计划存储在 HDFS 中，并在随后有执行引擎调用执行

执行引擎

Hive本身并不直接处理数据文件
而是通过执行引擎处理
当下Hive支持MapReduce、Tez、Spark3种执行引擎

元数据

又称中介数据、中继数据，为描述数据的数据（data about data）
主要是描述数据属性（property）的信息
用来支持如指示存储位置、历史数据、资源查找、文件记录等功能

Hive Metadata即Hive的元数据

包含用Hive创建的database、table、表的位置、类型、属性，字段顺序类型等元信息
元数据存储在关系型数据库中
如hive内置的Derby、或者第三方如MySQL等

元数据服务（Metastore）

管理metadata元数据，对外暴露服务地址
让各种客户端通过连接metastore服务，由metastore再去连接MySQL数据库来存取元数据
有了metastore服务，就可以有多个客户端同时连接
而且这些客户端不需要知道MySQL数据库的用户名和密码，只需要连接metastore 服务即可
某种程度上也保证了hive元数据的安全

配置metastore

	内嵌模式	本地模式	远程模式
Metastore单独配置、启动	否	否	是
Metadata存储介质	Derby	MySQL	MySQL

metastore服务配置有3种模式：

内嵌模式
本地模式
远程模式

区分3种配置方式的关键是弄清楚两个问题：

Metastore服务是否需要单独配置、单独启动？
Metadata是存储在内置的derby中，还是第三方RDBMS,比如MySQL

远程metastore

在生产环境中，建议用远程模式来配置Hive Metastore

在这种情况下，其他依赖hive的软件都可以通过Metastore访问hive

由于还可以完全屏蔽数据库层，因此这也带来了更好的可管理性/安全性

Remote Metastore 远程模式：

安装部署

安装 MySQL

在一台机器上安装MySQL，远程其他机器远程访问即可

卸载mariadb

# 查询本机的 mariadb， 返回什么删除什么
rpm -qa | grep mariadb
# mariadb-libs-5.5.68-1.el7.x86_64

# 删除
rpm -e mariadb-libs-5.5.68-1.el7.x86_64 --nodeps

# 再次查看
rpm -qa|grep mariadb

安装MySQL

# 创建MySQL文件夹
mkdir /hadoop/software/mysql && cd mysql

# 将MySQL安装包发送到此处

# 解压
tar xvf mysql-5.7.29-1.el7.x86_64.rpm-bundle.tar

# 安装依赖-libaio
yum -y install libaio

# 安装MySQL
rpm -ivh mysql-community-common-5.7.29-1.el7.x86_64.rpm mysql-community-libs-5.7.29-1.el7.x86_64.rpm mysql-community-client-5.7.29-1.el7.x86_64.rpm mysql-community-server-5.7.29-1.el7.x86_64.rpm

初始化MySQL

#初始化
mysqld --initialize

#更改所属组
chown mysql:mysql /var/lib/mysql -R

#启动mysql
systemctl start mysqld.service

#查看生成的临时root密码
cat  /var/log/mysqld.log

# 下面这句话会提示密码
[Note] A temporary password is generated for root@localhost: sOLjp4jO4n&7

# 汇总命令
mysqld --initialize && chown mysql:mysql /var/lib/mysql -R && systemctl start mysqld.service
cat  /var/log/mysqld.log

修改、授权

mysql -u root -p
Enter password:     #这里输入在日志中生成的临时密码

# 更新root密码
alter user user() identified by "root";

# 授权
use mysql;
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
FLUSH PRIVILEGES;
exit;

# mysql的启动、关闭、状态查看
systemctl stop mysqld
systemctl status mysqld
systemctl start mysqld

# 开机自启
systemctl enable  mysqld                             

# 查看是否已经设置自启动成功
systemctl list-unit-files | grep mysqld
# mysqld.service                                enabled

卸载MySQL

# 关闭mysql服务
systemctl stop mysqld.service

# 查找安装mysql的rpm包
[root@node-211 ~]# rpm -qa | grep -i mysql      
mysql-community-libs-5.7.29-1.el7.x86_64
mysql-community-common-5.7.29-1.el7.x86_64
mysql-community-client-5.7.29-1.el7.x86_64
mysql-community-server-5.7.29-1.el7.x86_64

# 卸载
[root@node-211 ~]# yum remove mysql-community-libs-5.7.29-1.el7.x86_64 mysql-community-common-5.7.29-1.el7.x86_64 mysql-community-client-5.7.29-1.el7.x86_64 mysql-community-server-5.7.29-1.el7.x86_64

# 查看是否卸载干净
rpm -qa | grep -i mysql

# 查找mysql相关目录 删除
[root@node1 ~]# find / -name mysql
/var/lib/mysql
/var/lib/mysql/mysql
/usr/share/mysql

[root@node1 ~]# rm -rf /var/lib/mysql
[root@node1 ~]# rm -rf /var/lib/mysql/mysql
[root@node1 ~]# rm -rf /usr/share/mysql

# 删除默认配置 日志
rm -rf /etc/my.cnf 
rm -rf /var/log/mysqld.log

安装 Hive

解压安装包

tar zxvf apache-hive-3.1.2-bin.tar.gz

ls
# apache-hive-3.1.2-bin  hadoop-3.3.0  jdk1.8.0_241  mysql

版本差异

Hive与Hadoop之间guava版本差异

# 删除hive中的guava包
cd /hadoop/software/apache-hive-3.1.2-bin/
rm lib/guava-19.0.jar -rf

# 从hadoop中复制guava
cp ../hadoop-3.3.0/share/hadoop/common/lib/guava-27.0-jre.jar ./lib/

修改配置文件

hive-env.sh

cd /hadoop/software/apache-hive-3.1.2-bin/conf
mv hive-env.sh.template hive-env.sh

vi hive-env.sh
export HADOOP_HOME=/hadoop/software/hadoop-3.3.0
export HIVE_CONF_DIR=/hadoop/software/apache-hive-3.1.2-bin/conf
export HIVE_AUX_JARS_PATH=/hadoop/software/apache-hive-3.1.2-bin/lib

hive-site.xml

# 新建这个文件
vi hive-site.xml

<configuration>
<!-- 存储元数据mysql相关配置 -->
<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://node-211:3306/hive3?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
</property>

<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
</property>

<!-- 连接MySQL的 用户名 -->
<property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>root</value>
</property>

<!-- 连接MySQL的 密码 -->
<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>root</value>
</property>

<!-- H2S运行绑定host -->
<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>node-211</value>
</property>

<!-- 远程模式部署metastore metastore地址 -->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://node-211:9083</value>
</property>

<!-- 关闭元数据存储授权  --> 
<property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
</property>
</configuration>

上传jar包

mysql-connector-java-5.1.32.jar

初始化元数据

cd /hadoop/software/apache-hive-3.1.2-bin/

bin/schematool -initSchema -dbType mysql -verbos
# 初始化成功会在mysql中创建74张表

创建hive存储目录

hadoop fs -mkdir /tmp
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

启动 hive

启动metastore服务

# 前台启动  关闭ctrl+c
/hadoop/software/apache-hive-3.1.2-bin/bin/hive --service metastore

# 前台启动开启debug日志
/hadoop/software/apache-hive-3.1.2-bin/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console  

# 后台启动 进程挂起  关闭使用jps + kill -9
nohup /hadoop/software/apache-hive-3.1.2-bin/bin/hive --service metastore &

# 日志在当前目录下的nohup.out中
cat nohup.out

# 关闭，jps 查看进程号
jps
kill -9 jps_id

##### 启动 hiveserver2

远程模式下beeline通过 Thrift 连接到单独的HiveServer2服务上（官方推荐在生产环境中使用）
HiveServer2支持多客户端的并发和身份认证，旨在为开放API客户端如JDBC、ODBC提供更好的支持

关系

HiveServer2通过Metastore服务读写元数据
所以在远程模式下，启动HiveServer2之前必须先首先启动metastore服务
特别注意
- 远程模式下，Beeline客户端只能通过HiveServer2服务访问Hive
- 而bin/hive是通过Metastore服务访问的

# 前台带日志启动hiveserver2
/hadoop/software/apache-hive-3.1.2-bin/bin/hive --service hiveserver2 --hiveconf hive.root.logger=DEBUG,console

# 后台启动hiveserver2
nohup /hadoop/software/apache-hive-3.1.2-bin/bin/hive --service hiveserver2 &

# 注意 启动hiveserver2需要一定的时间
# 不要启动之后立即beeline连接 可能连接不上

测试连接

1、 beeline客户端连接

拷贝node-211上的安装包到beeline客户端机器上（node-212）

scp -r /hadoop/software/apache-hive-3.1.2-bin/ root@node-212:/hadoop/software/

error

Error: Could not open client transport with JDBC Uri: jdbc:hive2://node1:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)

<!--在hadoop的配置文件core-site.xml中添加如下属性：-->
<property>
        <name>hadoop.proxyuser.root.hosts</name>
      <value>*</value>
</property>
<property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
</property>

Hive 客户端

bin/hive

# 直接运行这条命令，就启动了hive
/hadoop/software/apache-hive-3.1.2-bin/bin/hive

bin/beeline

# beeline需要先连接hiveserver2
# 再通过hiveserver2 连接 metastore
# 而且需要手动连接
/hadoop/software/apache-hive-3.1.2-bin/bin/beeline
    
beeline> ! connect jdbc:hive2://node-211:10000
Connecting to jdbc:hive2://node-211:10000
Enter username for jdbc:hive2://node-211:10000: root
Enter password for jdbc:hive2://node-211:10000: # 这里直接回车就可以了

中文乱码

-- 注意：
-- 下面sql语句是需要在MySQL中执行
-- 修改Hive存储的元数据信息（metadata）

use hive3;
show tables;

alter table hive3.COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
alter table hive3.TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
alter table hive3.PARTITION_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8 ;
alter table hive3.PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
alter table hive3.INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8; 

# 需要删除表，重新创建才会生效

Hive SQL

DML

group by 语法限制，group by语法中的字段
- 要么是group by分组的字段
- 要么是被聚合函数应用的字段
- 避免出现一个字段多个值的歧义
having和where区别

	执行时间	可否使用聚合函数
where	分组前过滤	后面不可以使用聚合函数
having	分组后过滤	后面可以使用聚合函数

执行顺序：from > where > group > having > order > select

show databases;

use hadoop;

------------Hive SQL-DML-Load加载数据---------------
-- step1:建表
-- 建表 student_local 用于演示从本地加载数据
create table student_local
(
    num  int,
    name string,
    sex  string,
    age  int,
    dept string
) row format delimited fields terminated by ',';

-- 建表student_HDFS  用于演示从HDFS加载数据
create table student_HDFS
(
    num  int,
    name string,
    sex  string,
    age  int,
    dept string
) row format delimited fields terminated by ',';

-- 建议使用beeline客户端 可以显示出加载过程日志信息
-- step2:加载数据
-- 从本地加载数据  数据位于HS2（node1）本地文件系统  本质是hadoop fs -put上传操作
load data local inpath '/root/students.txt' into table hadoop.student_local;
select * from student_local;

-- 从HDFS加载数据  数据位于HDFS文件系统根目录下  本质是hadoop fs -mv 移动操作
-- 先把数据上传到HDFS上  hadoop fs -put students.txt /
load data inpath '/students.txt' into table hadoop.student_hdfs;
select * from student_hdfs;

------------Hive SQL-DML-Insert插入数据-----------------

-- step1:创建一张源表student
drop table if exists student;
create table student
(
    num  int,
    name string,
    sex  string,
    age  int,
    dept string
) row format delimited fields terminated by ',';

-- step2:加载数据
load data local inpath '/root/students.txt' into table student;
select * from student;

-- step3：创建一张目标表  只有两个字段
create table student_from_insert
(
    sno   int,
    sname string
);

-- 使用insert+select插入数据到新表中
insert into table student_from_insert select num,name from student;
select * from student_from_insert;


------------Hive SQL select查询基础语法------------
-- 创建表t_usa_covid19
drop table if exists t_usa_covid19;
CREATE TABLE t_usa_covid19
(
    count_date string,
    county     string,
    state      string,
    fips       int,
    cases      int,
    deaths     int
) row format delimited fields terminated by ",";

-- 将数据load加载到t_usa_covid19表对应的路径下
load data local inpath '/root/us-covid19-counties.dat' into table t_usa_covid19;

-- 1、select_expr
-- 查询所有字段或者指定字段
select * from t_usa_covid19;
select county, cases, deaths from t_usa_covid19;

-- 查询常数返回 此时返回的结果和表中字段无关
select 1 from t_usa_covid19;
-- 查询当前数据库
select current_database();

--省去from关键字


-- 2、ALL DISTINCT
-- 返回所有匹配的行
select state from t_usa_covid19;
-- 相当于
select all state from t_usa_covid19;

-- 返回所有匹配的行 去除重复的结果
select distinct state from t_usa_covid19;

-- 多个字段distinct 整体去重
select distinct county, state from t_usa_covid19;



-- 3、WHERE CAUSE
select * from t_usa_covid19 where 1 > 2; -- 1 > 2 返回false
select * from t_usa_covid19 where 1 == 1; -- 1 = 1 返回true

-- 找出来自于California州的疫情数据
select * from t_usa_covid19 where state = 'California';
--where条件中使用函数 找出州名字母长度超过10位的有哪些
select * from t_usa_covid19 where length(state) > 10;

-- 注意：where条件中不能使用聚合函数
-- --报错 SemanticException:Not yet supported place for UDAF 'count'
-- 聚合函数要使用它的前提是结果集已经确定。
-- 而where子句还处于“确定”结果集的过程中，因而不能使用聚合函数。
select state, sum(deaths) from t_usa_covid19 where sum(deaths) > 100 group by state;
--可以使用Having实现
select state, sum(deaths) from t_usa_covid19 group by state having sum(deaths) > 100;


-- 4、聚合操作
-- 统计美国总共有多少个县county
select county as itcast from t_usa_covid19;
-- 学会使用as 给查询返回的结果起个别名
select count(county) as county_cnts from t_usa_covid19;
-- 去重distinct
select count(distinct county) as county_cnts from t_usa_covid19;

-- 统计美国加州有多少个县
select count(county) from t_usa_covid19 where state = "California";
-- 统计德州总死亡病例数
select sum(deaths) from t_usa_covid19 where state = "Texas";
-- 统计出美国最高确诊病例数是哪个县
select max(cases) from t_usa_covid19;


-- 5、GROUP BY
select * from t_usa_covid19;

-- 根据state州进行分组 统计每个州有多少个县county
select count(county) from t_usa_covid19 where count_date = "2021-01-28" group by state;

-- 想看一下统计的结果是属于哪一个州的
select state, count(county) as county_nums from t_usa_covid19 where count_date = "2021-01-28" group by state;

-- 再想看一下每个县的死亡病例数，我们猜想很简单呀  把deaths字段加上返回  真实情况如何呢？
select state, count(county), sum(deaths) from t_usa_covid19 where count_date = "2021-01-28" group by state;
-- 很尴尬 sql报错了org.apache.hadoop.hive.ql.parse.SemanticException:Line 1:27 Expression not in GROUP BY key 'deaths'

-- 为什么会报错？？group by的语法限制
-- 结论：出现在GROUP BY中select_expr的字段：要么是GROUP BY分组的字段；要么是被聚合函数应用的字段。
-- deaths不是分组字段 报错
-- state是分组字段 可以直接出现在select_expr中

-- 被聚合函数应用
select state, count(county), sum(deaths) from t_usa_covid19 where count_date = "2021-01-28" group by state;


-- 6、having
-- 统计2021-01-28死亡病例数大于10000的州
select state, sum(deaths) from t_usa_covid19 where count_date = "2021-01-28"  and sum(deaths) > 10000 group by state;
-- where语句中不能使用聚合函数 语法报错

-- 先where分组前过滤，再进行group by分组， 分组后每个分组结果集确定 再使用having过滤
select state, sum(deaths) from t_usa_covid19 where count_date = "2021-01-28" group by state having sum(deaths) > 10000;
-- 这样写更好 即在group by的时候聚合函数已经作用得出结果 having直接引用结果过滤 不需要再单独计算一次了
select state, sum(deaths) as cnts from t_usa_covid19 where count_date = "2021-01-28" group by state having cnts > 10000;


-- 7、order by
-- 根据确诊病例数升序排序 查询返回结果
select * from t_usa_covid19;
select * from t_usa_covid19 order by cases;
-- 不写排序规则 默认就是asc升序
select * from t_usa_covid19 order by cases asc;

-- 根据死亡病例数倒序排序 查询返回加州每个县的结果
select * from t_usa_covid19 where state = "California" order by cases desc;
select state,count(county)  from t_usa_covid19 where count_date="2021-01-28" order by state;

-- 8、limit
-- 没有限制返回2021.1.28 加州的所有记录
select * from t_usa_covid19 where count_date = "2021-01-28" and state = "California";

-- 返回结果集的前5条
select * from t_usa_covid19 where count_date="2021-01-28" and state="California" limit 5;

-- 返回结果集从第1行开始 共3行
select * from t_usa_covid19 where count_date = "2021-01-28" and state = "California" limit 2,3;
-- 注意 第一个参数偏移量是从0开始的


-- 执行顺序
select state, sum(deaths) as cnts from t_usa_covid19 where count_date = "2021-01-28" group by state having cnts > 10000 limit 2;


------------Hive Join SQL 语法------------
-- Join语法练习 建表
drop table if exists employee_address;
drop table if exists employee_connection;
drop table if exists employee;

-- table1: 员工表
CREATE TABLE employee(id int,name string,deg string,salary int,dept string) row format delimited fields terminated by ',';
-- table2:员工家庭住址信息表
CREATE TABLE employee_address(id int,hno string,street string,city string) row format delimited fields terminated by ',';
-- table3:员工联系方式信息表
CREATE TABLE employee_connection(id int,phno string,email string) row format delimited fields terminated by ',';

-- 加载数据到表中
load data local inpath '/root/test-file/employee.txt' into table employee;
load data local inpath '/root/test-file/employee_address.txt' into table employee_address;
load data local inpath '/root/test-file/employee_connection.txt' into table employee_connection;

-- 验证数据
show tables;
select * from employee;

select * from employee_address;

select * from employee_connection;

-- 1、inner join
select e.id, e.name, e_a.city, e_a.street from employee e inner join employee_address e_a on e.id = e_a.id;

-- 等价于 inner join=join
select e.id, e.name, e_a.city, e_a.street from employee e join employee_address e_a on e.id = e_a.id;

-- 等价于 隐式连接表示法
select e.id, e.name, e_a.city, e_a.street from employee e, employee_address e_a where e.id = e_a.id;


-- 2、left join
select e.id, e.name, e_conn.phno, e_conn.email from employee e left join employee_connection e_conn on e.id = e_conn.id;

-- 等价于 left outer join
select e.id, e.name, e_conn.phno, e_conn.email from employee e left outer join employee_connection e_conn on e.id = e_conn.id;


-----------------Hive 常用的内置函数----------------------
show functions;
describe function extended count;


------------String Functions 字符串函数------------
-- 查看字符串长度
select length("spark");

-- 字符串反转
select reverse("spark");

-- 字符串拼接
select concat("angela", "baby");
-- 带分隔符字符串连接函数：concat_ws(separator, [string | array(string)]+)
select concat_ws('.', 'www', array('spark', 'cn'));

-- 字符串截取函数：
-- substr(str, pos[, len]) 或者 substring(str, pos[, len])
select substr("hadoop-spark", -5);
-- pos是从1开始的索引，如果为负数则倒着数
select substr("hadoop-spark", 2, 2);
-- 分割字符串函数: split(str, regex)
-- split针对字符串数据进行切割  返回是数组array  可以通过数组的下标取内部的元素 注意下标从0开始的
select split('apache hive', ' ');
select split('apache hive', ' ')[0];
select split('apache hive', ' ')[1];


----------- Date Functions 日期函数 -----------------
-- 获取当前日期: current_date
select current_date();
-- 获取当前UNIX时间戳函数: unix_timestamp
select unix_timestamp();
-- 日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");
-- 指定格式日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp('20111207 13:01:03', 'yyyyMMdd HH:mm:ss');
-- UNIX时间戳转日期函数: from_unixtime
select from_unixtime(1618238391);
select from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');

-- 日期比较函数: datediff  日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
select datediff('2022-12-01', '2022-01-01');
-- 日期增加函数: date_add
select date_add('2012-02-28', 10);
-- 日期减少函数: date_sub
select date_sub('2012-01-1', 10);


----Mathematical Functions 数学函数-------------
-- 取整函数: round  返回double类型的整数值部分 （遵循四舍五入）
select round(3.1415926);
-- 指定精度取整函数: round(double a, int d) 返回指定精度d的double类型
select round(3.1415926, 4);
-- 取随机数函数: rand 每次执行都不一样 返回一个0到1范围内的随机数
select rand();
-- 指定种子取随机数函数: rand(int seed) 得到一个稳定的随机数序列
select rand(5);


-----Conditional Functions 条件函数------------------
-- 使用之前课程创建好的student表数据
select * from student limit 3;

-- if条件判断: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
select if(1 = 2, 100, 200);
select if(sex = '男', 'M', 'W') from student limit 3;

-- 空值转换函数: nvl(T value, T default_value)
select nvl("allen", "spark");
select nvl(null, "itcast");

-- 条件转换函数: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
select case sex when '男' then 'male' else 'female' end from student limit 3;

Join

根据两个或多个表中的列之间的关系，从这些表中共同组合查询数据

table_reference：是join查询中使用的表名
table_factor：与table_reference相同,是联接查询中使用的表名
join_condition：join查询关联的条件，如果在两个以上的表上需要连接，则使用AND关键字

inner join（内连接）

两张表的交集

只有进行连接的两个表中都存在与连接条件相匹配的数据才会被留下来

left join（左连接）

左外连接，两张表的并集

join时以左表的全部数据为准，右边与之关联，左表数据全部返回，右表：

关联上的显示返回

关联不上的显示null返回

函数

show functions：查看当下可用的所有函数
describe function extended funcname：查看函数的使用方式

内置函数

字符串函数
- length：字符串长度函数
- reverse：字符串反转函数
- concat：字符串连接函数
- concat_ws：带分隔符字符串连接函数
- substr、substring：字符串截取函数

select length("itcast");
select reverse("itcast");
select concat("angela","baby");

-- 带分隔符字符串连接函数：concat_ws(separator, [string | array(string)]+)
select concat_ws('.', 'www', array('itcast', 'cn'));

-- 字符串截取函数：substr(str, pos[, len]) 或者 substring(str, pos[, len])
select substr("angelababy",-2); --pos是从1开始的索引，如果为负数则倒着数
select substr("angelababy",2,2);

-- 分割字符串函数: split(str, regex)
select split('apache hive', ' ');

日期类型函数

-- 获取当前日期: current_date
select current_date();

-- 获取当前UNIX时间戳函数: unix_timestamp
select unix_timestamp();

-- 日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");

-- 指定格式日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss');

-- UNIX时间戳转日期函数: from_unixtime
select from_unixtime(1618238391);
select from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');

-- 日期比较函数: datediff 日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
select datediff('2012-12-08','2012-05-09');

-- 日期增加函数: date_add
select date_add('2012-02-28',10);

-- 日期减少函数: date_sub
select date_sub('2012-01-1',10);

数值类型函数

-- 取整函数: round 返回double类型的整数值部分 （遵循四舍五入）
select round(3.1415926);

-- 指定精度取整函数: round(double a, int d) 返回指定精度d的double类型
select round(3.1415926,4);

-- 取随机数函数: rand 每次执行都不一样 返回一个0到1范围内的随机数
select rand();

-- 指定种子取随机数函数: rand(int seed) 得到一个稳定的随机数序列
select rand(3);

条件函数

-- 使用之前课程创建好的student表数据
select * from student limit 3;

-- if条件判断: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
select if(1=2,100,200);
select if(sex ='男','M','W') from student limit 3;

-- 空值转换函数: nvl(T value, T default_value)
select nvl("allen","itcast");
select nvl(null,"itcast");

-- 条件转换函数: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
select case sex when '男' then 'male' else 'female' end from student limit 3;

用户定义函数

根据输入输出的行数区分

UDF（User-Defined-Function）：普通函数，一进一出
UDAF（User-Defined Aggregation Function）：聚合函数，多进一出
UDTF（User-Defined Table-Generating Functions）：表生成函数，一进多出

案例

show databases;

--------------1、建库-------------------
-- 如果数据库已存在就删除
drop database if exists db_msg cascade;

-- 创建数据库
create database db_msg;

-- 切换数据库
use db_msg;


--------------2、建表-------------------
-- 如果表已存在就删除
drop table if exists db_msg.tb_msg_source;

-- 建表
create table db_msg.tb_msg_source
(
    msg_time           string comment "消息发送时间",
    sender_name        string comment "发送人昵称",
    sender_account     string comment "发送人账号",
    sender_sex         string comment "发送人性别",
    sender_ip          string comment "发送人ip地址",
    sender_os          string comment "发送人操作系统",
    sender_phonetype   string comment "发送人手机型号",
    sender_network     string comment "发送人网络类型",
    sender_gps         string comment "发送人的GPS定位",
    receiver_name      string comment "接收人昵称",
    receiver_ip        string comment "接收人IP",
    receiver_account   string comment "接收人账号",
    receiver_os        string comment "接收人操作系统",
    receiver_phonetype string comment "接收人手机型号",
    receiver_network   string comment "接收人网络类型",
    receiver_gps       string comment "接收人的GPS定位",
    receiver_sex       string comment "接收人性别",
    msg_type           string comment "消息类型",
    distance           string comment "双方距离",
    message            string comment "消息内容"
) row format delimited fields terminated by '\t';

--------------3、加载数据-------------------
-- 上传数据文件到node1服务器本地文件系统（HS2服务所在机器）
-- shell: mkdir -p /root/test-file/project

-- 加载数据到表中
load data local inpath '/root/test-file/project/data1.tsv' into table db_msg.tb_msg_source;
load data local inpath '/root/test-file/project/data2.tsv' into table db_msg.tb_msg_source;

-- 查询表 验证数据文件是否映射成功
select * from tb_msg_source limit 10;

-- 统计行数
select count(*) as cnt from tb_msg_source;


--------------4、ETL数据清洗-------------------
-- 问题1：当前数据中，有一些数据的字段为空，不是合法数据
select msg_time, sender_name, sender_gps from db_msg.tb_msg_source where length(sender_gps) = 0 limit 10;

-- 问题2：需求中，需要统计每天、每个小时的消息量，但是数据中没有天和小时字段，只有整体时间字段，不好处理
select msg_time from db_msg.tb_msg_source limit 10;

-- 问题3：需求中，需要对经度和维度构建地区的可视化地图，但是数据中GPS经纬度为一个字段，不好处理
select sender_gps from db_msg.tb_msg_source limit 10;

-- ETL实现
-- 如果表已存在就删除
drop table if exists db_msg.tb_msg_etl;

-- 将Select语句的结果保存到新表中
create table db_msg.tb_msg_etl as
select *,
       substr(msg_time, 0, 10)   as dayinfo,    --获取天
       substr(msg_time, 12, 2)   as hourinfo,   --获取小时
       split(sender_gps, ",")[0] as sender_lng, --提取经度
       split(sender_gps, ",")[1] as sender_lat  --提取纬度
from db_msg.tb_msg_source

-- 过滤字段为空的数据
where length(sender_gps) > 0;

-- 验证ETL结果
select msg_time, dayinfo, hourinfo, sender_gps, sender_lng, sender_lat from db_msg.tb_msg_etl limit 10;


--------------5、需求指标统计分析-------------------
-- 需求：统计今日总消息量
create table if not exists tb_rs_total_msg_cnt comment "今日消息总量" as
select dayinfo, count(*) as total_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo;

select * from tb_rs_total_msg_cnt;


-- 需求：统计今日每小时消息量、发送和接收用户数
create table if not exists tb_rs_hour_msg_cnt comment "每小时消息量趋势" as
select dayinfo,
       hourinfo,
       count(*)                         as total_msg_cnt,
       count(distinct sender_account)   as sender_usr_cnt,
       count(distinct receiver_account) as receiver_usr_cnt
from db_msg.tb_msg_etl
group by dayinfo, hourinfo;

select * from tb_rs_hour_msg_cnt;


-- 需求：统计今日各地区发送消息数据量
create table if not exists tb_rs_loc_cnt
    comment "今日各地区发送消息总量"
as
select dayinfo,
       sender_gps,
       cast(sender_lng as double) as longitude,
       cast(sender_lat as double) as latitude,
       count(*)                   as total_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo, sender_gps, sender_lng, sender_lat;

select * from tb_rs_loc_cnt;


--需求：统计今日发送消息和接收消息的用户数
create table if not exists tb_rs_usr_cnt
    comment "今日发送消息人数、接受消息人数"
as
select dayinfo,
       count(distinct sender_account)   as sender_usr_cnt,
       count(distinct receiver_account) as receiver_usr_cnt
from db_msg.tb_msg_etl
group by dayinfo;

select * from tb_rs_usr_cnt;


--需求：统计今日发送消息最多的Top10用户
create table if not exists tb_rs_susr_top10
    comment "发送消息条数最多的Top10用户"
as
select dayinfo,
       sender_name as username,
       count(*)    as sender_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo, sender_name
order by sender_msg_cnt desc
limit 10;

select * from tb_rs_susr_top10;

-- 需求：统计今日接收消息最多的Top10用户
create table if not exists tb_rs_rusr_top10
    comment "接受消息条数最多的Top10用户"
as
select dayinfo,
       receiver_name as username,
       count(*)      as receiver_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo, receiver_name
order by receiver_msg_cnt desc
limit 10;

select * from tb_rs_rusr_top10;


-- 需求：统计发送人的手机型号分布情况
create table if not exists tb_rs_sender_phone
    comment "发送人的手机型号分布"
as
select dayinfo,
       sender_phonetype,
       count(distinct sender_account) as cnt
from tb_msg_etl
group by dayinfo, sender_phonetype;

select * from tb_rs_sender_phone;

-- 需求：统计发送人的设备操作系统分布情况
create table if not exists tb_rs_sender_os
    comment "发送人的OS分布"
as
select dayinfo,
       sender_os,
       count(distinct sender_account) as cnt
from tb_msg_etl
group by dayinfo, sender_os;

select * from tb_rs_sender_os;

Hadoop 之 Hive