Hive使用详解 | 每天，遇到更好的你

Hive Join

Hive 只支持等值连接（equality joins）、外连接（outer joins）和（left/right joins）。

Hive 不支持所有非等值的连接，因为非等值连接非常难转化到 map/reduce 任务。

另外，Hive 支持多于 2 个表的连接。

准备a.txt、b.txt、c.txt，表中字段默认’\t’分隔

1
2
3

[root@node1 hive-1.2.1]# vi a.txt
[root@node1 hive-1.2.1]# vi b.txt
[root@node1 hive-1.2.1]# vi c.txt

a.txt

b.txt

c.txt

在hive中创建a、b、c三张测试表，其中id代表学生的学号，name是学生姓名，score是学生成绩，book是学生拥有的书

三张表中，id都是主键
第一张表，表示某个班级的学生名单
第二张表，表示参加了某种考试的学生的成绩
第三张表，表示部分学生拥有的书

hive> create table a(
    > id bigint,
    > name string
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '\t'

hive> create table b(
    > id bigint,
    > score string
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '\t'

hive> create table c(
    > id bigint,
    > book string
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '\t'

将a.txt、b.txt、c.txt文件中的数据导入a、b、c表中

1
2
3

hive> load data local inpath 'a.txt' into table a;
hive> load data local inpath 'b.txt' into table b;
hive> load data local inpath 'c.txt' into table c;

join的用法：

1	找出班级中参加了某种考试的学生姓名以及成绩。该需求希望做到的是，取表a和表b中均存在的记录，并使用主键id连结起来

hive> create table d as select a1.id as id_a, a1.name , b1.id as id_b , b1.score
    > from a a1
    > join b b1
    > on a1.id = b1.id;

输出结果：

1	hive> select * from d;

如果关联的结果有重复记录，那么记录会全部保留

查询班级中拥有书的同学姓名以及其拥有的书名

hive> create table e as select a1.id as id_a, a1.name , c1.id as id_c , c1.book
    > from a a1
    > join c c1
    > on a1.id = c1.id;

输出结果：

1	hive> select * from e;

只查询班级中拥有书的同学：

hive> create table f as select a1.id as id_a, a1.name
    > from a a1
    > join c c1
    > on a1.id = c1.id;

输出结果：

1	hive> select * from f;

发现：

1	结果集每个id保留了2条记录，有些数据是重复的，我们只需要保留1条就足够了，这时可以使用distinct对关联结果进行去重

可以使用distinct去重

hive> create table g as select distinct a1.id as id_a, a1.name
    > from a a1
    > join c c1
    > on a1.id = c1.id;

输出结果：

1	hive> select * from g;

Left outer join的使用

现在需求：

想要知道班级中哪些同学没有参加过考试，这时我们需要把班级的所有学生记录都去出来，同时，如果某学生参加考试，则列出其分数。
这里需要以a为左表，所取结果中应包含a的所有记录

hive> create table h as select distinct a1.id as id_a, a1.name, b1.id as id_b , b1.score
    > from a a1
    > left outer join b b1
    > on a1.id = b1.id;

输出结果

1	hive> select * from h;

1
2

在结果集中，如果某学生没有参加某考试，即其在b表中无相应记录，那么结果集的相应字段会被赋予NULL值
那么可以用另一种方式实现最初的需求：得到参加考试的学生姓名和考试成绩

hive> create table i as select distinct a1.id as id_a, a1.name, b1.id as id_b , b1.score
    > from a a1
    > left outer join b b1
    > on a1.id = b1.id
    > where b1.id is not null;

输出结果

1	hive> select * from i;

1	上述代码加上where b.id is not null,把在b表中值为空的记录删除，实现的其实就是传统sql中的exists in操作

需求：把班级里未参加考试的学生取出来，因为我们要通知它们补考。这个需求即取出a中存在但b中不存在的记录

hive> create table j as select distinct a1.id as id_a, a1.name, b1.id as id_b , b1.score
    > from a a1
    > left outer join b b1
    > on a1.id = b1.id
    > where b1.id is null;

输出结果：

1	hive> select * from i;

1 2	实现的其实就是传统sql中的exists not in操作，为了不让结果集中出现重复记录，可以使用distinct去重

Left semi join

1	left semi join,有一个限制条件，即右表的字段只能出现在on子句中，而不能在select和where字句中引用

需求：找出班级中参加考试的学生

1
2
3

hive> create table k as select distinct a1.id ,a1.name from a a1
    > left semi join b b1
    > on a1.id = b1.id;

输出结果：

1	hive> select * from k;

full outer join用法

1	full outer join实现全连接。

需求：取出班级中全体学生/参加考试的学生及其成绩。即取a中存在或b中存在的记录

hive> create table l as select distinct a1.id ,a1.name , b1.id as id_b ,b1.score
    > from a a1
    > full outer join b b1
    > on a1.id = b1.id;

输出结果：

1	hive> select * from l;

left outer join where is not null与left semi join的联系与区别：

两者均可实现exists in操作，

不同的是，

前者允许右表的字段在select或where子句中引用，而后者不允许。

除了left outer join，Hive QL中还有right outer join，
其功能与前者相当，只不过左表和右表的角色刚好相反。

另外，Hive QL中没有left join、right join、full join以及right semi join等操作。