本次主要是干DWS层的另外两个宽表,商品主题日统计宽表和用户主题日统计宽表。因为是销售商品给用户,而且销售主题日统计宽表在上次已经干完了,这另外两个宽表基本换汤不换药。还有MySQL的索引之类的一些东西,让我认识到SQL也是有性能(效率)的评级的,网上可以搜到MySQL的评级分类,以后基本将自己的SQL优化到const这种效率的SQL就很牛了。

Hive相关配置参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--小文件合并
-- set mapred.max.split.size=2147483648;
-- set mapred.min.split.size.per.node=1000000000;
-- set mapred.min.split.size.per.rack=1000000000;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=true;
-- set hive.skewjoin.key=100000;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
-- group倾斜
set hive.groupby.skewindata=false;

变量命名规则

建表语法需要在Hive中写, 因为Presto不支持.
给变量起名主要有种规范, 大多数语言都支持前两种, 后两种, 部分语言支持, 部分语言不支持.

  • 规范1: 大驼峰命名法, 也叫 双峰驼命名法, 即: 每个单词的首字母都大写, 其它小写.
    例如: HelloWorld, MaxValue
  • 规范2: 小驼峰命名法, 也叫 单峰驼命名法, 即: 从第二个单词开始, 每个单词的首字母都大写, 其它小写.
    例如: helloWorld, maxValue, zhangSanAge
  • 规范3: 蛇形命名法, 单词间用下划线隔开.
    例如: max_value, min_value, hello_world, zhang_san_age
  • 规范4: 串行命名法, 单词间用中划线隔开.
    例如: max-value, min-value, zhang-san-age

好用的起名网站,参考: https://unbug.github.io/codelf/

DWS之商品主题日统计宽表

  • 主题需求

    • 指标

      1
      2
      3
      下单次数、下单件数、下单金额、被支付次数、被支付件数、被支付金额、被退款次数、被退款件数、被退款金额、被加入购物车次数、被加入购物车件数、被收藏次数、好评数、中评数、差评数

      --总共15个指标
    • 维度

      1
      日期(day)+商品
  • 本主题建表操作

    注意:建表操作需要在hive中执行,presto不支持hive的建表语法。

Step1建表

在DWS层创建dws_sku_daycount

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
create table yp_dws.dws_sku_daycount 
(
dt STRING,
sku_id string comment 'sku_id',
sku_name string comment '商品名称',
order_count bigint comment '被下单次数',
order_num bigint comment '被下单件数',
order_amount decimal(38,2) comment '被下单金额',
payment_count bigint comment '被支付次数',
payment_num bigint comment '被支付件数',
payment_amount decimal(38,2) comment '被支付金额',
refund_count bigint comment '被退款次数',
refund_num bigint comment '被退款件数',
refund_amount decimal(38,2) comment '被退款金额',
cart_count bigint comment '被加入购物车次数',
cart_num bigint comment '被加入购物车件数',
favor_count bigint comment '被收藏次数',
evaluation_good_count bigint comment '好评数',
evaluation_mid_count bigint comment '中评数',
evaluation_bad_count bigint comment '差评数'
) COMMENT '每日商品行为'
--PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

分析字段

分析我们要的数据来源于哪些表的哪些字段

说白了就是一个一个字段的对,DWS层的数据基本是来源于DWB层的,但是也是有可能来源一DWD层,虽然这样做会导致血缘关系混乱,但是做这的一切都是利于分析。

  • 指标: 下单次数、下单件数、下单金额、被支付次数、被支付件数、被支付金额、被退款次数、被退款件数、被退款金额、被加入购物车次数、被加入购物车件数、被收藏次数、好评数、中评数、差评数
  • 维度: 日期 + 商品(id, 名字)

分析结论

根据建表的字段,我们可以分析出

下单次数、下单件数、下单金额为一组,

被支付次数、被支付件数、被支付金额为一组,

被退款次数、被退款件数、被退款金额为一组等等…..

这里我们发现前三组的字段中都有相同的字段,可以放到一个order_base表中

dt, order_id, goods_id, goods_name
buy_num(购买商品的数量), total_price

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1: 下单次数、下单件数、下单金额
表: yp_dwb.dwb_order_detail
字段: dt, order_id, goods_id, goods_name,
buy_num(购买商品的数量), total_price(购买商品的价格)
2: 被支付次数、被支付件数、被支付金额
表: yp_dwb.dwb_order_detail
字段: dt, order_id, goods_id, goods_name,
buy_num, total_price, is_pay(0:未支付, 1:已支付)
3: 被退款次数、被退款件数、被退款金额
表: yp_dwb.dwb_order_detail
字段: dt, order_id, goods_id, goods_name,
buy_num, total_price, refund_id(不为null说明有退款)
4: 被加入购物车次数、被加入购物车件数
表: yp_dwd.fact_shop_cart
字段: id, buy_num, goods_id, end_date='9999-99-99'
5: 被收藏次数
表: yp_dwd.fact_goods_collect
字段: id, goods_id, end_date='9999-99-99'
6: 好评数、中评数、差评数
表: yp_dwd.fact_goods_evaluation_detail
字段: dt, goods_id, geval_scores_goods(商品评分: 0 ~ 10分)
评分规则: >=9 好评, >6 <9 中评, <=6 差评

Step2下单支付退款等指标

准备数据源

准备数据源,用于计算: 下单次数、下单件数、下单金额、 被支付次数、被支付件数、被支付金额、 被退款次数、被退款件数、被退款金额

这里是写链式CTE表达式,先写 order_base,因为我们发现前三组的字段中都有相同的字段,所以我们可以把这些相同的字段放到一个order_base表中。

1
2
3
4
5
6
7
8
9
10
11
12
13
with order_base as (
select
dt, -- 订单日期, 维度
order_id, -- 订单id
goods_id, -- 商品id,
goods_name, -- 商品名
buy_num, -- 购买商品的数量,
total_price, -- 购买商品的总价格
is_pay, -- 标记是否已支付, 0:未支付, 1:已支付
refund_id, -- 标记是否有退款, 不为null, 说明有退款.
row_number() over(partition by order_id, goods_id) rn -- 根据订单id, 商品id去重.
from yp_dwb.dwb_order_detail
),

下单-次数-件数-金额

1
2
3
4
5
6
7
8
9
10
11
12
order_count as (
select
dt, -- 日期维度
goods_id as sku_id, -- 商品id, 维度
goods_name as sku_name, -- 商品名, 维度
count(order_id) as order_count, -- 被下单次数
sum(buy_num) as order_num, -- 被下单件数
sum(total_price) as order_amount -- 被下单金额
from order_base
where rn = 1
group by dt, goods_id, goods_name
)

被支付-次数-件数-金额

1
2
3
4
5
6
7
8
9
10
11
12
payment_count as (
select
dt, -- 日期维度
goods_id as sku_id, -- 商品id, 维度
goods_name as sku_name, -- 商品名, 维度
count(order_id) as payment_count, -- 被支付次数
sum(buy_num) as payment_num, -- 被支付件数
sum(total_price) as payment_amount -- 被支付金额
from order_base
where rn = 1 and is_pay=1 -- rn=1 去重, is_pay = 1 已支付
group by dt, goods_id, goods_name
),

被退款次数、件数、金额

1
2
3
4
5
6
7
8
9
10
11
12
refund_count as (
select
dt, -- 日期维度
goods_id as sku_id, -- 商品id, 维度
goods_name as sku_name, -- 商品名, 维度
count(order_id) as refund_count, -- 被退款次数
sum(buy_num) as refund_num, -- 被退款件数
sum(total_price) as refund_amount -- 被退款金额
from order_base
where rn = 1 and refund_id is not null -- rn=1 去重, refund_id is not null 不为空, 说明有退款
group by dt, goods_id, goods_name
),

Step3购物车收藏等指标

被加入购物车次数、件数

1
2
3
4
5
6
7
8
9
10
cart_count as (
select
substring(create_time, 1, 10) as dt, -- 日期
goods_id as sku_id, -- 商品id
count(id) as cart_count, -- 被加入购物车次数
sum(buy_num) as cart_num -- 被加入购物车件数
from yp_dwd.fact_shop_cart
where end_date='9999-99-99'
group by substring(create_time, 1, 10), goods_id
),

被收藏次数

1
2
3
4
5
6
7
8
9
favor_count as (
select
substring(create_time, 1, 10) as dt, -- 日期
goods_id as sku_id, -- 商品id
count(id) as favor_count -- 被收藏次数
from yp_dwd.fact_goods_collect
where end_date='9999-99-99'
group by substring(create_time, 1, 10), goods_id
),

Step4好中差评相关指标

1
2
3
4
5
6
7
8
9
10
evaluation_count as (
select
substring(create_time, 1, 10) as dt, -- 日期
goods_id as sku_id, -- 商品id
count(if(geval_scores_goods >= 9, 1, null)) as evaluation_good_count, -- 好评数, if(评分 >= 9, 1, null), 因为count(1)会被统计, count(null)会被忽略
count(if(geval_scores_goods > 6 and geval_scores_goods < 9 , 1, null)) as evaluation_mid_count, -- 中评数
count(if(geval_scores_goods <= 6, 1, null)) as evaluation_bad_count -- 差评数
from yp_dwd.fact_goods_evaluation_detail
group by substring(create_time, 1, 10), goods_id
),

Step5完整实现

对上述的6个结果做合并, 即: 合并 order_count, payment_count, refund_count, cart_count, favor_count, evaluation_count

最终发现一个Bug,重复的数据会有三行

union all和 full outer join的区别。

pCpEESs.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
insert into hive.yp_dws.dws_sku_daycount
--订单明细表抽取字段,并且进行去重,作为后续的base基础数据
with order_base as (select
dt,
order_id, --订单id
goods_id, --商品id
goods_name,--商品名称
buy_num,--购买商品数量
total_price,--商品总金额(数量*单价)
is_pay,--支付状态(1表示已经支付)
row_number() over(partition by order_id,goods_id) as rn
from yp_dwb.dwb_order_detail),

--下单次数、件数、金额统计
order_count as (select
dt,goods_id as sku_id,goods_name as sku_name,
count(order_id) order_count,
sum(buy_num) order_num,
sum(total_price) order_amount
from order_base where rn =1
group by dt,goods_id,goods_name),

--订单状态,已支付
pay_base as(
select *,
row_number() over(partition by order_id, goods_id) rn
from yp_dwb.dwb_order_detail
where is_pay=1
),

--支付次数、件数、金额统计
payment_count as(
select dt, goods_id sku_id, goods_name sku_name,
count(order_id) payment_count,
sum(buy_num) payment_num,
sum(total_price) payment_amount
from pay_base
where rn=1
group by dt, goods_id, goods_name
),

--退款次数、件数、金额统计
refund_base as(
select *,
row_number() over(partition by order_id, goods_id) rn
from yp_dwb.dwb_order_detail
where refund_id is not null
),
-- 退款次数、件数、金额
refund_count as (
select dt, goods_id sku_id, goods_name sku_name,
count(order_id) refund_count,
sum(buy_num) refund_num,
sum(total_price) refund_amount
from refund_base
where rn=1
group by dt, goods_id, goods_name
),

-- 购物车次数、件数
cart_count as (
select substring(create_time, 1, 10) dt, goods_id sku_id,
count(id) cart_count,
sum(buy_num) cart_num
from yp_dwd.fact_shop_cart
where end_date = '9999-99-99'
group by substring(create_time, 1, 10), goods_id
),
-- 收藏次数
favor_count as (
select substring(c.create_time, 1, 10) dt, goods_id sku_id,
count(c.id) favor_count
from yp_dwd.fact_goods_collect c
where end_date='9999-99-99'
group by substring(c.create_time, 1, 10), goods_id
),
-- 好评、中评、差评数量
evaluation_count as (
select substring(geval_addtime, 1, 10) dt, e.goods_id sku_id,
count(if(geval_scores_goods >= 9, 1, null)) evaluation_good_count,
count(if(geval_scores_goods >6 and geval_scores_goods < 9, 1, null)) evaluation_mid_count,
count(if(geval_scores_goods <= 6, 1, null)) evaluation_bad_count
from yp_dwd.fact_goods_evaluation_detail e
group by substring(geval_addtime, 1, 10), e.goods_id
),

--合并结果集
unionall as (
select
dt, sku_id, sku_name,
order_count,
order_num,
order_amount,
0 as payment_count,
0 as payment_num,
0 as payment_amount,
0 as refund_count,
0 as refund_num,
0 as refund_amount,
0 as cart_count,
0 as cart_num,
0 as favor_count,
0 as evaluation_good_count,
0 as evaluation_mid_count,
0 as evaluation_bad_count
from order_count
union all
select
dt, sku_id, sku_name,
0 order_count,
0 order_num,
0 order_amount,
payment_count,
payment_num,
payment_amount,
0 as refund_count,
0 as refund_num,
0 as refund_amount,
0 as cart_count,
0 as cart_num,
0 as favor_count,
0 as evaluation_good_count,
0 as evaluation_mid_count,
0 as evaluation_bad_count
from payment_count
union all
select
dt, sku_id, sku_name,
0 order_count,
0 order_num,
0 order_amount,
0 as payment_count,
0 as payment_num,
0 as payment_amount,
refund_count,
refund_num,
refund_amount,
0 as cart_count,
0 as cart_num,
0 as favor_count,
0 as evaluation_good_count,
0 as evaluation_mid_count,
0 as evaluation_bad_count
from refund_count
union all
select
dt, sku_id, null as sku_name,
0 order_count,
0 order_num,
0 order_amount,
0 as payment_count,
0 as payment_num,
0 as payment_amount,
0 as refund_count,
0 as refund_num,
0 as refund_amount,
cart_count,
cart_num,
0 as favor_count,
0 as evaluation_good_count,
0 as evaluation_mid_count,
0 as evaluation_bad_count
from cart_count
union all
select
dt, sku_id, null as sku_name,
0 order_count,
0 order_num,
0 order_amount,
0 as payment_count,
0 as payment_num,
0 as payment_amount,
0 as refund_count,
0 as refund_num,
0 as refund_amount,
0 as cart_count,
0 as cart_num,
favor_count,
0 as evaluation_good_count,
0 as evaluation_mid_count,
0 as evaluation_bad_count
from favor_count
union all
select
dt, sku_id, null as sku_name,
0 order_count,
0 order_num,
0 order_amount,
0 as payment_count,
0 as payment_num,
0 as payment_amount,
0 as refund_count,
0 as refund_num,
0 as refund_amount,
0 as cart_count,
0 as cart_num,
0 as favor_count,
evaluation_good_count,
evaluation_mid_count,
evaluation_bad_count
from evaluation_count
)

select
dt, sku_id, max(sku_name),
sum(order_count),
sum(order_num),
sum(order_amount),
sum(payment_count),
sum(payment_num),
sum(payment_amount),
sum(refund_count),
sum(refund_num),
sum(refund_amount),
sum(cart_count),
sum(cart_num),
sum(favor_count),
sum(evaluation_good_count),
sum(evaluation_mid_count),
sum(evaluation_bad_count)
from unionall
group by dt, sku_id
--order by dt, sku_id
;

这里union all合并之后,通过dt,sku_id分组,通过max(string),sum(int)函数的原因是将数据进行合并。

如果直接使用full outer join就不用分组然后再累加了。

这里需要说明一下假设有stu表

id name money
1 null 0
2 张三 100
1
2
select max(name) from stu;   -- 最终结果为张三,因为根据哈希值进行比较的
select sum(money) from stu; -- 最终结果为100,因为另一条数据值为null

另一种写法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
----注意,如果session无法选中presto数据源----------
----把sql文件语法方言支持调整为Generic sql模式----------

insert into hive.yp_dws.dws_sku_daycount
with order_cnt as (select
dt ,pay_time ,apply_date ,goods_id , goods_name , order_state ,is_pay ,refund_id ,refund_state ,
order_id ,buy_num ,total_price ,
row_number() over(partition by order_id , goods_id) as rn1
from hive.yp_dwb.dwb_order_detail),

goods_order_1 as (select
dt,
goods_id,
goods_name,
count(order_id) as order_count,
sum(buy_num) as order_num,
sum(total_price) as order_amount
from order_cnt where rn1 = 1
group by goods_id,goods_name , dt) ,

goods_pay_2 as (select
substring(pay_time,1,10) as dt,
goods_id,
goods_name,
count(order_id) as payment_count,
sum(buy_num) as payment_num,
sum(total_price) as payment_amount
from order_cnt where rn1 = 1 and order_state not in(1,7) and is_pay=1
group by goods_id,goods_name, substring(pay_time,1,10)),

goods_refund_3 as(select
substring(apply_date,1,10) as dt,
goods_id,
goods_name,
count(order_id) as refund_count,
sum(buy_num) as refund_num,
sum(total_price) as refund_amount
from order_cnt where rn1 = 1 and refund_id is not null and refund_state = 5
group by goods_id , goods_name , substring(apply_date,1,10)),

goods_cart_4 as (select
substring(c.create_time ,1,10) as dt,
c.goods_id ,
d1.goods_name ,
count(c.id) as cart_count,
sum(c.buy_num) as cart_num
from hive.yp_dwd.fact_shop_cart c join hive.yp_dwb.dwb_goods_detail d1
on c.goods_id = d1.id
group by c.goods_id , d1.goods_name , substring(c.create_time ,1,10)),

goods_favor_5 as (select
substring(gc.create_time ,1,10) as dt,
gc.goods_id ,
d2.goods_name ,
count(gc.id) as favor_count
from hive.yp_dwd.fact_goods_collect gc join hive.yp_dwb.dwb_goods_detail d2
on gc.goods_id = d2.id
group by gc.goods_id , d2.goods_name , substring(gc.create_time ,1,10)) ,

goods_eval_6 as (select
substring(g.create_time ,1,10) as dt,
g.goods_id,
d3.goods_name,
-- 低于 6分 差评 , 6~8分(包含) 8以上好评
count(
if( g.geval_scores_goods is null OR g.geval_scores_goods > 8 , g.id ,null )
) as evaluation_good_count,
count(
if( g.geval_scores_goods is not null and g.geval_scores_goods between 6 and 8 , g.id ,null )
) as evaluation_mid_count,
count(
if( g.geval_scores_goods is not null and g.geval_scores_goods < 6 , g.id ,null )
) as evaluation_bad_count
from hive.yp_dwd.fact_goods_evaluation_detail g join hive.yp_dwb.dwb_goods_detail d3
on g.goods_id = d3.id
group by g.goods_id,d3.goods_name,substring(g.create_time ,1,10)),

temp as (select
coalesce(goods_order_1.dt,goods_pay_2.dt,goods_refund_3.dt,goods_cart_4.dt,goods_favor_5.dt,goods_eval_6.dt) as dt,
coalesce(goods_order_1.goods_id,goods_pay_2.goods_id,goods_refund_3.goods_id,goods_cart_4.goods_id,goods_favor_5.goods_id,goods_eval_6.goods_id) as sku_id,
coalesce(goods_order_1.goods_name,goods_pay_2.goods_name,goods_refund_3.goods_name,goods_cart_4.goods_name,goods_favor_5.goods_name,goods_eval_6.goods_name) as sku_name,

coalesce(goods_order_1.order_count,0) as order_count,
coalesce(goods_order_1.order_num,0) as order_num,
coalesce(goods_order_1.order_amount,0) as order_amount,

coalesce(goods_pay_2.payment_count,0) as payment_count,
coalesce(goods_pay_2.payment_num,0) as payment_num,
coalesce(goods_pay_2.payment_amount,0) as payment_amount,

coalesce(goods_refund_3.refund_count,0) as refund_count,
coalesce(goods_refund_3.refund_num,0) as refund_num,
coalesce(goods_refund_3.refund_amount,0) as refund_amount,

coalesce(goods_cart_4.cart_count,0) as cart_count,
coalesce(goods_cart_4.cart_num,0) as cart_num,

coalesce(goods_favor_5.favor_count,0) as favor_count,

coalesce(goods_eval_6.evaluation_good_count,0) as evaluation_good_count,
coalesce(goods_eval_6.evaluation_mid_count,0) as evaluation_mid_count,
coalesce(goods_eval_6.evaluation_bad_count,0) as evaluation_bad_count
from goods_order_1
full join goods_pay_2 on goods_order_1.goods_id=goods_pay_2.goods_id and goods_order_1.dt = goods_pay_2.dt
full join goods_refund_3 on goods_order_1.goods_id = goods_refund_3.goods_id and goods_order_1.dt = goods_refund_3.dt
full join goods_cart_4 on goods_order_1.goods_id = goods_cart_4.goods_id and goods_order_1.dt = goods_cart_4.dt
full join goods_favor_5 on goods_order_1.goods_id = goods_favor_5.goods_id and goods_order_1.dt = goods_favor_5.dt
full join goods_eval_6 on goods_order_1.goods_id = goods_eval_6.goods_id and goods_order_1.dt = goods_eval_6.dt)

select
dt,
sku_id, sku_name,
sum(order_count),
sum(order_num),
sum(order_amount),
sum(payment_count),
sum(payment_num),
sum(payment_amount),
sum(refund_count),
sum(refund_num),
sum(refund_amount),
sum(cart_count),
sum(cart_num),
sum(favor_count),
sum(evaluation_good_count),
sum(evaluation_mid_count),
sum(evaluation_bad_count)
from temp
group by sku_id, sku_name,dt;

模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
with order_base as (
select
dt, -- 订单日期, 维度
order_id, -- 订单id
goods_id, -- 商品id,
goods_name, -- 商品名
buy_num, -- 购买商品的数量,
total_price, -- 购买商品的总价格
is_pay, -- 标记是否已支付, 0:未支付, 1:已支付
refund_id, -- 标记是否有退款, 不为null, 说明有退款.
row_number() over(partition by order_id, goods_id) rn -- 根据订单id, 商品id去重.
from yp_dwb.dwb_order_detail
),
order_count as (
select
日期,商品id,商品名,被下单次数,被下单件数,被下单金额
from order_base
where rn = 1 -- 按照订单id,商品id进行去重
group by dt, goods_id, goods_name
),
payment_count as (
select
日期,商品id,商品名 ,被支付次数,被支付件数,被支付金额
from order_base
where rn = 1 and is_pay=1 -- rn=1 去重, is_pay = 1 已支付
group by dt, goods_id, goods_name
),
refund_count as (
select
日期,商品id,商品名 ,被退款次数,被退款件数,被退款金额
from order_base
where rn = 1 and refund_id is not null -- rn=1 去重, refund_id is not null 不为空, 说明有退款
group by dt, goods_id, goods_name
),
cart_count as (
select
日期,商品id,被加入购物车次数,被加入购物车件数
where end_date='9999-99-99'
group by substring(create_time, 1, 10), goods_id
),
favor_count as (
select .....
from yp_dwd.fact_goods_collect
where end_date='9999-99-99'
group by substring(create_time, 1, 10), goods_id
),
evaluation_count as (
select 日期,商品id,好中差评个数
from yp_dwd.fact_goods_evaluation_detail
group by substring(create_time, 1, 10), goods_id
),
unionall as (
select 日期,商品id,商品名,被下单次数,被下单件数,被下单金额 (空白字段用null填充)
from order_count
union all
select ...
from payment_count
union all
select ......
)
select
dt, sku_id, max(sku_name),
sum(order_count),
sum(order_num),
sum(order_amount),
sum(payment_count),
sum(payment_num),
sum(payment_amount),
sum(refund_count),
sum(refund_num),
sum(refund_amount),
sum(cart_count),
sum(cart_num),
sum(favor_count),
sum(evaluation_good_count),
sum(evaluation_mid_count),
sum(evaluation_bad_count)
from unionall
group by dt, sku_id
--order by dt, sku_id

DWS之用户主题日统计宽表

建表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
create table yp_dws.dws_user_daycount
(
dt STRING,
user_id string comment '用户 id',
login_count bigint comment '登录次数',
store_collect_count bigint comment '店铺收藏数量',
goods_collect_count bigint comment '商品收藏数量',
cart_count bigint comment '加入购物车次数',
cart_amount decimal(38,2) comment '加入购物车金额',
order_count bigint comment '下单次数',
order_amount decimal(38,2) comment '下单金额',
payment_count bigint comment '支付次数',
payment_amount decimal(38,2) comment '支付金额'
) COMMENT '每日用户行为'
--PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

最终实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
insert into yp_dws.dws_user_daycount
-- 登录次数
with login_count as (
select
count(id) as login_count,
login_user as user_id, dt
from yp_dwd.fact_user_login
group by login_user, dt
),
-- 店铺收藏数
store_collect_count as (
select
count(id) as store_collect_count,
user_id, substring(create_time, 1, 10) as dt
from yp_dwd.fact_store_collect
where end_date='9999-99-99'
group by user_id, substring(create_time, 1, 10)
),
-- 商品收藏数
goods_collect_count as (
select
count(id) as goods_collect_count,
user_id, substring(create_time, 1, 10) as dt
from yp_dwd.fact_goods_collect
where end_date='9999-99-99'
group by user_id, substring(create_time, 1, 10)
),
-- 加入购物车次数和金额
cart_count_amount as (
select
count(cart.id) as cart_count,
sum(g.goods_promotion_price) as cart_amount,
buyer_id as user_id, substring(cart.create_time, 1, 10) as dt
from yp_dwd.fact_shop_cart cart, yp_dwb.dwb_goods_detail g
where cart.end_date='9999-99-99' and cart.goods_id=g.id
group by buyer_id, substring(cart.create_time, 1, 10)
),
-- 下单次数和金额
order_count_amount as (
select
count(o.id) as order_count,
sum(order_amount) as order_amount,
buyer_id as user_id, substring(create_date, 1, 10) as dt
from yp_dwd.fact_shop_order o, yp_dwd.fact_shop_order_address_detail od
where o.id=od.id
and o.is_valid=1 and o.end_date='9999-99-99' and od.end_date='9999-99-99'
group by buyer_id, substring(create_date, 1, 10)
),
-- 支付次数和金额
payment_count_amount as (
select
count(id) as payment_count,
sum(trade_true_amount) as payment_amount,
user_id, substring(create_time, 1, 10) as dt
from yp_dwd.fact_trade_record
where is_valid=1 and trade_type in (1,11) and status=1
group by user_id, substring(create_time, 1, 10)
)
select
-- dt,
user_id,
-- 登录次数
sum(login_count) login_count,
-- 店铺收藏数
sum(store_collect_count) store_collect_count,
-- 商品收藏数
sum(goods_collect_count) goods_collect_count,
-- 加入购物车次数和金额
sum(cart_count) cart_count,
sum(cart_amount) cart_amount,
-- 下单次数和金额
sum(order_count) order_count,
sum(order_amount) order_amount,
-- 支付次数和金额
sum(payment_count) payment_count,
sum(payment_amount) payment_amount
,dt
from
(
select lc.login_count,
0 store_collect_count,
0 goods_collect_count,
0 cart_count, 0 cart_amount,
0 order_count, 0 order_amount,
0 payment_count, 0 payment_amount,
user_id, dt
from login_count lc
union all
select
0 login_count,
scc.store_collect_count,
0 goods_collect_count,
0 cart_count, 0 cart_amount,
0 order_count, 0 order_amount,
0 payment_count, 0 payment_amount,
user_id, dt
from store_collect_count scc
union all
select
0 login_count,
0 store_collect_count,
gcc.goods_collect_count,
0 cart_count, 0 cart_amount,
0 order_count, 0 order_amount,
0 payment_count, 0 payment_amount,
user_id, dt
from goods_collect_count gcc
union all
select
0 login_count,
0 store_collect_count,
0 goods_collect_count,
cca.cart_count, cart_amount,
0 order_count, 0 order_amount,
0 payment_count, 0 payment_amount,
user_id, dt
from cart_count_amount cca
union all
select
0 login_count,
0 store_collect_count,
0 goods_collect_count,
0 cart_count, 0 cart_amount,
oca.order_count, order_amount,
0 payment_count, 0 payment_amount,
user_id, dt
from order_count_amount oca
union all
select
0 login_count,
0 store_collect_count,
0 goods_collect_count,
0 cart_count, 0 cart_amount,
0 order_count, 0 order_amount,
pca.payment_count, payment_amount,
user_id, dt
from payment_count_amount pca
) user_count
group by user_id, dt
;

索引

索引是什么

在数据库中,索引指的是提供指向存储在表的指定列中的数据值的指针(地址),数据库使用索引以找到特定值,然后顺指针找到包含该值的行。

  • 索引的优点
    • 大大提高查询的效率
  • 索引的缺点
    • 降低更新表的速度,如对表进行INSERT、UPDATE和DELETE。因为更新表时,MySQL不仅要保存数据,还要保存一下索引文件。MySQL支持自动更新索引。
    • 建立索引会占用磁盘空间的索引文件。

索引使用的注意事项

  • 当创建索引之后,根据具有索引的字段查询,效率才能体现。

MySQL索引

MySQL的索引分类

  • 普通索引

    普通索引是最基本的索引,它没有任何限制。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    CREATE INDEX indexName ON table_name (column_name);

    --也可以在建表的时候同时指定索引
    CREATE TABLE `table` (
    `id` int(11) NOT NULL AUTO_INCREMENT ,
    `title` char(255) CHARACTER NOT NULL ,
    `content` text CHARACTER NULL ,
    `time` int(10) NULL DEFAULT NULL ,
    PRIMARY KEY (`id`),
    INDEX index_name (title(length))
    )

    --length为可选参数,表示索引的长度,只有字符串类型的字段才能指定索引长度
  • 唯一索引

    与普通索引类似,不同的就是:==索引列的值必须唯一==,但允许有空值。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    CREATE UNIQUE INDEX indexName ON table(column(length));

    --建表的时候创建索引
    CREATE TABLE `table` (
    `id` int(11) NOT NULL AUTO_INCREMENT ,
    `title` char(255) CHARACTER NOT NULL ,
    `content` text CHARACTER NULL ,
    `time` int(10) NULL DEFAULT NULL ,
    UNIQUE indexName (title(length))
    );
  • 主键索引

    一种特殊的唯一索引,一个表只能有一个主键,不允许有空值。

    一般是在建表的时候同时创建主键索引。

    1
    2
    3
    4
    5
    CREATE TABLE `table` (
    `id` int(11) NOT NULL AUTO_INCREMENT ,
    `title` char(255) NOT NULL ,
    PRIMARY KEY (`id`)
    );
  • 组合索引

指多个字段上创建的索引,只有在查询条件中使用了创建索引时的第一个字段,索引才会被使用。

使用组合索引时==遵循最左前缀==集合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ALTER TABLE mytable ADD INDEX name_city_age (name(10),city,age);

--建表时,name长度为16,这里用10。这是因为一般情况下名字的长度不会超过10,这样会加速索引查询速度,还会减少索引文件的大小,提高INSERT的更新速度。

--建立这样的组合索引,其实是相当于分别建立了下面三组组合MySQL数据库索引
name,city,age
name,city
name

--为什么没有 city,age这样的组合索引呢?这是因为MySQL组合索引“最左前缀”的结果。简单的理解就是只从最左面的开始组合。并不是只要包含这三列的查询都会用到该组合索引,下面的几个SQL就会用到这个组合MySQL数据库索引:

SELECT * FROM mytable WHERE name="admin" AND city="郑州"

SELECT * FROM mytable WHERE name="admin"

--而下面几个则不会用到:
SELECT * FROM mytable WHERE city="郑州" AND name="admin"
SELECT * FROM mytable WHERE age=20 AND city="郑州"
SELECT * FROM mytable WHERE city="郑州"

Hive索引

Hive 0.7版本之后,开始支持索引,但是功能很弱。

Hive3.0版本之后,直接移除了索引

Hive中如何使用索引:

  • 使用物化视图 自动更新
  • 使用ORC、Parquet等格式文件。这些文件格式本身列式存储,内部特性支持查询效率的提高。

ORC之行组索引

Row Group Index

  • Row Group Index行组索引

    ORC为每个stripe建立的包含min/max值的索引,就称为Row Group Index,也叫min-max Index,或者Storage Index。

    1
    2
    3
    4
    5
    1、在建立ORC格式表时,指定表参数’orc.create.index’=’true’之后,便会建立Row Group Index;

    2、需要注意的是,为了使Row Group Index有效利用,向表中加载数据时,必须对需要使用索引的字段进行排序,否则,min/max会失去意义。

    3、另外,这种索引通常用于数值型字段的查询过滤优化上。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    --CTAS
    CREATE TABLE t_text_2 stored AS ORC
    TBLPROPERTIES
    ('orc.compress'='SNAPPY',
    'orc.create.index'='true', --开启行组索引
    'orc.stripe.size'='10485760',
    'orc.row.index.stride'='10000') --索引条目之间的行数(必须> = 1000)
    AS
    SELECT xxxx
    FROM t_text_1
    DISTRIBUTE BY id sort BY id;

    --参数hive.optimize.index.filte 表示是否自动使用索引,默认为false(不使用);
    --如果不设置该参数为true,那么ORC的索引当然也不会使用。

    --执行sql之前,为了使用索引,应该设置下面的参数
    set hive.optimize.index.filter=true;

ORC之Bloom Filter过滤器

Bloom Filter Index

  • hive中布隆过滤器索引

    在建表时候,通过表参数==orc.bloom.filter.columns=”pcid”==来指定为那些字段建立BloomFilter索引;

    这样,在生成数据的时候,会在每个stripe中,为该字段建立BloomFilter的数据结构;

    当查询条件中包含==对该字段的=号过滤时候,先从BloomFilter中获取以下是否包含该值==,如果不包含,则跳过该stripe。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    CREATE TABLE t_text_2 stored AS ORC 
    TBLPROPERTIES
    ('orc.compress'='SNAPPY',
    'orc.create.index'='true',
    "orc.bloom.filter.columns"="pcid", --布隆过滤器 对pcid进行索引
    'orc.stripe.size'='10485760',
    'orc.row.index.stride'='10000')
    AS
    SELECT xxxx
    FROM t_text_1
    DISTRIBUTE BY id sort BY id;


    --执行查询
    SET hive.optimize.index.filter=true;

    SELECT COUNT(1) FROM t_text_2 WHERE id >= 0 AND id <= 1000
    AND pcid IN ('0005E26F0DCCDB56F9041C','A');