本次主要是干DWS层的另外两个宽表，商品主题日统计宽表和用户主题日统计宽表。因为是销售商品给用户，而且销售主题日统计宽表在上次已经干完了，这另外两个宽表基本换汤不换药。还有MySQL的索引之类的一些东西，让我认识到SQL也是有性能(效率)的评级的，网上可以搜到MySQL的评级分类，以后基本将自己的SQL优化到const这种效率的SQL就很牛了。

Hive相关配置参数

--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--小文件合并
-- set mapred.max.split.size=2147483648;
-- set mapred.min.split.size.per.node=1000000000;
-- set mapred.min.split.size.per.rack=1000000000;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=true;
-- set hive.skewjoin.key=100000;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
-- group倾斜
set hive.groupby.skewindata=false;

变量命名规则

建表语法需要在Hive中写, 因为Presto不支持.
给变量起名主要有四种规范, 大多数语言都支持前两种, 后两种, 部分语言支持, 部分语言不支持.

规范1: 大驼峰命名法, 也叫双峰驼命名法, 即: 每个单词的首字母都大写, 其它小写.
例如: HelloWorld, MaxValue

规范2: 小驼峰命名法, 也叫单峰驼命名法, 即: 从第二个单词开始, 每个单词的首字母都大写, 其它小写.
例如: helloWorld, maxValue, zhangSanAge

规范3: 蛇形命名法, 单词间用下划线隔开.
例如: max_value, min_value, hello_world, zhang_san_age

规范4: 串行命名法, 单词间用中划线隔开.
例如: max-value, min-value, zhang-san-age

好用的起名网站,参考: https://unbug.github.io/codelf/

DWS之商品主题日统计宽表

主题需求

指标

1
2
3

下单次数、下单件数、下单金额、被支付次数、被支付件数、被支付金额、被退款次数、被退款件数、被退款金额、被加入购物车次数、被加入购物车件数、被收藏次数、好评数、中评数、差评数

--总共15个指标

维度
1
日期（day）+商品

本主题建表操作

注意：建表操作需要在hive中执行，presto不支持hive的建表语法。

Step1建表

在DWS层创建dws_sku_daycount

create table yp_dws.dws_sku_daycount 
(
    dt STRING,
    sku_id string comment 'sku_id',
    sku_name string comment '商品名称',
    order_count bigint comment '被下单次数',
    order_num bigint comment '被下单件数',
    order_amount decimal(38,2) comment '被下单金额',
    payment_count bigint  comment '被支付次数',
    payment_num bigint comment '被支付件数',
    payment_amount decimal(38,2) comment '被支付金额',
    refund_count bigint  comment '被退款次数',
    refund_num bigint comment '被退款件数',
    refund_amount  decimal(38,2) comment '被退款金额',
    cart_count bigint comment '被加入购物车次数',
    cart_num bigint comment '被加入购物车件数',
    favor_count bigint comment '被收藏次数',
    evaluation_good_count bigint comment '好评数',
    evaluation_mid_count bigint comment '中评数',
    evaluation_bad_count bigint comment '差评数'
) COMMENT '每日商品行为'
--PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

分析字段

分析我们要的数据来源于哪些表的哪些字段

说白了就是一个一个字段的对，DWS层的数据基本是来源于DWB层的，但是也是有可能来源一DWD层，虽然这样做会导致血缘关系混乱，但是做这的一切都是利于分析。

指标: 下单次数、下单件数、下单金额、被支付次数、被支付件数、被支付金额、被退款次数、被退款件数、被退款金额、被加入购物车次数、被加入购物车件数、被收藏次数、好评数、中评数、差评数

维度: 日期 + 商品(id, 名字)

分析结论

根据建表的字段，我们可以分析出

下单次数、下单件数、下单金额为一组，

被支付次数、被支付件数、被支付金额为一组，

被退款次数、被退款件数、被退款金额为一组等等…..

这里我们发现前三组的字段中都有相同的字段,可以放到一个order_base表中

dt, order_id, goods_id, goods_name
buy_num(购买商品的数量), total_price

组1: 下单次数、下单件数、下单金额
    表:  yp_dwb.dwb_order_detail
    字段: dt, order_id, goods_id, goods_name,
          buy_num(购买商品的数量), total_price(购买商品的价格)
组2: 被支付次数、被支付件数、被支付金额
    表:   yp_dwb.dwb_order_detail
    字段:  dt, order_id, goods_id, goods_name,
          buy_num, total_price, is_pay(0:未支付, 1:已支付)
组3: 被退款次数、被退款件数、被退款金额
    表:   yp_dwb.dwb_order_detail
    字段:  dt, order_id, goods_id, goods_name,
          buy_num, total_price, refund_id(不为null说明有退款)
组4: 被加入购物车次数、被加入购物车件数
    表:   yp_dwd.fact_shop_cart
    字段: id, buy_num, goods_id, end_date='9999-99-99'
组5: 被收藏次数
    表:  yp_dwd.fact_goods_collect
    字段: id, goods_id, end_date='9999-99-99'
组6: 好评数、中评数、差评数
    表:  yp_dwd.fact_goods_evaluation_detail
    字段: dt, goods_id, geval_scores_goods(商品评分: 0 ~ 10分)
        评分规则: >=9 好评,    >6 <9 中评,  <=6 差评

Step2下单支付退款等指标

准备数据源

准备数据源，用于计算: 下单次数、下单件数、下单金额、被支付次数、被支付件数、被支付金额、被退款次数、被退款件数、被退款金额

这里是写链式CTE表达式，先写 order_base，因为我们发现前三组的字段中都有相同的字段，所以我们可以把这些相同的字段放到一个order_base表中。

with order_base as (
    select
        dt,             -- 订单日期, 维度
        order_id,       -- 订单id
        goods_id,       -- 商品id,
        goods_name,     -- 商品名
        buy_num,        -- 购买商品的数量,
        total_price,    -- 购买商品的总价格
        is_pay,         -- 标记是否已支付, 0:未支付, 1:已支付
        refund_id,      -- 标记是否有退款, 不为null, 说明有退款.
        row_number() over(partition by order_id, goods_id) rn   -- 根据订单id, 商品id去重.
    from yp_dwb.dwb_order_detail
),

下单-次数-件数-金额

order_count as (
    select
           dt,                              -- 日期维度
           goods_id as sku_id,              -- 商品id, 维度
           goods_name as sku_name,          -- 商品名, 维度
           count(order_id) as order_count,  -- 被下单次数
           sum(buy_num) as order_num,       -- 被下单件数
           sum(total_price) as order_amount -- 被下单金额
    from order_base
    where rn = 1
    group by dt, goods_id, goods_name
)

被支付-次数-件数-金额

payment_count as (
    select
           dt,                              -- 日期维度
           goods_id as sku_id,              -- 商品id, 维度
           goods_name as sku_name,          -- 商品名, 维度
           count(order_id) as payment_count,  -- 被支付次数
           sum(buy_num) as payment_num,       -- 被支付件数
           sum(total_price) as payment_amount -- 被支付金额
    from order_base
    where rn = 1 and is_pay=1       -- rn=1 去重, is_pay = 1 已支付
    group by dt, goods_id, goods_name
),

被退款次数、件数、金额

refund_count as (
    select
           dt,                              -- 日期维度
           goods_id as sku_id,              -- 商品id, 维度
           goods_name as sku_name,          -- 商品名, 维度
           count(order_id) as refund_count,  -- 被退款次数
           sum(buy_num) as refund_num,       -- 被退款件数
           sum(total_price) as refund_amount -- 被退款金额
    from order_base
    where rn = 1 and refund_id is not null    -- rn=1 去重, refund_id is not null 不为空, 说明有退款
    group by dt, goods_id, goods_name
),

Step3购物车收藏等指标

被加入购物车次数、件数

cart_count as (
    select
        substring(create_time, 1, 10) as dt,    -- 日期
        goods_id as sku_id,                     -- 商品id
        count(id) as cart_count,                -- 被加入购物车次数
        sum(buy_num) as cart_num                -- 被加入购物车件数
    from yp_dwd.fact_shop_cart
    where end_date='9999-99-99'
    group by substring(create_time, 1, 10), goods_id
),

被收藏次数

favor_count as (
    select
        substring(create_time, 1, 10) as dt,    -- 日期
        goods_id as sku_id,                     -- 商品id
        count(id) as favor_count                -- 被收藏次数
    from yp_dwd.fact_goods_collect
    where end_date='9999-99-99'
    group by substring(create_time, 1, 10), goods_id
),

Step4好中差评相关指标

evaluation_count as (
    select
        substring(create_time, 1, 10) as dt,    -- 日期
        goods_id as sku_id,                     -- 商品id
        count(if(geval_scores_goods >= 9, 1, null)) as evaluation_good_count, -- 好评数, if(评分 >= 9, 1, null), 因为count(1)会被统计, count(null)会被忽略
        count(if(geval_scores_goods > 6 and geval_scores_goods < 9 , 1, null)) as evaluation_mid_count, -- 中评数
        count(if(geval_scores_goods <= 6, 1, null)) as evaluation_bad_count -- 差评数
    from yp_dwd.fact_goods_evaluation_detail
    group by substring(create_time, 1, 10), goods_id
),

Step5完整实现

对上述的6个结果做合并, 即: 合并 order_count, payment_count, refund_count, cart_count, favor_count, evaluation_count

最终发现一个Bug，重复的数据会有三行

union all和 full outer join的区别。

insert into hive.yp_dws.dws_sku_daycount
--订单明细表抽取字段，并且进行去重，作为后续的base基础数据
with order_base as (select
    dt,
    order_id, --订单id
    goods_id, --商品id
    goods_name,--商品名称
    buy_num,--购买商品数量
    total_price,--商品总金额（数量*单价）
    is_pay,--支付状态（1表示已经支付）
    row_number() over(partition by order_id,goods_id) as rn
from yp_dwb.dwb_order_detail),

--下单次数、件数、金额统计
order_count as (select
    dt,goods_id as sku_id,goods_name as sku_name,
    count(order_id) order_count,
    sum(buy_num) order_num,
    sum(total_price) order_amount
from order_base where rn =1
group by dt,goods_id,goods_name),

--订单状态,已支付
pay_base as(
    select *,
    row_number() over(partition by order_id, goods_id) rn
    from yp_dwb.dwb_order_detail
    where is_pay=1
),

--支付次数、件数、金额统计
payment_count as(
    select dt, goods_id sku_id, goods_name sku_name,
       count(order_id) payment_count,
       sum(buy_num) payment_num,
       sum(total_price) payment_amount
    from pay_base
    where rn=1
    group by dt, goods_id, goods_name
),

--退款次数、件数、金额统计
refund_base as(
    select *,
       row_number() over(partition by order_id, goods_id) rn
    from yp_dwb.dwb_order_detail
    where refund_id is not null
),
-- 退款次数、件数、金额
refund_count as (
    select dt, goods_id sku_id, goods_name sku_name,
       count(order_id) refund_count,
       sum(buy_num) refund_num,
       sum(total_price) refund_amount
    from refund_base
    where rn=1
    group by dt, goods_id, goods_name
),

-- 购物车次数、件数
cart_count as (
    select substring(create_time, 1, 10) dt, goods_id sku_id,
           count(id) cart_count,
            sum(buy_num) cart_num
    from yp_dwd.fact_shop_cart
    where end_date = '9999-99-99'
    group by substring(create_time, 1, 10), goods_id
),
-- 收藏次数
favor_count as (
    select substring(c.create_time, 1, 10) dt, goods_id sku_id,
           count(c.id) favor_count
    from yp_dwd.fact_goods_collect c
    where end_date='9999-99-99'
    group by substring(c.create_time, 1, 10), goods_id
),
-- 好评、中评、差评数量
evaluation_count as (
    select substring(geval_addtime, 1, 10) dt, e.goods_id sku_id,
           count(if(geval_scores_goods >= 9, 1, null)) evaluation_good_count,
           count(if(geval_scores_goods >6 and geval_scores_goods < 9, 1, null)) evaluation_mid_count,
           count(if(geval_scores_goods <= 6, 1, null)) evaluation_bad_count
    from yp_dwd.fact_goods_evaluation_detail e
    group by substring(geval_addtime, 1, 10), e.goods_id
),

--合并结果集
unionall as (
    select
        dt, sku_id, sku_name,
        order_count,
        order_num,
        order_amount,
        0 as payment_count,
        0 as payment_num,
        0 as payment_amount,
        0 as refund_count,
        0 as refund_num,
        0 as refund_amount,
        0 as cart_count,
        0 as cart_num,
        0 as favor_count,
        0 as evaluation_good_count,
        0 as evaluation_mid_count,
        0 as evaluation_bad_count
    from order_count
    union all
    select
        dt, sku_id, sku_name,
        0 order_count,
        0 order_num,
        0 order_amount,
        payment_count,
        payment_num,
        payment_amount,
        0 as refund_count,
        0 as refund_num,
        0 as refund_amount,
        0 as cart_count,
        0 as cart_num,
        0 as favor_count,
        0 as evaluation_good_count,
        0 as evaluation_mid_count,
        0 as evaluation_bad_count
    from payment_count
    union all
    select
        dt, sku_id, sku_name,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 as payment_count,
        0 as payment_num,
        0 as payment_amount,
        refund_count,
        refund_num,
        refund_amount,
        0 as cart_count,
        0 as cart_num,
        0 as favor_count,
        0 as evaluation_good_count,
        0 as evaluation_mid_count,
        0 as evaluation_bad_count
    from refund_count
    union all
    select
        dt, sku_id, null as sku_name,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 as payment_count,
        0 as payment_num,
        0 as payment_amount,
        0 as refund_count,
        0 as refund_num,
        0 as refund_amount,
        cart_count,
        cart_num,
        0 as favor_count,
        0 as evaluation_good_count,
        0 as evaluation_mid_count,
        0 as evaluation_bad_count
    from cart_count
    union all
    select
        dt, sku_id, null as sku_name,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 as payment_count,
        0 as payment_num,
        0 as payment_amount,
        0 as refund_count,
        0 as refund_num,
        0 as refund_amount,
        0 as cart_count,
        0 as cart_num,
        favor_count,
        0 as evaluation_good_count,
        0 as evaluation_mid_count,
        0 as evaluation_bad_count
    from favor_count
    union all
    select
        dt, sku_id, null as sku_name,
        0 order_count,
        0 order_num,
        0 order_amount,
        0 as payment_count,
        0 as payment_num,
        0 as payment_amount,
        0 as refund_count,
        0 as refund_num,
        0 as refund_amount,
        0 as cart_count,
        0 as cart_num,
        0 as favor_count,
        evaluation_good_count,
        evaluation_mid_count,
        evaluation_bad_count
    from evaluation_count
)

select
    dt, sku_id, max(sku_name),
    sum(order_count),
    sum(order_num),
    sum(order_amount),
    sum(payment_count),
    sum(payment_num),
    sum(payment_amount),
    sum(refund_count),
    sum(refund_num),
    sum(refund_amount),
    sum(cart_count),
    sum(cart_num),
    sum(favor_count),
    sum(evaluation_good_count),
    sum(evaluation_mid_count),
    sum(evaluation_bad_count)
from unionall
group by dt, sku_id
--order by dt, sku_id
;

这里union all合并之后，通过dt，sku_id分组，通过max(string)，sum(int)函数的原因是将数据进行合并。

如果直接使用full outer join就不用分组然后再累加了。

这里需要说明一下假设有stu表

id name money

1 null 0

2 张三 100
1
2
select max(name) from stu;   -- 最终结果为张三，因为根据哈希值进行比较的
select sum(money) from stu;  -- 最终结果为100，因为另一条数据值为null

id	name	money
1	null	0
2	张三	100

另一种写法：

----注意，如果session无法选中presto数据源----------
----把sql文件语法方言支持调整为Generic sql模式----------

insert into hive.yp_dws.dws_sku_daycount
with order_cnt as (select
	dt ,pay_time ,apply_date ,goods_id , goods_name , order_state ,is_pay ,refund_id ,refund_state ,
	order_id ,buy_num ,total_price ,
	row_number() over(partition by order_id , goods_id) as rn1
from hive.yp_dwb.dwb_order_detail),

goods_order_1  as (select
	dt,
	goods_id,
	goods_name,
	count(order_id) as order_count,
	sum(buy_num) as order_num,
	sum(total_price) as order_amount
from order_cnt where rn1 = 1
group by goods_id,goods_name , dt) ,

goods_pay_2 as (select
	substring(pay_time,1,10) as dt,
	goods_id,
	goods_name,
	count(order_id) as payment_count,
	sum(buy_num) as payment_num,
	sum(total_price) as payment_amount
from order_cnt where rn1 = 1 and  order_state not in(1,7) and is_pay=1
group by goods_id,goods_name, substring(pay_time,1,10)),

goods_refund_3 as(select
	substring(apply_date,1,10) as dt,
	goods_id,
	goods_name,
	count(order_id) as refund_count,
	sum(buy_num) as refund_num,
	sum(total_price) as refund_amount
from order_cnt where rn1 = 1 and refund_id is not null and refund_state = 5
group by  goods_id , goods_name , substring(apply_date,1,10)),

goods_cart_4 as (select
	 substring(c.create_time ,1,10) as  dt,
	 c.goods_id  ,
	 d1.goods_name ,
	 count(c.id) as cart_count,
	 sum(c.buy_num) as cart_num
from hive.yp_dwd.fact_shop_cart c join hive.yp_dwb.dwb_goods_detail  d1
	on c.goods_id  = d1.id
group by c.goods_id  , d1.goods_name , substring(c.create_time ,1,10)),

goods_favor_5 as (select
	 substring(gc.create_time ,1,10) as dt,
	 gc.goods_id ,
	 d2.goods_name ,
	 count(gc.id) as favor_count
from hive.yp_dwd.fact_goods_collect gc join hive.yp_dwb.dwb_goods_detail  d2
	on gc.goods_id  = d2.id
group  by gc.goods_id , d2.goods_name , substring(gc.create_time ,1,10)) ,

goods_eval_6 as (select
	substring(g.create_time ,1,10)  as dt,
	g.goods_id,
	d3.goods_name,
	-- 低于 6分  差评 , 6~8分(包含)  8以上好评
	count(
		if( g.geval_scores_goods is null OR g.geval_scores_goods > 8 , g.id ,null )
		) as evaluation_good_count,
	count(
		if( g.geval_scores_goods is not null and g.geval_scores_goods between 6 and 8 , g.id ,null )
		) as evaluation_mid_count,
	count(
		if( g.geval_scores_goods is not null and g.geval_scores_goods < 6 , g.id ,null )
		) as evaluation_bad_count
from hive.yp_dwd.fact_goods_evaluation_detail  g join hive.yp_dwb.dwb_goods_detail  d3
	on  g.goods_id  = d3.id
group by  g.goods_id,d3.goods_name,substring(g.create_time ,1,10)),

temp as (select
	coalesce(goods_order_1.dt,goods_pay_2.dt,goods_refund_3.dt,goods_cart_4.dt,goods_favor_5.dt,goods_eval_6.dt) as dt,
	coalesce(goods_order_1.goods_id,goods_pay_2.goods_id,goods_refund_3.goods_id,goods_cart_4.goods_id,goods_favor_5.goods_id,goods_eval_6.goods_id) as sku_id,
	coalesce(goods_order_1.goods_name,goods_pay_2.goods_name,goods_refund_3.goods_name,goods_cart_4.goods_name,goods_favor_5.goods_name,goods_eval_6.goods_name) as sku_name,

	coalesce(goods_order_1.order_count,0) as order_count,
	coalesce(goods_order_1.order_num,0) as order_num,
	coalesce(goods_order_1.order_amount,0) as order_amount,

	coalesce(goods_pay_2.payment_count,0) as payment_count,
	coalesce(goods_pay_2.payment_num,0) as payment_num,
	coalesce(goods_pay_2.payment_amount,0) as payment_amount,

	coalesce(goods_refund_3.refund_count,0) as refund_count,
	coalesce(goods_refund_3.refund_num,0) as refund_num,
	coalesce(goods_refund_3.refund_amount,0) as refund_amount,

	coalesce(goods_cart_4.cart_count,0) as cart_count,
	coalesce(goods_cart_4.cart_num,0) as cart_num,

	coalesce(goods_favor_5.favor_count,0) as favor_count,

	coalesce(goods_eval_6.evaluation_good_count,0) as evaluation_good_count,
	coalesce(goods_eval_6.evaluation_mid_count,0) as evaluation_mid_count,
	coalesce(goods_eval_6.evaluation_bad_count,0) as evaluation_bad_count
from  goods_order_1
	full join goods_pay_2 on goods_order_1.goods_id=goods_pay_2.goods_id and goods_order_1.dt = goods_pay_2.dt
	full join goods_refund_3 on goods_order_1.goods_id = goods_refund_3.goods_id and goods_order_1.dt = goods_refund_3.dt
	full join goods_cart_4 on goods_order_1.goods_id = goods_cart_4.goods_id and goods_order_1.dt = goods_cart_4.dt
	full join goods_favor_5 on goods_order_1.goods_id = goods_favor_5.goods_id and goods_order_1.dt = goods_favor_5.dt
	full join goods_eval_6 on goods_order_1.goods_id = goods_eval_6.goods_id and goods_order_1.dt = goods_eval_6.dt)

select
	dt,
	sku_id, sku_name,
	sum(order_count),
	sum(order_num),
	sum(order_amount),
	sum(payment_count),
	sum(payment_num),
	sum(payment_amount),
	sum(refund_count),
	sum(refund_num),
	sum(refund_amount),
	sum(cart_count),
	sum(cart_num),
	sum(favor_count),
	sum(evaluation_good_count),
	sum(evaluation_mid_count),
	sum(evaluation_bad_count)
from  temp
group by sku_id, sku_name,dt;

模板

with order_base as (
    select
        dt,             -- 订单日期, 维度
        order_id,       -- 订单id
        goods_id,       -- 商品id,
        goods_name,     -- 商品名
        buy_num,        -- 购买商品的数量,
        total_price,    -- 购买商品的总价格
        is_pay,         -- 标记是否已支付, 0:未支付, 1:已支付
        refund_id,      -- 标记是否有退款, 不为null, 说明有退款.
        row_number() over(partition by order_id, goods_id) rn   -- 根据订单id, 商品id去重.
    from yp_dwb.dwb_order_detail
),
order_count as (
	select 
  日期，商品id，商品名，被下单次数，被下单件数，被下单金额 
  from order_base
  where rn = 1  -- 按照订单id，商品id进行去重
  group by dt, goods_id, goods_name
),
payment_count as (
	select 
  日期，商品id，商品名 ,被支付次数,被支付件数,被支付金额
  from order_base
  where rn = 1 and is_pay=1       -- rn=1 去重, is_pay = 1 已支付
	group by dt, goods_id, goods_name
),
refund_count as (
  select
  日期，商品id，商品名 ,被退款次数,被退款件数,被退款金额
  from order_base
  where rn = 1 and refund_id is not null    -- rn=1 去重, refund_id is not null 不为空, 说明有退款
  group by dt, goods_id, goods_name
),
cart_count as (
  select
  日期，商品id,被加入购物车次数,被加入购物车件数
  where end_date='9999-99-99'
  group by substring(create_time, 1, 10), goods_id
),
favor_count as (
    select .....
    from yp_dwd.fact_goods_collect
    where end_date='9999-99-99'
    group by substring(create_time, 1, 10), goods_id
),
evaluation_count as (
    select 日期，商品id，好中差评个数
    from yp_dwd.fact_goods_evaluation_detail
    group by substring(create_time, 1, 10), goods_id
),
unionall as (
	select  日期，商品id，商品名，被下单次数，被下单件数，被下单金额 (空白字段用null填充)
  	from order_count
union all
  select ...
  	from payment_count
union all
  select ......
) 
select
    dt, sku_id, max(sku_name),
    sum(order_count),
    sum(order_num),
    sum(order_amount),
    sum(payment_count),
    sum(payment_num),
    sum(payment_amount),
    sum(refund_count),
    sum(refund_num),
    sum(refund_amount),
    sum(cart_count),
    sum(cart_num),
    sum(favor_count),
    sum(evaluation_good_count),
    sum(evaluation_mid_count),
    sum(evaluation_bad_count)
from unionall
group by dt, sku_id
--order by dt, sku_id

DWS之用户主题日统计宽表

建表：

create table yp_dws.dws_user_daycount
(
    dt STRING,
    user_id string comment '用户 id',
    login_count bigint comment '登录次数',
    store_collect_count bigint comment '店铺收藏数量',
    goods_collect_count bigint comment '商品收藏数量',
    cart_count bigint comment '加入购物车次数',
    cart_amount decimal(38,2) comment '加入购物车金额',
    order_count bigint comment '下单次数',
    order_amount    decimal(38,2)  comment '下单金额',
    payment_count   bigint      comment '支付次数',
    payment_amount  decimal(38,2) comment '支付金额'
) COMMENT '每日用户行为'
--PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

最终实现：

insert into yp_dws.dws_user_daycount
-- 登录次数
with login_count as (
   select
      count(id) as login_count,
      login_user as user_id, dt
   from yp_dwd.fact_user_login
   group by login_user, dt
),
-- 店铺收藏数
store_collect_count as (
   select
      count(id) as store_collect_count,
      user_id, substring(create_time, 1, 10) as dt
   from yp_dwd.fact_store_collect
   where end_date='9999-99-99'
   group by user_id, substring(create_time, 1, 10)
),
-- 商品收藏数
goods_collect_count as (
   select
      count(id) as goods_collect_count,
      user_id, substring(create_time, 1, 10) as dt
   from yp_dwd.fact_goods_collect
    where end_date='9999-99-99'
   group by user_id, substring(create_time, 1, 10)
),
-- 加入购物车次数和金额
cart_count_amount as (
   select
      count(cart.id) as cart_count,
      sum(g.goods_promotion_price) as cart_amount,
      buyer_id as user_id, substring(cart.create_time, 1, 10) as dt
   from yp_dwd.fact_shop_cart cart, yp_dwb.dwb_goods_detail g
   where cart.end_date='9999-99-99' and cart.goods_id=g.id
   group by buyer_id, substring(cart.create_time, 1, 10)
),
-- 下单次数和金额
order_count_amount as (
   select
      count(o.id) as order_count,
      sum(order_amount) as order_amount,
      buyer_id as user_id, substring(create_date, 1, 10) as dt
   from yp_dwd.fact_shop_order o, yp_dwd.fact_shop_order_address_detail od
   where o.id=od.id
     and o.is_valid=1 and o.end_date='9999-99-99' and od.end_date='9999-99-99'
   group by buyer_id, substring(create_date, 1, 10)
),
-- 支付次数和金额
payment_count_amount as (
   select
      count(id) as payment_count,
      sum(trade_true_amount) as payment_amount,
      user_id, substring(create_time, 1, 10) as dt
   from yp_dwd.fact_trade_record
   where is_valid=1 and trade_type in (1,11) and status=1
   group by user_id, substring(create_time, 1, 10)
)
select
--        dt,
       user_id,
      -- 登录次数
       sum(login_count) login_count,
       -- 店铺收藏数
       sum(store_collect_count) store_collect_count,
       -- 商品收藏数
       sum(goods_collect_count) goods_collect_count,
       -- 加入购物车次数和金额
       sum(cart_count) cart_count,
       sum(cart_amount) cart_amount,
       -- 下单次数和金额
       sum(order_count) order_count,
       sum(order_amount) order_amount,
       -- 支付次数和金额
       sum(payment_count) payment_count,
       sum(payment_amount) payment_amount
      ,dt
from
(
    select lc.login_count,
           0 store_collect_count,
           0 goods_collect_count,
           0 cart_count, 0 cart_amount,
           0 order_count, 0 order_amount,
           0 payment_count, 0 payment_amount,
           user_id, dt
    from login_count lc
    union all
    select
           0 login_count,
           scc.store_collect_count,
           0 goods_collect_count,
           0 cart_count, 0 cart_amount,
           0 order_count, 0 order_amount,
           0 payment_count, 0 payment_amount,
           user_id, dt
    from store_collect_count scc
    union all
    select
           0 login_count,
           0 store_collect_count,
           gcc.goods_collect_count,
           0 cart_count, 0 cart_amount,
           0 order_count, 0 order_amount,
           0 payment_count, 0 payment_amount,
           user_id, dt
    from goods_collect_count gcc
    union all
    select
           0 login_count,
           0 store_collect_count,
           0 goods_collect_count,
           cca.cart_count, cart_amount,
           0 order_count, 0 order_amount,
           0 payment_count, 0 payment_amount,
           user_id, dt
    from cart_count_amount cca
    union all
    select
           0 login_count,
           0 store_collect_count,
           0 goods_collect_count,
           0 cart_count, 0 cart_amount,
           oca.order_count, order_amount,
           0 payment_count, 0 payment_amount,
           user_id, dt
    from order_count_amount oca
    union all
    select
           0 login_count,
           0 store_collect_count,
           0 goods_collect_count,
           0 cart_count, 0 cart_amount,
           0 order_count, 0 order_amount,
           pca.payment_count, payment_amount,
           user_id, dt
    from payment_count_amount pca
) user_count
group by user_id, dt
;

索引

索引是什么

在数据库中，索引指的是提供指向存储在表的指定列中的数据值的指针（地址），数据库使用索引以找到特定值，然后顺指针找到包含该值的行。

索引的优点
- 大大提高查询的效率
索引的缺点
- 降低更新表的速度，如对表进行INSERT、UPDATE和DELETE。因为更新表时，MySQL不仅要保存数据，还要保存一下索引文件。MySQL支持自动更新索引。
- 建立索引会占用磁盘空间的索引文件。

索引使用的注意事项

当创建索引之后，根据具有索引的字段查询，效率才能体现。

MySQL索引

MySQL的索引分类

普通索引

普通索引是最基本的索引，它没有任何限制。

CREATE INDEX indexName ON table_name (column_name);

--也可以在建表的时候同时指定索引
CREATE TABLE `table` (
    `id` int(11) NOT NULL AUTO_INCREMENT ,
    `title` char(255) CHARACTER NOT NULL ,
    `content` text CHARACTER NULL ,
    `time` int(10) NULL DEFAULT NULL ,
    PRIMARY KEY (`id`),
    INDEX index_name (title(length))
)

--length为可选参数，表示索引的长度，只有字符串类型的字段才能指定索引长度

唯一索引

与普通索引类似，不同的就是：==索引列的值必须唯一==，但允许有空值。

CREATE UNIQUE INDEX indexName ON table(column(length));

--建表的时候创建索引
CREATE TABLE `table` (
    `id` int(11) NOT NULL AUTO_INCREMENT ,
    `title` char(255) CHARACTER NOT NULL ,
    `content` text CHARACTER NULL ,
    `time` int(10) NULL DEFAULT NULL ,
    UNIQUE indexName (title(length))
);

主键索引

一种特殊的唯一索引，一个表只能有一个主键，不允许有空值。

一般是在建表的时候同时创建主键索引。
1
2
3
4
5
CREATE TABLE `table` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`title` char(255) NOT NULL ,
PRIMARY KEY (`id`)
);
组合索引

指多个字段上创建的索引，只有在查询条件中使用了创建索引时的第一个字段，索引才会被使用。

使用组合索引时==遵循最左前缀==集合。

ALTER TABLE mytable ADD INDEX name_city_age (name(10),city,age);

--建表时，name长度为16，这里用10。这是因为一般情况下名字的长度不会超过10，这样会加速索引查询速度，还会减少索引文件的大小，提高INSERT的更新速度。

--建立这样的组合索引，其实是相当于分别建立了下面三组组合MySQL数据库索引
name,city,age
name,city
name 

--为什么没有 city，age这样的组合索引呢？这是因为MySQL组合索引“最左前缀”的结果。简单的理解就是只从最左面的开始组合。并不是只要包含这三列的查询都会用到该组合索引，下面的几个SQL就会用到这个组合MySQL数据库索引：

SELECT * FROM mytable WHERE name="admin" AND city="郑州"

SELECT * FROM mytable WHERE name="admin"

--而下面几个则不会用到：
SELECT * FROM mytable WHERE city="郑州" AND name="admin" 
SELECT * FROM mytable WHERE age=20 AND city="郑州"
SELECT * FROM mytable WHERE city="郑州"

Hive索引

Hive 0.7版本之后，开始支持索引，但是功能很弱。

Hive3.0版本之后，直接移除了索引

Hive中如何使用索引：

使用物化视图自动更新

使用ORC、Parquet等格式文件。这些文件格式本身列式存储，内部特性支持查询效率的提高。

ORC之行组索引

Row Group Index

Row Group Index行组索引

ORC为每个stripe建立的包含min/max值的索引，就称为Row Group Index，也叫min-max Index，或者Storage Index。

1、在建立ORC格式表时，指定表参数’orc.create.index’=’true’之后，便会建立Row Group Index；

2、需要注意的是，为了使Row Group Index有效利用，向表中加载数据时，必须对需要使用索引的字段进行排序，否则，min/max会失去意义。

3、另外，这种索引通常用于数值型字段的查询过滤优化上。

--CTAS
CREATE TABLE t_text_2 stored AS ORC 
TBLPROPERTIES
    ('orc.compress'='SNAPPY',
    'orc.create.index'='true', --开启行组索引
    'orc.stripe.size'='10485760',
    'orc.row.index.stride'='10000') 	--索引条目之间的行数（必须> = 1000）
AS 
SELECT xxxx
FROM t_text_1 
DISTRIBUTE BY id sort BY id;

--参数hive.optimize.index.filte 表示是否自动使用索引，默认为false（不使用）；
--如果不设置该参数为true，那么ORC的索引当然也不会使用。

--执行sql之前，为了使用索引，应该设置下面的参数
set hive.optimize.index.filter=true;

ORC之Bloom Filter过滤器

Bloom Filter Index

hive中布隆过滤器索引

在建表时候，通过表参数==orc.bloom.filter.columns=”pcid”==来指定为那些字段建立BloomFilter索引；

这样，在生成数据的时候，会在每个stripe中，为该字段建立BloomFilter的数据结构；

当查询条件中包含==对该字段的=号过滤时候，先从BloomFilter中获取以下是否包含该值==，如果不包含，则跳过该stripe。

CREATE TABLE t_text_2 stored AS ORC 
TBLPROPERTIES
    ('orc.compress'='SNAPPY',
    'orc.create.index'='true',
    "orc.bloom.filter.columns"="pcid", --布隆过滤器 对pcid进行索引
    'orc.stripe.size'='10485760',
    'orc.row.index.stride'='10000') 
AS 
SELECT xxxx
FROM t_text_1 
DISTRIBUTE BY id sort BY id;


--执行查询
SET hive.optimize.index.filter=true;

SELECT COUNT(1) FROM t_text_2 WHERE id >= 0 AND id <= 1000  
AND pcid IN ('0005E26F0DCCDB56F9041C','A');