数据准备

1
2
3
4
5
6
7
8
9
10
11
12
13
研发部,1,乔峰,男,20,5000
研发部,2,段誉,男,21,2800
研发部,3,虚竹,男,23,8000
研发部,4,阿紫,女,18,4000
销售部,5,扫地僧,男,85,9000
销售部,6,李秋水,女,33,4500
销售部,7,鸠摩智,男,50,3900
销售部,8,天山童姥,女,60,8900
销售部,9,慕容博,男,58,3400
人事部,10,丁春秋,男,90,7000
人事部,11,王语嫣,女,50,7700
人事部,12,阿朱,女,43,5500
人事部,13,无崖子,男,51,8800

题目

1
2
3
4
5
6
7
8
9
练习题目:
(1) 读入emp.txt文档,生成RDD
(2) 获得年龄大于50的员工
(3) 获得人事部性别为男的员工
(4) 获取比无崖子薪资高的员工信息
(5) 获得整个公司薪资最高的员工名字
(6) 获得每个部门的员工人数,并降序排序
(7) 获得每个部门的平均薪资
(8) 获得每个部门的最高薪水

答案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# (1) 读入emp.txt文档,生成RDD
fileRdd = sc.textFile('file:///export/data/employee.txt')
print(fileRdd.glom().collect())

# (2) 获得年龄大于50的员工
rather50 = fileRdd.filter(lambda line: int(line.split(",")[4]) > 50)
print(rather50.collect())

# (3) 获得人事部性别为男的员工
genderMale = fileRdd.filter(lambda line: line.split(",")[0] == '人事部' and line.split(",")[3] == '男')
print(genderMale.collect())

# (4) 获取比无崖子薪资高的员工信息
infoWuyazi = fileRdd.filter(lambda line: line.split(",")[2] == '无崖子')
salaryW = int(infoWuyazi.map(lambda x: x.split(",")[5]).collect()[0])
thanWuyazi = fileRdd.filter(lambda line: int(line.split(",")[5]) > salaryW)
print(thanWuyazi.collect())

# (5) 获得整个公司薪资最高的员工名字(使用top算子
greatest = fileRdd.top(1, lambda line: line.split(",")[5])
print(greatest)

# (6) 获得每个部门的员工人数,并降序排序 (思路&想法: 预期的结果: (部门名称, 总人数) 大概要使用reduceByKey算子吧
norepeatedRdd = fileRdd.distinct()
demo = norepeatedRdd.map(lambda line: (line.split(",")[0], 1))
resRdd = demo.reduceByKey(lambda x,y: x+y)
print(resRdd.glom().collect())

# (7) 获得每个部门的平均薪资 (思路&想法: 预期结果为(部门名称,平均薪资) ('部门', (薪资, 1))
primeRdd = fileRdd.map(lambda line: (line.split(",")[0], (line.split(",")[5],1)))
print(primeRdd.glom().collect())
secRdd = primeRdd.reduceByKey(lambda x,y: (int(x[0]) +int(y[0]), x[1]+y[1]))
print(secRdd.glom().collect())
resultRdd = secRdd.map(lambda line: (line[0], line[1][0]/line[1][1]))
print(resultRdd.glom().collect())

# (8)获得每个部门的最高薪水 (思路&想法: 预期结果为(部门名称,最高工资 -- 求每个部门,分组求topN(首先按照部门聚合, 再求组内top
salaryRdd = fileRdd.map(lambda lien: (lien.split(",")[0], lien.split(",")[5]))
groupRdd = salaryRdd.groupByKey()
show = groupRdd.map(lambda x: (x[0], *x[1]))
res = groupRdd.map(lambda x: (x[0], max(x[1])))
print(res.glom().collect())