数据准备12345678910111213研发部,1,乔峰,男,20,5000研发部,2,段誉,男,21,2800研发部,3,虚竹,男,23,8000研发部,4,阿紫,女,18,4000销售部,5,扫地僧,男,85,9000销售部,6,李秋水,女,33,4500销售部,7,鸠摩智,男,50,3900销售部,8,天山童姥,女,60,8900销售部,9,慕容博,男,58,3400人事部,10,丁春秋,男,90,7000人事部,11,王语嫣,女,50,7700人事部,12,阿朱,女,43,5500人事部,13,无崖子,男,51,8800 题目123456789练习题目:(1) 读入emp.txt文档,生成RDD(2) 获得年龄大于50的员工(3) 获得人事部性别为男的员工(4) 获取比无崖子薪资高的员工信息(5) 获得整个公司薪资最高的员工名字(6) 获得每个部门的员工人数,并降序排序(7) 获得每个部门的平均薪资(8) 获得每个部门的最高薪水 答案123456789101112131415161718192021222324252627282930313233343536373839404142# (1) 读入emp.txt文档,生成RDDfileRdd = sc.textFile('file:///export/data/employee.txt')print(fileRdd.glom().collect())# (2) 获得年龄大于50的员工rather50 = fileRdd.filter(lambda line: int(line.split(",")[4]) > 50)print(rather50.collect())# (3) 获得人事部性别为男的员工genderMale = fileRdd.filter(lambda line: line.split(",")[0] == '人事部' and line.split(",")[3] == '男')print(genderMale.collect())# (4) 获取比无崖子薪资高的员工信息infoWuyazi = fileRdd.filter(lambda line: line.split(",")[2] == '无崖子')salaryW = int(infoWuyazi.map(lambda x: x.split(",")[5]).collect()[0])thanWuyazi = fileRdd.filter(lambda line: int(line.split(",")[5]) > salaryW)print(thanWuyazi.collect())# (5) 获得整个公司薪资最高的员工名字(使用top算子greatest = fileRdd.top(1, lambda line: line.split(",")[5])print(greatest)# (6) 获得每个部门的员工人数,并降序排序 (思路&想法: 预期的结果: (部门名称, 总人数) 大概要使用reduceByKey算子吧norepeatedRdd = fileRdd.distinct()demo = norepeatedRdd.map(lambda line: (line.split(",")[0], 1))resRdd = demo.reduceByKey(lambda x,y: x+y)print(resRdd.glom().collect())# (7) 获得每个部门的平均薪资 (思路&想法: 预期结果为(部门名称,平均薪资) ('部门', (薪资, 1))primeRdd = fileRdd.map(lambda line: (line.split(",")[0], (line.split(",")[5],1)))print(primeRdd.glom().collect())secRdd = primeRdd.reduceByKey(lambda x,y: (int(x[0]) +int(y[0]), x[1]+y[1]))print(secRdd.glom().collect())resultRdd = secRdd.map(lambda line: (line[0], line[1][0]/line[1][1]))print(resultRdd.glom().collect())# (8)获得每个部门的最高薪水 (思路&想法: 预期结果为(部门名称,最高工资 -- 求每个部门,分组求topN(首先按照部门聚合, 再求组内topsalaryRdd = fileRdd.map(lambda lien: (lien.split(",")[0], lien.split(",")[5]))groupRdd = salaryRdd.groupByKey()show = groupRdd.map(lambda x: (x[0], *x[1]))res = groupRdd.map(lambda x: (x[0], max(x[1])))print(res.glom().collect())