国庆节的最后一天,你收到了澳大利亚某气象局的offer,在你上班的第一天,老板Jason把你叫到了办公室,布置了一个神秘任务。原来,Jason正在闷声发大财,默默的在一年内收集了澳大利亚Canberra当地2007年一整年的天气数据。作为Data Maniac的荣誉学员,你需要:

Q1:

我们已经在后台加载了这个名为 ‘weather’ 的数据。用你这几节课学到的方法,你需要快速的了解数据结构(观测值/变量个数,变量性质等);

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiIjIHlvdSBjYW4gZG8gaXQhIiwic29sdXRpb24iOiJ3ZWF0aGVyXG5zdHIod2VhdGhlcilcbm5hbWVzKHdlYXRoZXIpIn0=

Q2:

你发现这个数据集中的’RainTomorrow’变量很有意思,你想到这个变量是否和‘Sunshine’以及‘Rainfall’这两个变量是否有某种不可告人的关系,为了更好得分析,你需要用课上学到的dplyr包中的某种方法只留下以下4个变量,‘Date’,‘Rainfall’,‘Sunshine’, ‘RainTomorrow’。并且将其命名为weather_cut;

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KGRwbHlyKSIsInNvbHV0aW9uIjoibGlicmFyeShkcGx5cilcbndlYXRoZXJfY3V0PC13ZWF0aGVyJT4lc2VsZWN0KDEsNSw3LDI0KVxud2VhdGhlcl9jdXQifQ==

Q3:为了更好的了解‘RainTomorrow‘和’Sunshine‘两个变量的关系,用tidyr包的spread函数将’RainTomorrow‘的两个值’Yes’和‘No’变成两个变量,填充在这两个新变量下的是原来变量’Sunshine‘的值。如果成功,你可以看到新的数据集的前六行为。讲这个新的数据集命名为weather_forecast;

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcbndlYXRoZXJfY3V0PC13ZWF0aGVyJT4lc2VsZWN0KDEsNSw3LDI0KSIsInNvbHV0aW9uIjoibGlicmFyeShkcGx5cilcbmxpYnJhcnkodGlkeXIpXG53ZWF0aGVyX2N1dDwtd2VhdGhlciU+JXNlbGVjdCgxLDUsNywyNClcbndlYXRoZXJfZm9yZWNhc3Q8LXNwcmVhZCh3ZWF0aGVyX2N1dCxSYWluVG9tb3Jyb3csU3Vuc2hpbmUpXG5oZWFkKHdlYXRoZXJfZm9yZWNhc3QpIn0=

Q4: 分析完毕后,你想把weather_forecast恢复成原样,于是你需要用tidyr包的gather函数把它变回去,结果的前六行如下:

Date Rainfall No Yes 1 2007-11-01 0.0 NA 6.3 2 2007-11-02 3.6 NA 9.7 3 2007-11-03 3.6 NA 3.3 4 2007-11-04 39.8 NA 9.1 5 2007-11-05 2.8 10.6 NA 6 2007-11-06 0.0 8.2 NA

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiIjXG5saWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcbndlYXRoZXJfY3V0PC13ZWF0aGVyJT4lc2VsZWN0KDEsNSw3LDI0KVxuXG53ZWF0aGVyX2ZvcmVjYXN0PC1zcHJlYWQod2VhdGhlcl9jdXQsUmFpblRvbW9ycm93LFN1bnNoaW5lKSIsInNvbHV0aW9uIjoibGlicmFyeShkcGx5cilcbmxpYnJhcnkodGlkeXIpXG53ZWF0aGVyX2N1dDwtd2VhdGhlciU+JXNlbGVjdCgxLDUsNywyNClcblxud2VhdGhlcl9mb3JlY2FzdDwtc3ByZWFkKHdlYXRoZXJfY3V0LFJhaW5Ub21vcnJvdyxTdW5zaGluZSlcblxud2VhdGhlcl9jdXQ8LWdhdGhlcih3ZWF0aGVyX2ZvcmVjYXN0LFJhaW5Ub21vcnJvdyxTdW5zaGluZSwzOjQpXG5oZWFkKHdlYXRoZXJfY3V0KSJ9

Q5:你发现weather_forecast数据集中有很多的NA,你需要用课上学习的方法(any(is.na())找到‘Yes‘和’No’两个变量中确认是否有NA,并且用summary函数找到这两个变量中的NA的数量;

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcblxud2VhdGhlcl9jdXQ8LXdlYXRoZXIlPiVzZWxlY3QoMSw1LDcsMjQpXG5cbndlYXRoZXJfZm9yZWNhc3Q8LXNwcmVhZCh3ZWF0aGVyX2N1dCxSYWluVG9tb3Jyb3csU3Vuc2hpbmUpIiwic29sdXRpb24iOiJsaWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcblxud2VhdGhlcl9jdXQ8LXdlYXRoZXIlPiVzZWxlY3QoMSw1LDcsMjQpXG5cbndlYXRoZXJfZm9yZWNhc3Q8LXNwcmVhZCh3ZWF0aGVyX2N1dCxSYWluVG9tb3Jyb3csU3Vuc2hpbmUpXG5cbmFueShpcy5uYSh3ZWF0aGVyX2ZvcmVjYXN0JFllcykpXG5hbnkoaXMubmEod2VhdGhlcl9mb3JlY2FzdCRObykpXG5zdW1tYXJ5KHdlYXRoZXJfZm9yZWNhc3QpIn0=

Q6.

确认NA后,你决定用平均数去填充”Yes”和“No”中的所有变量;

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcblxud2VhdGhlcl9jdXQ8LXdlYXRoZXIlPiVzZWxlY3QoMSw1LDcsMjQpXG5cbndlYXRoZXJfZm9yZWNhc3Q8LXNwcmVhZCh3ZWF0aGVyX2N1dCxSYWluVG9tb3Jyb3csU3Vuc2hpbmUpIiwic29sdXRpb24iOiJsaWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcblxud2VhdGhlcl9jdXQ8LXdlYXRoZXIlPiVzZWxlY3QoMSw1LDcsMjQpXG5cbndlYXRoZXJfZm9yZWNhc3Q8LXNwcmVhZCh3ZWF0aGVyX2N1dCxSYWluVG9tb3Jyb3csU3Vuc2hpbmUpXG5cbiMjXG53ZWF0aGVyX2ZvcmVjYXN0JE5vW2lzLm5hKHdlYXRoZXJfZm9yZWNhc3QkTm8pXTwtbWVhbih3ZWF0aGVyX2ZvcmVjYXN0JE5vLG5hLnJtID0gVFJVRSlcblxud2VhdGhlcl9mb3JlY2FzdCRZZXNbaXMubmEod2VhdGhlcl9mb3JlY2FzdCRZZXMpXTwtbWVhbih3ZWF0aGVyX2ZvcmVjYXN0JFllcyxuYS5ybSA9IFRSVUUpIn0=

Q7.

解决了NA的问题以后,你需要用hist函数,plot函数以及boxplot函数找到weather_forecast数据集的‘No’变量中的异常值,感受下哪个函数最适合发现异常值。

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IndlYXRoZXI8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MveTdyZjhmOThsMm14aWZrL3dlYXRoZXIuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KGRwbHlyKVxubGlicmFyeSh0aWR5cilcblxud2VhdGhlcl9jdXQ8LXdlYXRoZXIlPiVzZWxlY3QoMSw1LDcsMjQpXG5cbndlYXRoZXJfZm9yZWNhc3Q8LXNwcmVhZCh3ZWF0aGVyX2N1dCxSYWluVG9tb3Jyb3csU3Vuc2hpbmUpXG5cbiMjXG53ZWF0aGVyX2ZvcmVjYXN0JE5vW2lzLm5hKHdlYXRoZXJfZm9yZWNhc3QkTm8pXTwtbWVhbih3ZWF0aGVyX2ZvcmVjYXN0JE5vLG5hLnJtID0gVFJVRSlcbndlYXRoZXJfZm9yZWNhc3QkWWVzW2lzLm5hKHdlYXRoZXJfZm9yZWNhc3QkWWVzKV08LW1lYW4od2VhdGhlcl9mb3JlY2FzdCRZZXMsbmEucm0gPSBUUlVFKVxuXG4jIyIsInNvbHV0aW9uIjoibGlicmFyeShkcGx5cilcbmxpYnJhcnkodGlkeXIpXG5cbndlYXRoZXJfY3V0PC13ZWF0aGVyJT4lc2VsZWN0KDEsNSw3LDI0KVxuXG53ZWF0aGVyX2ZvcmVjYXN0PC1zcHJlYWQod2VhdGhlcl9jdXQsUmFpblRvbW9ycm93LFN1bnNoaW5lKVxuXG4jI1xud2VhdGhlcl9mb3JlY2FzdCROb1tpcy5uYSh3ZWF0aGVyX2ZvcmVjYXN0JE5vKV08LW1lYW4od2VhdGhlcl9mb3JlY2FzdCRObyxuYS5ybSA9IFRSVUUpXG53ZWF0aGVyX2ZvcmVjYXN0JFllc1tpcy5uYSh3ZWF0aGVyX2ZvcmVjYXN0JFllcyldPC1tZWFuKHdlYXRoZXJfZm9yZWNhc3QkWWVzLG5hLnJtID0gVFJVRSlcblxuXG5oaXN0KHdlYXRoZXJfZm9yZWNhc3QkTm8pXG5wbG90KHdlYXRoZXJfZm9yZWNhc3QkTm8pXG5ib3hwbG90KHdlYXRoZXJfZm9yZWNhc3QkTm8pIn0=