Load the dataset chicago.rds from moodle using the readRDS() function of R into a dataframe called chicago. Don’t display the code to laod the file in the knitted document. The dataset contains air pollution data from Chicago from a period Jan 1987 till end of 2005. To learn more about these measures visit http://www.airnow.gov/index.cfm?action=aqibasics.particle
Make sure the dataset is a dplyr object from here on. Perform basic data cleansing activities that deal with outliers, data entry errors and missing values if any.
Visualizing & treating outliers and missing Values
## No id variables; using all as measure variables
## Warning: Removed 4692 rows containing non-finite values (stat_boxplot).
## 0% 5% 10% 50% 90% 95% 97%
## 1.700000 5.700000 6.863214 14.657143 28.000000 33.000000 37.300000
## 99% 100%
## 42.572000 61.500000
## 0% 5% 10% 50% 90% 95% 97%
## 2.00000 12.00000 15.50000 30.27885 57.00000 67.50000 75.50000
## 99% 100%
## 89.00000 365.00000
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, last
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
## 'data.frame': 6940 obs. of 8 variables:
## $ city : chr "chic" "chic" "chic" "chic" ...
## $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
## $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
## $ date : Date, format: "1987-01-01" "1987-01-02" ...
## $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
## $ pm10tmean2: num 34 NA 34.2 47 NA ...
## $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
## $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
## city tmpd dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 1987-01-02 NA NA 3.304348 23.19099
## 3 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 1987-01-05 NA NA 4.750000 30.33333
## 6 1987-01-06 NA 48.00000 5.833333 25.77233
## chicago
##
## 8 Variables 6940 Observations
## ---------------------------------------------------------------------------
## city
## n missing unique value
## 6940 0 1 chic
## ---------------------------------------------------------------------------
## tmpd
## n missing unique Info Mean .05 .10 .25 .50
## 6939 1 196 1 50.31 18.0 25.0 35.0 51.0
## .75 .90 .95
## 67.0 75.0 78.5
##
## lowest : -16.0 -12.0 -10.5 -8.0 -6.5
## highest: 89.0 89.5 90.0 91.5 92.0
## ---------------------------------------------------------------------------
## dptp
## n missing unique Info Mean .05 .10 .25 .50
## 6938 2 1245 1 40.34 9.243 16.875 27.000 39.875
## .75 .90 .95
## 55.750 64.700 68.200
##
## lowest : -25.62 -23.12 -23.00 -21.00 -18.88
## highest: 77.00 77.10 77.25 78.12 78.25
## ---------------------------------------------------------------------------
## date
## n missing unique
## 6940 0 6940
##
## lowest : 1987-01-01 1987-01-02 1987-01-03 1987-01-04 1987-01-05
## highest: 2005-12-27 2005-12-28 2005-12-29 2005-12-30 2005-12-31
## ---------------------------------------------------------------------------
## pm25tmean2
## n missing unique Info Mean .05 .10 .25 .50
## 2493 4447 1072 1 16.18 5.700 6.863 9.700 14.657
## .75 .90 .95
## 20.600 28.000 33.000
##
## lowest : 1.7 2.1 2.4 2.5 2.6, highest: 41.4 41.5 41.7 42.3 42.5
## ---------------------------------------------------------------------------
## pm10tmean2
## n missing unique Info Mean .05 .10 .25 .50
## 6698 242 818 1 33.69 12.00 15.50 21.50 30.28
## .75 .90 .95
## 42.00 57.00 67.50
##
## lowest : 2.00 3.00 4.00 4.50 5.00
## highest: 87.75 88.00 88.25 88.50 89.00
## ---------------------------------------------------------------------------
## o3tmean2
## n missing unique Info Mean .05 .10 .25 .50
## 6940 0 6048 1 19.44 3.692 5.417 10.073 18.522
## .75 .90 .95
## 27.001 34.343 39.643
##
## lowest : 0.1528 0.3611 0.4271 0.4915 0.5000
## highest: 61.6771 62.9697 63.1042 63.3750 66.5875
## ---------------------------------------------------------------------------
## no2tmean2
## n missing unique Info Mean .05 .10 .25 .50
## 6940 0 6198 1 25.23 12.98 15.42 19.65 24.56
## .75 .90 .95
## 30.14 35.56 39.55
##
## lowest : 6.158 6.385 6.726 6.804 7.065
## highest: 56.230 57.525 57.564 59.516 62.480
## ---------------------------------------------------------------------------
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect
## chicago$pm25
## n missing unique Info Mean .05 .10 .25 .50
## 2493 4447 1072 1 16.18 5.700 6.863 9.700 14.657
## .75 .90 .95
## 20.600 28.000 33.000
##
## lowest : 1.7 2.1 2.4 2.5 2.6, highest: 41.4 41.5 41.7 42.3 42.5
## chicago$pm10
## n missing unique Info Mean .05 .10 .25 .50
## 6698 242 818 1 33.69 12.00 15.50 21.50 30.28
## .75 .90 .95
## 42.00 57.00 67.50
##
## lowest : 2.00 3.00 4.00 4.50 5.00
## highest: 87.75 88.00 88.25 88.50 89.00
## chicago$pm25
## n missing unique Info Mean .05 .10 .25 .50
## 6940 0 1072 1 16.22 5.714 6.971 9.750 14.750
## .75 .90 .95
## 20.609 28.000 33.000
##
## lowest : 1.7 2.1 2.4 2.5 2.6, highest: 41.4 41.5 41.7 42.3 42.5
## chicago$pm10
## n missing unique Info Mean .05 .10 .25 .50
## 6940 0 818 1 33.68 12.00 15.50 21.50 30.32
## .75 .90 .95
## 42.00 57.00 67.52
##
## lowest : 2.00 3.00 4.00 4.50 5.00
## highest: 87.75 88.00 88.25 88.50 89.00
Displaying the top 10 rows that have very high PM2.5 as it is extremely hazardous for health
## city tmpd dewpoint date pm25 pm10 ozone no2
## 154 chic 69.0 47.250 1987-06-03 42.5 42.0 23.313361 21.74306
## 343 chic 49.5 34.375 1987-12-09 42.5 21.0 9.041667 12.47500
## 450 chic 53.0 35.500 1988-03-25 42.5 25.0 22.002415 16.99583
## 535 chic 74.5 52.875 1988-06-18 42.5 55.0 41.231399 38.97894
## 648 chic 54.5 47.500 1988-10-09 42.5 25.5 11.940476 32.08514
## 876 chic 67.5 60.875 1989-05-25 42.5 33.0 26.303504 35.52494
## 1076 chic 19.5 6.500 1989-12-11 42.5 35.0 11.541667 24.66825
## 1138 chic 28.5 20.125 1990-02-11 42.5 29.5 15.875000 24.66667
## 1295 chic 78.5 67.625 1990-07-18 42.5 47.0 26.380328 29.24107
## 1296 chic 77.5 68.000 1990-07-19 42.5 47.5 18.490132 26.91071
Plotting each variable to check if other factors are linked to PM2.5 levels
## 'data.frame': 545 obs. of 9 variables:
## $ city : chr "chic" "chic" "chic" "chic" ...
## $ tmpd : num 29 29.5 25 27 22 37 36.5 38 31 28.5 ...
## $ dewpoint: num 22 20.1 21.5 24.8 20.6 ...
## $ date : Date, format: "1987-01-08" "1987-01-11" ...
## $ pm25 : num 33 38.8 34.5 36.5 33.4 ...
## $ pm10 : num 36 22 19 29.5 32.9 ...
## $ ozone : num 11.29 17.54 7.92 5.83 8.91 ...
## $ no2 : num 17 13.7 19.5 20.7 24.5 ...
## $ year : num 1987 1987 1987 1987 1987 ...
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: Removed 1 rows containing missing values (geom_point).
From the graphical output, we observe a possible relation between temperature and PM25, PM10 levels. The Level of pollutants like PM2.5, PM10, NO2 and Ozone are high in Winter and low in Summer
attach(chicago)
## The following object is masked _by_ .GlobalEnv:
##
## year
chicago <- mutate(chicago, pm25_norm = (pm25 - mean(pm25)) / (sd(pm25)))
chicago <- mutate(chicago, pm10_norm = (pm10 - min(pm10)) / (max(pm10)-min(pm10)))
year = as.POSIXlt(chicago$date)$year + 1900 add a new column for the year of the observation.year = as.POSIXlt(chicago$date)$year + 1900
chicago <- mutate(chicago,year)
## Adding missing grouping variables: `year`
## # A tibble: 9 x 7
## year Temperature Dew_Point_Temperature PM2.5 PM10 Ozone
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1987 52.12603 41.73527 16.52341 39.12437 20.49656
## 2 1989 48.36712 38.19623 16.82721 38.60283 20.81377
## 3 1991 51.40274 41.13390 16.81330 36.59886 20.00840
## 4 1993 48.71918 40.85993 16.73578 36.41688 15.80195
## 5 1997 48.70959 39.92764 16.65269 30.70733 18.60359
## 6 1998 53.78356 44.26521 17.80983 30.92909 19.27065
## 7 1999 51.46301 40.98630 16.67534 31.25416 20.46593
## 8 2000 50.04098 40.25164 16.93471 30.48314 18.53101
## 9 2001 51.04384 41.31233 16.86804 32.36493 19.35421
## # ... with 1 more variables: NOx <dbl>
Displaying the years that have higher than average PM2.5 levels. Possibly, Chicago has observed heavy infrastructure development or some industrialization in these years that has led to an increase in pollution levels
pm25.quint. Use the cut fucntion that divides the range of x into intervals and codes the values in x according to which interval they fall.## # A tibble: 5 x 2
## pm25.quint o3
## <fctr> <dbl>
## 1 (1.66,9.86] 19.43551
## 2 (9.86,18] 19.43551
## 3 (18,26.2] 19.43551
## 4 (26.2,34.3] 19.43551
## 5 (34.3,42.5] 19.43551
From the above output, it seems there is no relationship between pm25 and ozone