Homework 1: Basic Data Wrangling

  1. Load the dataset chicago.rds from moodle using the readRDS() function of R into a dataframe called chicago. Don’t display the code to laod the file in the knitted document. The dataset contains air pollution data from Chicago from a period Jan 1987 till end of 2005. To learn more about these measures visit http://www.airnow.gov/index.cfm?action=aqibasics.particle

  2. Make sure the dataset is a dplyr object from here on. Perform basic data cleansing activities that deal with outliers, data entry errors and missing values if any.

Visualizing & treating outliers and missing Values

## No id variables; using all as measure variables
## Warning: Removed 4692 rows containing non-finite values (stat_boxplot).

##        0%        5%       10%       50%       90%       95%       97% 
##  1.700000  5.700000  6.863214 14.657143 28.000000 33.000000 37.300000 
##       99%      100% 
## 42.572000 61.500000
##        0%        5%       10%       50%       90%       95%       97% 
##   2.00000  12.00000  15.50000  30.27885  57.00000  67.50000  75.50000 
##       99%      100% 
##  89.00000 365.00000
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, last
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
## 'data.frame':    6940 obs. of  8 variables:
##  $ city      : chr  "chic" "chic" "chic" "chic" ...
##  $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
##  $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
##  $ date      : Date, format: "1987-01-01" "1987-01-02" ...
##  $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pm10tmean2: num  34 NA 34.2 47 NA ...
##  $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
##  $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...
##   city tmpd   dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
##         date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01         NA   34.00000 4.250000  19.98810
## 2 1987-01-02         NA         NA 3.304348  23.19099
## 3 1987-01-03         NA   34.16667 3.333333  23.81548
## 4 1987-01-04         NA   47.00000 4.375000  30.43452
## 5 1987-01-05         NA         NA 4.750000  30.33333
## 6 1987-01-06         NA   48.00000 5.833333  25.77233
## chicago 
## 
##  8  Variables      6940  Observations
## ---------------------------------------------------------------------------
## city 
##       n missing  unique   value 
##    6940       0       1    chic 
## ---------------------------------------------------------------------------
## tmpd 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6939       1     196       1   50.31    18.0    25.0    35.0    51.0 
##     .75     .90     .95 
##    67.0    75.0    78.5 
## 
## lowest : -16.0 -12.0 -10.5  -8.0  -6.5
## highest:  89.0  89.5  90.0  91.5  92.0 
## ---------------------------------------------------------------------------
## dptp 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6938       2    1245       1   40.34   9.243  16.875  27.000  39.875 
##     .75     .90     .95 
##  55.750  64.700  68.200 
## 
## lowest : -25.62 -23.12 -23.00 -21.00 -18.88
## highest:  77.00  77.10  77.25  78.12  78.25 
## ---------------------------------------------------------------------------
## date 
##       n missing  unique 
##    6940       0    6940 
## 
## lowest : 1987-01-01 1987-01-02 1987-01-03 1987-01-04 1987-01-05
## highest: 2005-12-27 2005-12-28 2005-12-29 2005-12-30 2005-12-31 
## ---------------------------------------------------------------------------
## pm25tmean2 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    2493    4447    1072       1   16.18   5.700   6.863   9.700  14.657 
##     .75     .90     .95 
##  20.600  28.000  33.000 
## 
## lowest :  1.7  2.1  2.4  2.5  2.6, highest: 41.4 41.5 41.7 42.3 42.5 
## ---------------------------------------------------------------------------
## pm10tmean2 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6698     242     818       1   33.69   12.00   15.50   21.50   30.28 
##     .75     .90     .95 
##   42.00   57.00   67.50 
## 
## lowest :  2.00  3.00  4.00  4.50  5.00
## highest: 87.75 88.00 88.25 88.50 89.00 
## ---------------------------------------------------------------------------
## o3tmean2 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6940       0    6048       1   19.44   3.692   5.417  10.073  18.522 
##     .75     .90     .95 
##  27.001  34.343  39.643 
## 
## lowest :  0.1528  0.3611  0.4271  0.4915  0.5000
## highest: 61.6771 62.9697 63.1042 63.3750 66.5875 
## ---------------------------------------------------------------------------
## no2tmean2 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6940       0    6198       1   25.23   12.98   15.42   19.65   24.56 
##     .75     .90     .95 
##   30.14   35.56   39.55 
## 
## lowest :  6.158  6.385  6.726  6.804  7.065
## highest: 56.230 57.525 57.564 59.516 62.480 
## ---------------------------------------------------------------------------

## Warning in data.matrix(x): NAs introduced by coercion
## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## Warning in hex(RGB(r, g, b), gamma = gamma, fixup = fixup, ...): 'gamma' is
## deprecated and has no effect

## chicago$pm25 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    2493    4447    1072       1   16.18   5.700   6.863   9.700  14.657 
##     .75     .90     .95 
##  20.600  28.000  33.000 
## 
## lowest :  1.7  2.1  2.4  2.5  2.6, highest: 41.4 41.5 41.7 42.3 42.5
## chicago$pm10 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6698     242     818       1   33.69   12.00   15.50   21.50   30.28 
##     .75     .90     .95 
##   42.00   57.00   67.50 
## 
## lowest :  2.00  3.00  4.00  4.50  5.00
## highest: 87.75 88.00 88.25 88.50 89.00
## chicago$pm25 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6940       0    1072       1   16.22   5.714   6.971   9.750  14.750 
##     .75     .90     .95 
##  20.609  28.000  33.000 
## 
## lowest :  1.7  2.1  2.4  2.5  2.6, highest: 41.4 41.5 41.7 42.3 42.5
## chicago$pm10 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##    6940       0     818       1   33.68   12.00   15.50   21.50   30.32 
##     .75     .90     .95 
##   42.00   57.00   67.52 
## 
## lowest :  2.00  3.00  4.00  4.50  5.00
## highest: 87.75 88.00 88.25 88.50 89.00
  1. Extract and display those rows (dplyr will display first 10 rows and that is fine) that have a ‘seemingly’ high value of PM2.5. Note, I am deliberately using managerial speak here when I say ‘seemingly’ but its your job to translate this into something convincing. Think about dsiplaying 10 rows that have most meaning.

Displaying the top 10 rows that have very high PM2.5 as it is extremely hazardous for health

##      city tmpd dewpoint       date pm25 pm10     ozone      no2
## 154  chic 69.0   47.250 1987-06-03 42.5 42.0 23.313361 21.74306
## 343  chic 49.5   34.375 1987-12-09 42.5 21.0  9.041667 12.47500
## 450  chic 53.0   35.500 1988-03-25 42.5 25.0 22.002415 16.99583
## 535  chic 74.5   52.875 1988-06-18 42.5 55.0 41.231399 38.97894
## 648  chic 54.5   47.500 1988-10-09 42.5 25.5 11.940476 32.08514
## 876  chic 67.5   60.875 1989-05-25 42.5 33.0 26.303504 35.52494
## 1076 chic 19.5    6.500 1989-12-11 42.5 35.0 11.541667 24.66825
## 1138 chic 28.5   20.125 1990-02-11 42.5 29.5 15.875000 24.66667
## 1295 chic 78.5   67.625 1990-07-18 42.5 47.0 26.380328 29.24107
## 1296 chic 77.5   68.000 1990-07-19 42.5 47.5 18.490132 26.91071
  1. Explore this subset of high PM2.5 level days further and see if other factors in the data could be linked to these high PM2.5 days. What queries would you run?

Plotting each variable to check if other factors are linked to PM2.5 levels

## 'data.frame':    545 obs. of  9 variables:
##  $ city    : chr  "chic" "chic" "chic" "chic" ...
##  $ tmpd    : num  29 29.5 25 27 22 37 36.5 38 31 28.5 ...
##  $ dewpoint: num  22 20.1 21.5 24.8 20.6 ...
##  $ date    : Date, format: "1987-01-08" "1987-01-11" ...
##  $ pm25    : num  33 38.8 34.5 36.5 33.4 ...
##  $ pm10    : num  36 22 19 29.5 32.9 ...
##  $ ozone   : num  11.29 17.54 7.92 5.83 8.91 ...
##  $ no2     : num  17 13.7 19.5 20.7 24.5 ...
##  $ year    : num  1987 1987 1987 1987 1987 ...
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: Removed 1 rows containing missing values (geom_point).

From the graphical output, we observe a possible relation between temperature and PM25, PM10 levels. The Level of pollutants like PM2.5, PM10, NO2 and Ozone are high in Winter and low in Summer

  1. Create two new columns that normalize the PM2.5 column using z-score and the PM10 column using min-max normalization
attach(chicago)
## The following object is masked _by_ .GlobalEnv:
## 
##     year
chicago <- mutate(chicago, pm25_norm = (pm25 - mean(pm25)) / (sd(pm25)))
chicago <- mutate(chicago, pm10_norm = (pm10 - min(pm10)) / (max(pm10)-min(pm10)))
  1. Using the commandyear = as.POSIXlt(chicago$date)$year + 1900 add a new column for the year of the observation.
year = as.POSIXlt(chicago$date)$year + 1900
chicago <- mutate(chicago,year)
  1. Present key summary statistics of the pollutants by year
## Adding missing grouping variables: `year`
## # A tibble: 9 x 7
##    year Temperature Dew_Point_Temperature    PM2.5     PM10    Ozone
##   <dbl>       <dbl>                 <dbl>    <dbl>    <dbl>    <dbl>
## 1  1987    52.12603              41.73527 16.52341 39.12437 20.49656
## 2  1989    48.36712              38.19623 16.82721 38.60283 20.81377
## 3  1991    51.40274              41.13390 16.81330 36.59886 20.00840
## 4  1993    48.71918              40.85993 16.73578 36.41688 15.80195
## 5  1997    48.70959              39.92764 16.65269 30.70733 18.60359
## 6  1998    53.78356              44.26521 17.80983 30.92909 19.27065
## 7  1999    51.46301              40.98630 16.67534 31.25416 20.46593
## 8  2000    50.04098              40.25164 16.93471 30.48314 18.53101
## 9  2001    51.04384              41.31233 16.86804 32.36493 19.35421
## # ... with 1 more variables: NOx <dbl>

Displaying the years that have higher than average PM2.5 levels. Possibly, Chicago has observed heavy infrastructure development or some industrialization in these years that has led to an increase in pollution levels

  1. What are the average levels of Ozone (o3) within quintiles of PM25? Note you have to convert pm25 levels into quintile groups and strore these levels into a new variable – call it pm25.quint. Use the cut fucntion that divides the range of x into intervals and codes the values in x according to which interval they fall.
## # A tibble: 5 x 2
##    pm25.quint       o3
##        <fctr>    <dbl>
## 1 (1.66,9.86] 19.43551
## 2   (9.86,18] 19.43551
## 3   (18,26.2] 19.43551
## 4 (26.2,34.3] 19.43551
## 5 (34.3,42.5] 19.43551

From the above output, it seems there is no relationship between pm25 and ozone