Directions

The purpose of this assignment is to refresh your data screening skills. (Hint: take a look at your ANLY 500 notes)

Please load in the airquality.csv dataset. You will conduct data screening tasks on this dataset.

library(readr)
airquality <- read_csv("airquality.csv")

## Parsed with column specification:
## cols(
##   Ozone = col_double(),
##   Solar.R = col_double(),
##   Wind = col_double(),
##   Temp = col_double(),
##   Month = col_double(),
##   Day = col_double(),
##   Neighborhood = col_double()
## )

Dataset:

Daily air quality measurements in New York, May to September 1973. [Link] (https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/airquality)

Variables:

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island. The range is 0-165 ppb.
- Solar.R: Solar radiation in Langleys in the frequency band 4000--7700 Angstroms from 0800 to 1200 hours at Central Park. The range is 0-330 Langleys.
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
- Neighborhood: 1 = Brooklyn, 2 = Bronx

Data screening:

Accuracy:

a)  Include output and indicate how the data are not accurate.
b)  Include output to show how you fixed the accuracy errors, and describe what you did.

str(airquality) # the Neighborhood should be numeric type

## tibble [154 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Ozone       : num [1:154] 41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R     : num [1:154] 190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind        : num [1:154] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp        : num [1:154] 67 72 74 62 56 66 65 59 61 69 ...
##  $ Month       : num [1:154] 5 5 5 5 5 5 5 5 5 5 ...
##  $ Day         : num [1:154] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood: num [1:154] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Ozone = col_double(),
##   ..   Solar.R = col_double(),
##   ..   Wind = col_double(),
##   ..   Temp = col_double(),
##   ..   Month = col_double(),
##   ..   Day = col_double(),
##   ..   Neighborhood = col_double()
##   .. )

airquality$Neighborhood <- as.factor(airquality$Neighborhood) # fix Neighborhood to factor
str(airquality)

## tibble [154 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Ozone       : num [1:154] 41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R     : num [1:154] 190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind        : num [1:154] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp        : num [1:154] 67 72 74 62 56 66 65 59 61 69 ...
##  $ Month       : num [1:154] 5 5 5 5 5 5 5 5 5 5 ...
##  $ Day         : num [1:154] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Ozone = col_double(),
##   ..   Solar.R = col_double(),
##   ..   Wind = col_double(),
##   ..   Temp = col_double(),
##   ..   Month = col_double(),
##   ..   Day = col_double(),
##   ..   Neighborhood = col_double()
##   .. )

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  0.00   Min.   :  0.0   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 18.00   1st Qu.:113.5   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.00   Median :203.0   Median : 9.700   Median :79.00  
##  Mean   : 41.77   Mean   :184.7   Mean   : 9.893   Mean   :77.38  
##  3rd Qu.: 63.00   3rd Qu.:258.5   3rd Qu.:11.500   3rd Qu.:84.75  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day       Neighborhood
##  Min.   :0.000   Min.   : 0.0   1:89        
##  1st Qu.:6.000   1st Qu.: 8.0   2:65        
##  Median :7.000   Median :16.0               
##  Mean   :6.948   Mean   :15.7               
##  3rd Qu.:8.000   3rd Qu.:23.0               
##  Max.   :9.000   Max.   :31.0               
##

Missing data:

a)  Include output that shows you have missing data.
b)  Include output and a description that shows what you did with the missing data.
    i)  Replace all participant data if they have less than 20% of missing data by row. 
    ii) You can leave out the other participants (i.e. you do not have to create allrows).

sum(is.na(airquality))

## [1] 44

summary(airquality) # most of the missing data is in Ozone and Solar.R variables.

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  0.00   Min.   :  0.0   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 18.00   1st Qu.:113.5   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.00   Median :203.0   Median : 9.700   Median :79.00  
##  Mean   : 41.77   Mean   :184.7   Mean   : 9.893   Mean   :77.38  
##  3rd Qu.: 63.00   3rd Qu.:258.5   3rd Qu.:11.500   3rd Qu.:84.75  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day       Neighborhood
##  Min.   :0.000   Min.   : 0.0   1:89        
##  1st Qu.:6.000   1st Qu.: 8.0   2:65        
##  Median :7.000   Median :16.0               
##  Mean   :6.948   Mean   :15.7               
##  3rd Qu.:8.000   3rd Qu.:23.0               
##  Max.   :9.000   Max.   :31.0               
##

ozone_percentmiss = sum(is.na(airquality$Ozone))/nrow(airquality)
ozone_percentmiss # the  missing percentage of Ozone variable is more than 5%

## [1] 0.2402597

hist(airquality$Ozone) # the distribution is skewed, replace NAs with median

library(DataCombine)
tmp <- DropNA(airquality, Var='Ozone')

## 37 rows dropped from the data frame because of missing values.

index1 <- is.na(airquality$Ozone)
median(airquality$Ozone)

## [1] NA

airquality$Ozone[index1] <- median(tmp$Ozone)
sum(is.na(airquality$Ozone))

## [1] 0

sr_percentmiss= sum(is.na(airquality$Solar.R))/nrow(airquality) 
sr_percentmiss# the  missing percentage of Solar.R variable is less than 5%, so we can replace NAs with 0

## [1] 0.04545455

index2 <- is.na(airquality$Solar.R) 
airquality$Solar.R[index2] <- 0
sum(is.na(airquality$Solar.R))

## [1] 0

sum(is.na(airquality))

## [1] 0

Outliers:

a)  Include a summary if your mahal scores are greater than the cutoff.
b)  What are the df for your Mahalanobis cutoff? 
c)  What is the cut off score for your Mahalanobis measure?
d)  How many outliers did you have?
e)  Delete all outliers.

outlier <- airquality[,-c(7)]
mahal <- mahalanobis(outlier,
                     colMeans(outlier,na.rm = TRUE),
                     cov(outlier))
cutoff = qchisq(1-.0001,ncol(outlier)) # df = 6
cutoff

## [1] 27.85634

summary(mahal < cutoff) # 2 outliers

##    Mode   FALSE    TRUE 
## logical       2     152

noout <- subset(outlier,mahal<cutoff)

Assumptions:

Additivity:

a)  Include the symnum bivariate correlation table of your continuous measures.
b)  Do you meet the assumption for additivity?

cor(noout)

##              Ozone     Solar.R        Wind       Temp       Month         Day
## Ozone    1.0000000  0.26993978 -0.51733552  0.6330348  0.18085840 -0.05424890
## Solar.R  0.2699398  1.00000000 -0.01879931  0.2850672 -0.02522364 -0.07092965
## Wind    -0.5173355 -0.01879931  1.00000000 -0.4593916 -0.17180036  0.04057959
## Temp     0.6330348  0.28506721 -0.45939162  1.0000000  0.42025126 -0.13337986
## Month    0.1808584 -0.02522364 -0.17180036  0.4202513  1.00000000 -0.01291092
## Day     -0.0542489 -0.07092965  0.04057959 -0.1333799 -0.01291092  1.00000000

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(noout))

symnum(cor(noout))

##         O S W T M D
## Ozone   1          
## Solar.R   1        
## Wind    .   1      
## Temp    ,   . 1    
## Month         . 1  
## Day               1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

# the assumption for additivity is met.

Linearity:

a)  Include a picture that shows how you might assess multivariate linearity.
b)  Do you think you've met the assumption for linearity?

random = rchisq(nrow(noout),6)
fake = lm(random~., data = noout)
summary(fake)

## 
## Call:
## lm(formula = random ~ ., data = noout)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.892 -2.365 -0.307  1.788  9.616 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.668249   2.946881   1.245  0.21522   
## Ozone        0.034939   0.012679   2.756  0.00661 **
## Solar.R     -0.005458   0.002775  -1.967  0.05113 . 
## Wind         0.180371   0.085811   2.102  0.03729 * 
## Temp         0.016704   0.038689   0.432  0.66658   
## Month       -0.252377   0.195996  -1.288  0.19992   
## Day          0.011468   0.028218   0.406  0.68505   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.037 on 145 degrees of freedom
## Multiple R-squared:  0.08401,    Adjusted R-squared:  0.04611 
## F-statistic: 2.216 on 6 and 145 DF,  p-value: 0.0447

standardized <- rstudent(fake)
qqnorm(standardized)
abline(0,1)

plot(fake,2)

# Yes. Linearity assumption is met.

Normality:

a)  Include a picture that shows how you might assess multivariate normality.
b)  Do you think you've met the assumption for normality?

library(moments)
skewness(noout)

##        Ozone      Solar.R         Wind         Temp        Month          Day 
##  1.377821097 -0.387550306  0.362194704 -0.367079196  0.009303634  0.015769806

kurtosis(noout)

##    Ozone  Solar.R     Wind     Temp    Month      Day 
## 4.427428 1.929602 3.092065 2.554183 1.704069 1.807454

hist(standardized,breaks = 15)

# The distribution is slightly right skewed.

Homogeneity/Homoscedasticity:

a)  Include a picture that shows how you might assess multivariate homogeneity.
b)  Do you think you've met the assumption for homogeneity? 
c)  Do you think you've met the assumption for homoscedasticity?

fitvalues = scale(fake$fitted.values)
plot(fake,1)

# The assumption for homogeneity is  met since the spread above the line is evenly spread below and above the line.
 # The assumption for homoscedasticity is met since the spread has no pattern.

HW1 - Data Screening

Zhuoxin Jiang

2020-07-25