The purpose of this assignment is to refresh your data screening skills. (Hint: take a look at your ANLY 500 notes)
Please load in the airquality.csv dataset. You will conduct data screening tasks on this dataset.
library(readr)
airquality <- read_csv("airquality.csv")
## Parsed with column specification:
## cols(
## Ozone = col_double(),
## Solar.R = col_double(),
## Wind = col_double(),
## Temp = col_double(),
## Month = col_double(),
## Day = col_double(),
## Neighborhood = col_double()
## )
Daily air quality measurements in New York, May to September 1973. [Link] (https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/airquality)
Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island. The range is 0-165 ppb.
- Solar.R: Solar radiation in Langleys in the frequency band 4000--7700 Angstroms from 0800 to 1200 hours at Central Park. The range is 0-330 Langleys.
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
- Neighborhood: 1 = Brooklyn, 2 = Bronx
a) Include output and indicate how the data are not accurate.
b) Include output to show how you fixed the accuracy errors, and describe what you did.
str(airquality) # the Neighborhood should be numeric type
## tibble [154 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Ozone : num [1:154] 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R : num [1:154] 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num [1:154] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : num [1:154] 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : num [1:154] 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : num [1:154] 1 2 3 4 5 6 7 8 9 10 ...
## $ Neighborhood: num [1:154] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Ozone = col_double(),
## .. Solar.R = col_double(),
## .. Wind = col_double(),
## .. Temp = col_double(),
## .. Month = col_double(),
## .. Day = col_double(),
## .. Neighborhood = col_double()
## .. )
airquality$Neighborhood <- as.factor(airquality$Neighborhood) # fix Neighborhood to factor
str(airquality)
## tibble [154 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Ozone : num [1:154] 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R : num [1:154] 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num [1:154] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : num [1:154] 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : num [1:154] 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : num [1:154] 1 2 3 4 5 6 7 8 9 10 ...
## $ Neighborhood: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Ozone = col_double(),
## .. Solar.R = col_double(),
## .. Wind = col_double(),
## .. Temp = col_double(),
## .. Month = col_double(),
## .. Day = col_double(),
## .. Neighborhood = col_double()
## .. )
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 18.00 1st Qu.:113.5 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.00 Median :203.0 Median : 9.700 Median :79.00
## Mean : 41.77 Mean :184.7 Mean : 9.893 Mean :77.38
## 3rd Qu.: 63.00 3rd Qu.:258.5 3rd Qu.:11.500 3rd Qu.:84.75
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day Neighborhood
## Min. :0.000 Min. : 0.0 1:89
## 1st Qu.:6.000 1st Qu.: 8.0 2:65
## Median :7.000 Median :16.0
## Mean :6.948 Mean :15.7
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
a) Include output that shows you have missing data.
b) Include output and a description that shows what you did with the missing data.
i) Replace all participant data if they have less than 20% of missing data by row.
ii) You can leave out the other participants (i.e. you do not have to create allrows).
sum(is.na(airquality))
## [1] 44
summary(airquality) # most of the missing data is in Ozone and Solar.R variables.
## Ozone Solar.R Wind Temp
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 18.00 1st Qu.:113.5 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.00 Median :203.0 Median : 9.700 Median :79.00
## Mean : 41.77 Mean :184.7 Mean : 9.893 Mean :77.38
## 3rd Qu.: 63.00 3rd Qu.:258.5 3rd Qu.:11.500 3rd Qu.:84.75
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day Neighborhood
## Min. :0.000 Min. : 0.0 1:89
## 1st Qu.:6.000 1st Qu.: 8.0 2:65
## Median :7.000 Median :16.0
## Mean :6.948 Mean :15.7
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
ozone_percentmiss = sum(is.na(airquality$Ozone))/nrow(airquality)
ozone_percentmiss # the missing percentage of Ozone variable is more than 5%
## [1] 0.2402597
hist(airquality$Ozone) # the distribution is skewed, replace NAs with median
library(DataCombine)
tmp <- DropNA(airquality, Var='Ozone')
## 37 rows dropped from the data frame because of missing values.
index1 <- is.na(airquality$Ozone)
median(airquality$Ozone)
## [1] NA
airquality$Ozone[index1] <- median(tmp$Ozone)
sum(is.na(airquality$Ozone))
## [1] 0
sr_percentmiss= sum(is.na(airquality$Solar.R))/nrow(airquality)
sr_percentmiss# the missing percentage of Solar.R variable is less than 5%, so we can replace NAs with 0
## [1] 0.04545455
index2 <- is.na(airquality$Solar.R)
airquality$Solar.R[index2] <- 0
sum(is.na(airquality$Solar.R))
## [1] 0
sum(is.na(airquality))
## [1] 0
a) Include a summary if your mahal scores are greater than the cutoff.
b) What are the df for your Mahalanobis cutoff?
c) What is the cut off score for your Mahalanobis measure?
d) How many outliers did you have?
e) Delete all outliers.
outlier <- airquality[,-c(7)]
mahal <- mahalanobis(outlier,
colMeans(outlier,na.rm = TRUE),
cov(outlier))
cutoff = qchisq(1-.0001,ncol(outlier)) # df = 6
cutoff
## [1] 27.85634
summary(mahal < cutoff) # 2 outliers
## Mode FALSE TRUE
## logical 2 152
noout <- subset(outlier,mahal<cutoff)
a) Include the symnum bivariate correlation table of your continuous measures.
b) Do you meet the assumption for additivity?
cor(noout)
## Ozone Solar.R Wind Temp Month Day
## Ozone 1.0000000 0.26993978 -0.51733552 0.6330348 0.18085840 -0.05424890
## Solar.R 0.2699398 1.00000000 -0.01879931 0.2850672 -0.02522364 -0.07092965
## Wind -0.5173355 -0.01879931 1.00000000 -0.4593916 -0.17180036 0.04057959
## Temp 0.6330348 0.28506721 -0.45939162 1.0000000 0.42025126 -0.13337986
## Month 0.1808584 -0.02522364 -0.17180036 0.4202513 1.00000000 -0.01291092
## Day -0.0542489 -0.07092965 0.04057959 -0.1333799 -0.01291092 1.00000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(noout))
symnum(cor(noout))
## O S W T M D
## Ozone 1
## Solar.R 1
## Wind . 1
## Temp , . 1
## Month . 1
## Day 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
# the assumption for additivity is met.
a) Include a picture that shows how you might assess multivariate linearity.
b) Do you think you've met the assumption for linearity?
random = rchisq(nrow(noout),6)
fake = lm(random~., data = noout)
summary(fake)
##
## Call:
## lm(formula = random ~ ., data = noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.892 -2.365 -0.307 1.788 9.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.668249 2.946881 1.245 0.21522
## Ozone 0.034939 0.012679 2.756 0.00661 **
## Solar.R -0.005458 0.002775 -1.967 0.05113 .
## Wind 0.180371 0.085811 2.102 0.03729 *
## Temp 0.016704 0.038689 0.432 0.66658
## Month -0.252377 0.195996 -1.288 0.19992
## Day 0.011468 0.028218 0.406 0.68505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.037 on 145 degrees of freedom
## Multiple R-squared: 0.08401, Adjusted R-squared: 0.04611
## F-statistic: 2.216 on 6 and 145 DF, p-value: 0.0447
standardized <- rstudent(fake)
qqnorm(standardized)
abline(0,1)
plot(fake,2)
# Yes. Linearity assumption is met.
a) Include a picture that shows how you might assess multivariate normality.
b) Do you think you've met the assumption for normality?
library(moments)
skewness(noout)
## Ozone Solar.R Wind Temp Month Day
## 1.377821097 -0.387550306 0.362194704 -0.367079196 0.009303634 0.015769806
kurtosis(noout)
## Ozone Solar.R Wind Temp Month Day
## 4.427428 1.929602 3.092065 2.554183 1.704069 1.807454
hist(standardized,breaks = 15)
# The distribution is slightly right skewed.
a) Include a picture that shows how you might assess multivariate homogeneity.
b) Do you think you've met the assumption for homogeneity?
c) Do you think you've met the assumption for homoscedasticity?
fitvalues = scale(fake$fitted.values)
plot(fake,1)
# The assumption for homogeneity is met since the spread above the line is evenly spread below and above the line.
# The assumption for homoscedasticity is met since the spread has no pattern.