The purpose of this assignment is to refresh your data screening skills. (Hint: take a look at your ANLY 500 notes)
Please load in the airquality.csv dataset. You will conduct data screening tasks on this dataset.
airquality <- read.csv("airquality.csv")
Daily air quality measurements in New York, May to September 1973. [Link] (https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/airquality)
Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island. The range is 0-165 ppb.
- Solar.R: Solar radiation in Langleys in the frequency band 4000--7700 Angstroms from 0800 to 1200 hours at Central Park. The range is 0-330 Langleys.
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
- Neighborhood: 1 = Brooklyn, 2 = Bronx
a) Include output and indicate how the data are not accurate.
b) Include output to show how you fixed the accuracy errors, and describe what you did.
str(airquality)
## 'data.frame': 154 obs. of 7 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R : int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Neighborhood: int 1 1 1 1 1 1 1 1 1 1 ...
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 18.00 1st Qu.:113.5 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.00 Median :203.0 Median : 9.700 Median :79.00
## Mean : 41.77 Mean :184.7 Mean : 9.893 Mean :77.38
## 3rd Qu.: 63.00 3rd Qu.:258.5 3rd Qu.:11.500 3rd Qu.:84.75
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day Neighborhood
## Min. :0.000 Min. : 0.0 Min. :1.000
## 1st Qu.:6.000 1st Qu.: 8.0 1st Qu.:1.000
## Median :7.000 Median :16.0 Median :1.000
## Mean :6.948 Mean :15.7 Mean :1.422
## 3rd Qu.:8.000 3rd Qu.:23.0 3rd Qu.:2.000
## Max. :9.000 Max. :31.0 Max. :2.000
##
table(airquality$Neighborhood)
##
## 1 2
## 89 65
table(airquality$Day)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 26 27 28 29 30 31
## 5 5 5 5 5 3
table(airquality$Month)
##
## 0 5 6 7 8 9
## 1 31 30 31 31 30
which(airquality$Month == 0)
## [1] 108
airquality[108,]
## Ozone Solar.R Wind Temp Month Day Neighborhood
## 108 0 0 0 0 0 0 2
# Row 108 does not have any info but is just a row with zeros. Removing row 108 from the data
airquality <- airquality[-108,]
# The ranges for ozone and solar.r are 0-165 and 0-330. We have one observation for ozone that is above the ozone range and 2 observations for solar.r that are above the solar.r. range. Assuming these as inaccurate data, I will remove the rows for values that are above the max range.
which(airquality$Ozone >= 165)
## [1] 117
which(airquality$Solar.R >= 330)
## [1] 16 45
airquality <- airquality[c(-117,-16,-45),]
a) Include output that shows you have missing data.
b) Include output and a description that shows what you did with the missing data.
i) Replace all participant data if they have less than 20% of missing data by row.
ii) You can leave out the other participants (i.e. you do not have to create allrows).
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:113.5 1st Qu.: 7.400 1st Qu.:72.25
## Median : 31.50 Median :201.0 Median : 9.700 Median :79.00
## Mean : 41.27 Mean :183.5 Mean : 9.965 Mean :77.94
## 3rd Qu.: 62.50 3rd Qu.:257.0 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :135.00 Max. :323.0 Max. :20.700 Max. :97.00
## NA's :36 NA's :7
## Month Day Neighborhood
## Min. :5.000 Min. : 1.00 Min. :1.00
## 1st Qu.:6.000 1st Qu.: 8.00 1st Qu.:1.00
## Median :7.000 Median :16.00 Median :1.00
## Mean :7.007 Mean :15.75 Mean :1.42
## 3rd Qu.:8.000 3rd Qu.:23.00 3rd Qu.:2.00
## Max. :9.000 Max. :31.00 Max. :2.00
##
# We have missing data (NAs) in Ozone and Solar.R
# We don't have any data missing for Wind, Temp, Month, Day and Neighborhood
percentmiss = function(x){sum(is.na(x))/length(x)*100}
missingRows = apply(airquality,1,percentmiss) # calculates what % of data is missing from each row
table(missingRows)
## missingRows
## 0 14.2857142857143 28.5714285714286
## 109 39 2
# Removing the rows that have both Ozone and Solar.R missing. From the missing rows table above, we can see we have 2 rows where Ozone as well as Solar.R data is missing.
newairquality = subset(airquality, missingRows <= 20)
missingRowsNew = apply(newairquality, 1, percentmiss)
table(missingRowsNew)
## missingRowsNew
## 0 14.2857142857143
## 109 39
a) Include a summary if your mahal scores are greater than the cutoff.
b) What are the df for your Mahalanobis cutoff?
c) What is the cut off score for your Mahalanobis measure?
d) How many outliers did you have?
e) Delete all outliers.
scores <- newairquality[,(1:4)]
mahal = mahalanobis(scores,
colMeans(scores, na.rm=TRUE),
cov(scores, use ="pairwise.complete.obs")
)
cutoff = qchisq(1-.001,ncol(scores))
print(cutoff)
## [1] 18.46683
summary(mahal < cutoff)
## Mode TRUE NA's
## logical 109 39
scores_noout <- subset(scores, mahal < cutoff) # removes the outliers from scores
summary(scores_noout)
## Ozone Solar.R Wind Temp
## Min. : 1.0 Min. : 7.0 Min. : 2.300 Min. :57.00
## 1st Qu.: 18.0 1st Qu.:112.0 1st Qu.: 7.400 1st Qu.:71.00
## Median : 31.0 Median :203.0 Median : 9.700 Median :79.00
## Mean : 41.2 Mean :182.9 Mean : 9.985 Mean :77.89
## 3rd Qu.: 61.0 3rd Qu.:255.0 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :135.0 Max. :323.0 Max. :20.700 Max. :97.00
a) Include the symnum bivariate correlation table of your continuous measures.
b) Do you meet the assumption for additivity?
cor(scores_noout)
## Ozone Solar.R Wind Temp
## Ozone 1.0000000 0.3715898 -0.5972108 0.7354191
## Solar.R 0.3715898 1.0000000 -0.1275563 0.3217750
## Wind -0.5972108 -0.1275563 1.0000000 -0.4991125
## Temp 0.7354191 0.3217750 -0.4991125 1.0000000
symnum(cor(scores_noout))
## O S W T
## Ozone 1
## Solar.R . 1
## Wind . 1
## Temp , . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
a) Include a picture that shows how you might assess multivariate linearity.
b) Do you think you've met the assumption for linearity?
random = rchisq(nrow(scores_noout), 7)
fake = lm(random~., data = scores_noout)
summary(fake)
##
## Call:
## lm(formula = random ~ ., data = scores_noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4913 -3.0751 -0.7287 2.1044 14.1708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.419166 4.504689 -0.093 0.9260
## Ozone -0.013467 0.019605 -0.687 0.4937
## Solar.R -0.004195 0.004443 -0.944 0.3472
## Wind 0.082516 0.132486 0.623 0.5348
## Temp 0.100751 0.057875 1.741 0.0847 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.84 on 104 degrees of freedom
## Multiple R-squared: 0.03487, Adjusted R-squared: -0.002253
## F-statistic: 0.9393 on 4 and 104 DF, p-value: 0.4443
standardized = rstudent(fake)
qqnorm(standardized)
abline(0,1)
# The data looks more or less linear excepts at the ends.
a) Include a picture that shows how you might assess multivariate normality.
b) Do you think you've met the assumption for normality?
hist(standardized, breaks=10)
# The data does not look perfectly normal. It has a skew.
a) Include a picture that shows how you might assess multivariate homogeneity.
b) Do you think you've met the assumption for homogeneity?
c) Do you think you've met the assumption for homoscedasticity?
fitvalues = scale(fake$fitted.values)
plot(fitvalues, standardized)
abline(0,0)
abline(v = 0)
# The data looks homogeneous