Directions

The purpose of this assignment is to refresh your data screening skills. (Hint: take a look at your ANLY 500 notes)

Please load in the airquality.csv dataset. You will conduct data screening tasks on this dataset.

airquality <- read.csv("airquality.csv")

Dataset:

Daily air quality measurements in New York, May to September 1973. [Link] (https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/airquality)

Variables:

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island. The range is 0-165 ppb.
- Solar.R: Solar radiation in Langleys in the frequency band 4000--7700 Angstroms from 0800 to 1200 hours at Central Park. The range is 0-330 Langleys.
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
- Neighborhood: 1 = Brooklyn, 2 = Bronx

Data screening:

Accuracy:

a)  Include output and indicate how the data are not accurate.
b)  Include output to show how you fixed the accuracy errors, and describe what you did.

str(airquality)

## 'data.frame':    154 obs. of  7 variables:
##  $ Ozone       : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R     : int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind        : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp        : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month       : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood: int  1 1 1 1 1 1 1 1 1 1 ...

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  0.00   Min.   :  0.0   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 18.00   1st Qu.:113.5   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.00   Median :203.0   Median : 9.700   Median :79.00  
##  Mean   : 41.77   Mean   :184.7   Mean   : 9.893   Mean   :77.38  
##  3rd Qu.: 63.00   3rd Qu.:258.5   3rd Qu.:11.500   3rd Qu.:84.75  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day        Neighborhood  
##  Min.   :0.000   Min.   : 0.0   Min.   :1.000  
##  1st Qu.:6.000   1st Qu.: 8.0   1st Qu.:1.000  
##  Median :7.000   Median :16.0   Median :1.000  
##  Mean   :6.948   Mean   :15.7   Mean   :1.422  
##  3rd Qu.:8.000   3rd Qu.:23.0   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :31.0   Max.   :2.000  
##

table(airquality$Neighborhood)

## 
##  1  2 
## 89 65

table(airquality$Day)

## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  1  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5 
## 26 27 28 29 30 31 
##  5  5  5  5  5  3

table(airquality$Month)

## 
##  0  5  6  7  8  9 
##  1 31 30 31 31 30

which(airquality$Month == 0)

## [1] 108

airquality[108,]

##     Ozone Solar.R Wind Temp Month Day Neighborhood
## 108     0       0    0    0     0   0            2

# Row 108 does not have any info but is just a row with zeros. Removing row 108 from the data
airquality <- airquality[-108,]

# The ranges for ozone and solar.r are 0-165 and 0-330. We have one observation for ozone that is above the ozone range and 2 observations for solar.r that are above the solar.r. range. Assuming these as inaccurate data, I will remove the rows for values that are above the max range.
which(airquality$Ozone >= 165)

## [1] 117

which(airquality$Solar.R >= 330)

## [1] 16 45

airquality <- airquality[c(-117,-16,-45),]

Missing data:

a)  Include output that shows you have missing data.
b)  Include output and a description that shows what you did with the missing data.
    i)  Replace all participant data if they have less than 20% of missing data by row. 
    ii) You can leave out the other participants (i.e. you do not have to create allrows).

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:113.5   1st Qu.: 7.400   1st Qu.:72.25  
##  Median : 31.50   Median :201.0   Median : 9.700   Median :79.00  
##  Mean   : 41.27   Mean   :183.5   Mean   : 9.965   Mean   :77.94  
##  3rd Qu.: 62.50   3rd Qu.:257.0   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :135.00   Max.   :323.0   Max.   :20.700   Max.   :97.00  
##  NA's   :36       NA's   :7                                       
##      Month            Day         Neighborhood 
##  Min.   :5.000   Min.   : 1.00   Min.   :1.00  
##  1st Qu.:6.000   1st Qu.: 8.00   1st Qu.:1.00  
##  Median :7.000   Median :16.00   Median :1.00  
##  Mean   :7.007   Mean   :15.75   Mean   :1.42  
##  3rd Qu.:8.000   3rd Qu.:23.00   3rd Qu.:2.00  
##  Max.   :9.000   Max.   :31.00   Max.   :2.00  
##

# We have missing data (NAs) in Ozone and Solar.R
# We don't have any data missing for Wind, Temp, Month, Day and Neighborhood
percentmiss = function(x){sum(is.na(x))/length(x)*100}
missingRows = apply(airquality,1,percentmiss)  # calculates what % of data is missing from each row
table(missingRows)

## missingRows
##                0 14.2857142857143 28.5714285714286 
##              109               39                2

# Removing the rows that have both Ozone and Solar.R missing. From the missing rows table above, we can see we have 2 rows where Ozone as well as Solar.R data is missing.
newairquality = subset(airquality, missingRows <= 20)
missingRowsNew = apply(newairquality, 1, percentmiss)
table(missingRowsNew)

## missingRowsNew
##                0 14.2857142857143 
##              109               39

Outliers:

a)  Include a summary if your mahal scores are greater than the cutoff.
b)  What are the df for your Mahalanobis cutoff? 
c)  What is the cut off score for your Mahalanobis measure?
d)  How many outliers did you have?
e)  Delete all outliers.

scores <- newairquality[,(1:4)]
mahal = mahalanobis(scores,
                    colMeans(scores, na.rm=TRUE),
                    cov(scores, use ="pairwise.complete.obs")
                    )

cutoff = qchisq(1-.001,ncol(scores))
print(cutoff)

## [1] 18.46683

summary(mahal < cutoff)

##    Mode    TRUE    NA's 
## logical     109      39

scores_noout <- subset(scores, mahal < cutoff)  # removes the outliers from scores
summary(scores_noout)

##      Ozone          Solar.R           Wind             Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.300   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:112.0   1st Qu.: 7.400   1st Qu.:71.00  
##  Median : 31.0   Median :203.0   Median : 9.700   Median :79.00  
##  Mean   : 41.2   Mean   :182.9   Mean   : 9.985   Mean   :77.89  
##  3rd Qu.: 61.0   3rd Qu.:255.0   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :135.0   Max.   :323.0   Max.   :20.700   Max.   :97.00

Assumptions:

Additivity:

a)  Include the symnum bivariate correlation table of your continuous measures.
b)  Do you meet the assumption for additivity?

cor(scores_noout)

##              Ozone    Solar.R       Wind       Temp
## Ozone    1.0000000  0.3715898 -0.5972108  0.7354191
## Solar.R  0.3715898  1.0000000 -0.1275563  0.3217750
## Wind    -0.5972108 -0.1275563  1.0000000 -0.4991125
## Temp     0.7354191  0.3217750 -0.4991125  1.0000000

symnum(cor(scores_noout))

##         O S W T
## Ozone   1      
## Solar.R . 1    
## Wind    .   1  
## Temp    , . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Linearity:

a)  Include a picture that shows how you might assess multivariate linearity.
b)  Do you think you've met the assumption for linearity?

random = rchisq(nrow(scores_noout), 7)

fake = lm(random~., data = scores_noout)
summary(fake)

## 
## Call:
## lm(formula = random ~ ., data = scores_noout)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4913 -3.0751 -0.7287  2.1044 14.1708 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.419166   4.504689  -0.093   0.9260  
## Ozone       -0.013467   0.019605  -0.687   0.4937  
## Solar.R     -0.004195   0.004443  -0.944   0.3472  
## Wind         0.082516   0.132486   0.623   0.5348  
## Temp         0.100751   0.057875   1.741   0.0847 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.84 on 104 degrees of freedom
## Multiple R-squared:  0.03487,    Adjusted R-squared:  -0.002253 
## F-statistic: 0.9393 on 4 and 104 DF,  p-value: 0.4443

standardized = rstudent(fake)
qqnorm(standardized)
abline(0,1)

# The data looks more or less linear excepts at the ends.

Normality:

a)  Include a picture that shows how you might assess multivariate normality.
b)  Do you think you've met the assumption for normality?

hist(standardized, breaks=10)

# The data does not look perfectly normal. It has a skew.

Homogeneity/Homoscedasticity:

a)  Include a picture that shows how you might assess multivariate homogeneity.
b)  Do you think you've met the assumption for homogeneity? 
c)  Do you think you've met the assumption for homoscedasticity?

fitvalues = scale(fake$fitted.values)

plot(fitvalues, standardized) 
abline(0,0)
abline(v = 0)

# The data looks homogeneous

HW1 - Data Screening

Ashwin Mendhi

2020-08-02