Recipe 6: Analysis of Covariance

Matthew Macchi

Rensselaer Polytechnic Institute

10/30/14 Version 1

1. Setting

System under test

This recipe will conduct an experiment on the ecdat dataset. The experiment will attempt to investigate the Air Quality for Californian Metropolitan Areas dataset.

install.packages("Ecdat", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//Rtmp8l0fei/downloaded_packages
library("Ecdat", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
x<-Airq

Factors and Levels

The data being examined has 30 observations of 6 variables.

head(x)
##   airq  vala  rain coas   dens  medi
## 1  104  2734 12.63  yes 1815.9  4397
## 2   85  2479 47.14  yes  804.9  5667
## 3  127  4845 42.77  yes 1907.9 15817
## 4  145 19734 33.18   no 1876.1 32698
## 5   84  4094 34.55  yes  340.9  6250
## 6  135  1850 14.81   no  335.5  4705
tail(x)
##    airq vala  rain coas   dens medi
## 25   74 5609 42.36  yes 2649.1 8947
## 26  124 3700 29.51   no 9642.9 5952
## 27   69 1396 42.92  yes 1105.5 4146
## 28  118 3023 41.32   no  910.8 3207
## 29  129 1515 31.22   no  379.6  853
## 30  129 1879 30.95   no  455.9  853
summary(x)
##       airq          vala            rain       coas         dens      
##  Min.   : 59   Min.   :  993   Min.   :12.6   no : 9   Min.   :  272  
##  1st Qu.: 81   1st Qu.: 1536   1st Qu.:31.0   yes:21   1st Qu.:  365  
##  Median :114   Median : 2630   Median :36.7            Median :  796  
##  Mean   :105   Mean   : 4188   Mean   :36.1            Mean   : 1729  
##  3rd Qu.:126   3rd Qu.: 4141   3rd Qu.:42.7            3rd Qu.: 1635  
##  Max.   :165   Max.   :19734   Max.   :68.1            Max.   :12958  
##       medi      
##  Min.   :  853  
##  1st Qu.: 3340  
##  Median : 4858  
##  Mean   : 9477  
##  3rd Qu.: 8715  
##  Max.   :59460
str(x)
## 'data.frame':    30 obs. of  6 variables:
##  $ airq: int  104 85 127 145 84 135 88 118 74 104 ...
##  $ vala: num  2734 2479 4845 19734 4094 ...
##  $ rain: num  12.6 47.1 42.8 33.2 34.5 ...
##  $ coas: Factor w/ 2 levels "no","yes": 2 2 2 1 2 1 2 1 2 2 ...
##  $ dens: num  1816 805 1908 1876 341 ...
##  $ medi: int  4397 5667 15817 32698 6250 4705 7165 4472 2658 33885 ...
x$airq <- as.numeric(x$airq)
x$medi <- as.numeric(x$medi)

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.

The continuous variables in this dataset are value added of companies, rain, average income per head, and population density. ### Response variables A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is air quality.

summary(x$airq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      59      81     114     105     126     165

The Data: How is it organized and what does it look like?

The data is organized initially into an 6 column table: The columns are titled as follows: airq, vala, rain, coas, dens, medi. All data is numeric less coas which is textual.

str(x)
## 'data.frame':    30 obs. of  6 variables:
##  $ airq: num  104 85 127 145 84 135 88 118 74 104 ...
##  $ vala: num  2734 2479 4845 19734 4094 ...
##  $ rain: num  12.6 47.1 42.8 33.2 34.5 ...
##  $ coas: Factor w/ 2 levels "no","yes": 2 2 2 1 2 1 2 1 2 2 ...
##  $ dens: num  1816 805 1908 1876 341 ...
##  $ medi: num  4397 5667 15817 32698 6250 ...

Randomization

This data comes from a collection from Yves Croissant. Since this is the only information available in regards to background information about the data collection, it is entirely possible that this data might not be completely randomized or the experiment had a completely randomized design. It is also published on 9/4/2014, so this data is very recent. # 2. (Experimental) Design ### How will the experiment be organized and conducted to test the hypothesis? In order to conduct this experiment, I will conduct an analysis of covariance (ANCOVA). It will use one attribute as factor - coastal - which has two factors. It will also use multiple independent explanatory variables, such as value added of companies, rain, average income per head, and population density, in order to predict the response variable - air quality. ### What is the rationale for this design? Covariance is a measure of how much two variables change together and how strong the relationship is between them. Analysis of Covariance (ANCOVA) is a general linear model which blends ANOVA and regression. ANCOVA evaluates whether population means of a dependent variable are equal across levels of a categorical independent variable, while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates. Therefore, when performing ANCOVA, the dependent variable means are adjusted to what they would be if all groups were equal on the covariates. In this design, the categorical factor is coastal, the continuous dependent response variable is air quality, and the covariates are the value of added companies, rain, average income per head, and population density. ### Randomize: What is the Randomization Scheme? The experiment was based on observations and therefore was not randomized. ### Replicate: Are there replicates and/or repeated measures? There are no replicates, but repeated measures do occur between the factors and levels. ### Block: Did you use blocking in the design? No blocking was performed in this design. # 3. (Statistical) Analysis ### (Exploratory Data Analysis) Graphics and descriptive summary

attach(x)

The plot below compares air quality, rian, and added value as dependent and independent variables. The graphs horizontal to a given variable use that variable as an independent variable, and the graphs vertical to the given variable uses that variable as a dependent variable. For example, the graph in the top right shows a somewhat of a positive correlation between Air quality and Added Value, with air quality as the response (independent) variable. Because Air quality is the response variable of interest, the top row of graphs is of most interest.

plot(x[,c(1,3,2)], main = "Cross Comparison of Air Quality, Rain (in inches), and Added Value (in thousands of $)")

plot of chunk unnamed-chunk-6 As before, the plot below compares three different continuous variables, but this time, rain, added value, and population densiy were evaluated as dependent and independent variables. The graphs horizonal to a given variable use that variable as an independent variable, and the graphs vertical to the given variable uses that variable as a dependent variable. For example, the graph in the bottom-middle placement shows a generally positive correlation between added value and population density, with added value as the response (independent) variable.

plot(x[,c(3,2,5)], main = "Cross Comparison of Rain (in inches), Added Value (in thousands of $) and Population Density (per squre mile)")

plot of chunk unnamed-chunk-7 The plot below shows the relationship between Air Quality and Rain. As stated before, it appears that there is not much of a correlation between Air Quality and Rain, with Air Quality as the response variable.

plot(airq~rain, main = "Air Quality by Rain", xlab = "Rain (in inches)", ylab= "Air Quality (in millibars)")

plot of chunk unnamed-chunk-8 The plot below shows the relationship between Added Value and Rain. As stated before, it appears that there is not much of a correlation between Added Value and Rain, with Added Value as the response variable.

plot(vala~rain, main = "Added Value by Rain", xlab = "Rain (in inches)", ylab= "Added Value (in thousands of dollars)")

plot of chunk unnamed-chunk-9 The plot below shows the relationship between Added Value and Air Quality. As stated before, it appears that there is not much of a correlation between Added Value and Air Quality, with Added Value as the response variable. There is a slight exponential look to the points, but this is likely just outliers in the data.

plot(vala~airq, main = "Added Value by Air Quality", xlab = "Air Quality (in millibars)", ylab= "Added Value (in thousands of dollars)")

plot of chunk unnamed-chunk-10 The boxplot below shows the relationship between the various coastal locations and their associated added values.

boxplot(vala~coas, main = "Added Value by Coastal Location Relativity", xlab = "Coastal Location", ylab = "Added Value (in thousands of dollars)")

plot of chunk unnamed-chunk-11 The boxplot below shows the relationship between the various coastal locations and their associated Air Qualities.

boxplot(airq~coas, main = "Air Quality by Coastal Location Relativity", xlab = "Coastal Location", ylab = "Air Quality (in millibars)")

plot of chunk unnamed-chunk-12 ### Testing At this point, I am introducitng the Analysis of Covariance (ANCOVA) test. This ANCOVA will investigate and examine a single-factor, multiple dependent variable analysis of covariance.

modelA <-lm(airq~vala + rain + dens + coas, data = x)
anova(modelA)
## Analysis of Variance Table
## 
## Response: airq
##           Df Sum Sq Mean Sq F value Pr(>F)   
## vala       1   2411    2411    4.21 0.0507 . 
## rain       1     11      11    0.02 0.8887   
## dens       1    194     194    0.34 0.5655   
## coas       1   5858    5858   10.24 0.0037 **
## Residuals 25  14308     572                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANCOVA Results

The summary of the ANCOVA gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, air quality, cannot be explained by anything other than randomization. The p-values in this ANCOVA support that this is the case. The p-values for added value, rain, population density, and coastal location are greater than an alpha of 0.05. Therefore, it is likely that the variation in air quality can be explained by these factors.

Tukey’s HSD test is a post-hoc test, meaning that it is performed after an analysis of variance (ANCOVA) test. The purpose of Tukey’s HSD test is to determine which groups in the sample differ. While ANOVA can tell the researcher whether groups in the sample differ, it cannot tell the researcher which groups differ. That is, if the results of ANOVA are positive in the sense that they state there is a significant difference among the groups, the obvious question becomes: Which groups in this sample differ significantly? It is not likely that all groups differ when compared to each other, only that a handful have significant differences. Tukey’s HSD can clarify to the researcher which groups among the sample in specific have significant differences.

tukey1 <-TukeyHSD(aov(airq~ coas, data=x))
tukey1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = airq ~ coas, data = x)
## 
## $coas
##          diff    lwr    upr p adj
## yes-no -29.48 -49.77 -9.185 0.006

Diagnostics/Model Adequacy Checking

To check the adequacy of using the ANCOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual error to determine if the residuals followed a normal distribution. I also created an interaction plot to see if there was an interaction effect between the two factors.

The somewhat linear fit of the residuals in the first QQ plot in reference to ‘vala’ is an indication that the model is somewhat adequate for this analysis.

The third type of plot is a Residuals vs.Fits plot which is used to identify the linearity of the residual values and to detemrine if there are any outlying values.

qqnorm(residuals(modelA))
qqline(residuals(modelA))

plot of chunk unnamed-chunk-15

plot(fitted(modelA),residuals(modelA))

plot of chunk unnamed-chunk-15 The null of a Shapiro-Wilk test states that the data presented is normally distribued. Fortunately, the p-value is greated than an alpha of 0.05, therefore we must fail to reject the null hypothesis. This suggests that the air quality data is normally distribuetd.

shapiro.test(x$airq)
## 
##  Shapiro-Wilk normality test
## 
## data:  x$airq
## W = 0.9426, p-value = 0.1068

4. References to the literature

See course canvas site. ### A summary of, or pointer to, the raw data ### complete and documented R code