Recipe 2: Two or More Factors

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Analysis on the average air pressure at storm center

Wei ZOU

RPI

10/1/2014

1. Setting

System under test

Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data What is the problem that you were given?
In this study, we explore the effects of year of occurrence and storm type on the air pressure at the storm's center. To do so, a two-factor, multi-level approach is used to conduct the analysis of variance.

install.packages("nasaweather")
## Installing package into 'C:/Users/wei/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("nasaweather", lib.loc="C:/Users/wei/Documents/R/win-library/3.1")
data1<-data.frame(storms)
summary(data1)
##      name                year          month           day    
##  Length:2747        Min.   :1995   Min.   : 6.0   Min.   : 1  
##  Class :character   1st Qu.:1995   1st Qu.: 8.0   1st Qu.: 9  
##  Mode  :character   Median :1997   Median : 9.0   Median :18  
##                     Mean   :1997   Mean   : 8.8   Mean   :17  
##                     3rd Qu.:1999   3rd Qu.:10.0   3rd Qu.:25  
##                     Max.   :2000   Max.   :12.0   Max.   :31  
##       hour            lat            long           pressure   
##  Min.   : 0.00   Min.   : 8.3   Min.   :-107.3   Min.   : 905  
##  1st Qu.: 3.50   1st Qu.:17.2   1st Qu.: -77.6   1st Qu.: 980  
##  Median :12.00   Median :25.0   Median : -60.9   Median : 995  
##  Mean   : 9.06   Mean   :26.7   Mean   : -60.9   Mean   : 990  
##  3rd Qu.:18.00   3rd Qu.:33.9   3rd Qu.: -45.8   3rd Qu.:1004  
##  Max.   :18.00   Max.   :70.7   Max.   :   1.0   Max.   :1019  
##       wind           type              seasday   
##  Min.   : 15.0   Length:2747        Min.   :  3  
##  1st Qu.: 35.0   Class :character   1st Qu.: 84  
##  Median : 50.0   Mode  :character   Median :103  
##  Mean   : 54.7                      Mean   :103  
##  3rd Qu.: 70.0                      3rd Qu.:125  
##  Max.   :155.0                      Max.   :185

Factors and Levels

There are two factors in this analysis: for “Year of occurrence”, it has 6 levels ranging from year 1995 to year 2000; for “storm type”, it has 4 levels (Tropical Depression, Tropical Storm, Hurricane, or Extratropical).

data1$year <-as.factor(data1$year)
nlevels(data1$year)
## [1] 6
data1$type<-as.factor(data1$type)
nlevels(data1$type)
## [1] 4

Continuous variables (if any)

There are a number of continuous variables in the dataset, for example, “lat” and “long”, “pressure”, etc.

Response variables

The response variable in this study is “pressure”, which is the air pressure at the center of the storm.

The Data: How is it organized and what does it look like?

The data are originated from the National Hurricane Center's archive of Tropical Cyclone Report. It has information on date and name of the storm, hour and location of the occurrence, air pressure and storm type.

Randomization

The data recorded all the storms occurred during the given time period, therefore the whole population characteristics were captured and we may see these data as randomed.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this study, we are trying to explore whether the year of occurrence and the storm type were affecting the air pressure at the center of the storm. To do so, we conduct a two factor, multi-level analysis of variance and estimnate a linear regression model to capture the quantitative effects.

What is the rationale for this design?

We are interested in finding out whether there is a trend in air pressure according to the year of occurrence or the storm type, so that we can predict the effects of future storms more accurately.

Randomize: What is the Randomization Scheme?

The data were trying to capture the characteristics of the entire population. There's no specially designed experiment to include randomization scheme.

Replicate: Are there replicates and/or repeated measures?

No, there are no replicates/repeated measures.

Block: Did you use blocking in the design?

This study used all the samples to conduct the analysis, so there is no blocking in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

For the first boxplot (by year of occurrence), the mean of air pressure at storm center doesn't show much variance in different levels (years), there are a number of outliers in each year, especially for year 1998. For the second boxplot (by storm type), we can see that the mean of the air pressure at storm center varies a lot for different storm types, indicating that the randomness in the air pressure might have something to do with the storm type.

boxplot(pressure~year, data1)

plot of chunk unnamed-chunk-3

boxplot(pressure~type,data1)

plot of chunk unnamed-chunk-3

Estimation (of Parameters)

We conduct three analysis of variance, one for the effect of year of occurrence, one for the storm type, and the last one for the interaction effect of year of occurrence and storm type. The results show that the randomness in air pressure at storm center are due to something else than simple randomization (the probability of attribute to simple randomization are very small). We may say that the year of occurrence, storm type and their interaction have an effect on the air pressure at the storm center.

model1<-aov(pressure~year,data1)
anova(model1)
## Analysis of Variance Table
## 
## Response: pressure
##             Df Sum Sq Mean Sq F value Pr(>F)    
## year         5  22560    4512    13.2  1e-12 ***
## Residuals 2741 937143     342                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2<-aov(pressure~type,data1)
anova(model2)
## Analysis of Variance Table
## 
## Response: pressure
##             Df Sum Sq Mean Sq F value Pr(>F)    
## type         3 538001  179334    1166 <2e-16 ***
## Residuals 2743 421702     154                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3<-aov(pressure~year*type,data1)
anova(model3)
## Analysis of Variance Table
## 
## Response: pressure
##             Df Sum Sq Mean Sq F value Pr(>F)    
## year         5  22560    4512   30.75 <2e-16 ***
## type         3 528623  176208 1200.88 <2e-16 ***
## year:type   15   8967     598    4.07  2e-07 ***
## Residuals 2723 399553     147                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostics/Model Adequacy Checking

We use three testing methods to varify the model adequacy. For model1, the qqplot shows that the data generally follows the normal distribution assumption. For model2, the plot for fitted reponse variable against residuals shows that there's no obvious trend in the residues, ensuring the randomness. For model3. the interaction plot shows that interaction effect does exist, which is consistent with our model findings.

qqnorm(residuals(model1))

plot of chunk unnamed-chunk-5

plot(fitted(model2),residuals(model2))

plot of chunk unnamed-chunk-5

interaction.plot(data1$year,data1$type,data1$pressure)

plot of chunk unnamed-chunk-5

4. References to the literature

http://cran.r-project.org/web/packages/nasaweather/nasaweather.pdf

5. Appendices

A summary of, or pointer to, the raw data

complete and documented R code