Recipe 2: Two-Factor Analysis

Recipes for the Design of Experiments: Recipe Outline

Hurricane Storms

Jane Braun

RPI

September 29th Version 1.0

1. Setting

System under test

Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data What is the problem that you were given?

load("C:/Users/braunj6/Documents/Fall 2014/Design of Experiments/storms.rdata")
fix(storms)

x<-storms

Factors and Levels

This dataset has 16 different columns with various characteristics of interest, and 2747 observations. Some of the factors of interest include the month, the latitude, the type, and the category, and how they relate to the wind speed.

head(x)
##      name year month day hour  lat  long pressure wind                type
## 1 Allison 1995     6   3    0 17.4 -84.3     1005   30 Tropical Depression
## 2 Allison 1995     6   3    6 18.3 -84.9     1004   30 Tropical Depression
## 3 Allison 1995     6   3   12 19.3 -85.7     1003   35      Tropical Storm
## 4 Allison 1995     6   3   18 20.6 -85.8     1001   40      Tropical Storm
## 5 Allison 1995     6   4    0 22.0 -86.0      997   50      Tropical Storm
## 6 Allison 1995     6   4    6 23.3 -86.3      995   60      Tropical Storm
##   seasday category
## 1       3       -1
## 2       3       -1
## 3       3        0
## 4       3        0
## 5       4        0
## 6       4        0
str(x)
## 'data.frame':    2747 obs. of  12 variables:
##  $ name    : chr  "Allison" "Allison" "Allison" "Allison" ...
##  $ year    : num  1995 1995 1995 1995 1995 ...
##  $ month   : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ day     : num  3 3 3 3 4 4 4 4 5 5 ...
##  $ hour    : num  0 6 12 18 0 6 12 18 0 6 ...
##  $ lat     : num  17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
##  $ long    : num  -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
##  $ pressure: num  1005 1004 1003 1001 997 ...
##  $ wind    : num  30 30 35 40 50 60 65 65 65 60 ...
##  $ type    : chr  "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
##  $ seasday : num  3 3 3 3 4 4 4 4 5 5 ...
##  $ category: num  -1 -1 0 0 0 0 1 1 1 0 ...

Continuous variables

There are many continuous variables in this dataset. They include:

Response variables

The response variable being analyzed is the wind speed of the storm.

The Data: How is it organized and what does it look like?

The data is organized into 12 columns, with 2747 observations. Each row does not have a unique hurricane associated with it, but rather the various hurricanes at different time stamps along their path.

Randomization

This data is not randomized because it is taken every 6 hours along the path of the hurricane.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will use a multi-factor analtsis of variance by subsetting the data. It will use various attributes as factors, such as month, latitude, category, and pressure.

What is the rationale for this design?

The rationale for this type of design is that the goal was to analyze the difference in means betweens between the groups. The ANOVA was set up to include interaction between the factors. Therefore, we wanted to see what the variation was both among, and between groups.

Randomize: What is the Randomization Scheme?

Because the dataset was a set of observations, there was no randomization.

Replicate: Are there replicates and/or repeated measures?

Each hurricane was measured multiple times throughout its path. Since each hurricane was unique, it was not possible to have repeated measures or replicates of the entire experiment.

Block: Did you use blocking in the design?

Blocking was used for latitude, month, category, and pressure.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

# Frequency of storms by type
barplot(table(x$type), main = "Frequency of Each Type of Storm")

plot of chunk unnamed-chunk-3

# Frequency of storms by category
barplot(table(x$category), main = "Frequency of Category of Storm")

plot of chunk unnamed-chunk-3

plot(x$lat, x$wind)

plot of chunk unnamed-chunk-3

boxplot(x$wind ~ x$month)

plot of chunk unnamed-chunk-3

barplot(table(x$month), main = "Frequency of Storms by Month")

plot of chunk unnamed-chunk-3

Testing

The focus of this recipe was on a >2 factor, >2 level ANOVA test. Therefore, the factors being analyzed had to be mutiple levels.

#ANOVA Testing

#Comparing ranges latitudes of storms
# H0: There is no difference in arrival delays between storms
# Ha: The difference in means is not = 0
x$lat[x$lat<20 & x$lat>=0] = "0-20"
x$lat[x$lat<40 & x$lat>=20] = "20-40"
x$lat[x$lat<60 & x$lat>=40] = "40-60"
x$lat[x$lat<80 & x$lat>=60] = "60-80"

model1 <- aov(wind ~ lat*month*category*pressure, data = x)
summary(model1)
##                               Df  Sum Sq Mean Sq  F value  Pr(>F)    
## lat                            3   30949   10316   389.84 < 2e-16 ***
## month                          1    4073    4073   153.90 < 2e-16 ***
## category                       1 1670049 1670049 63108.94 < 2e-16 ***
## pressure                       1   36412   36412  1375.98 < 2e-16 ***
## lat:month                      3     247      82     3.11  0.0253 *  
## lat:category                   3     470     157     5.93  0.0005 ***
## month:category                 1       2       2     0.06  0.8047    
## lat:pressure                   3    1359     453    17.12 5.2e-11 ***
## month:pressure                 1      29      29     1.08  0.2979    
## category:pressure              1   18472   18472   698.03 < 2e-16 ***
## lat:month:category             3     284      95     3.57  0.0135 *  
## lat:month:pressure             3     211      70     2.65  0.0470 *  
## lat:category:pressure          3      68      23     0.86  0.4601    
## month:category:pressure        1      98      98     3.71  0.0541 .  
## lat:month:category:pressure    2     129      64     2.43  0.0880 .  
## Residuals                   2716   71873      26                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary of the ANOVA gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, wind, cannot be explained by anything other than randomization. The p-values in this ANOVA support that this is not the case. There are significant p-values for latitude, month, category, pressures, and all interactions, except:

Diagnostics/Model Adequacy Checking

par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-5 The Q-Q Normality Plot of the residuals shows that the data is fairly normally distributed, and therefore the normality assumption is upheld.

interaction.plot(x$lat, x$month, x$wind)

plot of chunk unnamed-chunk-6

interaction.plot(x$month, x$lat, x$wind)

plot of chunk unnamed-chunk-6 Because the lines of the interaction plots are not parallel and they cross multiple imes, there is evidence to support that there is interaction between the different treatment levels.

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-7 The plot of the fitted values versus the residual values is generally quite scattered and random.

4. Contingencies

There are multiple contingencies that could be looked into for the nasaweather storm data. First, pressure could be affected not only by the storm itself, but also the location of the storm. For example, the air pressure for a hurricane in Florida would be different from the air pressure of a storm in the Northeast - due to altitude, humidity, etc. Therefore, in the future it might be important to block by location/altitude also.

For future analysis, a Tukey Test could also be used. Tukey's range tests are used alongside ANOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.

The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.

model2 <- aov(wind~lat, data=x)
summary(model2)
##               Df  Sum Sq Mean Sq F value  Pr(>F)    
## lat            3   30949   10316    15.7 4.1e-10 ***
## Residuals   2743 1803776     658                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = wind ~ lat, data = x)
## 
## $lat
##                diff     lwr     upr  p adj
## 20-40-0-20    4.383   1.645  7.1203 0.0002
## 40-60-0-20   -4.490  -8.779 -0.2014 0.0360
## 60-80-0-20  -13.032 -25.893 -0.1708 0.0456
## 40-60-20-40  -8.873 -12.991 -4.7552 0.0000
## 60-80-20-40 -17.415 -30.220 -4.6096 0.0027
## 60-80-40-60  -8.542 -21.765  4.6821 0.3450

Based on the ANOVA, the p-value is less than an alpha of .05, therefore it supports the alternative hypothesis, that the variation in wind values can be explained by change in latitude.

The results of the Tukey Test display p-values less than an alpha, 0.05 between all pairs of latitude ranges, except between the 40-60 and 60-80 group. This shows that, for these pairs, there is a significant difference in means between these pairs.

4. References to the literature

N/A

5. Appendices

A summary of, or pointer to, the raw data

The data can be found at GitHub https://github.com/hadley/nasaweather

Complete and documented R code

See code above for complete R code