Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data
What is the problem that you were given?
load("C:/Users/braunj6/Documents/Fall 2014/Design of Experiments/storms.rdata")
fix(storms)
x<-storms
This dataset has 16 different columns with various characteristics of interest, and 2747 observations. Some of the factors of interest include the month, the latitude, the type, and the category, and how they relate to the wind speed.
head(x)
## name year month day hour lat long pressure wind type
## 1 Allison 1995 6 3 0 17.4 -84.3 1005 30 Tropical Depression
## 2 Allison 1995 6 3 6 18.3 -84.9 1004 30 Tropical Depression
## 3 Allison 1995 6 3 12 19.3 -85.7 1003 35 Tropical Storm
## 4 Allison 1995 6 3 18 20.6 -85.8 1001 40 Tropical Storm
## 5 Allison 1995 6 4 0 22.0 -86.0 997 50 Tropical Storm
## 6 Allison 1995 6 4 6 23.3 -86.3 995 60 Tropical Storm
## seasday category
## 1 3 -1
## 2 3 -1
## 3 3 0
## 4 3 0
## 5 4 0
## 6 4 0
str(x)
## 'data.frame': 2747 obs. of 12 variables:
## $ name : chr "Allison" "Allison" "Allison" "Allison" ...
## $ year : num 1995 1995 1995 1995 1995 ...
## $ month : num 6 6 6 6 6 6 6 6 6 6 ...
## $ day : num 3 3 3 3 4 4 4 4 5 5 ...
## $ hour : num 0 6 12 18 0 6 12 18 0 6 ...
## $ lat : num 17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
## $ long : num -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
## $ pressure: num 1005 1004 1003 1001 997 ...
## $ wind : num 30 30 35 40 50 60 65 65 65 60 ...
## $ type : chr "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
## $ seasday : num 3 3 3 3 4 4 4 4 5 5 ...
## $ category: num -1 -1 0 0 0 0 1 1 1 0 ...
There are many continuous variables in this dataset. They include:
The response variable being analyzed is the wind speed of the storm.
The data is organized into 12 columns, with 2747 observations. Each row does not have a unique hurricane associated with it, but rather the various hurricanes at different time stamps along their path.
This data is not randomized because it is taken every 6 hours along the path of the hurricane.
The experiment will use a multi-factor analtsis of variance by subsetting the data. It will use various attributes as factors, such as month, latitude, category, and pressure.
The rationale for this type of design is that the goal was to analyze the difference in means betweens between the groups. The ANOVA was set up to include interaction between the factors. Therefore, we wanted to see what the variation was both among, and between groups.
Because the dataset was a set of observations, there was no randomization.
Each hurricane was measured multiple times throughout its path. Since each hurricane was unique, it was not possible to have repeated measures or replicates of the entire experiment.
Blocking was used for latitude, month, category, and pressure.
# Frequency of storms by type
barplot(table(x$type), main = "Frequency of Each Type of Storm")
# Frequency of storms by category
barplot(table(x$category), main = "Frequency of Category of Storm")
plot(x$lat, x$wind)
boxplot(x$wind ~ x$month)
barplot(table(x$month), main = "Frequency of Storms by Month")
The focus of this recipe was on a >2 factor, >2 level ANOVA test. Therefore, the factors being analyzed had to be mutiple levels.
#ANOVA Testing
#Comparing ranges latitudes of storms
# H0: There is no difference in arrival delays between storms
# Ha: The difference in means is not = 0
x$lat[x$lat<20 & x$lat>=0] = "0-20"
x$lat[x$lat<40 & x$lat>=20] = "20-40"
x$lat[x$lat<60 & x$lat>=40] = "40-60"
x$lat[x$lat<80 & x$lat>=60] = "60-80"
model1 <- aov(wind ~ lat*month*category*pressure, data = x)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## lat 3 30949 10316 389.84 < 2e-16 ***
## month 1 4073 4073 153.90 < 2e-16 ***
## category 1 1670049 1670049 63108.94 < 2e-16 ***
## pressure 1 36412 36412 1375.98 < 2e-16 ***
## lat:month 3 247 82 3.11 0.0253 *
## lat:category 3 470 157 5.93 0.0005 ***
## month:category 1 2 2 0.06 0.8047
## lat:pressure 3 1359 453 17.12 5.2e-11 ***
## month:pressure 1 29 29 1.08 0.2979
## category:pressure 1 18472 18472 698.03 < 2e-16 ***
## lat:month:category 3 284 95 3.57 0.0135 *
## lat:month:pressure 3 211 70 2.65 0.0470 *
## lat:category:pressure 3 68 23 0.86 0.4601
## month:category:pressure 1 98 98 3.71 0.0541 .
## lat:month:category:pressure 2 129 64 2.43 0.0880 .
## Residuals 2716 71873 26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of the ANOVA gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, wind, cannot be explained by anything other than randomization. The p-values in this ANOVA support that this is not the case. There are significant p-values for latitude, month, category, pressures, and all interactions, except:
par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))
The Q-Q Normality Plot of the residuals shows that the data is fairly normally distributed, and therefore the normality assumption is upheld.
interaction.plot(x$lat, x$month, x$wind)
interaction.plot(x$month, x$lat, x$wind)
Because the lines of the interaction plots are not parallel and they cross multiple imes, there is evidence to support that there is interaction between the different treatment levels.
plot(fitted(model1),residuals(model1))
The plot of the fitted values versus the residual values is generally quite scattered and random.
There are multiple contingencies that could be looked into for the nasaweather storm data. First, pressure could be affected not only by the storm itself, but also the location of the storm. For example, the air pressure for a hurricane in Florida would be different from the air pressure of a storm in the Northeast - due to altitude, humidity, etc. Therefore, in the future it might be important to block by location/altitude also.
For future analysis, a Tukey Test could also be used. Tukey's range tests are used alongside ANOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.
The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.
model2 <- aov(wind~lat, data=x)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## lat 3 30949 10316 15.7 4.1e-10 ***
## Residuals 2743 1803776 658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = wind ~ lat, data = x)
##
## $lat
## diff lwr upr p adj
## 20-40-0-20 4.383 1.645 7.1203 0.0002
## 40-60-0-20 -4.490 -8.779 -0.2014 0.0360
## 60-80-0-20 -13.032 -25.893 -0.1708 0.0456
## 40-60-20-40 -8.873 -12.991 -4.7552 0.0000
## 60-80-20-40 -17.415 -30.220 -4.6096 0.0027
## 60-80-40-60 -8.542 -21.765 4.6821 0.3450
Based on the ANOVA, the p-value is less than an alpha of .05, therefore it supports the alternative hypothesis, that the variation in wind values can be explained by change in latitude.
The results of the Tukey Test display p-values less than an alpha, 0.05 between all pairs of latitude ranges, except between the 40-60 and 60-80 group. This shows that, for these pairs, there is a significant difference in means between these pairs.
N/A
The data can be found at GitHub https://github.com/hadley/nasaweather
See code above for complete R code