This recipe will conduct an experiment on the nasaweather dataset. The experiment will attempt to investigate the storms susbet and examine the mean wind speed of storms in the years 1995 and 2000 in hopes of supporting or refuting the claim that storms wind speed has generally increased over the years.
install.packages("nasaweather", repos='http://cran.us.r-project.org')
##
## The downloaded binary packages are in
## /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//RtmpuQjEKE/downloaded_packages
library("nasaweather", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
x<-storms
head(x)
## name year month day hour lat long pressure wind type
## 1 Allison 1995 6 3 0 17.4 -84.3 1005 30 Tropical Depression
## 2 Allison 1995 6 3 6 18.3 -84.9 1004 30 Tropical Depression
## 3 Allison 1995 6 3 12 19.3 -85.7 1003 35 Tropical Storm
## 4 Allison 1995 6 3 18 20.6 -85.8 1001 40 Tropical Storm
## 5 Allison 1995 6 4 0 22.0 -86.0 997 50 Tropical Storm
## 6 Allison 1995 6 4 6 23.3 -86.3 995 60 Tropical Storm
## seasday
## 1 3
## 2 3
## 3 3
## 4 3
## 5 4
## 6 4
A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. The factors in this instance are the years being investigated.
The term level is also used for categorical variables. In this case, the levels of the factor are 1995 and 2000.
head(x)
## name year month day hour lat long pressure wind type
## 1 Allison 1995 6 3 0 17.4 -84.3 1005 30 Tropical Depression
## 2 Allison 1995 6 3 6 18.3 -84.9 1004 30 Tropical Depression
## 3 Allison 1995 6 3 12 19.3 -85.7 1003 35 Tropical Storm
## 4 Allison 1995 6 3 18 20.6 -85.8 1001 40 Tropical Storm
## 5 Allison 1995 6 4 0 22.0 -86.0 997 50 Tropical Storm
## 6 Allison 1995 6 4 6 23.3 -86.3 995 60 Tropical Storm
## seasday
## 1 3
## 2 3
## 3 3
## 4 3
## 5 4
## 6 4
tail(x)
## name year month day hour lat long pressure wind type
## 2742 Nadine 2000 10 21 6 33.3 -53.5 1000 50 Tropical Storm
## 2743 Nadine 2000 10 21 12 34.1 -52.3 1000 50 Tropical Storm
## 2744 Nadine 2000 10 21 18 34.8 -51.3 1000 45 Tropical Storm
## 2745 Nadine 2000 10 22 0 35.7 -50.5 1004 40 Extratropical
## 2746 Nadine 2000 10 22 6 37.0 -49.0 1005 40 Extratropical
## 2747 Nadine 2000 10 22 12 39.0 -47.0 1005 35 Extratropical
## seasday
## 2742 143
## 2743 143
## 2744 143
## 2745 144
## 2746 144
## 2747 144
summary(x)
## name year month day
## Length:2747 Min. :1995 Min. : 6.0 Min. : 1
## Class :character 1st Qu.:1995 1st Qu.: 8.0 1st Qu.: 9
## Mode :character Median :1997 Median : 9.0 Median :18
## Mean :1997 Mean : 8.8 Mean :17
## 3rd Qu.:1999 3rd Qu.:10.0 3rd Qu.:25
## Max. :2000 Max. :12.0 Max. :31
## hour lat long pressure
## Min. : 0.00 Min. : 8.3 Min. :-107.3 Min. : 905
## 1st Qu.: 3.50 1st Qu.:17.2 1st Qu.: -77.6 1st Qu.: 980
## Median :12.00 Median :25.0 Median : -60.9 Median : 995
## Mean : 9.06 Mean :26.7 Mean : -60.9 Mean : 990
## 3rd Qu.:18.00 3rd Qu.:33.9 3rd Qu.: -45.8 3rd Qu.:1004
## Max. :18.00 Max. :70.7 Max. : 1.0 Max. :1019
## wind type seasday
## Min. : 15.0 Length:2747 Min. : 3
## 1st Qu.: 35.0 Class :character 1st Qu.: 84
## Median : 50.0 Mode :character Median :103
## Mean : 54.7 Mean :103
## 3rd Qu.: 70.0 3rd Qu.:125
## Max. :155.0 Max. :185
If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.
In this instance, only one variable can be considered continuous. Since Wind speed is not a categorical variable, it is continuous.
A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is
The data is organized initially into an 11 column table: The columns are titled as follows: Name, Year, Month, Day, Hour, Lat, Long, Pressure, Wind, Type, and Seasday. All data is numeric minus Name and Type, which are textual. Since the experiment is focusing on years 1995 and 2000, the data has been subset to only look at values in those years.
y20<-subset(x,x$year=='2000')
y95<-subset(x,x$year=='1995')
summary(y20)
## name year month day
## Length:407 Min. :2000 Min. : 8.00 Min. : 1.0
## Class :character 1st Qu.:2000 1st Qu.: 8.00 1st Qu.:10.0
## Mode :character Median :2000 Median : 9.00 Median :17.0
## Mean :2000 Mean : 8.95 Mean :16.1
## 3rd Qu.:2000 3rd Qu.:10.00 3rd Qu.:22.0
## Max. :2000 Max. :10.00 Max. :30.0
## hour lat long pressure
## Min. : 0.00 Min. :10.3 Min. :-101.0 Min. : 941
## 1st Qu.: 6.00 1st Qu.:17.2 1st Qu.: -73.3 1st Qu.: 984
## Median :12.00 Median :27.1 Median : -56.5 Median : 997
## Mean : 9.17 Mean :27.6 Mean : -58.7 Mean : 993
## 3rd Qu.:18.00 3rd Qu.:35.4 3rd Qu.: -45.2 3rd Qu.:1007
## Max. :18.00 Max. :70.7 Max. : -4.0 Max. :1013
## wind type seasday
## Min. : 20.0 Length:407 Min. : 64
## 1st Qu.: 35.0 Class :character 1st Qu.: 83
## Median : 45.0 Mode :character Median :111
## Mean : 52.4 Mean :106
## 3rd Qu.: 65.0 3rd Qu.:123
## Max. :120.0 Max. :144
summary(y95)
## name year month day
## Length:724 Min. :1995 Min. : 6.00 Min. : 1
## Class :character 1st Qu.:1995 1st Qu.: 8.00 1st Qu.: 6
## Mode :character Median :1995 Median : 8.50 Median :16
## Mean :1995 Mean : 8.55 Mean :16
## 3rd Qu.:1995 3rd Qu.: 9.00 3rd Qu.:26
## Max. :1995 Max. :11.00 Max. :31
## hour lat long pressure
## Min. : 0 Min. : 8.3 Min. :-98.5 Min. : 919
## 1st Qu.: 0 1st Qu.:19.1 1st Qu.:-76.0 1st Qu.: 979
## Median :12 Median :25.1 Median :-60.2 Median : 995
## Mean : 9 Mean :27.1 Mean :-61.7 Mean : 989
## 3rd Qu.:18 3rd Qu.:33.6 3rd Qu.:-48.2 3rd Qu.:1004
## Max. :18 Max. :65.0 Max. : -1.0 Max. :1019
## wind type seasday
## Min. : 20.0 Length:724 Min. : 3.0
## 1st Qu.: 35.0 Class :character 1st Qu.: 75.8
## Median : 50.0 Mode :character Median : 92.5
## Mean : 54.2 Mean : 93.9
## 3rd Qu.: 70.0 3rd Qu.:122.0
## Max. :130.0 Max. :156.0
The data were collected with monthly measurements at 6 hour increments according to the date of the storm for each variable from 01/95 to 12/00, a 5 year span.
The null hypothesis is that there is no difference in the mean wind speeds of the storms over the course of 5 years. The alternative is that there is a statistically significant difference. A two sample t-test will be used to determine if this difference exists, using an alpha of .05.
A two-sample T-Test is used for this design. You use an independent t-test when you want to compare the mean of one sample with the mean of another sample to see if there is a statistically significant difference between the two. As the name suggests, you use an independent t-test when your samples are independent.
The data were collected with monthly measurements at 6 hour increments according to the date of the storm for each variable from 01/95 to 12/00, a 5 year span.
There are no replicates, but repeated measures do occur between the two years.
No blocking was necessary in the construction of this experiment.
par(mfrow=c(1,2))
hist(y95$wind)
hist(y20$wind)
boxplot(y95$wind, y20$wind, main="Mean Wind Speed", xlab="Year", ylab="Wind Speed (mph)", names=c("1995","2000"))
A two-sample T-test was conducted to compare the means of the subsets.
t.test(y95$wind, y20$wind, var.equal=TRUE)
##
## Two Sample t-test
##
## data: y95$wind and y20$wind
## t = 1.187, df = 1129, p-value = 0.2356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.143 4.642
## sample estimates:
## mean of x mean of y
## 54.16 52.41
Based on these preliminary results from the t-test, we reject the H0 that there is no difference between the mean wind speeds of storms in 1995 and 2000. However, we must check the normality to ensure these results are valid.
As was seen with the histograms, the data does not appear to be normally distributed. However, we can code and plot a line to to check the distribution of data along the line.
qqnorm(y95$wind, ylab="Wind Speed (mph)")
qqline(y95$wind, ylab="Wind Speed (mph)")
qqnorm(y20$wind, ylab="Wind Speed (mph)")
qqline(y20$wind, ylab="Wind Speed (mph)")
Further, we can use the Shapiro-Wilk Normality test to determine whether or not the data is Normally distributed.
shapiro.test(y95$wind)
##
## Shapiro-Wilk normality test
##
## data: y95$wind
## W = 0.9427, p-value = 4.222e-16
shapiro.test(y20$wind)
##
## Shapiro-Wilk normality test
##
## data: y20$wind
## W = 0.928, p-value = 4.369e-13
See course canvas site