Recipe 1

Matthew Macchi

Rensselaer Polytechnic Institute

9/18/14 Version 1

1. Setting

System under test

This recipe will conduct an experiment on the nasaweather dataset. The experiment will attempt to investigate the storms susbet and examine the mean wind speed of storms in the years 1995 and 2000 in hopes of supporting or refuting the claim that storms wind speed has generally increased over the years.

install.packages("nasaweather", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//RtmpuQjEKE/downloaded_packages
library("nasaweather", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
x<-storms
head(x)
##      name year month day hour  lat  long pressure wind                type
## 1 Allison 1995     6   3    0 17.4 -84.3     1005   30 Tropical Depression
## 2 Allison 1995     6   3    6 18.3 -84.9     1004   30 Tropical Depression
## 3 Allison 1995     6   3   12 19.3 -85.7     1003   35      Tropical Storm
## 4 Allison 1995     6   3   18 20.6 -85.8     1001   40      Tropical Storm
## 5 Allison 1995     6   4    0 22.0 -86.0      997   50      Tropical Storm
## 6 Allison 1995     6   4    6 23.3 -86.3      995   60      Tropical Storm
##   seasday
## 1       3
## 2       3
## 3       3
## 4       3
## 5       4
## 6       4

Factors and Levels

A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. The factors in this instance are the years being investigated.

The term level is also used for categorical variables. In this case, the levels of the factor are 1995 and 2000.

head(x)
##      name year month day hour  lat  long pressure wind                type
## 1 Allison 1995     6   3    0 17.4 -84.3     1005   30 Tropical Depression
## 2 Allison 1995     6   3    6 18.3 -84.9     1004   30 Tropical Depression
## 3 Allison 1995     6   3   12 19.3 -85.7     1003   35      Tropical Storm
## 4 Allison 1995     6   3   18 20.6 -85.8     1001   40      Tropical Storm
## 5 Allison 1995     6   4    0 22.0 -86.0      997   50      Tropical Storm
## 6 Allison 1995     6   4    6 23.3 -86.3      995   60      Tropical Storm
##   seasday
## 1       3
## 2       3
## 3       3
## 4       3
## 5       4
## 6       4
tail(x)
##        name year month day hour  lat  long pressure wind           type
## 2742 Nadine 2000    10  21    6 33.3 -53.5     1000   50 Tropical Storm
## 2743 Nadine 2000    10  21   12 34.1 -52.3     1000   50 Tropical Storm
## 2744 Nadine 2000    10  21   18 34.8 -51.3     1000   45 Tropical Storm
## 2745 Nadine 2000    10  22    0 35.7 -50.5     1004   40  Extratropical
## 2746 Nadine 2000    10  22    6 37.0 -49.0     1005   40  Extratropical
## 2747 Nadine 2000    10  22   12 39.0 -47.0     1005   35  Extratropical
##      seasday
## 2742     143
## 2743     143
## 2744     143
## 2745     144
## 2746     144
## 2747     144
summary(x)
##      name                year          month           day    
##  Length:2747        Min.   :1995   Min.   : 6.0   Min.   : 1  
##  Class :character   1st Qu.:1995   1st Qu.: 8.0   1st Qu.: 9  
##  Mode  :character   Median :1997   Median : 9.0   Median :18  
##                     Mean   :1997   Mean   : 8.8   Mean   :17  
##                     3rd Qu.:1999   3rd Qu.:10.0   3rd Qu.:25  
##                     Max.   :2000   Max.   :12.0   Max.   :31  
##       hour            lat            long           pressure   
##  Min.   : 0.00   Min.   : 8.3   Min.   :-107.3   Min.   : 905  
##  1st Qu.: 3.50   1st Qu.:17.2   1st Qu.: -77.6   1st Qu.: 980  
##  Median :12.00   Median :25.0   Median : -60.9   Median : 995  
##  Mean   : 9.06   Mean   :26.7   Mean   : -60.9   Mean   : 990  
##  3rd Qu.:18.00   3rd Qu.:33.9   3rd Qu.: -45.8   3rd Qu.:1004  
##  Max.   :18.00   Max.   :70.7   Max.   :   1.0   Max.   :1019  
##       wind           type              seasday   
##  Min.   : 15.0   Length:2747        Min.   :  3  
##  1st Qu.: 35.0   Class :character   1st Qu.: 84  
##  Median : 50.0   Mode  :character   Median :103  
##  Mean   : 54.7                      Mean   :103  
##  3rd Qu.: 70.0                      3rd Qu.:125  
##  Max.   :155.0                      Max.   :185

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.

In this instance, only one variable can be considered continuous. Since Wind speed is not a categorical variable, it is continuous.

Response variables

A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is

The Data: How is it organized and what does it look like?

The data is organized initially into an 11 column table: The columns are titled as follows: Name, Year, Month, Day, Hour, Lat, Long, Pressure, Wind, Type, and Seasday. All data is numeric minus Name and Type, which are textual. Since the experiment is focusing on years 1995 and 2000, the data has been subset to only look at values in those years.

y20<-subset(x,x$year=='2000')
y95<-subset(x,x$year=='1995')
summary(y20)
##      name                year          month            day      
##  Length:407         Min.   :2000   Min.   : 8.00   Min.   : 1.0  
##  Class :character   1st Qu.:2000   1st Qu.: 8.00   1st Qu.:10.0  
##  Mode  :character   Median :2000   Median : 9.00   Median :17.0  
##                     Mean   :2000   Mean   : 8.95   Mean   :16.1  
##                     3rd Qu.:2000   3rd Qu.:10.00   3rd Qu.:22.0  
##                     Max.   :2000   Max.   :10.00   Max.   :30.0  
##       hour            lat            long           pressure   
##  Min.   : 0.00   Min.   :10.3   Min.   :-101.0   Min.   : 941  
##  1st Qu.: 6.00   1st Qu.:17.2   1st Qu.: -73.3   1st Qu.: 984  
##  Median :12.00   Median :27.1   Median : -56.5   Median : 997  
##  Mean   : 9.17   Mean   :27.6   Mean   : -58.7   Mean   : 993  
##  3rd Qu.:18.00   3rd Qu.:35.4   3rd Qu.: -45.2   3rd Qu.:1007  
##  Max.   :18.00   Max.   :70.7   Max.   :  -4.0   Max.   :1013  
##       wind           type              seasday   
##  Min.   : 20.0   Length:407         Min.   : 64  
##  1st Qu.: 35.0   Class :character   1st Qu.: 83  
##  Median : 45.0   Mode  :character   Median :111  
##  Mean   : 52.4                      Mean   :106  
##  3rd Qu.: 65.0                      3rd Qu.:123  
##  Max.   :120.0                      Max.   :144
summary(y95)
##      name                year          month            day    
##  Length:724         Min.   :1995   Min.   : 6.00   Min.   : 1  
##  Class :character   1st Qu.:1995   1st Qu.: 8.00   1st Qu.: 6  
##  Mode  :character   Median :1995   Median : 8.50   Median :16  
##                     Mean   :1995   Mean   : 8.55   Mean   :16  
##                     3rd Qu.:1995   3rd Qu.: 9.00   3rd Qu.:26  
##                     Max.   :1995   Max.   :11.00   Max.   :31  
##       hour         lat            long          pressure   
##  Min.   : 0   Min.   : 8.3   Min.   :-98.5   Min.   : 919  
##  1st Qu.: 0   1st Qu.:19.1   1st Qu.:-76.0   1st Qu.: 979  
##  Median :12   Median :25.1   Median :-60.2   Median : 995  
##  Mean   : 9   Mean   :27.1   Mean   :-61.7   Mean   : 989  
##  3rd Qu.:18   3rd Qu.:33.6   3rd Qu.:-48.2   3rd Qu.:1004  
##  Max.   :18   Max.   :65.0   Max.   : -1.0   Max.   :1019  
##       wind           type              seasday     
##  Min.   : 20.0   Length:724         Min.   :  3.0  
##  1st Qu.: 35.0   Class :character   1st Qu.: 75.8  
##  Median : 50.0   Mode  :character   Median : 92.5  
##  Mean   : 54.2                      Mean   : 93.9  
##  3rd Qu.: 70.0                      3rd Qu.:122.0  
##  Max.   :130.0                      Max.   :156.0

Randomization

The data were collected with monthly measurements at 6 hour increments according to the date of the storm for each variable from 01/95 to 12/00, a 5 year span.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The null hypothesis is that there is no difference in the mean wind speeds of the storms over the course of 5 years. The alternative is that there is a statistically significant difference. A two sample t-test will be used to determine if this difference exists, using an alpha of .05.

What is the rationale for this design?

A two-sample T-Test is used for this design. You use an independent t-test when you want to compare the mean of one sample with the mean of another sample to see if there is a statistically significant difference between the two. As the name suggests, you use an independent t-test when your samples are independent.

Randomize: What is the Randomization Scheme?

The data were collected with monthly measurements at 6 hour increments according to the date of the storm for each variable from 01/95 to 12/00, a 5 year span.

Replicate: Are there replicates and/or repeated measures?

There are no replicates, but repeated measures do occur between the two years.

Block: Did you use blocking in the design?

No blocking was necessary in the construction of this experiment.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

par(mfrow=c(1,2))
hist(y95$wind)
hist(y20$wind)

plot of chunk unnamed-chunk-4

boxplot(y95$wind, y20$wind, main="Mean Wind Speed", xlab="Year", ylab="Wind Speed (mph)", names=c("1995","2000"))

plot of chunk unnamed-chunk-4

Testing

A two-sample T-test was conducted to compare the means of the subsets.

t.test(y95$wind, y20$wind, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  y95$wind and y20$wind
## t = 1.187, df = 1129, p-value = 0.2356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.143  4.642
## sample estimates:
## mean of x mean of y 
##     54.16     52.41

Based on these preliminary results from the t-test, we reject the H0 that there is no difference between the mean wind speeds of storms in 1995 and 2000. However, we must check the normality to ensure these results are valid.

Diagnostics/Model Adequacy Checking

As was seen with the histograms, the data does not appear to be normally distributed. However, we can code and plot a line to to check the distribution of data along the line.

qqnorm(y95$wind, ylab="Wind Speed (mph)")
qqline(y95$wind, ylab="Wind Speed (mph)") 

plot of chunk unnamed-chunk-6

qqnorm(y20$wind, ylab="Wind Speed (mph)")
qqline(y20$wind, ylab="Wind Speed (mph)")

plot of chunk unnamed-chunk-6

Further, we can use the Shapiro-Wilk Normality test to determine whether or not the data is Normally distributed.

shapiro.test(y95$wind)
## 
##  Shapiro-Wilk normality test
## 
## data:  y95$wind
## W = 0.9427, p-value = 4.222e-16
shapiro.test(y20$wind)
## 
##  Shapiro-Wilk normality test
## 
## data:  y20$wind
## W = 0.928, p-value = 4.369e-13

4. References to the literature

See course canvas site

5. Appendices

A summary of, or pointer to, the raw data

complete and documented R code