Recipie for Descriptive Statistics

Ali Svoobda

RPI

9/17/14 V.1

1. Setting

System under test

For this recipie, the storm data from the nasaweather package will be examined. Specifically, the data will be subsetted to view only the midnight (hour=0) and at noon (hour=12) observations and the wind speed will be examined.

To access the package and nasaweather datasets:

library("nasaweather", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
library(nasaweather)

Save storms data to workspace and create subsets:

storm<-storms
view(storm)
## Error: could not find function "view"
AM<-subset(storm,storm$hour=="0")
NOON<-subset(storm,storm$hour=="12")

** Note ** throughout this recipie, the AM subset refers to hour 0 or midnight and the NOON subset is noon or hour 12.

Factors and Levels

The factor in this experiment is hour with 0 and 12 as the levels being examined. Other factors in the storms dataset include year(levels= 1995-2000), month(levels= 6-12), name(Levels= Allison-Nadine), and type(levels= Tropical Depression, Tropical Storm, Hurricane, or Extratropica)

First and last 6 observations for the midnight and noon observations

head(AM)
##       name year month day hour  lat  long pressure wind
## 1  Allison 1995     6   3    0 17.4 -84.3     1005   30
## 5  Allison 1995     6   4    0 22.0 -86.0      997   50
## 9  Allison 1995     6   5    0 27.6 -86.1      988   65
## 13 Allison 1995     6   6    0 31.8 -82.8      993   30
## 17 Allison 1995     6   7    0 35.6 -75.9      992   40
## 21 Allison 1995     6   8    0 41.0 -67.7      982   50
##                   type seasday
## 1  Tropical Depression       3
## 5       Tropical Storm       4
## 9            Hurricane       5
## 13 Tropical Depression       6
## 17       Extratropical       7
## 21       Extratropical       8
tail(AM)
##         name year month day hour  lat  long pressure wind
## 2723 Michael 2000    10  18    0 30.4 -70.9      988   65
## 2727 Michael 2000    10  19    0 34.2 -67.8      983   75
## 2731 Michael 2000    10  20    0 48.0 -56.5      966   75
## 2737  Nadine 2000    10  20    0 28.7 -58.8     1008   30
## 2741  Nadine 2000    10  21    0 32.4 -55.2      999   50
## 2745  Nadine 2000    10  22    0 35.7 -50.5     1004   40
##                     type seasday
## 2723           Hurricane     140
## 2727           Hurricane     141
## 2731       Extratropical     142
## 2737 Tropical Depression     142
## 2741      Tropical Storm     143
## 2745       Extratropical     144
head(NOON)
##       name year month day hour  lat  long pressure wind           type
## 3  Allison 1995     6   3   12 19.3 -85.7     1003   35 Tropical Storm
## 7  Allison 1995     6   4   12 24.7 -86.2      987   65      Hurricane
## 11 Allison 1995     6   5   12 29.6 -84.7      990   60 Tropical Storm
## 15 Allison 1995     6   6   12 33.6 -80.0      995   35  Extratropical
## 19 Allison 1995     6   7   12 38.5 -71.0      988   45  Extratropical
## 23 Allison 1995     6   8   12 43.8 -63.7      989   50  Extratropical
##    seasday
## 3        3
## 7        4
## 11       5
## 15       6
## 19       7
## 23       8
tail(NOON)
##         name year month day hour  lat  long pressure wind
## 2729 Michael 2000    10  19   12 39.8 -61.6      979   75
## 2733 Michael 2000    10  20   12 51.0 -53.5      968   65
## 2735  Nadine 2000    10  19   12 26.2 -59.9     1009   25
## 2739  Nadine 2000    10  20   12 30.4 -57.2     1003   35
## 2743  Nadine 2000    10  21   12 34.1 -52.3     1000   50
## 2747  Nadine 2000    10  22   12 39.0 -47.0     1005   35
##                     type seasday
## 2729           Hurricane     141
## 2733       Extratropical     142
## 2735 Tropical Depression     141
## 2739      Tropical Storm     142
## 2743      Tropical Storm     143
## 2747       Extratropical     144

Continuous Variables

The continuous variable that will be examined is wind or the storms maximum sustained wind speed measured in knots. Other continuous variables in the dataset are day, latitude and longitude, pressure and day of the hurricane season.

Response Variables

The response variable for this recipie will be the wind speed. If further analysis was done, one of the other continuous variables could be considered the response.

The Data: How is it organized and what does it look like?

The storms dataset constains 2747 observations of 11 variables/factors. The data is organized chronologically. For the subsets (AM and NOON) they each have one observation everyday the storm exisited. After the storm ends, the next row of data corresponds to the first day of the next storm.

Structure of the storms dataset:

str(storm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2747 obs. of  11 variables:
##  $ name    : chr  "Allison" "Allison" "Allison" "Allison" ...
##  $ year    : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ month   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ day     : int  3 3 3 3 4 4 4 4 5 5 ...
##  $ hour    : int  0 6 12 18 0 6 12 18 0 6 ...
##  $ lat     : num  17.4 18.3 19.3 20.6 22 23.3 24.7 26.2 27.6 28.5 ...
##  $ long    : num  -84.3 -84.9 -85.7 -85.8 -86 -86.3 -86.2 -86.2 -86.1 -85.6 ...
##  $ pressure: int  1005 1004 1003 1001 997 995 987 988 988 990 ...
##  $ wind    : int  30 30 35 40 50 60 65 65 65 60 ...
##  $ type    : chr  "Tropical Depression" "Tropical Depression" "Tropical Storm" "Tropical Storm" ...
##  $ seasday : int  3 3 3 3 4 4 4 4 5 5 ...
head(storm)
##      name year month day hour  lat  long pressure wind                type
## 1 Allison 1995     6   3    0 17.4 -84.3     1005   30 Tropical Depression
## 2 Allison 1995     6   3    6 18.3 -84.9     1004   30 Tropical Depression
## 3 Allison 1995     6   3   12 19.3 -85.7     1003   35      Tropical Storm
## 4 Allison 1995     6   3   18 20.6 -85.8     1001   40      Tropical Storm
## 5 Allison 1995     6   4    0 22.0 -86.0      997   50      Tropical Storm
## 6 Allison 1995     6   4    6 23.3 -86.3      995   60      Tropical Storm
##   seasday
## 1       3
## 2       3
## 3       3
## 4       3
## 5       4
## 6       4
tail(storm)
##        name year month day hour  lat  long pressure wind           type
## 2742 Nadine 2000    10  21    6 33.3 -53.5     1000   50 Tropical Storm
## 2743 Nadine 2000    10  21   12 34.1 -52.3     1000   50 Tropical Storm
## 2744 Nadine 2000    10  21   18 34.8 -51.3     1000   45 Tropical Storm
## 2745 Nadine 2000    10  22    0 35.7 -50.5     1004   40  Extratropical
## 2746 Nadine 2000    10  22    6 37.0 -49.0     1005   40  Extratropical
## 2747 Nadine 2000    10  22   12 39.0 -47.0     1005   35  Extratropical
##      seasday
## 2742     143
## 2743     143
## 2744     143
## 2745     144
## 2746     144
## 2747     144

For more information about the storms dataset:

?storms
## starting httpd help server ... done

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The Null Hypothesis is that there will be no significant difference in the wind speeds between the midnight and afternoon for the storms. The Alternate Hypothesis is that there is a significant difference in wind speed means between the midnight and afternoon measurement.

This will be tested by comparing the two subsets of data that were created in section 1 (AM and NOON). Once the necessary tests are performed, the assumptions will be checked to ensure a valid conclusion is reached.

What is the Rationale for this design?

The data points in the midnight (hour=0) and at noon (hour=12) were the subsets of observations choosen since they are at two opposite times of the day and the goal is to determine if the wind speed is significanly different between these two times; in other words, does the time of day cause the wind speed of storms to change?

Randomize: What is the Randomization Scheme?

The data collected in the storms data set is not random, but collected every 6 hours that the particular storm was at tropical storm status or higher up until it fell back down below this status.

Replicate: Are there replicates and/or repeated measures?

There are not replicates as the observations are of natural occurrences that can not be replicated by the experimenters. There are repeated measures as all storms that were “named”“ from 1995 until 2000 were recorded at the same time intervals.

Block: Did you use blocking in the design?

Blocking was not required in creating the storms dataset.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Summary Statistics for entire storms dataset(all hours):

summary(storm)
##      name                year          month           day    
##  Length:2747        Min.   :1995   Min.   : 6.0   Min.   : 1  
##  Class :character   1st Qu.:1995   1st Qu.: 8.0   1st Qu.: 9  
##  Mode  :character   Median :1997   Median : 9.0   Median :18  
##                     Mean   :1997   Mean   : 8.8   Mean   :17  
##                     3rd Qu.:1999   3rd Qu.:10.0   3rd Qu.:25  
##                     Max.   :2000   Max.   :12.0   Max.   :31  
##       hour            lat            long           pressure   
##  Min.   : 0.00   Min.   : 8.3   Min.   :-107.3   Min.   : 905  
##  1st Qu.: 3.50   1st Qu.:17.2   1st Qu.: -77.6   1st Qu.: 980  
##  Median :12.00   Median :25.0   Median : -60.9   Median : 995  
##  Mean   : 9.06   Mean   :26.7   Mean   : -60.9   Mean   : 990  
##  3rd Qu.:18.00   3rd Qu.:33.9   3rd Qu.: -45.8   3rd Qu.:1004  
##  Max.   :18.00   Max.   :70.7   Max.   :   1.0   Max.   :1019  
##       wind           type              seasday   
##  Min.   : 15.0   Length:2747        Min.   :  3  
##  1st Qu.: 35.0   Class :character   1st Qu.: 84  
##  Median : 50.0   Mode  :character   Median :103  
##  Mean   : 54.7                      Mean   :103  
##  3rd Qu.: 70.0                      3rd Qu.:125  
##  Max.   :155.0                      Max.   :185

Summary Statistics for each time subset (midnight and noon):

summary(AM)
##      name                year          month           day          hour  
##  Length:686         Min.   :1995   Min.   : 6.0   Min.   : 1   Min.   :0  
##  Class :character   1st Qu.:1995   1st Qu.: 8.0   1st Qu.: 9   1st Qu.:0  
##  Mode  :character   Median :1997   Median : 9.0   Median :18   Median :0  
##                     Mean   :1997   Mean   : 8.8   Mean   :17   Mean   :0  
##                     3rd Qu.:1999   3rd Qu.:10.0   3rd Qu.:25   3rd Qu.:0  
##                     Max.   :2000   Max.   :12.0   Max.   :31   Max.   :0  
##       lat            long           pressure         wind      
##  Min.   : 8.4   Min.   :-107.3   Min.   : 910   Min.   : 15.0  
##  1st Qu.:17.1   1st Qu.: -77.2   1st Qu.: 980   1st Qu.: 35.0  
##  Median :24.7   Median : -60.7   Median : 995   Median : 50.0  
##  Mean   :26.6   Mean   : -60.8   Mean   : 990   Mean   : 54.6  
##  3rd Qu.:34.0   3rd Qu.: -45.8   3rd Qu.:1004   3rd Qu.: 70.0  
##  Max.   :69.0   Max.   :   1.0   Max.   :1017   Max.   :155.0  
##      type              seasday   
##  Length:686         Min.   :  3  
##  Class :character   1st Qu.: 85  
##  Mode  :character   Median :103  
##                     Mean   :103  
##                     3rd Qu.:125  
##                     Max.   :185
summary(NOON)
##      name                year          month            day    
##  Length:691         Min.   :1995   Min.   : 6.00   Min.   : 1  
##  Class :character   1st Qu.:1995   1st Qu.: 8.00   1st Qu.: 9  
##  Mode  :character   Median :1997   Median : 9.00   Median :18  
##                     Mean   :1997   Mean   : 8.81   Mean   :17  
##                     3rd Qu.:1999   3rd Qu.:10.00   3rd Qu.:25  
##                     Max.   :2000   Max.   :12.00   Max.   :31  
##       hour         lat            long           pressure   
##  Min.   :12   Min.   : 8.6   Min.   :-104.0   Min.   : 914  
##  1st Qu.:12   1st Qu.:17.3   1st Qu.: -77.5   1st Qu.: 980  
##  Median :12   Median :25.0   Median : -60.9   Median : 995  
##  Mean   :12   Mean   :26.7   Mean   : -60.8   Mean   : 990  
##  3rd Qu.:12   3rd Qu.:33.9   3rd Qu.: -45.8   3rd Qu.:1004  
##  Max.   :12   Max.   :65.5   Max.   :  -4.4   Max.   :1019  
##       wind           type              seasday   
##  Min.   : 15.0   Length:691         Min.   :  3  
##  1st Qu.: 35.0   Class :character   1st Qu.: 85  
##  Median : 50.0   Mode  :character   Median :103  
##  Mean   : 54.5                      Mean   :103  
##  3rd Qu.: 70.0                      3rd Qu.:125  
##  Max.   :150.0                      Max.   :185

Mean wind speed for each time:

mean(AM$wind)
## [1] 54.63
mean(NOON$wind)
## [1] 54.54

Boxplots:

par(mfrow=c(1,2))
boxplot(AM$wind, main="AM Wind Speed")
boxplot(NOON$wind, main="NOON Wind Speed")

plot of chunk unnamed-chunk-9

Initial Analysis: From the graphs and descriptive summary, the wind speeds of the two groups (AM and NOON) appear very similar but tests should be done to see if the differences are significant.

Testing

The hypothesis will first be tested with a t-test. The null hypothesis for this test is that there will be no difference in means between the two groups being tested.

T-Test:

t.test(AM$wind, NOON$wind)
## 
##  Welch Two Sample t-test
## 
## data:  AM$wind and NOON$wind
## t = 0.0656, df = 1375, p-value = 0.9477
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.642  2.825
## sample estimates:
## mean of x mean of y 
##     54.63     54.54

Since the p-value is large and greater than the significance level (.05), we fail to reject the null hypothesis.

Diagnostics/Model Adequacy Checking

Visually inspect normality of data:

par(mfrow=c(1,1))
qqnorm(storm$wind)
qqline(storm$wind)

plot of chunk unnamed-chunk-11

Check that the data follows the normailty assumption of the t-test with a shapiro-wilks normailty test:

shapiro.test(AM$wind)
## 
##  Shapiro-Wilk normality test
## 
## data:  AM$wind
## W = 0.9269, p-value < 2.2e-16
shapiro.test(NOON$wind)
## 
##  Shapiro-Wilk normality test
## 
## data:  NOON$wind
## W = 0.9263, p-value < 2.2e-16

We reject the null hypothesis that the data is normall since the p-values are less than the significance level. This violates the assumptions of the t-test used above. See section 5: Contingencies for

4. References to the Literature

None used.

5. Contingencies

Since the normality assumption of the t-test run in section 3 was violated, a different test must be performed to truely test the difference of the wind speeds at the two times. The wilcoxon ranked sum test does not assume normality and therefore can be used:

wilcox.test(AM$wind, NOON$wind)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  AM$wind and NOON$wind
## W = 237709, p-value = 0.9247
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the significance level so we fail to reject the null hypothesis that there is no true difference between the two groups.

We cannot prove that the time of day causes a difference in wind speed observed for storms.

6. Appendicies

Link to raw data

https://github.com/hadley/nasaweather

Complete R Code

All included above.