Recipe 2: Factorial Experiment

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown). When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superseding the version of August 24. Always use the most recent version.

NASA Weather: Analysis of Storm Pressures

Max Winkelman

Rensselaer Polytechnic Institute

October 2 2014

Version 1

1. Setting

NASA Weather

The data in this package are atmospheric and geographic measures from a 24 by 24 grid that encloses Central America. There are four sets of data. The data set to be analyzed in this recipe will be “Storms.”

Install the ‘nasaweather’ package

install.packages("nasaweather", repos='http://cran.us.r-project.org')
## package 'nasaweather' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Max\AppData\Local\Temp\RtmpGKc440\downloaded_packages
#downloads 'nasaweather' from cran website
library("nasaweather", lib.loc="C:/Program Files/R/R-3.1.1/library")
#references 'fueleconomy' in R library
storms<-storms
#assigns the variables in the data set, storms, to the variable named 'data'

Factors and Levels

Factor: Year and Month

Levels: 1995, 1996, 1997, 1998, 1999, and 2000 and June(6), July(7), August(8), September (9), October(10), November(11), and December(12)

head(storms)
##      name year month day hour  lat  long pressure wind                type
## 1 Allison 1995     6   3    0 17.4 -84.3     1005   30 Tropical Depression
## 2 Allison 1995     6   3    6 18.3 -84.9     1004   30 Tropical Depression
## 3 Allison 1995     6   3   12 19.3 -85.7     1003   35      Tropical Storm
## 4 Allison 1995     6   3   18 20.6 -85.8     1001   40      Tropical Storm
## 5 Allison 1995     6   4    0 22.0 -86.0      997   50      Tropical Storm
## 6 Allison 1995     6   4    6 23.3 -86.3      995   60      Tropical Storm
##   seasday
## 1       3
## 2       3
## 3       3
## 4       3
## 5       4
## 6       4
#displays the first 6 sets of variables 
tail(storms)
##        name year month day hour  lat  long pressure wind           type
## 2742 Nadine 2000    10  21    6 33.3 -53.5     1000   50 Tropical Storm
## 2743 Nadine 2000    10  21   12 34.1 -52.3     1000   50 Tropical Storm
## 2744 Nadine 2000    10  21   18 34.8 -51.3     1000   45 Tropical Storm
## 2745 Nadine 2000    10  22    0 35.7 -50.5     1004   40  Extratropical
## 2746 Nadine 2000    10  22    6 37.0 -49.0     1005   40  Extratropical
## 2747 Nadine 2000    10  22   12 39.0 -47.0     1005   35  Extratropical
##      seasday
## 2742     143
## 2743     143
## 2744     143
## 2745     144
## 2746     144
## 2747     144
#displays the last 6 sets of variables 
summary(storms)
##      name                year          month           day    
##  Length:2747        Min.   :1995   Min.   : 6.0   Min.   : 1  
##  Class :character   1st Qu.:1995   1st Qu.: 8.0   1st Qu.: 9  
##  Mode  :character   Median :1997   Median : 9.0   Median :18  
##                     Mean   :1997   Mean   : 8.8   Mean   :17  
##                     3rd Qu.:1999   3rd Qu.:10.0   3rd Qu.:25  
##                     Max.   :2000   Max.   :12.0   Max.   :31  
##       hour            lat            long           pressure   
##  Min.   : 0.00   Min.   : 8.3   Min.   :-107.3   Min.   : 905  
##  1st Qu.: 3.50   1st Qu.:17.2   1st Qu.: -77.6   1st Qu.: 980  
##  Median :12.00   Median :25.0   Median : -60.9   Median : 995  
##  Mean   : 9.06   Mean   :26.7   Mean   : -60.9   Mean   : 990  
##  3rd Qu.:18.00   3rd Qu.:33.9   3rd Qu.: -45.8   3rd Qu.:1004  
##  Max.   :18.00   Max.   :70.7   Max.   :   1.0   Max.   :1019  
##       wind           type              seasday   
##  Min.   : 15.0   Length:2747        Min.   :  3  
##  1st Qu.: 35.0   Class :character   1st Qu.: 84  
##  Median : 50.0   Mode  :character   Median :103  
##  Mean   : 54.7                      Mean   :103  
##  3rd Qu.: 70.0                      3rd Qu.:125  
##  Max.   :155.0                      Max.   :185
#displays a summary of the variables

Continuous variables:

The data set, ‘storms,’ was created with no specific experiment in particular. Certain variables such as “year,” “month,” “day,” “pressure,” and “wind” can be considered continuous variables. The storm “type” can be considered a categorical variable.

Response variables:

Depending on the analysis preformed on the data set, ‘storms,’ several variables can be considered response variables. For this sample recipe, “pressure” will be considered as a response variable.

The Data: How is it organized and what does it look like?

The data sets of NASA weather were observed from January of 1995 to December of 2000 and were obtained from the NASA Langley Research Center Atmospheric Sciences Data Center. The variables of “storms” are: “Name,” which displays the name of the storm, “Year,” “Month,” and “Day” all describe the data on which the data was recorded, “Latitude” and “Longitude,” describe the location of the storm, “Pressure” shows the pressure in psi, “Wind” shows the wind speed in mph, and “SeasDay.”

Randomization

The data from “storms” is organized based on the variables within each column; however, it can be assumed that the original data was gathered with proper randomization methods utilized by NASA.

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

The results of weather testing conducted by NASA were not acquired with any one specific experiment or hypothesis in mind. This data is made publicly available for anyone to use to test their own hypothesis. For this case, the data from “storms” will be analyzed to determine the extent of the difference in air pressure between storms occurring during different years (1995-2000) and different months (June-December). An analysis of variance test will be performed to determine the relationship between the sample means. The null hypothesis of this experiment is that the pressure means across all storms from different months and years are the same.

What is the rationale for this design?

The rationale behind the weather observations conducted by NASA was to gather various parameters from different storms and make them readily available. With no specific aim in mind, the data can be analyzed be anyone wishing to test a hypothesis relating to the data.

Randomize: What is the Randomization Scheme?

Since this data was gathered with no specific intention, the randomization scheme, if any, is unknown. The data used in this example will be selected based solely on the month and year of the storm. Pressure, in psi, will be the response variable.

Replicate: Are there replicates and/or repeated measures?

Yes, measurements of pressure, wind, latitude, and longitude were recorded for the same storms at several different time points. No replicates were performed in this data gathering.

Block: Did you use blocking in the design?

The original data set, while organized based on the variables that were measurement, was not arranged into experimental groups because no experiment was conducted. For this specific example, the storm data will be broken up into groups based on the month and the year that a storm had occurred.

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive Summary

#Assign the data types 
storms$month=as.factor(storms$month)
#makes the variable "month" a factor
storms$year=as.factor(storms$year)
#makes the variable "year" a factor

#Boxplot
boxplot(pressure~month,data=storms, xlab="Month", ylab="Pressure (psi)")

plot of chunk unnamed-chunk-3

#boxplot of the air pressures from storms of each month
boxplot(pressure~year,data=storms, xlab="Year", ylab="Pressure (psi)")

plot of chunk unnamed-chunk-3

#boxplot of the air pressures from storms of each year

The boxplots above display the distribution of the variation of pressure that can be attributed to the month and the year that the storm occurred. No statistical inference be made about the data sets without performing a statistical test. However, due to the lack of sufficient sample size, it can be assummed that the mean pressure of storms that occurred in December will be significantly different that the pressure means from other months.

Testing

An analysis of variance (ANOVA) will be used to determine the statistical significance between the pressure means. The null hypothesis for all three ANOVA tests is that the mean pressure vectors of all samples are equal to each other. The first anova test will analyze the pressure variance as a result of the month. The second anova test will analyze the pressure variance as a result of the year. The third anova test will analyze the pressure variance as a result of the interaction between the month and the year. If the null hypothesis is rejected, the alternative hypothesis, which states that the mean vectors are not equal to each other, is accepted.

# ANOVA
#Month
model_month=aov(pressure~month,data=storms) 
anova(model_month)
## Analysis of Variance Table
## 
## Response: pressure
##             Df Sum Sq Mean Sq F value Pr(>F)    
## month        6  62467   10411    31.8 <2e-16 ***
## Residuals 2740 897236     327                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#performs an anova test for storm month

#Year
model_year=aov(pressure~year,data=storms) 
anova(model_year)
## Analysis of Variance Table
## 
## Response: pressure
##             Df Sum Sq Mean Sq F value Pr(>F)    
## year         5  22560    4512    13.2  1e-12 ***
## Residuals 2741 937143     342                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#performs an anova test for storm year


#Month and Year
model_month_year=aov(pressure~month*year,data=storms) 
anova(model_month_year)
## Analysis of Variance Table
## 
## Response: pressure
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## month         6  62467   10411   34.24 < 2e-16 ***
## year          5  25274    5055   16.63 3.3e-16 ***
## month:year   18  45920    2551    8.39 < 2e-16 ***
## Residuals  2717 826042     304                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#performs an anova test for the interaction of storm month and year

ANOVA Results: The anova test that analyzed the variation in pressure as a result of the variation in the month that a storm occurred produced a p-value of 2e-16. This indicates that there is a very small probability that the variation of pressure data can be attributed to solely randomization. It is highly likely that the month during which a storm occurs has an effect on the pressure mean. The anova test that analyzed the variation in pressure as a result of the variation in the year that a storm occurred produced a p-value of 1e-12. This indicates that there is a very small probability that the variation of pressure data can be attributed to solely randomization. It is highly likely that the year during which a storm occurs has an effect on the pressure mean. The anova test that analyzed the variation in pressure as a result of the interaction of both the month and the year that a storm occurred produced a p-value of 2e-16. This indicates that there is a very small probability that the variation of pressure data can be attributed to solely randomization. It is highly likely that the interaction of the month and the year has an effect on the pressure mean. Without a post-hoc analysis, it is impossible to determine precisely which pressure means are significantly distinct.

Estimation of Parameters

Summary of all factors and levels

summary(storms$month)
##   6   7   8   9  10  11  12 
##  83 251 720 947 569 170   7
#displays the amount of data points in each month

june<-storms$month=="6" 
#assigns all of the data of month 6 to the vector, june
summary(storms[june,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     982     994    1000    1000    1010    1020
#displays a summary of the air pressure data for the month of June

july<-storms$month=="7" 
#assigns all of the data of month 7 to the vector, july
summary(storms[july,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     960     995    1000    1000    1010    1020
#displays a summary of the air pressure data for the month of july

august<-storms$month=="8" 
#assigns all of the data of month 8 to the vector, august
summary(storms[august,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     929     979     995     990    1000    1010
#displays a summary of the air pressure data for the month of August

sept<-storms$month=="9" 
#assigns all of the data of month 9 to the vector, sept
summary(storms[sept,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     921     971     990     985    1000    1020
#displays a summary of the air pressure data for the month of September

oct<-storms$month=="10" 
#assigns all of the data of month 10 to the vector, oct
summary(storms[oct,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     905     982     996     991    1000    1020
#displays a summary of the air pressure data for the month of October

nov<-storms$month=="11" 
#assigns all of the data of month 11 to the vector, nov
summary(storms[nov,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     933     987     998     993    1000    1010
#displays a summary of the air pressure data for the month of November

dec<-storms$month=="12" 
#assigns all of the data of month 12 to the vector, june
summary(storms[dec,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     979     981     985     985     989     990
#displays a summary of the air pressure data for the month of December

#Year
summary(storms$year)
## 1995 1996 1997 1998 1999 2000 
##  724  536  186  483  411  407
#displays the amount of data points in each month

nfive<-storms$year=="1995" 
#assigns all of the data of 1995 to the vector, nfive
summary(storms[nfive,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     919     979     995     989    1000    1020
#displays a summary of the air pressure data for the year of 1995 

nsix<-storms$year=="1996" 
#assigns all of the data of 1996 to the vector, nsix
summary(storms[nsix,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     933     979     994     989    1000    1020
#displays a summary of the air pressure data for the year of 1996

nseven<-storms$year=="1997" 
#assigns all of the data of 1997 to the vector, nseven
summary(storms[nseven,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     946     991    1000     997    1000    1010
#displays a summary of the air pressure data for the year of 1997

neight<-storms$year=="1998" 
#assigns all of the data of 1998 to the vector, neight
summary(storms[neight,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     905     980     996     990    1000    1010
#displays a summary of the air pressure data for the year of 1998

nnine<-storms$year=="1999" 
#assigns all of the data of 1999 to the vector, nnine
summary(storms[nnine,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     921     971     992     985    1000    1020
#displays a summary of the air pressure data for the year of 1999

tt<-storms$year=="2000" 
#assigns all of the data of 2000 to the vector, tt
summary(storms[tt,"pressure"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     941     984     997     993    1010    1010
#displays a summary of the air pressure data for the year of 2000

Diagnostics/Model Adequacy Checking

Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot. When an anova is performed, it is done so with the assumption that the test statistic follows a normal distribution. Visualization of a Q-Q plot will further confirm if that assumption is correct for the anova tests that were performed.

#Month
qqnorm(residuals(model_month), ylab="Month Residuals")
qqline(residuals(model_month))

plot of chunk unnamed-chunk-6

#produces a Q-Q normal plot for the storm month with a normal fit line

#Year
qqnorm(residuals(model_year), ylab="Year Residuals")
qqline(residuals(model_year))

plot of chunk unnamed-chunk-6

#produces a Q-Q normal plot for the storm year with a normal fit line


#Month and Year
qqnorm(residuals(model_month_year), ylab="Month/Year Interaction Residuals")
qqline(residuals(model_month_year))

plot of chunk unnamed-chunk-6

#produces a Q-Q normal plot for the interaction between the storm month and year with a normal fit line

The Normal Q-Q plots for storm month, year, and month/year interaction returned a relatively linear relationship between the pressure data and the theoretical quantities, indicating that they follow the theoretical distribution. The tails of all three plots appear to deviate from the linear line on the Q-Q plot. This could indicate that the data is not normally distributed, however, no certain conclusion can be made by visual inspection. For an ANOVA, the data sets must follow a normal distribution. The relatively linear relationship for all data sets justifies the use of ANOVA to test for the significant difference.

Two Way Interaction Plots display the mean of the response for two-way combinations of factors, and can indicate if there is any interactions between them through visual inspection. Data sets that do not have any interaction will appear as perfectly parallel lines. Changes in slope and intersections are good indications of interactions.

interaction.plot(storms$month,storms$year,storms$pressure)

plot of chunk unnamed-chunk-7

#creates a plot that shows the interaction of the year and month on a storm's pressure

In the interaction plot above, there are clear observations of changes in slope and plot intersections, indicating some degree of interaction between the month and the year of a storm, in regards to the pressure data. This supports the results of the anova tests.

Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.

#Month
plot(fitted(model_month),residuals(model_month)) 

plot of chunk unnamed-chunk-8

#Year
plot(fitted(model_year),residuals(model_year))

plot of chunk unnamed-chunk-8

#Month and Year
plot(fitted(model_month_year),residuals(model_month_year)) 

plot of chunk unnamed-chunk-8

The three residual plots above all show a large degree of variation of the residual values. However, none of the plots produce an abundance of distinct outliers. The lack of numerous outliers confirms the aduequacy of the fit of the model.

4. References to the literature

No literature was used in this sample recipe

5. Appendices

The raw data used in this statistical analysis are results of vehicle testing conducted by NASA. It can be readily accessed using R or RStudio. It is available as a downloadable package and can be found online at https://github.com/hadley/nasaweather