Recipe 1: Example of Descriptive Statistics

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments

Trevor Manzanares

Rensselaer Polytechnic Institute

9/12/14

1. Setting

System under test

Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data What is the problem that you were given?

remove(list=ls())
#install.packages("nasaweather", repos='http://cran.us.r-project.org')
#library("nasaweather", lib.loc="C:/Users/Trevor/Documents/R/win-library/3.1/nasaweather/R")
library("nasaweather", lib.loc="~/R/win-library/3.1")
#select subset of nasaweather data set
data(atmos)
#get information on atmos dataset
??atmos
## starting httpd help server ... done
#name dataset
x<-atmos
#view first few lines
head(x)
##     lat   long year month surftemp  temp pressure ozone cloudlow cloudmid
## 1 36.20 -113.8 1995     1    272.7 272.1      835   304      7.5     34.5
## 2 33.70 -113.8 1995     1    279.5 282.2      940   304     11.5     32.5
## 3 31.21 -113.8 1995     1    284.7 285.2      960   298     16.5     26.0
## 4 28.71 -113.8 1995     1    289.3 290.7      990   276     20.5     14.5
## 5 26.22 -113.8 1995     1    292.2 292.7     1000   274     26.0     10.5
## 6 23.72 -113.8 1995     1    294.1 293.6     1000   264     30.0      9.5
##   cloudhigh
## 1      26.0
## 2      20.0
## 3      16.0
## 4      13.0
## 5       7.5
## 6       8.0
#attach dataset for easy column referencing
attach(atmos)
## The following object is masked from package:datasets:
## 
##     pressure
#observe the structure of the data, ie. how many variables
str(atmos)
## Classes 'tbl_df', 'tbl' and 'data.frame':    41472 obs. of  11 variables:
##  $ lat      : num  36.2 33.7 31.2 28.7 26.2 ...
##  $ long     : num  -114 -114 -114 -114 -114 ...
##  $ year     : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ surftemp : num  273 280 285 289 292 ...
##  $ temp     : num  272 282 285 291 293 ...
##  $ pressure : num  835 940 960 990 1000 1000 1000 1000 1000 1000 ...
##  $ ozone    : num  304 304 298 276 274 264 258 252 250 250 ...
##  $ cloudlow : num  7.5 11.5 16.5 20.5 26 30 29.5 26.5 27.5 26 ...
##  $ cloudmid : num  34.5 32.5 26 14.5 10.5 9.5 11 17.5 18.5 16.5 ...
##  $ cloudhigh: num  26 20 16 13 7.5 8 14.5 19.5 22.5 21 ...
#observe monthly mean of variables surftemp and temp
tapply(surftemp,temp,mean)
## 269.1 269.7 270.3 270.9 271.5 272.1 272.7 273.2 273.8 274.4   275 275.6 
## 284.7 281.6 270.9 269.7 275.2 276.4 277.1 271.2 273.8 275.6 277.2 277.7 
## 276.1 276.7 277.3 277.8 278.4 278.9 279.5   280 280.5 281.1 281.6 282.2 
## 282.0 281.0 280.6 283.2 280.0 282.4 281.8 279.3 281.2 281.3 282.8 282.2 
## 282.7 283.2 283.7 284.2 284.7 285.2 285.8 286.3 286.8 287.3 287.8 288.3 
## 283.8 283.0 284.1 285.2 286.2 285.4 286.7 286.5 287.1 287.3 288.4 288.8 
## 288.8 289.3 289.8 290.2 290.7 291.2 291.7 292.2 292.7 293.2 293.6 294.1 
## 289.7 289.9 290.4 289.8 290.6 290.9 291.3 291.3 291.8 292.1 292.7 293.0 
## 294.6   295 295.5   296 296.5 296.9 297.4 297.8 298.3 298.7 299.2 299.6 
## 293.6 294.0 294.5 294.7 295.1 295.5 295.9 296.2 296.8 297.2 297.4 297.6 
## 300.1 300.5   301 301.4 301.9 302.3 302.8 303.2 303.6   304 304.5 304.9 
## 298.0 298.2 298.7 299.3 299.5 299.9 299.8 299.8 299.8 299.9 300.1 302.0 
## 305.3 305.8 306.2 306.6   307 307.5 307.9 308.3 308.7 309.1 309.6   310 
## 302.6 304.1 304.3 305.7 305.7 305.6 306.1 305.9 307.0 308.0 308.2 309.6

Factors and Levels

2 factors, surftemp and temp, can assume many levels

head(x)
##     lat   long year month surftemp  temp pressure ozone cloudlow cloudmid
## 1 36.20 -113.8 1995     1    272.7 272.1      835   304      7.5     34.5
## 2 33.70 -113.8 1995     1    279.5 282.2      940   304     11.5     32.5
## 3 31.21 -113.8 1995     1    284.7 285.2      960   298     16.5     26.0
## 4 28.71 -113.8 1995     1    289.3 290.7      990   276     20.5     14.5
## 5 26.22 -113.8 1995     1    292.2 292.7     1000   274     26.0     10.5
## 6 23.72 -113.8 1995     1    294.1 293.6     1000   264     30.0      9.5
##   cloudhigh
## 1      26.0
## 2      20.0
## 3      16.0
## 4      13.0
## 5       7.5
## 6       8.0
tail(x)
##           lat  long year month surftemp  temp pressure ozone cloudlow
## 41467  -8.722 -56.2 2000    12    294.6 301.9      990   248      7.5
## 41468 -11.217 -56.2 2000    12    294.1 302.3      985   252      6.5
## 41469 -13.713 -56.2 2000    12    295.0 303.6      960   250      6.5
## 41470 -16.209 -56.2 2000    12    297.8 304.5      995   250     11.5
## 41471 -18.704 -56.2 2000    12    299.6 304.9      995   252     14.5
## 41472 -21.200 -56.2 2000    12    299.6 304.0      970   254     14.5
##       cloudmid cloudhigh
## 41467     32.0      40.5
## 41468     30.5      40.0
## 41469     28.5      40.5
## 41470     28.5      31.0
## 41471     23.0      26.0
## 41472     21.0      23.5
#view background on data
??atmos
#view summary statistics
summary(x)
##       lat              long             year          month      
##  Min.   :-21.20   Min.   :-113.8   Min.   :1995   Min.   : 1.00  
##  1st Qu.: -6.85   1st Qu.: -99.4   1st Qu.:1996   1st Qu.: 3.75  
##  Median :  7.50   Median : -85.0   Median :1998   Median : 6.50  
##  Mean   :  7.50   Mean   : -85.0   Mean   :1998   Mean   : 6.50  
##  3rd Qu.: 21.85   3rd Qu.: -70.6   3rd Qu.:1999   3rd Qu.: 9.25  
##  Max.   : 36.20   Max.   : -56.2   Max.   :2000   Max.   :12.00  
##                                                                  
##     surftemp        temp        pressure        ozone        cloudlow   
##  Min.   :266   Min.   :269   Min.   : 615   Min.   :232   Min.   : 0.5  
##  1st Qu.:294   1st Qu.:296   1st Qu.: 995   1st Qu.:254   1st Qu.:15.0  
##  Median :297   Median :299   Median :1000   Median :264   Median :23.5  
##  Mean   :296   Mean   :298   Mean   : 985   Mean   :267   Mean   :26.2  
##  3rd Qu.:299   3rd Qu.:301   3rd Qu.:1000   3rd Qu.:276   3rd Qu.:34.5  
##  Max.   :315   Max.   :310   Max.   :1000   Max.   :390   Max.   :84.5  
##                                                           NA's   :110   
##     cloudmid      cloudhigh   
##  Min.   : 0.0   Min.   : 0.0  
##  1st Qu.: 7.5   1st Qu.: 1.5  
##  Median :14.0   Median : 8.5  
##  Mean   :15.3   Mean   :12.0  
##  3rd Qu.:22.0   3rd Qu.:18.5  
##  Max.   :83.5   Max.   :62.5  
## 
attach(x)
## The following objects are masked from atmos:
## 
##     cloudhigh, cloudlow, cloudmid, lat, long, month, ozone,
##     pressure, surftemp, temp, year
## 
## The following object is masked from package:datasets:
## 
##     pressure

Continuous variables (if any)

latitude (num), longitude (num), year (int), month (int), surftemp (num), temp (num), pressure (num), ozone (num, response), cloudlow (num), cloudmid (num), cloudhigh (num) ### Response variables ozone ### The Data: How is it organized and what does it look like? The data are tabluated into 11 columns, with some missing data. All variables are numeric, except for two, which are integers. ### Randomization The data were collected with monthly averages for each variable from 01/95 to 12/00, a 5 year span.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The null hypothesis is that there is no difference in the mean temperatures of the Surface Temperature and the Near-Surfact Air Temperature. The alternative is that there is a statistically sig difference. A two sample t-test will be used to determine if this difference exists, using an alpha of .05. ### What is the rationale for this design? T-tests are the most basic method of determining if a statistically significant difference exists between two factors. ### Randomize: What is the Randomization Scheme? The data sample was collected with monthly averages for each variable from 01/95 to 12/00. ### Replicate: Are there replicates and/or repeated measures? There is no replication, but there are repeated measures once a month for 6 years. ### Block: Did you use blocking in the design? No, blocking was unnecessary, as both factors contribute to temperature.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

discovering differences between means of two sample variables

par(mfrow=c(1,2))
boxplot(surftemp, main="surftemp")
boxplot(temp, main="airtemp")

plot of chunk unnamed-chunk-3 ### Testing

#try a basic t test for variables surftemp and temp
t.test(surftemp,temp,var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  surftemp and temp
## t = -51.78, df = 82942, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.754 -1.626
## sample estimates:
## mean of x mean of y 
##     296.2     297.9

Based on these preliminary results from the t-test, we reject the H0 that there is no difference between the mean temperatures of surftemp and airtemp. However, we must check the normality to ensure these results are valid.

Estimation (of Parameters)

# Shapiro-Wilk test of normality.  Adequate if p < 0.1
shapiro.test(temp[1:4999])
## 
##  Shapiro-Wilk normality test
## 
## data:  temp[1:4999]
## W = 0.867, p-value < 2.2e-16
shapiro.test(ozone[1:4999])
## 
##  Shapiro-Wilk normality test
## 
## data:  ozone[1:4999]
## W = 0.8925, p-value < 2.2e-16

One of the primary assumptions of the t-test is that the data are normally distributed, and since they are not, the results from the t-test are essentially invalidated. ### Diagnostics/Model Adequacy Checking Describe

qqnorm(temp,ylab="Temp")

plot of chunk unnamed-chunk-6

qqnorm(ozone,ylab="Ozone")

plot of chunk unnamed-chunk-6

4. References to the literature

??atmos

http://stat-computing.org/dataexpo/2006/

5. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

library(“nasaweather”, lib.loc=“~/R/win-library/3.1”)