This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data What is the problem that you were given?
remove(list=ls())
#install.packages("nasaweather", repos='http://cran.us.r-project.org')
#library("nasaweather", lib.loc="C:/Users/Trevor/Documents/R/win-library/3.1/nasaweather/R")
library("nasaweather", lib.loc="~/R/win-library/3.1")
#select subset of nasaweather data set
data(atmos)
#get information on atmos dataset
??atmos
## starting httpd help server ... done
#name dataset
x<-atmos
#view first few lines
head(x)
## lat long year month surftemp temp pressure ozone cloudlow cloudmid
## 1 36.20 -113.8 1995 1 272.7 272.1 835 304 7.5 34.5
## 2 33.70 -113.8 1995 1 279.5 282.2 940 304 11.5 32.5
## 3 31.21 -113.8 1995 1 284.7 285.2 960 298 16.5 26.0
## 4 28.71 -113.8 1995 1 289.3 290.7 990 276 20.5 14.5
## 5 26.22 -113.8 1995 1 292.2 292.7 1000 274 26.0 10.5
## 6 23.72 -113.8 1995 1 294.1 293.6 1000 264 30.0 9.5
## cloudhigh
## 1 26.0
## 2 20.0
## 3 16.0
## 4 13.0
## 5 7.5
## 6 8.0
#attach dataset for easy column referencing
attach(atmos)
## The following object is masked from package:datasets:
##
## pressure
#observe the structure of the data, ie. how many variables
str(atmos)
## Classes 'tbl_df', 'tbl' and 'data.frame': 41472 obs. of 11 variables:
## $ lat : num 36.2 33.7 31.2 28.7 26.2 ...
## $ long : num -114 -114 -114 -114 -114 ...
## $ year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ surftemp : num 273 280 285 289 292 ...
## $ temp : num 272 282 285 291 293 ...
## $ pressure : num 835 940 960 990 1000 1000 1000 1000 1000 1000 ...
## $ ozone : num 304 304 298 276 274 264 258 252 250 250 ...
## $ cloudlow : num 7.5 11.5 16.5 20.5 26 30 29.5 26.5 27.5 26 ...
## $ cloudmid : num 34.5 32.5 26 14.5 10.5 9.5 11 17.5 18.5 16.5 ...
## $ cloudhigh: num 26 20 16 13 7.5 8 14.5 19.5 22.5 21 ...
#observe monthly mean of variables surftemp and temp
tapply(surftemp,temp,mean)
## 269.1 269.7 270.3 270.9 271.5 272.1 272.7 273.2 273.8 274.4 275 275.6
## 284.7 281.6 270.9 269.7 275.2 276.4 277.1 271.2 273.8 275.6 277.2 277.7
## 276.1 276.7 277.3 277.8 278.4 278.9 279.5 280 280.5 281.1 281.6 282.2
## 282.0 281.0 280.6 283.2 280.0 282.4 281.8 279.3 281.2 281.3 282.8 282.2
## 282.7 283.2 283.7 284.2 284.7 285.2 285.8 286.3 286.8 287.3 287.8 288.3
## 283.8 283.0 284.1 285.2 286.2 285.4 286.7 286.5 287.1 287.3 288.4 288.8
## 288.8 289.3 289.8 290.2 290.7 291.2 291.7 292.2 292.7 293.2 293.6 294.1
## 289.7 289.9 290.4 289.8 290.6 290.9 291.3 291.3 291.8 292.1 292.7 293.0
## 294.6 295 295.5 296 296.5 296.9 297.4 297.8 298.3 298.7 299.2 299.6
## 293.6 294.0 294.5 294.7 295.1 295.5 295.9 296.2 296.8 297.2 297.4 297.6
## 300.1 300.5 301 301.4 301.9 302.3 302.8 303.2 303.6 304 304.5 304.9
## 298.0 298.2 298.7 299.3 299.5 299.9 299.8 299.8 299.8 299.9 300.1 302.0
## 305.3 305.8 306.2 306.6 307 307.5 307.9 308.3 308.7 309.1 309.6 310
## 302.6 304.1 304.3 305.7 305.7 305.6 306.1 305.9 307.0 308.0 308.2 309.6
2 factors, surftemp and temp, can assume many levels
head(x)
## lat long year month surftemp temp pressure ozone cloudlow cloudmid
## 1 36.20 -113.8 1995 1 272.7 272.1 835 304 7.5 34.5
## 2 33.70 -113.8 1995 1 279.5 282.2 940 304 11.5 32.5
## 3 31.21 -113.8 1995 1 284.7 285.2 960 298 16.5 26.0
## 4 28.71 -113.8 1995 1 289.3 290.7 990 276 20.5 14.5
## 5 26.22 -113.8 1995 1 292.2 292.7 1000 274 26.0 10.5
## 6 23.72 -113.8 1995 1 294.1 293.6 1000 264 30.0 9.5
## cloudhigh
## 1 26.0
## 2 20.0
## 3 16.0
## 4 13.0
## 5 7.5
## 6 8.0
tail(x)
## lat long year month surftemp temp pressure ozone cloudlow
## 41467 -8.722 -56.2 2000 12 294.6 301.9 990 248 7.5
## 41468 -11.217 -56.2 2000 12 294.1 302.3 985 252 6.5
## 41469 -13.713 -56.2 2000 12 295.0 303.6 960 250 6.5
## 41470 -16.209 -56.2 2000 12 297.8 304.5 995 250 11.5
## 41471 -18.704 -56.2 2000 12 299.6 304.9 995 252 14.5
## 41472 -21.200 -56.2 2000 12 299.6 304.0 970 254 14.5
## cloudmid cloudhigh
## 41467 32.0 40.5
## 41468 30.5 40.0
## 41469 28.5 40.5
## 41470 28.5 31.0
## 41471 23.0 26.0
## 41472 21.0 23.5
#view background on data
??atmos
#view summary statistics
summary(x)
## lat long year month
## Min. :-21.20 Min. :-113.8 Min. :1995 Min. : 1.00
## 1st Qu.: -6.85 1st Qu.: -99.4 1st Qu.:1996 1st Qu.: 3.75
## Median : 7.50 Median : -85.0 Median :1998 Median : 6.50
## Mean : 7.50 Mean : -85.0 Mean :1998 Mean : 6.50
## 3rd Qu.: 21.85 3rd Qu.: -70.6 3rd Qu.:1999 3rd Qu.: 9.25
## Max. : 36.20 Max. : -56.2 Max. :2000 Max. :12.00
##
## surftemp temp pressure ozone cloudlow
## Min. :266 Min. :269 Min. : 615 Min. :232 Min. : 0.5
## 1st Qu.:294 1st Qu.:296 1st Qu.: 995 1st Qu.:254 1st Qu.:15.0
## Median :297 Median :299 Median :1000 Median :264 Median :23.5
## Mean :296 Mean :298 Mean : 985 Mean :267 Mean :26.2
## 3rd Qu.:299 3rd Qu.:301 3rd Qu.:1000 3rd Qu.:276 3rd Qu.:34.5
## Max. :315 Max. :310 Max. :1000 Max. :390 Max. :84.5
## NA's :110
## cloudmid cloudhigh
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 7.5 1st Qu.: 1.5
## Median :14.0 Median : 8.5
## Mean :15.3 Mean :12.0
## 3rd Qu.:22.0 3rd Qu.:18.5
## Max. :83.5 Max. :62.5
##
attach(x)
## The following objects are masked from atmos:
##
## cloudhigh, cloudlow, cloudmid, lat, long, month, ozone,
## pressure, surftemp, temp, year
##
## The following object is masked from package:datasets:
##
## pressure
latitude (num), longitude (num), year (int), month (int), surftemp (num), temp (num), pressure (num), ozone (num, response), cloudlow (num), cloudmid (num), cloudhigh (num) ### Response variables ozone ### The Data: How is it organized and what does it look like? The data are tabluated into 11 columns, with some missing data. All variables are numeric, except for two, which are integers. ### Randomization The data were collected with monthly averages for each variable from 01/95 to 12/00, a 5 year span.
The null hypothesis is that there is no difference in the mean temperatures of the Surface Temperature and the Near-Surfact Air Temperature. The alternative is that there is a statistically sig difference. A two sample t-test will be used to determine if this difference exists, using an alpha of .05. ### What is the rationale for this design? T-tests are the most basic method of determining if a statistically significant difference exists between two factors. ### Randomize: What is the Randomization Scheme? The data sample was collected with monthly averages for each variable from 01/95 to 12/00. ### Replicate: Are there replicates and/or repeated measures? There is no replication, but there are repeated measures once a month for 6 years. ### Block: Did you use blocking in the design? No, blocking was unnecessary, as both factors contribute to temperature.
par(mfrow=c(1,2))
boxplot(surftemp, main="surftemp")
boxplot(temp, main="airtemp")
### Testing
#try a basic t test for variables surftemp and temp
t.test(surftemp,temp,var.equal=TRUE)
##
## Two Sample t-test
##
## data: surftemp and temp
## t = -51.78, df = 82942, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.754 -1.626
## sample estimates:
## mean of x mean of y
## 296.2 297.9
Based on these preliminary results from the t-test, we reject the H0 that there is no difference between the mean temperatures of surftemp and airtemp. However, we must check the normality to ensure these results are valid.
# Shapiro-Wilk test of normality. Adequate if p < 0.1
shapiro.test(temp[1:4999])
##
## Shapiro-Wilk normality test
##
## data: temp[1:4999]
## W = 0.867, p-value < 2.2e-16
shapiro.test(ozone[1:4999])
##
## Shapiro-Wilk normality test
##
## data: ozone[1:4999]
## W = 0.8925, p-value < 2.2e-16
One of the primary assumptions of the t-test is that the data are normally distributed, and since they are not, the results from the t-test are essentially invalidated. ### Diagnostics/Model Adequacy Checking Describe
qqnorm(temp,ylab="Temp")
qqnorm(ozone,ylab="Ozone")
library(“nasaweather”, lib.loc=“~/R/win-library/3.1”)