title: “Project3” author: “Chixian Ma” date: “Sunday, April 12, 2015” output: html_document
My dataset is the NASA weather data, which from the Hadley Wickham datasets. The full data package has four datasets: atmosphere, elevation, storms tracks, and glaciers. I choose the atmosphere dataset. This dataset has 7 variables: surface temperature, air temperature, air pressure, ozone, and cloud cover(low,mid,and high). The dataset has 41472 observations of the atmosphere condition of Central America monthly from 1995 to 2000.
https://github.com/hadley/nasaweather
load("C:/Users/Xianwei/Desktop/atmos.rdata")
attach(atmos)
## The following object is masked from package:datasets:
##
## pressure
head(atmos)
## lat long year month surftemp temp pressure ozone cloudlow
## 1 36.20000 -113.8 1995 1 272.7 272.1 835 304 7.5
## 2 33.70435 -113.8 1995 1 279.5 282.2 940 304 11.5
## 3 31.20870 -113.8 1995 1 284.7 285.2 960 298 16.5
## 4 28.71304 -113.8 1995 1 289.3 290.7 990 276 20.5
## 5 26.21739 -113.8 1995 1 292.2 292.7 1000 274 26.0
## 6 23.72174 -113.8 1995 1 294.1 293.6 1000 264 30.0
## cloudmid cloudhigh
## 1 34.5 26.0
## 2 32.5 20.0
## 3 26.0 16.0
## 4 14.5 13.0
## 5 10.5 7.5
## 6 9.5 8.0
plot(atmos[,5:9])
My independent variables are: surface temperature and air pressure.
My dependent variable is air temperature.
My null hypothesis \(H_0\) is that the variation of air temperature depends on randomness and cannot be explained by any of the independent variables: surface temperature and air pressure.
My \(H_1\) is that the variation of air temperature can be explained by any of the independent variables: surface temperature and air pressure.
I decided to use step-wise method to build my model. So I include the independent variables based on their correlation with the dependent variable.
cor(atmos[,5:7])
## surftemp temp pressure
## surftemp 1.0000000 0.8067645 0.3517102
## temp 0.8067645 1.0000000 0.4966186
## pressure 0.3517102 0.4966186 1.0000000
According to the correlation matrix, I include surface temperature first into my model. Second, I include air pressure into my model.
lm1 <- lm(temp~surftemp)
summary(lm1)
##
## Call:
## lm(formula = temp ~ surftemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8084 -1.4386 -0.0386 1.6246 18.0229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.094233 0.869839 64.49 <2e-16 ***
## surftemp 0.816346 0.002936 278.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.793 on 41470 degrees of freedom
## Multiple R-squared: 0.6509, Adjusted R-squared: 0.6509
## F-statistic: 7.731e+04 on 1 and 41470 DF, p-value: < 2.2e-16
lm2 <- lm(temp~surftemp+pressure)
summary(lm2)
##
## Call:
## lm(formula = temp ~ surftemp + pressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7275 -1.5675 -0.1542 1.5636 18.4211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.666e+01 8.029e-01 70.57 <2e-16 ***
## surftemp 7.299e-01 2.895e-03 252.14 <2e-16 ***
## pressure 2.542e-02 2.993e-04 84.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.578 on 41469 degrees of freedom
## Multiple R-squared: 0.7026, Adjusted R-squared: 0.7026
## F-statistic: 4.898e+04 on 2 and 41469 DF, p-value: < 2.2e-16
par(mfrow=c(1,1))
residual <- residuals(lm2)
plot(fitted(lm2),residual,pch = 21, bg = 'blue', main = "Residual Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')
par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(fitted(lm2),standresidual,pch = 21, bg = 'blue', main = "Standardized Residual Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')
hist(resid(lm2), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")
boxplot(residuals(lm2), main = "Boxplot of Residuals")
par(mfrow = c(1,1))
qqnorm(residuals(lm2),main = "Q-Q Plot")
qqline(residuals(lm2))
library(het.test)
## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
dataset1 <- data.frame(x = surftemp, y = temp)
model1 <- VAR(dataset1, p =1)
whites.htest(model1)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 1594.4028
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.0000
dataset2 <- data.frame(x = pressure, y = temp)
model2 <- VAR(dataset2, p =1)
whites.htest(model2)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 6103.1103
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.0000
plot(surftemp,temp, xlab = "surface temperature", ylab = "air temperature")
abline(lm1)
plot(pressure,temp, xlab = "air pressure", ylab = "air temperature")
lm3 <- lm(temp~pressure)
abline(lm3)
To be finished in the final version
r2 <- summary(lm2)$r.squared
f2 <- r2/(1-r2)
First, I choose to use f2 = 2.36, alpha = 0.05, beta = 0.8, and 2 predictor to compute sample size in G * power. G * Power shows that I only need 8 observations to do the regression. So, I think I’d better choose to use rule of thumb and decided to use f2 = 0.02 in my project. Then, G * power computed that I need to have 485 observasions.