title: “Project3” author: “Chixian Ma” date: “Sunday, April 12, 2015” output: html_document

Dataset

My dataset is the NASA weather data, which from the Hadley Wickham datasets. The full data package has four datasets: atmosphere, elevation, storms tracks, and glaciers. I choose the atmosphere dataset. This dataset has 7 variables: surface temperature, air temperature, air pressure, ozone, and cloud cover(low,mid,and high). The dataset has 41472 observations of the atmosphere condition of Central America monthly from 1995 to 2000.

https://github.com/hadley/nasaweather

load("C:/Users/Xianwei/Desktop/atmos.rdata")
attach(atmos)
## The following object is masked from package:datasets:
## 
##     pressure
head(atmos)
##        lat   long year month surftemp  temp pressure ozone cloudlow
## 1 36.20000 -113.8 1995     1    272.7 272.1      835   304      7.5
## 2 33.70435 -113.8 1995     1    279.5 282.2      940   304     11.5
## 3 31.20870 -113.8 1995     1    284.7 285.2      960   298     16.5
## 4 28.71304 -113.8 1995     1    289.3 290.7      990   276     20.5
## 5 26.21739 -113.8 1995     1    292.2 292.7     1000   274     26.0
## 6 23.72174 -113.8 1995     1    294.1 293.6     1000   264     30.0
##   cloudmid cloudhigh
## 1     34.5      26.0
## 2     32.5      20.0
## 3     26.0      16.0
## 4     14.5      13.0
## 5     10.5       7.5
## 6      9.5       8.0
plot(atmos[,5:9])

Independent Variable

My independent variables are: surface temperature and air pressure.

Dependent variable

My dependent variable is air temperature.

Null hypothesis \(H_0\)

My null hypothesis \(H_0\) is that the variation of air temperature depends on randomness and cannot be explained by any of the independent variables: surface temperature and air pressure.

My \(H_1\) is that the variation of air temperature can be explained by any of the independent variables: surface temperature and air pressure.

Building the model

I decided to use step-wise method to build my model. So I include the independent variables based on their correlation with the dependent variable.

cor(atmos[,5:7])
##           surftemp      temp  pressure
## surftemp 1.0000000 0.8067645 0.3517102
## temp     0.8067645 1.0000000 0.4966186
## pressure 0.3517102 0.4966186 1.0000000

According to the correlation matrix, I include surface temperature first into my model. Second, I include air pressure into my model.

lm1 <- lm(temp~surftemp)
summary(lm1)
## 
## Call:
## lm(formula = temp ~ surftemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8084  -1.4386  -0.0386   1.6246  18.0229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.094233   0.869839   64.49   <2e-16 ***
## surftemp     0.816346   0.002936  278.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.793 on 41470 degrees of freedom
## Multiple R-squared:  0.6509, Adjusted R-squared:  0.6509 
## F-statistic: 7.731e+04 on 1 and 41470 DF,  p-value: < 2.2e-16
lm2 <- lm(temp~surftemp+pressure)
summary(lm2)
## 
## Call:
## lm(formula = temp ~ surftemp + pressure)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7275  -1.5675  -0.1542   1.5636  18.4211 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.666e+01  8.029e-01   70.57   <2e-16 ***
## surftemp    7.299e-01  2.895e-03  252.14   <2e-16 ***
## pressure    2.542e-02  2.993e-04   84.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.578 on 41469 degrees of freedom
## Multiple R-squared:  0.7026, Adjusted R-squared:  0.7026 
## F-statistic: 4.898e+04 on 2 and 41469 DF,  p-value: < 2.2e-16

Plot

Residual fitted plot

par(mfrow=c(1,1))
residual <- residuals(lm2)
plot(fitted(lm2),residual,pch = 21, bg = 'blue', main = "Residual Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')

Standardized Residual fitted plot

par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(fitted(lm2),standresidual,pch = 21, bg = 'blue', main = "Standardized Residual Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')

Histogram of Residuals

hist(resid(lm2), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")

Boxplot

boxplot(residuals(lm2), main = "Boxplot of Residuals")

Q-Q Plot

par(mfrow = c(1,1))
qqnorm(residuals(lm2),main = "Q-Q Plot")
qqline(residuals(lm2))

White’s test

library(het.test)
## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
dataset1 <- data.frame(x = surftemp, y = temp)
model1 <- VAR(dataset1, p =1)
whites.htest(model1)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  1594.4028 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000
dataset2 <- data.frame(x = pressure, y = temp)
model2 <- VAR(dataset2, p =1)
whites.htest(model2)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  6103.1103 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000
plot(surftemp,temp, xlab = "surface temperature", ylab = "air temperature")
abline(lm1)

plot(pressure,temp, xlab = "air pressure", ylab = "air temperature")
lm3 <- lm(temp~pressure)
abline(lm3)

Interpret

To be finished in the final version

Interpret b0,b1 and R

Interpret of White’s test

Test of assumptions

Four issues

Causality

Sample Sizes

r2 <- summary(lm2)$r.squared
f2 <- r2/(1-r2)

First, I choose to use f2 = 2.36, alpha = 0.05, beta = 0.8, and 2 predictor to compute sample size in G * power. G * Power shows that I only need 8 observations to do the regression. So, I think I’d better choose to use rule of thumb and decided to use f2 = 0.02 in my project. Then, G * power computed that I need to have 485 observasions.

Collinearity

Measurement Error