title: “Project3” author: “Chixian Ma” date: “Sunday, April 12, 2015” output: html_document

Dataset

My dataset is the NASA weather data, which from the Hadley Wickham datasets. The full data package has four datasets: atmosphere, elevation, storms tracks, and glaciers. I choose the atmosphere dataset. This dataset has 7 variables: surface temperature, air temperature, air pressure, ozone, and cloud cover(low,mid,and high). The dataset has 41472 observations of the atmosphere condition of Central America monthly from 1995 to 2000.

https://github.com/hadley/nasaweather

load("C:/Users/machi_000/Desktop/Project3/atmos.rdata")
str(atmos)
## Classes 'tbl_df', 'tbl' and 'data.frame':    41472 obs. of  11 variables:
##  $ lat      : num  36.2 33.7 31.2 28.7 26.2 ...
##  $ long     : num  -114 -114 -114 -114 -114 ...
##  $ year     : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ surftemp : num  273 280 285 289 292 ...
##  $ temp     : num  272 282 285 291 293 ...
##  $ pressure : num  835 940 960 990 1000 1000 1000 1000 1000 1000 ...
##  $ ozone    : num  304 304 298 276 274 264 258 252 250 250 ...
##  $ cloudlow : num  7.5 11.5 16.5 20.5 26 30 29.5 26.5 27.5 26 ...
##  $ cloudmid : num  34.5 32.5 26 14.5 10.5 9.5 11 17.5 18.5 16.5 ...
##  $ cloudhigh: num  26 20 16 13 7.5 8 14.5 19.5 22.5 21 ...

Surface temperature and air temperature are in Kelvin (K = 273.15 + Centigrade) . Air pressure is in Pa.

Becasue the dataset is too large and in order to avoid issue of sample size, I use reduce the sample size to 485 (computed through G * Power and will be discussed in the Sample Size section).

N <- 485
M <- nrow(atmos)
set.seed(13)
indicies <- sample(M,N,replace = FALSE)
atmosnew <- atmos[indicies,]
plot(atmosnew[5:7])

Independent Variable

My independent variables are: surface temperature and air pressure.

Dependent variable

My dependent variable is air temperature.

Null hypothesis \(H_0\)

My null hypothesis \(H_0\) is that the variation of air temperature depends on randomness and cannot be explained by any of the independent variables: surface temperature and air pressure.

My \(H_1\) is that the variation of air temperature can be explained by any of the independent variables: surface temperature and air pressure.

Building the model

I decided to use step-wise method to build my model. So I add the independent variables based on their correlation with the dependent variable. The following correlation matrix shows the correlation of dependent variable (air temperature) and two independent variables (surface temperature and air pressure).

attach(atmosnew)
## The following object is masked from package:datasets:
## 
##     pressure
cor(atmosnew[,5:7])
##           surftemp      temp  pressure
## surftemp 1.0000000 0.8360641 0.3171675
## temp     0.8360641 1.0000000 0.5432380
## pressure 0.3171675 0.5432380 1.0000000

According to the correlation matrix, I include surface temperature first into my model.Surface temperature and air temperature are highly correlated,with the correlation of 0.83. Second, I include air pressure into my model.The correlation of air temperature and air pressure is 0.5432.

In addition, the correlation of surface temperature and air pressure is 0.317. So, include both independent may have suppression in my model and need to pay attention to it.

lm1 <- lm(temp~surftemp)
summary(lm1)
## 
## Call:
## lm(formula = temp ~ surftemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2166  -1.3515   0.0366   1.7175   6.8673 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.64788    7.67117   5.299 1.78e-07 ***
## surftemp     0.86808    0.02592  33.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.843 on 483 degrees of freedom
## Multiple R-squared:  0.699,  Adjusted R-squared:  0.6984 
## F-statistic:  1122 on 1 and 483 DF,  p-value: < 2.2e-16

The model is: air temperature is explained by surface temperature. bo of this model is 40.64788, which means when surface temperature is 0, the air temperature is 40.64788 K. b1 of this model is 0.86808 and shows that with one K increases in surface temperature, air temperature increases 0.86808.

Adjusted R^2 is 0.6984. R^2 explains that surface temperature can explain 69.84% of the variation of air temperature. Also, more independents are needed in the explanation of the air temperature.

Then, for the significant test, F-statistic of this model is 1122 on 1 and 483 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that surface temperature can explain the variation in air temperature.

lm2 <- lm(temp~surftemp+pressure)
summary(lm2)
## 
## Call:
## lm(formula = temp ~ surftemp + pressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3811 -1.4412 -0.1096  1.4301  6.6287 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.387489   6.490529   6.223 1.06e-09 ***
## surftemp     0.766271   0.023124  33.137  < 2e-16 ***
## pressure     0.030889   0.002225  13.882  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 482 degrees of freedom
## Multiple R-squared:  0.785,  Adjusted R-squared:  0.7841 
## F-statistic: 879.8 on 2 and 482 DF,  p-value: < 2.2e-16

The model is: air temperature is explained by surface temperature and air pressure. bo of this model is 40.38, which means when surface temperature is 0, the air temperature is 40.38 K. b1 of this model is 0.766 and shows that with one K increases in surface temperature, air temperature increases 0.766. b2 of this model is 0.0308. It shows that with one Pa increases in air pressure, air temperature increases 0.0308.

Adjusted R^2 is 0.7841. R^2 explains that surface temperature can explain 78,41% of the variation of air temperature. It is higher than the previous R^2.We can say that add air pressure into the model is reasonable. However, more independents are needed in the explanation of the air temperature.

Then, for the significant test, F-statistic of this model is 693.5 on 2 and 482 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that surface temperature and air pressure can explain the variation in air temperature. ## Plot

Residual vs fitted plot

par(mfrow=c(1,1))
residual <- residuals(lm2)
plot(fitted(lm2),residual,pch = 21, bg = 'blue', main = "Residual vs Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')

Standardized Residual vs fitted plot

par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(fitted(lm2),standresidual,pch = 21, bg = 'blue', main = "Standardized Residual vs Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')     

Interpret

Test of assumptions

Assumption 1. The distribution of residuals is normal (at each value of the outcome)

Histogram of Residuals

hist(resid(lm2), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")

The histogram of residual looks that the residual should be normal distributed. However, there maybe some outliers at left.

Boxplot

boxplot(residuals(lm2), main = "Boxplot of Residuals")

The boxplot shows there are several ouliers of residuals at left and right side. Then, it shows the mean of residual is 0, so we can say there is no skewness of residuals.

Q-Q Plot

par(mfrow = c(1,1))
qqnorm(residuals(lm2),main = "Q-Q Plot")
qqline(residuals(lm2),col = 'red')

The Q-Q plot looks almost linear except a little deviation from the lower and upper range of the distribution. There maybe some outliers at lower and upper range.

By looking through histrogram plot, Boxplot, and Q-Q plot, I think the distribution of residuals is normal except there are maybe some outliers.

Assumption 2: The variance of the residuals for every set of values for the predictor is equal.

To test this assumption, I first look at the standardized residual versus fitted plot. The plot looks like the data is kind of homoscedasticity, but I am not very sure. So, then, I do the white’s test.

White’s test

library(het.test)
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: MASS
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
dataset1 <- data.frame(x = surftemp, y = temp)
model1 <- VAR(dataset1, p =1)
whites.htest(model1)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  6.7208 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.8755
dataset2 <- data.frame(x = pressure, y = temp)
model2 <- VAR(dataset2, p =1)
whites.htest(model2)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  5.3382 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.9457

The results of White’s test shows the P-value are greater than 0.05. So, we cannot reject the null hypothesis and conclude that our data is homoscedastic. This indicates that the variance of the residuals for every set of values of the predictor is equal and the model satisfies this assumption.

Assumption 3: The error term is additive.

This assumption is had to test and need to know it from the theory. I searched on the internet for the theory and did’t find the two independet variables (surface temperature and air pressure) are interactive. So, I think the error term should be additive.

Assumption 4: At every value of the outcome the expected (mean) value of the residuals is zero.

This assumption measn that the independent variables and dependent variable should be perfect linear relationship. Looking at the residual versus fitted plot, I find there are more points at right part than lift part . It shows the points are not randam across the dynamic range. So, the independent variables and dependent variables are not linear relationship and assumption 4 is violated.

Assumption 5: The expected correlation between residuals, for any two cases, is 0.

This assumption means that all cases should be independent of one another. My data were collected in time order. I think the average air temperature in June and July should be correlated. So, this assumption is violated.

Assumption 6: All predictor variables are uncorrelated with the error term.

This assumption means that predictors are not correlated with the error term. So I plot the predictors vs residual to see their relationship.

par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(surftemp,standresidual,pch = 21, bg = 'blue', main = "surface temperature vs Standardized Residual", xlab = "surface temperature", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')

plot(pressure,standresidual,pch = 21, bg = 'blue', main = "air pressure vs Standardized Residual", xlab = "air pressure", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')

According to the plot, the predictors are not correlated with the residuals. So, assumption 6 is satisfied.

Assumption 7: No predictors are a perfect linear function of other predictors.

This assumption means that there is not perfect multicollinearity of independent variables. According to the correlation matrix, the correlation between two independent variables is 0.39. So there is a correlation between IVs, but it is nor perfect linear relationship. Assumption 7 is satisfied.

Assumption 8: The mean of the error term is zero.

par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(fitted(lm2),standresidual,pch = 21, bg = 'blue', main = "Standardized Residual vs Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')     

To test this assumption, I look at the standardized residual vs fitted plot. It looks like equal points above and under 0 and the distince from the points over 0 and under 0 to 0 should be almost the same. So,by looking at the plot, I think the mean of error term should be very close to 0.

mean(standresidual)
## [1] -0.0006030735

Then, I compute the standard residual and find it’s -0.0006030735. To conclude, I think assumption 8 is satisfied.

1. Causality

Surface temperature and air pressure are proximal and probabilistic causes of air temperature. They are Proximal causes because we can easily know the direct and immediate cause of something, but it’s very had to find the start point of the process. They are not determinate because everything has exception. Not every time when surface temperature increases and air pressure increases, air temperature increases.

Surface temperature and air pressure are related to air temperature because they are correlated with each other. Then, based on my knowledge, I think direction of influence should be surface temperature increase causes air temperature increase and air pressure increase causes air temperature increase. However, their relation is not isolate. Other factors also could cause air temperature to increase.

2. Sample Sizes

r2 <- summary(lm2)$r.squared
f2 <- r2/(1-r2)

My dataset is very larget. So, it may be waste time to do the regression with the whole data and also could increase the probability of Type I error. So, I need to reduce the sample size.

First, I choose to use f2 = 2.36, alpha = 0.05, beta = 0.8, and 2 predictor to compute sample size in G * power. G * Power shows that I only need 8 observations to do the regression. So, I think I’d better choose to use rule of thumb and decided to use f2 = 0.02 in my project. G * power computed that I need to have 485 observasions.

3. Collinearity

lm3 <- lm(surftemp~pressure)
R2 <- summary(lm3)$r.squared
plot(pressure,surftemp, pch = 21, bg = 'blue', main = "surface temperature vs air pressure", xlab = "surface temperature", ylab = "air pressure")

Tolerance <- 1-R2

Independent variables in my model have collinearity, but they are not perfect collinearity. According to the correlation matrix, their correlation is 0.317. Then, looking at the plot, I think IVs are not high correlated. In addition, R^2 is 0.1551 for using surface temperature as IV and air pressure as DV. Tolerance is 0.8994047. So, IVs in my model don’t have high collinearity.

4. Measurement Error

I think my model have measurement error. Because we can’t measure any variable perfectly. There is attenuation of correlation. The correlation we compute is lower than the actual correlation. Measurement error is hard to avoid and is very dangerous.