title: “Project3” author: “Chixian Ma” date: “Sunday, April 12, 2015” output: html_document
My dataset is the NASA weather data, which from the Hadley Wickham datasets. The full data package has four datasets: atmosphere, elevation, storms tracks, and glaciers. I choose the atmosphere dataset. This dataset has 7 variables: surface temperature, air temperature, air pressure, ozone, and cloud cover(low,mid,and high). The dataset has 41472 observations of the atmosphere condition of Central America monthly from 1995 to 2000.
https://github.com/hadley/nasaweather
load("C:/Users/machi_000/Desktop/Project3/atmos.rdata")
str(atmos)
## Classes 'tbl_df', 'tbl' and 'data.frame': 41472 obs. of 11 variables:
## $ lat : num 36.2 33.7 31.2 28.7 26.2 ...
## $ long : num -114 -114 -114 -114 -114 ...
## $ year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ surftemp : num 273 280 285 289 292 ...
## $ temp : num 272 282 285 291 293 ...
## $ pressure : num 835 940 960 990 1000 1000 1000 1000 1000 1000 ...
## $ ozone : num 304 304 298 276 274 264 258 252 250 250 ...
## $ cloudlow : num 7.5 11.5 16.5 20.5 26 30 29.5 26.5 27.5 26 ...
## $ cloudmid : num 34.5 32.5 26 14.5 10.5 9.5 11 17.5 18.5 16.5 ...
## $ cloudhigh: num 26 20 16 13 7.5 8 14.5 19.5 22.5 21 ...
Surface temperature and air temperature are in Kelvin (K = 273.15 + Centigrade) . Air pressure is in Pa.
Becasue the dataset is too large and in order to avoid issue of sample size, I use reduce the sample size to 485 (computed through G * Power and will be discussed in the Sample Size section).
N <- 485
M <- nrow(atmos)
set.seed(13)
indicies <- sample(M,N,replace = FALSE)
atmosnew <- atmos[indicies,]
plot(atmosnew[5:7])
My independent variables are: surface temperature and air pressure.
My dependent variable is air temperature.
My null hypothesis \(H_0\) is that the variation of air temperature depends on randomness and cannot be explained by any of the independent variables: surface temperature and air pressure.
My \(H_1\) is that the variation of air temperature can be explained by any of the independent variables: surface temperature and air pressure.
I decided to use step-wise method to build my model. So I add the independent variables based on their correlation with the dependent variable. The following correlation matrix shows the correlation of dependent variable (air temperature) and two independent variables (surface temperature and air pressure).
attach(atmosnew)
## The following object is masked from package:datasets:
##
## pressure
cor(atmosnew[,5:7])
## surftemp temp pressure
## surftemp 1.0000000 0.8360641 0.3171675
## temp 0.8360641 1.0000000 0.5432380
## pressure 0.3171675 0.5432380 1.0000000
According to the correlation matrix, I include surface temperature first into my model.Surface temperature and air temperature are highly correlated,with the correlation of 0.83. Second, I include air pressure into my model.The correlation of air temperature and air pressure is 0.5432.
In addition, the correlation of surface temperature and air pressure is 0.317. So, include both independent may have suppression in my model and need to pay attention to it.
lm1 <- lm(temp~surftemp)
summary(lm1)
##
## Call:
## lm(formula = temp ~ surftemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2166 -1.3515 0.0366 1.7175 6.8673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.64788 7.67117 5.299 1.78e-07 ***
## surftemp 0.86808 0.02592 33.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.843 on 483 degrees of freedom
## Multiple R-squared: 0.699, Adjusted R-squared: 0.6984
## F-statistic: 1122 on 1 and 483 DF, p-value: < 2.2e-16
The model is: air temperature is explained by surface temperature. bo of this model is 40.64788, which means when surface temperature is 0, the air temperature is 40.64788 K. b1 of this model is 0.86808 and shows that with one K increases in surface temperature, air temperature increases 0.86808.
Adjusted R^2 is 0.6984. R^2 explains that surface temperature can explain 69.84% of the variation of air temperature. Also, more independents are needed in the explanation of the air temperature.
Then, for the significant test, F-statistic of this model is 1122 on 1 and 483 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that surface temperature can explain the variation in air temperature.
lm2 <- lm(temp~surftemp+pressure)
summary(lm2)
##
## Call:
## lm(formula = temp ~ surftemp + pressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3811 -1.4412 -0.1096 1.4301 6.6287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.387489 6.490529 6.223 1.06e-09 ***
## surftemp 0.766271 0.023124 33.137 < 2e-16 ***
## pressure 0.030889 0.002225 13.882 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 482 degrees of freedom
## Multiple R-squared: 0.785, Adjusted R-squared: 0.7841
## F-statistic: 879.8 on 2 and 482 DF, p-value: < 2.2e-16
The model is: air temperature is explained by surface temperature and air pressure. bo of this model is 40.38, which means when surface temperature is 0, the air temperature is 40.38 K. b1 of this model is 0.766 and shows that with one K increases in surface temperature, air temperature increases 0.766. b2 of this model is 0.0308. It shows that with one Pa increases in air pressure, air temperature increases 0.0308.
Adjusted R^2 is 0.7841. R^2 explains that surface temperature can explain 78,41% of the variation of air temperature. It is higher than the previous R^2.We can say that add air pressure into the model is reasonable. However, more independents are needed in the explanation of the air temperature.
Then, for the significant test, F-statistic of this model is 693.5 on 2 and 482 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that surface temperature and air pressure can explain the variation in air temperature. ## Plot
par(mfrow=c(1,1))
residual <- residuals(lm2)
plot(fitted(lm2),residual,pch = 21, bg = 'blue', main = "Residual vs Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')
par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(fitted(lm2),standresidual,pch = 21, bg = 'blue', main = "Standardized Residual vs Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')
hist(resid(lm2), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")
The histogram of residual looks that the residual should be normal distributed. However, there maybe some outliers at left.
boxplot(residuals(lm2), main = "Boxplot of Residuals")
The boxplot shows there are several ouliers of residuals at left and right side. Then, it shows the mean of residual is 0, so we can say there is no skewness of residuals.
par(mfrow = c(1,1))
qqnorm(residuals(lm2),main = "Q-Q Plot")
qqline(residuals(lm2),col = 'red')
The Q-Q plot looks almost linear except a little deviation from the lower and upper range of the distribution. There maybe some outliers at lower and upper range.
By looking through histrogram plot, Boxplot, and Q-Q plot, I think the distribution of residuals is normal except there are maybe some outliers.
To test this assumption, I first look at the standardized residual versus fitted plot. The plot looks like the data is kind of homoscedasticity, but I am not very sure. So, then, I do the white’s test.
library(het.test)
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: MASS
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
dataset1 <- data.frame(x = surftemp, y = temp)
model1 <- VAR(dataset1, p =1)
whites.htest(model1)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 6.7208
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.8755
dataset2 <- data.frame(x = pressure, y = temp)
model2 <- VAR(dataset2, p =1)
whites.htest(model2)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 5.3382
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.9457
The results of White’s test shows the P-value are greater than 0.05. So, we cannot reject the null hypothesis and conclude that our data is homoscedastic. This indicates that the variance of the residuals for every set of values of the predictor is equal and the model satisfies this assumption.
This assumption is had to test and need to know it from the theory. I searched on the internet for the theory and did’t find the two independet variables (surface temperature and air pressure) are interactive. So, I think the error term should be additive.
This assumption measn that the independent variables and dependent variable should be perfect linear relationship. Looking at the residual versus fitted plot, I find there are more points at right part than lift part . It shows the points are not randam across the dynamic range. So, the independent variables and dependent variables are not linear relationship and assumption 4 is violated.
This assumption means that all cases should be independent of one another. My data were collected in time order. I think the average air temperature in June and July should be correlated. So, this assumption is violated.
This assumption means that there is not perfect multicollinearity of independent variables. According to the correlation matrix, the correlation between two independent variables is 0.39. So there is a correlation between IVs, but it is nor perfect linear relationship. Assumption 7 is satisfied.
par(mfrow=c(1,1))
standresidual <- rstandard(lm2)
plot(fitted(lm2),standresidual,pch = 21, bg = 'blue', main = "Standardized Residual vs Fitted Plot", xlab = "Fitted values of Model", ylab = "Residuals")
abline(0,0, lwd = 2, col = 'red')
To test this assumption, I look at the standardized residual vs fitted plot. It looks like equal points above and under 0 and the distince from the points over 0 and under 0 to 0 should be almost the same. So,by looking at the plot, I think the mean of error term should be very close to 0.
mean(standresidual)
## [1] -0.0006030735
Then, I compute the standard residual and find it’s -0.0006030735. To conclude, I think assumption 8 is satisfied.
Surface temperature and air pressure are proximal and probabilistic causes of air temperature. They are Proximal causes because we can easily know the direct and immediate cause of something, but it’s very had to find the start point of the process. They are not determinate because everything has exception. Not every time when surface temperature increases and air pressure increases, air temperature increases.
Surface temperature and air pressure are related to air temperature because they are correlated with each other. Then, based on my knowledge, I think direction of influence should be surface temperature increase causes air temperature increase and air pressure increase causes air temperature increase. However, their relation is not isolate. Other factors also could cause air temperature to increase.
r2 <- summary(lm2)$r.squared
f2 <- r2/(1-r2)
My dataset is very larget. So, it may be waste time to do the regression with the whole data and also could increase the probability of Type I error. So, I need to reduce the sample size.
First, I choose to use f2 = 2.36, alpha = 0.05, beta = 0.8, and 2 predictor to compute sample size in G * power. G * Power shows that I only need 8 observations to do the regression. So, I think I’d better choose to use rule of thumb and decided to use f2 = 0.02 in my project. G * power computed that I need to have 485 observasions.
lm3 <- lm(surftemp~pressure)
R2 <- summary(lm3)$r.squared
plot(pressure,surftemp, pch = 21, bg = 'blue', main = "surface temperature vs air pressure", xlab = "surface temperature", ylab = "air pressure")
Tolerance <- 1-R2
Independent variables in my model have collinearity, but they are not perfect collinearity. According to the correlation matrix, their correlation is 0.317. Then, looking at the plot, I think IVs are not high correlated. In addition, R^2 is 0.1551 for using surface temperature as IV and air pressure as DV. Tolerance is 0.8994047. So, IVs in my model don’t have high collinearity.
I think my model have measurement error. Because we can’t measure any variable perfectly. There is attenuation of correlation. The correlation we compute is lower than the actual correlation. Measurement error is hard to avoid and is very dangerous.