DATA 606 Data Final Project

Adam Gersowitz

Introduction

Over the past century, climate change has caused global temperatures to rise across the world. I’m going to examine if the year to year percent chage in temperature varies for coordinates in Central and South America. By doing this I will be able to establish if other factors cause the change in temperature to be more extreme or more mild. By doing this we may be able to establish which countries world wide will be hit the hardest by climate change in the future, which will help us preppare for potential socio-economic crises such as mass-migration and the crumbling of international trade relations.

I plan on running multiple variables found in this dataset through a multiple regression analysis to determine which factors impact the percent change in average air temperature the greatest. I will determine which of these fields will be used during the data exploration phase.

Data

The data I will be using was collected by NASA and is data from the 2006 ASA data expo. The data can be found at http://stat-computing.org/dataexpo/2006/. For this project the data was pulled from the nasaweather package (https://blog.rstudio.com/2014/07/23/new-data-packages/). The data runs from 1995-2000, so I will look at the changes from the data recorded in 1995 to the data in 2000.

Each case represents an atmospheric weather reading of coordinates in North, South, and Central America. There 41,472 cases in the data set. I will collect the dataset and create variaous dataframe pertaining to the average and percentage change in temperature.This is an observaitonal study because I’m collecting the data from temperature data that was collected in a time-series fashion. After transforming the data so we can look at it through the lense of percent change in annual air temperature from the beginning of the time frame (1995) to the end (2000), we end up with 576 cases.

The response variables are the temperature and surface temperatures which are numerical.The explanatory variable is the cooridnates (laititude and longitude) which is categorical, ozone (numerical), pressure(numerical), cloud coverage(numerical).

Hoepfully, this data will be generalizable to the rest of the world because it will take into account factors such as LAtitude which is a factor for all land masses worldwide. Some issues that will impact generalizeability is this data set is just of the Americas. This dataset encompases is a relatively small land mass that is close to the equator so it may have less variaiblity within it’s range but may have mroe extreme changes when compared to the rest of the world. The best subset to the world to generalize to will be land mass at similar latitude points. Additonally, the dataset is from over 20 years ago. There have been countless impacts on the environement since then which may render this data less useful than a more recent collection.

This data will not be able to be used to prove causality as it is an observational study of temperature change for a small percentage of the world.

Libraries Used

library(nasaweather)
library(sqldf)
library(ggplot2)
library(lubridate)
library(sf)
library(mapview)
library(rworldmap)
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)
library(rgdal)
library(raster)
library(geosphere)
library(osmdata)

Data Prep

nrow(atmos)
## [1] 41472
length(atmos)
## [1] 11
temp <- atmos


# I will also create dataframes that store average information on temperature and other factors by date and location
temp<-sqldf("select a.*, lat||'-'||long ll from temp a")

mean_variables<-sqldf("select avg(cloudlow) mean_cloudlow, avg(cloudhigh) mean_cloudhigh,avg(cloudmid) mean_cloudmid,avg(pressure) mean_pressure, avg(ozone) mean_ozone, lat, long from temp group by lat,long")

temp<-sqldf("select b.*, mean_cloudlow, mean_cloudhigh,mean_cloudmid,mean_pressure,mean_ozone from mean_variables a join temp b on a.lat = b.lat and a.long = b.long")

meantemp_coordyear  <-sqldf("select avg(temp) mean_temp,avg(surftemp) mean_surftemp, lat,long,year,ll from temp group by lat,long, year,ll order by year asc")

#Lastly I'll create a dataframe that captures the percent change from the first year to the last year in temperature and other factors

temp_percent_change  <-sqldf(" select distinct tm.lat, tm.long,temp.mean_cloudlow,temp.mean_cloudmid,
temp.mean_cloudhigh,
temp.mean_pressure,
temp.mean_ozone,
                              ((mean_temp - last_yr_temp)/last_yr_temp)*100 as per_chng_temp,
                              ((mean_surftemp - last_yr_surftemp)/last_yr_surftemp)*100 as per_chng_surftemp
                              from
                              (select mean_temp,mean_surftemp,  year, lat, long, ll
                              from meantemp_coordyear x
                              where
                              year = (select max(year) from meantemp_coordyear y where x.lat = y.lat and x.long = y.long)) tm
                             join
                             (select mean_temp as last_yr_temp, mean_surftemp as last_yr_surftemp, lat, long, ll,
                            year from meantemp_coordyear a
                            where
                            year = (select min(year) from meantemp_coordyear b where a.lat = b.lat and a.long = b.long)) lm on
                             lm.lat = tm.lat and lm.long = tm.long
                             join temp on temp.ll = tm.ll and temp.year = tm.year
                             order by  tm.lat, tm.long
                             
                             ")

tpc<-as.data.frame(temp_percent_change)
# I need to add in data columns for the proximity to the Ocean, Proximity to the Tropics/Equator and population size of nearby #areas

#calculating how many laititudinal units away form each tropic and equator
tpc[,3:4]<-sapply(tpc[,3:4], as.numeric)
tpc<-sqldf("select a.*, case when per_chng_temp >0 then 'Increase in Average Annual Temperature' else 'Decrease in Average Annual Temperature' end posneg,
           case when per_chng_surftemp >0 then 'Increase in Average Annual Surface Temperature' else 'Decrease in Average Annual Surface Temperature'end posnegsurfd from tpc a")

Exploratory Data Analysis

I will analyze this data on a year to year basis because there will certainly be seasonality in the temperatures of the region. When we examine the relationship between latitude and change in temperature we can see there is a pattern in the change in temperature in relation to latitude that is not just linear. Since latitude is based on around radial distance from the equator we will create an additional dummy variable to determine if the coordinates are in the Northern or Southern Hemisphere.

We see that proximity to the equator does not have a clear correlation with changes in the air temperature over the timeframe. We also see that the Southern Hemisphere has more variablity and a higher percent change in temperature.

summary(tpc$per_chng_temp)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.53679 -0.04444  0.18820  0.35346  0.59789  4.95746
summary(tpc$per_chng_surftemp)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.01098 -0.44015 -0.20318 -0.17956 -0.02194  1.72443
tpc$equator<-abs(tpc$lat)
tpc<-sqldf("select a.*, case when lat>0 then 'Northern' else 'Southern' end hemisphere from tpc a")


pf<-ggplot(tpc, aes(x=tpc$equator, y=tpc$per_chng_temp), colour = "Red") + geom_point()+geom_smooth()
pf+labs(x="Radial Distance to Equator",y="% Change in Temperature", title = "% Change in Temperature in relation to Equator",colour = "Coordinates of Interest")

pf<-ggplot(tpc, aes(x=tpc$hemisphere,y=tpc$per_chng_temp), colour = "Red") + geom_point()
pf+labs(x="Hemisphere",y="% Change in Temperature", title = "% Change in Temperature in relation to Hemisphere",colour = "Coordinates of Interest")

When we plot Ozone and Pressure we see that there is a lot of clumping with low levels of Ozone and high amounts of pressure. We will include this data when doing our analysis.

pf<-ggplot(tpc, aes(tpc$mean_ozone,tpc$per_chng_temp)) + geom_point()+ geom_smooth()
pf+labs(x="Average Ozone",y="% Change in Temperature", title = "% Change in Temperature in relation to Ozone")

pf<-ggplot(tpc, aes(tpc$mean_pressure,tpc$per_chng_temp)) + geom_point()+ geom_smooth()
pf+labs(x="Average Pressure",y="% Change in Temperature", title = "% Change in Temperature in relation to Pressure")

In additon, when we plot the valence of the percent change in temperature over the world map of their coordinates we can plainly see that nearly all land locked coordinates saw an increase in average temperature while most ocean coordinates saw a decrease. When we graph the same grid by surface temperature the difference between ocean surface temperature and land surface temperature is even more drastic.

We can also see banding around the tropics and equator that we had suspected when we did our analysis above.

world <- ne_countries(scale = 50)

p<-ggplot(data = tpc) + geom_polygon(data=world, aes(x=long, y=lat, group=group),  color="white", lwd = .25) +
    geom_point(data = tpc, aes(x = long, y = lat, fill = posneg), size = 2, 
        shape = 23)  + scale_colour_manual(values=c("blue","red")) + geom_hline(yintercept=23.5, linetype="dashed", color = "Blue", label = "Cancer") + geom_hline(yintercept=-23.5, linetype="dashed", color = "Green", label = "Capricorn") + geom_hline(yintercept=0, color = "Red", label = "Equator") +
    coord_sf(xlim = c(-125, -50), ylim = c(-25, 40), expand = FALSE)

p+labs(x="Longitude",y="Latitude", title = "% Change in Temperature Mapped", colour =  "Positive or Negative Change")

p<-ggplot(data = tpc) + geom_polygon(data=world, aes(x=long, y=lat, group=group),  color="white", lwd = .25) +
    geom_point(data = tpc, aes(x = long, y = lat, fill = posnegsurfd), size = 2, 
        shape = 23)  + scale_colour_manual(values=c("blue","red")) + geom_hline(yintercept=23.5, linetype="dashed", color = "Blue", label = "Cancer") + geom_hline(yintercept=-23.5, linetype="dashed", color = "Green", label = "Capricorn") + geom_hline(yintercept=0, color = "Red", label = "Equator") +
    coord_sf(xlim = c(-125, -50), ylim = c(-25, 40), expand = FALSE)

p+labs(x="Longitude",y="Latitude", title = "% Change in Surface Temperature Mapped", colour =  "Positive or Negative Change")

When we plot the average percentage of days with low, medium, and high cloud coverage we get an interesting plot. There does seem to be some correlation with cloud cover and the percent change in temperature but it does not seem to be very clearly discernable. To combat this I will create a variable which accounts for the average monthly percentage of days that there was cloud cover. I will do this by simply adding the 3 percentages together which will give me the average percentage of time that there was cloud cover for certain coordinates.

pf<-ggplot(tpc, aes(x=tpc$per_chng_temp,y=tpc$mean_cloudlow), colour = "Red") + geom_point()+geom_smooth()
pf+labs(x="% Change in Temperature",y="Average % of days with low cloud coverage", title = "% Change in Temperature in relation to Cloud Coverage",colour = "Coordinates of Interest")

pf<-ggplot(tpc, aes(x=tpc$per_chng_temp,tpc$mean_cloudmid), colour = "Blue") + geom_point()+geom_smooth()
pf+labs(x="% Change in Temperature",y="Average % of days with middle cloud coverage", title = "% Change in Temperature in relation to cloud coverage",colour = "Coordinates of Interest")

pf<-ggplot(tpc, aes(x=tpc$per_chng_temp,tpc$mean_cloudhigh), colour = "Green") + geom_point()+geom_smooth()
pf+labs(x="% Change in Temperature",y="Average % of days with high cloud coverage", title = "% Change in Temperature in relation to Cloud Coverage",colour = "Coordinates of Interest")

pf<-ggplot(tpc, aes(x=tpc$per_chng_temp)) + 
  geom_smooth(aes(y=tpc$mean_cloudlow), colour = "Red") +
  geom_smooth(aes(y=tpc$mean_cloudmid), colour = "Blue") +
  geom_smooth(aes(y=tpc$mean_cloudhigh), colour = "Green") 
pf+labs(x="% Change in Temperature",y="Average Monthly % of Days with Type of Cloud Cover", title = "% Change in Temperature in relation to Cloud Cover",colour = "Average Cloud Cover over 5 Years")

tpc$cloud_per<-tpc$mean_cloudlow+tpc$mean_cloudmid+tpc$mean_cloudhigh

I will also create a variable that rates the height of the clouds when there is cloud cover. I will multiply the percentage of days with high cloud cover by 3, middle cloud cover will get multiplied by 2 and low cloud cover will get multiplied by 1. Then I will add those 3 results together and divide by the total percentage of days with cloud cover. This will give me a number from 1 to 3 where 1 is 100% of the time when there were clouds there was low cloud coverage and 3 would be 100% high cloud coverage. I then divide this by 3 to get a ratio form 0 to 1. We can see a slightly postive correlation between % of time with cloud cover and temperature change and an even stronger correlation with the height of the cloud cover. I’ll create a final variable call cloud that will multiply the ratio of percentage of cloud cover by the rating of cloud height that will get a single variable on the impact of cloud cover and use it in the regression analysis.

After examining all of the data in the dataset we will examine how radial distance to the equator,hemisphere, Ozone, pressure, and Cloud Cover impacted the change in air temperature over the 5 year span. Since the distance to the equator and tropics will have strong colinearity we will only use the distance to the equator as an indicator

tpc$cloud_height<-(((tpc$mean_cloudhigh*3)+(tpc$mean_cloudmid*2)+(tpc$mean_cloudlow))/tpc$cloud_per)/3

pf<-ggplot(tpc, aes(x=tpc$cloud_per,y=tpc$per_chng_temp)) + geom_point()+ geom_smooth()
pf+labs(x="Percentage of Time With Cloud Cover",y="% Change in Temperature", title = "% Change in Temperature in relation to Percentage of Days with Cloud Cover")

pf<-ggplot(tpc, aes(x=tpc$cloud_height,y=tpc$per_chng_temp)) + geom_point()+ geom_smooth()
pf+labs(x="Rating of Height when Cloud Cover Exists",y="% Change in Temperature", title = "% Change in Temperature in relation to Height when Cloud Cover Exists")

tpc$cloud<-(tpc$cloud_per/100)*tpc$cloud_height


pf<-ggplot(tpc, aes(x=tpc$cloud,y=tpc$per_chng_temp)) + geom_point()+ geom_smooth()
pf+labs(x="Impact of Cloud Coverage",y="% Change in Temperature", title = "% Change in Temperature in relation to Percentage of Days and height of Cloud Cover")

Inference

Conditions for Multiple Regression Analysis

The main conditons that need to be met to proceed with multiple regression analysis are that:

-Residuals are normally distributed -Independent Variables are not highly correlated -Variance of residuals are similar accross independent varialbles

By using the functions below we find that are model meets the conditions above. The histogram demonstrates the spread of the residuals is very narrow and slightly skewed right but nearly normal. The line plot shows the redisual variance is relatively consistent until the Quantiles get very high then it deviates greatly. This may be cause for concern, but should be close enough to satisfy the condition. We see that distance to the equator and ozone have high colinearity so we will remove ozone from the model to achieve acceptable VIF scores. I chose to leave equator as it would be a more understandable and basic explanatory variable.

With these assumptions satisfied we can proceed.

tpc_lm <- lm(per_chng_temp ~ equator+mean_ozone+hemisphere+cloud+mean_pressure, data = tpc)


qqnorm(tpc_lm$residuals)
qqline(tpc_lm$residuals)

hist(tpc_lm$residuals, breaks = 100)

car::vif(tpc_lm)
##       equator    mean_ozone    hemisphere         cloud mean_pressure 
##      6.944968      6.860725      1.256799      1.118724      1.102558
tpc_lm <- lm(per_chng_temp ~ equator+hemisphere+cloud+mean_pressure, data = tpc)


qqnorm(tpc_lm$residuals)
qqline(tpc_lm$residuals)

hist(tpc_lm$residuals, breaks = 100)

car::vif(tpc_lm)
##       equator    hemisphere         cloud mean_pressure 
##      1.307801      1.211483      1.111954      1.072062

Theoretical Inference

After summarizing this model we see that all variables are statistically signficiant so we will leave them in our final model.

summary(tpc_lm)
## 
## Call:
## lm(formula = per_chng_temp ~ equator + hemisphere + cloud + mean_pressure, 
##     data = tpc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8350 -0.2132 -0.0554  0.1556  3.6169 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.1946749  0.5576723   9.315  < 2e-16 ***
## equator            -0.0041958  0.0024338  -1.724   0.0853 .  
## hemisphereSouthern  0.3371065  0.0487391   6.917 1.24e-11 ***
## cloud               2.6021384  0.2329271  11.171  < 2e-16 ***
## mean_pressure      -0.0057896  0.0005333 -10.857  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5145 on 571 degrees of freedom
## Multiple R-squared:  0.3841, Adjusted R-squared:  0.3797 
## F-statistic: 89.01 on 4 and 571 DF,  p-value: < 2.2e-16

Conclusion

What we are left with is a multiple regression formula for the percent increase in temperature over a 5 year period.

\[ \begin{aligned} \widehat{score} &= \hat{\beta}_0 + \hat{\beta}_1 \times equator + \hat{\beta}_2 \times hemisphereSouthern + \hat{\beta}_3 \times cloud +\hat{\beta}_4 \times meanpressure \\ &= 5.195 + -.004 \times equator + .337 \times hemisphereSouthern + 2.602 \times cloud +-.006 \times meanpressure \end{aligned} \]

Our limited model state that frequent and high cloud coverage and southern hemisphere locations are the most influential factors on increase in average annual temperature.However, this model only produced an Adjusted R-squred of 0.3797. This means that only 38% of the variance in percent change in air temperature can be explained by the variables in this model.

In order to expand on this model and research we would need to look into much more diverse variables than those presented in this dataset. Since cloud height and coverage seems to have the greatest impact, perhaps we would want to look into factors that could cause high cloud coverage as potential underlying causes of the percent change.

Additionally, limitations due to the scope of these datasets being exclusively of the Americas may cuase the latitudinal based factors (equator and hemisphere) as ungeneralizable due to unseen factors. While this may be a good start to getting an understanding of what causes air temperature to increase based on coordinates it must be expanded on significantly to generalize the findings to the rest of the world.