There are many geographical factors that determine a locations average annual temperature. Latitude, elevation, proximity to the ocean and even a locations population density all play a role in temperature regulation. The purpose of this project is to visualize and perform a full regression analysis on these relationships. The region that will be explored is the eastern united states.

The full project can be accessed on github here.

require(readxl)
require(dplyr)
require(ggplot2)
require(sf)
require(olsrr)
require(rnaturalearth)
require(rnaturalearthdata)
require(plotly)
require(car)

1-Data Preperation

1.1 Load Data

Geographical Data was collected across 55 locations in the eastern united states. The data was collected from usclimatedata.com, Google Earth and WolframAlpha. The full data dictionary can be found here. A glimpse of the data is found below.

dirdata <- "/Users/endererimhan/Documents/Data Sets/"
climate_raw <- read_xlsx(paste0(dirdata, "Climate.XLSX"))
climate_raw <- as.data.frame(climate_raw)
glimpse(climate_raw)
## Rows: 55
## Columns: 8
## $ Latitude           <chr> "46°52′11″N", "45°39′32″N", "44°17′03″N", "44°15′39…
## $ Longtitude         <chr> "68°00′09″W", "68°42′29″W", "75°59′10″W", "72°34′24…
## $ City               <chr> "Caribou", "Millinocket", "Lake Placid", "Montpelie…
## $ State              <chr> "ME", "ME", "NY", "VE", "ME", "NY", "NY", "ME", "MA…
## $ Temp               <dbl> 39.65, 41.45, 42.15, 43.20, 45.65, 40.60, 48.40, 46…
## $ Elevation          <dbl> 449, 371, 1801, 525, 125, 1706, 505, 62, 1001, 482,…
## $ Nearest.Sea        <dbl> 95, 76, 182, 120, 18, 174, 227, 0, 83, 39, 0, 127, …
## $ Population.Density <dbl> 96.9, 811.0, 1761.0, 744.0, 337.0, 5.0, 5806.0, 313…

1.2 Preprocess Data

Before any analysis or visualization is done, we need to define some new variables. The first thing to notice is that the Latitude/Longitude data is in Degree Minute Second format. We want to transform this to decimal format to do clean analysis on. We will also transform Nearest.Sea into a Boolean variable called Proximity. If the location is within 30 miles of an ocean, we encode Proximity to be 1. Otherwise Proximity will be 0. The first 5 rows of our transformed data is found below.

climate_train <- climate_raw

climate_train <- climate_train %>% 
  mutate(Lat_deg = as.integer(substr(Latitude, start = 1, stop = 2)),
         Lat_min = as.integer(substr(Latitude, start = 4, stop = 5)),
         Lat_sec = as.integer(substr(Latitude, start = 7, stop = 8)),
         Lon_deg = as.integer(substr(Longtitude, start = 1, stop = 2)),
         Lon_min = as.integer(substr(Longtitude, start = 4, stop = 5)),
         Lon_sec = as.integer(substr(Longtitude, start = 7, stop =8)))

climate_train <- climate_train %>% 
  mutate(Latitude = Lat_deg + (Lat_min / 60) + (Lat_sec / 3600),
            Longtitude = -(Lon_deg + (Lon_min / 60) + (Lon_sec / 3600)))

climate_train <- climate_train %>% 
  mutate(Proximity = ifelse(Nearest.Sea > 30, 0, 1))

climate_train <- climate_train[, c(1:8,15)] 

glimpse(climate_train)
## Rows: 55
## Columns: 9
## $ Latitude           <dbl> 46.86972, 45.65889, 44.28417, 44.26083, 44.33306, 4…
## $ Longtitude         <dbl> -68.00250, -68.70806, -75.98611, -72.57333, -69.785…
## $ City               <chr> "Caribou", "Millinocket", "Lake Placid", "Montpelie…
## $ State              <chr> "ME", "ME", "NY", "VE", "ME", "NY", "NY", "ME", "MA…
## $ Temp               <dbl> 39.65, 41.45, 42.15, 43.20, 45.65, 40.60, 48.40, 46…
## $ Elevation          <dbl> 449, 371, 1801, 525, 125, 1706, 505, 62, 1001, 482,…
## $ Nearest.Sea        <dbl> 95, 76, 182, 120, 18, 174, 227, 0, 83, 39, 0, 127, …
## $ Population.Density <dbl> 96.9, 811.0, 1761.0, 744.0, 337.0, 5.0, 5806.0, 313…
## $ Proximity          <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, …

2 - Data Visualization

Below is a interactive map of the eastern united states with each point representing a location in our data.

world <- ne_countries(scale = "medium", returnclass = "sf")
map <- ggplot(data = world) +
  geom_sf() + 
  coord_sf(xlim = c(-87.5, -60.5), ylim = c(25.3, 47.5), expand = FALSE)

climate_map <- map + 
  geom_point(data = climate_train, aes(x = Longtitude, y = Latitude, color = Temp, group = City), size = 2.5) + 
  scale_color_gradient(low = "blue", high = "red")

ggplotly(climate_map)

Clearly the more south a location is, the warmer the average temperature of that location will be. Latitude will be used in our regression model.

Lets see how a locations elevation, latitude and average annual temperature interact with each other with the interactive plot below.

climate_scatter <- ggplot(data = climate_train, aes(x = log(Elevation), y = Latitude, color = Temp)) + 
  geom_point(aes(group = City )) + 
  scale_color_gradient(low = "blue", high = "red") + 
  xlab("Elevation in log scale") + ggtitle("Average Temperature vs Elevation and Latitude ")

ggplotly(climate_scatter, tooltip = c("group", "Latitude", "log(Elevation)", "Temp"))

When latitude remains fixed, average temperatures tend to decrease the higher the elevation. When elevation remains fixed, average temperatures tend to increase the more south a location is.

Lets see how average temperature and proximity to the ocean relate to each other.

climate_train %>% 
  mutate(Proximity = ifelse(Proximity == 0, "Far", "Close")) %>% 
  ggplot(aes(x = Proximity, y = Temp)) +
  geom_boxplot() +
  ggtitle("Average Temperature vs Proximity to Ocean") + 
  theme_classic()

Clearly there is a difference in average annual temperatures between locations close and far to the ocean. However this is misleading since we include locations in Florida in our data. All locations in Florida are close to the ocean and they are all our most southern locations. Thus latitude is a confounding variable to both proximity and average annual temperature.

In climatology it is known that the ocean plays a role in temperature regulation. There are less extremes in the temperature for coastal locations. We will put proximity into our regression model to see if it plays a role in changing the average annual temperature as well.

Lets see how population density plays a role in a locations average annual temperature.

pop_scatter <- ggplot(data = climate_train, aes(x = Population.Density, y = Temp)) + 
  geom_point(aes(group = City)) + 
  xlab("Population density") + ylab("Temperature") + ggtitle("Average Temperature vs Population Density")

ggplotly(pop_scatter, tooltip = c("group", "x", "y"))

It seems there is a slight nonlinear trend upward however statistically it is hard to tell what the relationship is. We need to take the other variables into account.

The more people per square mile a location has, the more the location becomes city-like. A city suffers from a phenomena known as “urban heat”. Congestion, concrete and tall buildings all play a role in keeping a city warmer than its surrounding location. Thus theoretically, population density should increase a locations average annual temperature. Lets see what role this variable has to play in our regression model.

Finally lets see the correlations among all our variables.

climate <- climate_train[,c(5, 1, 6, 8, 9)]
round(cor(climate), 2)
##                     Temp Latitude Elevation Population.Density Proximity
## Temp                1.00    -0.98     -0.47              -0.06      0.28
## Latitude           -0.98     1.00      0.34               0.16     -0.19
## Elevation          -0.47     0.34      1.00              -0.25     -0.58
## Population.Density -0.06     0.16     -0.25               1.00      0.34
## Proximity           0.28    -0.19     -0.58               0.34      1.00

Latitude is highly correlated with average annual temperature. Elevation has moderate correlation with our response variable as well. There doesn’t seem to be significant correlations between our predictor variables which is great news for regression modeling. It should be noted that proximity and elevation are moderately correlated. This makes sense since coastal locations are at sea level. In the case of our regression model, elevation may capture a chunk of the information proximity provides and the model may disregard proximity.

Now lets do some regression analysis.

3 - Building a regression model

Using the method of least squares we build a regression model with all our variables included. Lets check the 4 assumptions of linear regression on our model. All 4 assumptions need to be validated to continue statistical analysis.

full_model <- lm(Temp~., data = climate)
par(mfrow = c(2, 2))
plot(full_model)

  1. The residuals vs fitted plot shows that there is a linear relationship between the response and explanatory variables.

  2. The cross-sectional nature of data guarantees that the independence assumption is met.

  3. The scale-location plot plot shows the constant variance assumption is validated.

  4. The Q-Q plot above shows that the residuals follow a normal distribution.

Since the assumptions are all met we can continue with our analysis.

summary(full_model)
## 
## Call:
## lm(formula = Temp ~ ., data = climate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.99052 -0.96475 -0.00509  0.91126  3.05867 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.193e+02  1.108e+00 107.612  < 2e-16 ***
## Latitude           -1.651e+00  3.123e-02 -52.866  < 2e-16 ***
## Elevation          -2.560e-03  3.980e-04  -6.432 4.64e-08 ***
## Population.Density  1.269e-04  3.696e-05   3.433  0.00121 ** 
## Proximity           1.414e-01  4.051e-01   0.349  0.72855    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.127 on 50 degrees of freedom
## Multiple R-squared:  0.9869, Adjusted R-squared:  0.9859 
## F-statistic: 943.7 on 4 and 50 DF,  p-value: < 2.2e-16

Lets check if our model is significant. From the F-statistic above and its corresponding p-value we can conclude with 95% confidence that our model is significant.

Looking at the adjusted r-squared statistic, since it is close to 1, we conclude that the data fits our model very well.

Our coefficients are all significant except for proximity. This validates our theory above that proximity may be disregarded by our model. There are two explanations for why this may be the case:

  1. Since Proximity to the ocean and elevation are moderately correlated, elevation may capture most of the information that proximity has to offer.

  2. Coastal locations generally have more moderate temperatures than its inland counterparts due to the oceans ability to regulate temperatures. Statistically, the average annual temperature may not change much but there will be less extremes in day to day observations.

Before we disregard proximity from our model lets use a popular variable selection method to see if there is indeed a better model than the one above. We will use all possible regressions to compare how the model fit changes when we add or remove variables.

ols_step_all_possible(full_model)
##    Index N                                      Predictors    R-Square
## 1      1 1                                        Latitude 0.960942853
## 2      2 1                                       Elevation 0.220841048
## 4      3 1                                       Proximity 0.078509699
## 3      4 1                              Population.Density 0.003330396
## 5      5 2                              Latitude Elevation 0.983410755
## 6      6 2                     Latitude Population.Density 0.971827526
## 7      7 2                              Latitude Proximity 0.971050147
## 8      8 2                    Elevation Population.Density 0.252734121
## 9      9 2                             Elevation Proximity 0.220976985
## 10    10 2                    Population.Density Proximity 0.105279670
## 11    11 3           Latitude Elevation Population.Density 0.986895309
## 12    12 3                    Latitude Elevation Proximity 0.983844969
## 13    13 3           Latitude Population.Density Proximity 0.976109272
## 14    14 3          Elevation Population.Density Proximity 0.256206678
## 15    15 4 Latitude Elevation Population.Density Proximity 0.986927156
##    Adj. R-Square Mallow's Cp
## 1     0.96020593   98.382741
## 2     0.20613994 2929.066649
## 4     0.06112309 3473.444538
## 3    -0.01547469 3760.984497
## 5     0.98277271   14.449254
## 6     0.97074397   58.751889
## 7     0.96993669   61.725148
## 8     0.22399313 2809.084499
## 9     0.19101456 2930.546726
## 10    0.07086735 3373.056834
## 11    0.98612444    3.121805
## 12    0.98289467   14.788508
## 13    0.97470394   44.375400
## 14    0.21245413 2797.802933
## 15    0.98588133    5.000000

The table shows that Model 11 and Model 15, the model from above, both have the highest adjusted R-square. Lets turn to a different statistic to assess fit.

Mallow’s CP is desired to be as small as possible. Since Model 11 has the smaller Mallow’s CP of the two, Model 11 will be our new model. Lets call model 11 the reduced model. The reduced model will exclude proximity and only included Elevation, Latitude and Population Density as explanatory variables.

4 - Regression Analysis on Reduced Model

Lets check the 4 assumptions of linear regression on our reduced model.

reduced_model <- lm(Temp ~ Latitude + Elevation + Population.Density, data = climate)
par(mfrow = c(2, 2))
plot(reduced_model)

  1. The residuals vs fitted plot shows that there is a linear relationship between the response and explanatory variables.

  2. The cross-sectional nature of data guarantees that the independence assumption is met.

  3. The scale-location plot plot shows the constant variance assumption is validated.

  4. The Q-Q plot above shows that the residuals follow a normal distribution.

Since the assumptions are all met we can continue with our analysis.

summary(reduced_model)
## 
## Call:
## lm(formula = Temp ~ Latitude + Elevation + Population.Density, 
##     data = climate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93555 -1.00440  0.00483  0.92129  3.11198 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.194e+02  1.067e+00 111.811  < 2e-16 ***
## Latitude           -1.652e+00  3.090e-02 -53.452  < 2e-16 ***
## Elevation          -2.629e-03  3.433e-04  -7.658    5e-10 ***
## Population.Density  1.303e-04  3.538e-05   3.683 0.000558 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.118 on 51 degrees of freedom
## Multiple R-squared:  0.9869, Adjusted R-squared:  0.9861 
## F-statistic:  1280 on 3 and 51 DF,  p-value: < 2.2e-16

From the F-statistic above and its corresponding p-value we can conclude with 95% confidence that our model is significant.

Looking at the adjusted r-squared statistic, since it is close to 1, we conclude that the data fits our model very well.

Our coefficients are all significant. The next step in the regression modeling process is to check to see if any remedial measures should be taken. First lets check to see if our model suffers from multi-collinearity. This is a fancy term for explanatory variables being highly correlated with each other.

A Variance Inflation Factor (VIF) measures multi-collinearity. If any VIF associated with each explanatory variable is bigger than 10, then remedial measures need to be taken.

vif(reduced_model)
##           Latitude          Elevation Population.Density 
##           1.215012           1.258275           1.147594

None of our VIFs are bigger than 10 thus we can continue with the model validation process. Lets check to see of their are any outliers in our model. Cooks distance is good at measuring if a data point is an outlier. Lets take a look at the chart below.

ols_plot_cooksd_bar(reduced_model)

Observation 6, 17, 20, 24 and 55 are all outliers. Lets check to see if these outliers influence our model using a statistic called DFFITS

ols_plot_dffits(reduced_model)

All our outliers seem to influence the model greatly. Lets take a look at each point to see why they may be outliers.

climate_train[c(6, 17, 20, 24, 55), c(-2, -7)]
##    Latitude        City State  Temp Elevation Population.Density Proximity
## 6  43.79083 Indian Lake    NY 40.60      1706                  5         0
## 17 40.76361    New York    NY 55.15        33              28491         1
## 20 39.29806   Baltimore    MD 58.45        33               7556         1
## 24 37.23806  Blacksburg    VA 51.50      2077               2240         0
## 55 25.76250       Miami    FL 77.05         7              12917         1
  1. Observation 6 is Indian Lake, NY. This location does not fit the pattern of its near neighbors. For example, Indian Lake is 3.5 degrees cooler than Lake Placid, NY even though it is one degree more south and 95 feet lower in elevation.

  2. Observation 17 is New York City. New York City has a population density twice that of the second largest observation.

  3. Observation 20, Baltimore, MD, does not fit the pattern of its near neighbors either. The location is very close to Washington DC but 2.75 degrees warmer but 1 degree higher in latitude. Baltimore is also significantly warmer than all locations 1 degree south of Baltimore.

  4. Observation 24 is Blacksburg, VA. It is the highest elevation in our data set with the second being Lake Placid, NY which is 200 feet lower in elevation. Blacksburg is also several degrees more south then Lake Placid.

  5. Observation 55 is Miami FL. Miami is the only observation with a unique latitude that is 2 latitude levels below the 2nd most southern location. Compared to Tampa, the 2nd warmest location in the data set, Miami is 3.7 degrees warmer.

These observations will be kept in the model. None of these points were observed in error and they are part of the population that our data was sampled from. The influential points are valid observations and are thus kept in the model.

We have concluded the diagnostics portion of regression modeling. Now lets interpret these results.

5 - Conclusion

Lets take a look at the linear equation the regression model outputted and dissect its meaning.

reduced_model$coefficients
##        (Intercept)           Latitude          Elevation Population.Density 
##       1.193535e+02      -1.651815e+00      -2.628519e-03       1.302771e-04

There is a significant linear relationship between a location’s annual temperature and its explanatory variables: Elevation, latitude and population density. Lets interpret the output above keeping note that this only applies locations in the eastern united states.

Keeping everything else constant, for every 100 feet increase in elevation, it is expected to see on average a 0.26 degree decrease in average annual temperature.

Keeping everything else constant, for every 1 degree decrease in latitude, it is expected to see on average a 1.65 degree increase in average annual temperature.

Finally, keeping everything else constant, for every 1000 people/mi^2 increase in population density, it is expected to see on average a 0.13 degree increase in annual temperature.