Climate Analysis of Indianapolis, IN and Asheville, NC

Aaron Hurley

2023-12-11

Business Understanding

The mission for the National Centers for Environmental Information (NCEI) is to provide “environmental data, products, and services covering the depths of the ocean to the surface of the sun to drive resilience, prosperity and equity for current and future generations.” They are considered the climatology center for the United States. They do this by aggregating the weather, satellite, and other sensor observations from around the world and storing them in an archive. Their archive of historical record is the Global Historical Climatology Network (GHCN). Through the GHCN, individuals and organizations can request data of more than 100,000 stations in 180 countries and across varying time-periods. These time periods could be as short as a day to more than 175 years.

NCEI, through the GHCN, analyze the data as well. They provide climatological analysis of the records through drought and precipitation products, wildfire products, snow and ice products, temperature, storm and wind products, as well as climate norms and other daily to monthly climatological products. These products are also available for download to the public. Recently, the National Oceanic and Atmospheric Administration (NOAA) – the parent organization to NCEI – has decided to begin transitioning to the Cloud. One reason to transition to the Cloud is to explore the possibility of Cloud Computing, and thus relive NOAA of some of the burden of calculating the preponderance of weather and climate models in the United States. The data procured from NCEI includes a total of 114 columns of data for two sites within the United States.

Sites to Analyze

These sites include:

The original concept was to work on six disparate sites across the US, however, the complexity of the data proved difficult for the scope of the project, so the previous two sites were identified as of high enough interest with viable data.

R is a powerful statistical and calculation tool. The supposition of this project is to determine whether R can analyze historical weather data, calculating the amount of variance from the accepted historical climatology for each selected location, and attempting to calculate a short-range climate forecast for each location and validate this against the accepted climate model’s output for those locations.

Data Understanding

The following is a small subset of what the data looks like. The complete structure follows.

ashvlNC <- read.csv("C:/Users/elder/OneDrive/Documents/MSDA 621/ClimateProject/data/KAVL.csv")
indiaIN <- read.csv("C:/Users/elder/OneDrive/Documents/MSDA 621/ClimateProject/data/KIND.csv")
head(ashvlNC)[,c(2,3,4,6,7,27)]
##                                NAME LATITUDE LONGITUDE     DATE JULIAN_DATE
## 1 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/1/1950           1
## 2 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/2/1950           2
## 3 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/3/1950           3
## 4 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/4/1950           4
## 5 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/5/1950           5
## 6 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/6/1950           6
##   PRCP
## 1 0.00
## 2 0.00
## 3 0.01
## 4 0.02
## 5 0.00
## 6 0.28

Data Preparation

There are 50,311 rows of data – beginning 01 Jan 1950 and runs to 15 Sep 2023. It is noted that Asheville had several years of missing data (1/1/1955 - 8/31/1964). This should not impact the analysis since we are primarily looking at the weather conditions by Julian Date.

The original purpose of this data is to hold daily climatalogical data. It has data about the daily maximum and minimum temperatures, daily precipitation amount, snow amount, cloud cover, and solar radiation. There are a total of roughly 118 variables. I say roughly since it depends on whether you consider the station name, ID, latitude, longitude, and elevation as variables.

However, to understand the datasets, we can look at their structures. Due to their sizes, rather than look at both, we will only look at one. Asheville, NC is a good example of the complicated structure of each dataset for each location:

str(ashvlNC)
## 'data.frame':    23390 obs. of  120 variables:
##  $ STATION        : chr  "USW00003812" "USW00003812" "USW00003812" "USW00003812" ...
##  $ NAME           : chr  "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" ...
##  $ LATITUDE       : num  35.4 35.4 35.4 35.4 35.4 ...
##  $ LONGITUDE      : num  -82.5 -82.5 -82.5 -82.5 -82.5 ...
##  $ ELEVATION      : num  646 646 646 646 646 ...
##  $ DATE           : chr  "1/1/1950" "1/2/1950" "1/3/1950" "1/4/1950" ...
##  $ JULIAN_DATE    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ACMC           : logi  NA NA NA NA NA NA ...
##  $ ACMC_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ ACMH           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ACMH_ATTRIBUTES: chr  "" "" "" "" ...
##  $ ACSC           : logi  NA NA NA NA NA NA ...
##  $ ACSC_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ ACSH           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ACSH_ATTRIBUTES: chr  "" "" "" "" ...
##  $ AWND           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ AWND_ATTRIBUTES: chr  "" "" "" "" ...
##  $ FMTM           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ FMTM_ATTRIBUTES: chr  "" "" "" "" ...
##  $ FRGT           : logi  NA NA NA NA NA NA ...
##  $ FRGT_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ GAHT           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ GAHT_ATTRIBUTES: chr  "" "" "" "" ...
##  $ PGTM           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PGTM_ATTRIBUTES: chr  "" "" "" "" ...
##  $ precipDelta    : num  -0.0745 -0.1228 -0.1363 -0.1083 -0.0455 ...
##  $ PRCP           : num  0 0 0.01 0.02 0 0.28 0.01 0 0 0.22 ...
##  $ PRCP_ATTRIBUTES: chr  ",,X," "T,,X," ",,X," ",,X," ...
##  $ PSUN           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PSUN_ATTRIBUTES: chr  "" "" "" "" ...
##  $ snowDelta      : num  -0.03906 -0.02969 -0.06406 -0.04844 -0.00469 ...
##  $ SNOW           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SNOW_ATTRIBUTES: chr  ",,X," ",,X," ",,X," ",,X," ...
##  $ SNWD           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SNWD_ATTRIBUTES: chr  ",,X," ",,X," ",,X," ",,X," ...
##  $ tavgDelta      : num  -42.4 -41.7 -41.1 -40.4 -37.1 ...
##  $ TAVG           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TAVG_ATTRIBUTES: chr  "" "" "" "" ...
##  $ tmaxDelta      : num  3.45 9.58 10.77 13.39 20.42 ...
##  $ TMAX           : int  54 59 59 61 68 62 47 52 54 60 ...
##  $ TMAX_ATTRIBUTES: chr  ",,X" ",,X" ",,X" ",,X" ...
##  $ tminDelta      : num  -5.41 8.17 18.06 28.11 31.94 ...
##  $ TMIN           : int  25 38 48 56 59 47 32 23 23 49 ...
##  $ TMIN_ATTRIBUTES: chr  ",,X" ",,X" ",,X" ",,X" ...
##  $ TOBS           : logi  NA NA NA NA NA NA ...
##  $ TOBS_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ TSUN           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TSUN_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WDF1           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WDF1_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WDF2           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WDF2_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WDF5           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WDF5_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WDFG           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WDFG_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WDFM           : logi  NA NA NA NA NA NA ...
##  $ WDFM_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ WESD           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WESD_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WSF1           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WSF1_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WSF2           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WSF2_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WSF5           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WSF5_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WSFG           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WSFG_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WSFM           : logi  NA NA NA NA NA NA ...
##  $ WSFM_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ WT01           : int  1 NA NA 1 NA 1 NA NA NA 1 ...
##  $ WT01_ATTRIBUTES: chr  ",,X" "" "" ",,X" ...
##  $ WT02           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT02_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT03           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT03_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT04           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT04_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT05           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT05_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT06           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT06_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT07           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT07_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT08           : int  1 NA NA NA NA 1 NA NA NA 1 ...
##  $ WT08_ATTRIBUTES: chr  ",,X" "" "" "" ...
##  $ WT09           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT09_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT10           : logi  NA NA NA NA NA NA ...
##  $ WT10_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ WT11           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT11_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT12           : logi  NA NA NA NA NA NA ...
##  $ WT12_ATTRIBUTES: logi  NA NA NA NA NA NA ...
##  $ WT13           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT13_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT14           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WT14_ATTRIBUTES: chr  "" "" "" "" ...
##  $ WT15           : int  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Null Values

Additionally, there are many null values.

colSums(is.na(ashvlNC))
##         STATION            NAME        LATITUDE       LONGITUDE       ELEVATION 
##               0               0               0               0               0 
##            DATE     JULIAN_DATE            ACMC ACMC_ATTRIBUTES            ACMH 
##               0               0           23390           23390           11932 
## ACMH_ATTRIBUTES            ACSC ACSC_ATTRIBUTES            ACSH ACSH_ATTRIBUTES 
##               0           23390           23390           11931               0 
##            AWND AWND_ATTRIBUTES            FMTM FMTM_ATTRIBUTES            FRGT 
##            8917               0           13243               0           23390 
## FRGT_ATTRIBUTES            GAHT GAHT_ATTRIBUTES            PGTM PGTM_ATTRIBUTES 
##           23390           23270               0           10674               0 
##     precipDelta            PRCP PRCP_ATTRIBUTES            PSUN PSUN_ATTRIBUTES 
##               0               0               0           16451               0 
##       snowDelta            SNOW SNOW_ATTRIBUTES            SNWD SNWD_ATTRIBUTES 
##               0               1               0               1               0 
##       tavgDelta            TAVG TAVG_ATTRIBUTES       tmaxDelta            TMAX 
##               0           16893               0               0               0 
## TMAX_ATTRIBUTES       tminDelta            TMIN TMIN_ATTRIBUTES            TOBS 
##               0               0               0               0           23390 
## TOBS_ATTRIBUTES            TSUN TSUN_ATTRIBUTES            WDF1 WDF1_ATTRIBUTES 
##           23390           10614               0           11931               0 
##            WDF2 WDF2_ATTRIBUTES            WDF5 WDF5_ATTRIBUTES            WDFG 
##           13423               0           13435               0           15473 
## WDFG_ATTRIBUTES            WDFM WDFM_ATTRIBUTES            WESD WESD_ATTRIBUTES 
##               0           23390           23390           23173               0 
##            WSF1 WSF1_ATTRIBUTES            WSF2 WSF2_ATTRIBUTES            WSF5 
##           11931               0           13423               0           13435 
## WSF5_ATTRIBUTES            WSFG WSFG_ATTRIBUTES            WSFM WSFM_ATTRIBUTES 
##               0           15466               0           23390           23390 
##            WT01 WT01_ATTRIBUTES            WT02 WT02_ATTRIBUTES            WT03 
##           11833               0           19601               0           20752 
## WT03_ATTRIBUTES            WT04 WT04_ATTRIBUTES            WT05 WT05_ATTRIBUTES 
##               0           23206               0           23349               0 
##            WT06 WT06_ATTRIBUTES            WT07 WT07_ATTRIBUTES            WT08 
##           23231               0           23389               0           18639 
## WT08_ATTRIBUTES            WT09 WT09_ATTRIBUTES            WT10 WT10_ATTRIBUTES 
##               0           23346               0           23390           23390 
##            WT11 WT11_ATTRIBUTES            WT12 WT12_ATTRIBUTES            WT13 
##           23382               0           23390           23390           19698 
## WT13_ATTRIBUTES            WT14 WT14_ATTRIBUTES            WT15 WT15_ATTRIBUTES 
##               0           22933               0           23370               0 
##            WT16 WT16_ATTRIBUTES            WT17 WT17_ATTRIBUTES            WT18 
##           15121               0           23286               0           22181 
## WT18_ATTRIBUTES            WT19 WT19_ATTRIBUTES            WT21 WT21_ATTRIBUTES 
##               0           23262               0           23259               0 
##            WT22 WT22_ATTRIBUTES            WV01 WV01_ATTRIBUTES            WV03 
##           23305               0           23390           23390           23322 
## WV03_ATTRIBUTES            WV07 WV07_ATTRIBUTES            WV20 WV20_ATTRIBUTES 
##               0           23390           23390           23390           23390

Desired Columns to Keep

To simplify the data for each location we can remove all of the columns except the following. These will give us precipitation (snow and rain) and temperature data for each day of the year from 1/1/1950 to 9/15/2023.

Notes: TAVG (daily average temperature) was considered as an option to analyze, however, there were far too few values to provide a useful dataset. Thus, it was removed as well. Additionally, SNOW was highly considered to be used, however, there ended up being too few rows of data for accurate predictions.

Revised Dataset

The revised dataset looks like this:

columnsToKeep <- c(2,6,7,27,40,43)
ashvlNCrevised <- ashvlNC[,columnsToKeep]
indiaINrevised <- indiaIN[,columnsToKeep]
str(ashvlNCrevised)
## 'data.frame':    23390 obs. of  6 variables:
##  $ NAME       : chr  "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" ...
##  $ DATE       : chr  "1/1/1950" "1/2/1950" "1/3/1950" "1/4/1950" ...
##  $ JULIAN_DATE: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PRCP       : num  0 0 0.01 0.02 0 0.28 0.01 0 0 0.22 ...
##  $ TMAX       : int  54 59 59 61 68 62 47 52 54 60 ...
##  $ TMIN       : int  25 38 48 56 59 47 32 23 23 49 ...

Revised Null Values

Now, there are zero null values.

colSums(is.na(ashvlNCrevised))
##        NAME        DATE JULIAN_DATE        PRCP        TMAX        TMIN 
##           0           0           0           0           0           0
colSums(is.na(indiaINrevised))
##        NAME        DATE JULIAN_DATE        PRCP        TMAX        TMIN 
##           0           0           0           0           0           0

Caveats

The initial scope of the project contained six(6) locations, where there were many null values. After simplifying the scope, there were no null values to worry about. However, the following was the approach used to handle the null values within the greater project.

We will not remove rows of data from the dataset because the climatological estimate will depend on longevity of data. By interpolating the missing data, we can turn a discrete dataset into a continuous one that allows us to analyze it in different ways.

Data Augmentation

In order to make changes to the dataset as we desire, we will run a for() loop to find the following new variables:

The “Avg” for each variable will be calculated based upon the Julian Day of the data and will be kept in a separate data frame to access in calculating the daily Delta from the norm for each variable.

AVLprecipAvg <- c()
AVLtmaxAvg <- c()
AVLtminAvg <- c()

INDprecipAvg <- c()
INDtmaxAvg <- c()
INDtminAvg <- c()

for(i in 1:366) {
  AVLfilteredVal <- filter(ashvlNCrevised, JULIAN_DATE == i)
  INDfilteredVal <- filter(indiaINrevised, JULIAN_DATE == i)

  AVLprecipAvg[i] <- mean(AVLfilteredVal$PRCP, na.rm = TRUE)
  AVLtmaxAvg[i] <- mean(AVLfilteredVal$TMAX, na.rm = TRUE)
  AVLtminAvg[i] <- mean(AVLfilteredVal$TMIN, na.rm = TRUE)
  
  INDprecipAvg[i] <- mean(INDfilteredVal$PRCP, na.rm = TRUE)
  INDtmaxAvg[i] <- mean(INDfilteredVal$TMAX, na.rm = TRUE)
  INDtminAvg[i] <- mean(INDfilteredVal$TMIN, na.rm = TRUE)
}

AVLavgDataFrame <- data.frame(AVLprecipAvg,AVLtmaxAvg,AVLtminAvg)
INDavgDataFrame <- data.frame(INDprecipAvg,INDtmaxAvg,INDtminAvg)

Data Augmentation, cont

Thus, using the values procured previously, we can calculate the difference from normal the values are by subtracting the average from the actual. We will store these as the following variables:

AVLprecipDelta <- data.frame(matrix(ncol = 1, nrow = 0))
AVLtmaxDelta <- data.frame(matrix(ncol = 1, nrow = 0))
AVLtminDelta <- data.frame(matrix(ncol = 1, nrow = 0))
AVLdate <- data.frame(matrix(ncol = 1, nrow = 0))

INDprecipDelta <- data.frame(matrix(ncol = 1, nrow = 0))
INDtmaxDelta <- data.frame(matrix(ncol = 1, nrow = 0))
INDtminDelta <- data.frame(matrix(ncol = 1, nrow = 0))
INDdate <- data.frame(matrix(ncol = 1, nrow = 0))

for(i in 1:366) {  
  AVLfilter <- filter(ashvlNCrevised, JULIAN_DATE == i)
  INDfilter <- filter(indiaINrevised, JULIAN_DATE == i)

  AVLdate <- rbind(AVLdate,data.frame(AVLfilter$DATE))
  AVLprecipDelta <- rbind(AVLprecipDelta, data.frame(AVLfilter$PRCP - AVLavgDataFrame$AVLprecipAvg[i]))
  AVLtmaxDelta <- rbind(AVLtmaxDelta, data.frame(AVLfilter$TMAX - AVLavgDataFrame$AVLtmaxAvg[i]))
  AVLtminDelta <- rbind(AVLtminDelta, data.frame(AVLfilter$TMIN - AVLavgDataFrame$AVLtminAvg[i]))
  if(i == 366) {
    colnames(AVLdate) <- c("DataDate")
    colnames(AVLprecipDelta) <- c("PrecipDelta")
    colnames(AVLtmaxDelta) <- c("TmaxDelta")
    colnames(AVLtminDelta) <- c("TminDelta")
    AVLdeparture <- data.frame(AVLdate, AVLprecipDelta, AVLtmaxDelta, AVLtminDelta)
  }
  
  INDdate <- rbind(INDdate, data.frame(INDfilter$DATE))
  INDprecipDelta <- rbind(INDprecipDelta, data.frame(INDfilter$PRCP - INDavgDataFrame$INDprecipAvg[i]))
  INDtmaxDelta <- rbind(INDtmaxDelta, data.frame(INDfilter$TMAX - INDavgDataFrame$INDtmaxAvg[i]))
  INDtminDelta <- rbind(INDtminDelta, data.frame(INDfilter$TMIN - INDavgDataFrame$INDtminAvg[i]))
  if(i == 366) {
    colnames(INDdate) <- c("DataDate")
    colnames(INDprecipDelta) <- c("PrecipDelta")
    colnames(INDtmaxDelta) <- c("TmaxDelta")
    colnames(INDtminDelta) <- c("TminDelta")
    INDdeparture <- data.frame(INDdate, INDprecipDelta, INDtmaxDelta, INDtminDelta)
  }
}

ashevilleFinalDataset <- inner_join(ashvlNCrevised,AVLdeparture, by =c("DATE" = "DataDate"))
indianapolisFinalDataset <- inner_join(indiaINrevised, INDdeparture, by = c("DATE" = "DataDate"))

ashevilleFinalDataset$DATE <- as.Date(ashevilleFinalDataset$DATE, "%m/%d/%Y")
indianapolisFinalDataset$DATE <- as.Date(indianapolisFinalDataset$DATE, "%m/%d/%Y")

ashevilleFinalDataset <- ashevilleFinalDataset[order(ashevilleFinalDataset$DATE),]
indianapolisFinalDataset <- indianapolisFinalDataset[order(indianapolisFinalDataset$DATE),]

ashevilleFinalDataset$TMAX <- ashevilleFinalDataset$TMAX + 273
ashevilleFinalDataset$TMIN <- ashevilleFinalDataset$TMIN + 273
indianapolisFinalDataset$TMAX <- indianapolisFinalDataset$TMAX + 273
indianapolisFinalDataset$TMIN <- indianapolisFinalDataset$TMIN + 273

head(ashevilleFinalDataset)
##                                NAME       DATE JULIAN_DATE PRCP TMAX TMIN
## 1 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-01           1 0.00  327  298
## 2 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-02           2 0.00  332  311
## 3 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-03           3 0.01  332  321
## 4 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-04           4 0.02  334  329
## 5 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-05           5 0.00  341  332
## 6 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-06           6 0.28  335  320
##   PrecipDelta TmaxDelta TminDelta
## 1 -0.07453125  3.453125 -5.406250
## 2 -0.12281250  9.578125  8.171875
## 3 -0.13625000 10.765625 18.062500
## 4 -0.10828125 13.390625 28.109375
## 5 -0.04546875 20.421875 31.937500
## 6  0.17437500 13.890625 19.125000

Data Augmentation, cont

Note: The temperature was changed to convert it to Kelvin (adding 273). This keeps all temperatures positive so we can perform a Box-Cox transformation on the models.

The summary statistics for Asheville, NC (TMAX, TMIN, and PRCP)

summary(ashevilleFinalDataset$TMAX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   281.0   329.0   343.0   340.3   353.0   374.0
summary(ashevilleFinalDataset$TMIN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   257.0   305.0   318.0   317.8   332.0   348.0
summary(ashevilleFinalDataset$PRCP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.131   0.060   5.290

The summary statistics for Indianapolis, IN (TMAX, TMIN, and PRCP)

summary(indianapolisFinalDataset$TMAX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   262.0   319.0   339.0   335.6   354.0   378.0
summary(indianapolisFinalDataset$TMIN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   246.0   303.0   317.0   316.5   333.0   354.0
summary(indianapolisFinalDataset$PRCP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1139  0.0500  7.2000

Data Visualization

Although NCEI hosts climatology data, the National Oceanic and Atmospheric Administration (NOAA) processes and produces much of the analysis of the data. The following visualizations are the climatological averages for the respective cities we are interested in. Of note, these behave as you might expect - lower high and low temperatures in the winter and higher temperatures in the summer.

Additionally of note, the higher latitudes will have lower temperatures while lower latitudes have higher temperatures. Also, of note, expect locations near the ocean to have a smaller range of daily and seasonal temperatures, while inland temperatures will have a larger variation in daily and seasonal temperatures.

Asheville, NC

This graph shows the average climatology for Asheville, NC.
This graph shows the average climatology for Asheville, NC.

Indianapolis, IN

This graph shows the average climatology for Indianapolis, IN.
This graph shows the average climatology for Indianapolis, IN.

Visualizations of Residuals

Additionally, because we found the residuals from normal each data point was, we can perform a visualization on these to see if there are any trends we need to take into account when we model the data.

The following are the residuals by Julian Date:

For Asheville, North Carolina:

avlTmaxDModel <- lm(TmaxDelta ~ JULIAN_DATE,ashevilleFinalDataset)
plot(ashevilleFinalDataset$JULIAN_DATE, avlTmaxDModel$residuals, pch = 20, col = "blue")
abline(h = 0)

avlTminDModel <- lm(TminDelta ~ JULIAN_DATE,ashevilleFinalDataset)
plot(ashevilleFinalDataset$JULIAN_DATE, avlTminDModel$residuals, pch = 20, col = "blue")
abline(h = 0)

avlPrecipDModel <- lm(PrecipDelta ~ JULIAN_DATE,ashevilleFinalDataset)
plot(ashevilleFinalDataset$JULIAN_DATE, avlPrecipDModel$residuals, pch = 20, col = "blue")
abline(h = 0)

Visualizations of Residuals, cont

For Indianapolis, Indiana:

indTmaxDModel <- lm(TmaxDelta ~ JULIAN_DATE,indianapolisFinalDataset)
plot(indianapolisFinalDataset$JULIAN_DATE, indTmaxDModel$residuals, pch = 20, col = "blue")
abline(h = 0)

indTminDModel <- lm(TminDelta ~ JULIAN_DATE,indianapolisFinalDataset)
plot(indianapolisFinalDataset$JULIAN_DATE, indTminDModel$residuals, pch = 20, col = "blue")
abline(h = 0)

indPrecipDModel <- lm(PrecipDelta ~ JULIAN_DATE,indianapolisFinalDataset)
plot(indianapolisFinalDataset$JULIAN_DATE, indPrecipDModel$residuals, pch = 20, col = "blue")
abline(h = 0)

Time Series Plots

These are time series plots for each location:

For Asheville, North Carolina:

plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)

plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)

plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)

Time Series Plots, cont.

For Indianapolis, Indiana:

plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)

plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)

plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)

Time Series by Julian Date

Additionally, we can see the time series for each variable as follows, by Julian Date:

For Asheville, North Carolina:

plot(ashevilleFinalDataset$JULIAN_DATE, ashevilleFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)

plot(ashevilleFinalDataset$JULIAN_DATE, ashevilleFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)

plot(ashevilleFinalDataset$JULIAN_DATE, ashevilleFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)

Time Series by Julian Date, cont.

For Indianapolis, Indiana:

plot(indianapolisFinalDataset$JULIAN_DATE, indianapolisFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)

plot(indianapolisFinalDataset$JULIAN_DATE, indianapolisFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)

plot(indianapolisFinalDataset$JULIAN_DATE, indianapolisFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)

Modeling and Evaluation

R Packages

The following R packages are used in the analysis:

ggplot2 - this is used for graphical visualizations

dplyr - this is used for data manipulation (filter, arrange, etc.)

MASS - this is used for calculating BoxCox transformations

tidyr - this is used for some general functionality that doesn’t exist in basic R

Data Modeling

The primary form of model we will use will be using linear regression. This makes the most sense for this dataset as temperatures and atmospheric phenomena are continuous. This makes it difficult to justify using Bagging or Boosting. However, this could be something considered in future analyses.

The following is the visualization and analysis of the slope of the data, and an analysis of the slopes of the trend lines by Julian Date and Calendar Date. These also show the trend lines showing the trend of temperatures and precipitation amounts through the time periods.

avlTMAXtrendSlope <- c()
avlTMINtrendSlope <- c()
avlPRCPtrendSlope <- c()

indTMAXtrendSlope <- c()
indTMINtrendSlope <- c()
indPRCPtrendSlope <- c()

for(i in 1:366) {
  AVLfilteredData <- filter(ashevilleFinalDataset, JULIAN_DATE == i)
  INDfilteredData <- filter(indianapolisFinalDataset, JULIAN_DATE == i)

  avlTMAXtrendSlope[i] <- summary(lm(TMAX ~ DATE, AVLfilteredData))$coefficients[2,1]
  avlTMINtrendSlope[i] <- summary(lm(TMIN ~ DATE, AVLfilteredData))$coefficients[2,1]
  avlPRCPtrendSlope[i] <- summary(lm(PRCP ~ DATE, AVLfilteredData))$coefficients[2,1]

  indTMAXtrendSlope[i] <- summary(lm(TMAX ~ DATE, INDfilteredData))$coefficients[2,1]
  indTMINtrendSlope[i] <- summary(lm(TMIN ~ DATE, INDfilteredData))$coefficients[2,1]
  indPRCPtrendSlope[i] <- summary(lm(PRCP ~ DATE, INDfilteredData))$coefficients[2,1]
}

avlTMAXtrendSlope <- data.frame(avlTMAXtrendSlope)
avlTMINtrendSlope <- data.frame(avlTMINtrendSlope)
avlPRCPtrendSlope <- data.frame(avlPRCPtrendSlope)
indTMAXtrendSlope <- data.frame(indTMAXtrendSlope)
indTMINtrendSlope <- data.frame(indTMINtrendSlope)
indPRCPtrendSlope <- data.frame(indPRCPtrendSlope)


plot(1:366,avlTMAXtrendSlope[,1])
lines(predict(lm(avlTMAXtrendSlope ~ c(1:366),avlTMAXtrendSlope)), col = 'red')
abline(h=0)

plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$TmaxDelta, col = 'blue')
lines(predict(lm(TmaxDelta ~ DATE, ashevilleFinalDataset)), col = 'red')
abline(h=0)

plot(1:366,avlTMINtrendSlope[,1])
lines(predict(lm(avlTMINtrendSlope ~ c(1:366),avlTMINtrendSlope)), col = 'red')
abline(h=0)

plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$TminDelta, col = 'blue')
lines(predict(lm(TminDelta ~ DATE, ashevilleFinalDataset)), col = 'red')
abline(h=0)

plot(1:366,avlPRCPtrendSlope[,1])
lines(predict(lm(avlPRCPtrendSlope ~ c(1:366),avlPRCPtrendSlope)), col = 'red')
abline(h=0)

plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$PrecipDelta, col = 'blue')
lines(predict(lm(PrecipDelta ~ DATE, ashevilleFinalDataset)), col = 'red')
abline(h=0)

plot(1:366,indTMAXtrendSlope[,1])
lines(predict(lm(indTMAXtrendSlope ~ c(1:366),indTMAXtrendSlope)), col = 'red')
abline(h=0)

plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TmaxDelta, col = 'blue')
lines(predict(lm(TmaxDelta ~ DATE, indianapolisFinalDataset)), col = 'red')
abline(h=0)

plot(1:366,indTMINtrendSlope[,1])
lines(predict(lm(indTMINtrendSlope ~ c(1:366),indTMINtrendSlope)), col = 'red')
abline(h=0)

plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TminDelta, col = 'blue')
lines(predict(lm(TminDelta ~ DATE, indianapolisFinalDataset)), col = 'red')
abline(h=0)

plot(1:366,indPRCPtrendSlope[,1])
lines(predict(lm(indPRCPtrendSlope ~ c(1:366),indPRCPtrendSlope)), col = 'red')
abline(h=0)

plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$PrecipDelta, col = 'blue')
lines(predict(lm(PrecipDelta ~ DATE, indianapolisFinalDataset)), col = 'red')
abline(h=0)

Q-Q Plots

Additionally, we can look at the Q-Q plots for each regression:

for(i in plot_list) {
  plot(i, which=c(2))
}

Note: Cook’s Distance was looked at for each plot, however, no values near 1 were encountered. This tells us that the outliers did not impact the regressions performed.

Non-Linear Transformation of Data

When we begin to look at the non-linearity of the data, we can see the following. Note that there is no discernible trend in the precipitation so no non-linear trend is computed.:

#Asheville, NC
avlTmaxNLModel <- lm(I(-cos(TMAX*pi/180)/17) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
qplot(ashevilleFinalDataset$DATE, avlTmaxNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

plot(avlTmaxNLModel, which = c(2))

avlTminNLModel <- lm((-cos(TMIN*pi/180)/17) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
qplot(ashevilleFinalDataset$DATE, avlTminNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

plot(avlTminNLModel, which = c(2))

#Indianapolis, IN
indTmaxNLModel <- lm(I(-cos(TMAX*pi/180)/25) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
qplot(indianapolisFinalDataset$DATE, indTmaxNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

plot(indTmaxNLModel, which = c(2))

indTminNLModel <- lm((-cos(TMIN*pi/180)/23) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
qplot(indianapolisFinalDataset$DATE, indTminNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

plot(indTminNLModel, which = c(2))

Non-Linear Model Summaries

And from the summaries of the models, we see the following:

summary(avlTmaxNLModel)
## 
## Call:
## lm(formula = I(-cos(TMAX * pi/180)/17) ~ c(1:length(ashevilleFinalDataset[, 
##     1])), data = ashevilleFinalDataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.005663 -0.004973 -0.002723  0.003021  0.042150 
## 
## Coefficients:
##                                           Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                             -5.316e-02  8.482e-05 -626.686  < 2e-16
## c(1:length(ashevilleFinalDataset[, 1])) -2.340e-08  6.281e-09   -3.726 0.000195
##                                            
## (Intercept)                             ***
## c(1:length(ashevilleFinalDataset[, 1])) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.006486 on 23389 degrees of freedom
## Multiple R-squared:  0.0005933,  Adjusted R-squared:  0.0005505 
## F-statistic: 13.88 on 1 and 23389 DF,  p-value: 0.0001949
summary(avlTminNLModel)
## 
## Call:
## lm(formula = (-cos(TMIN * pi/180)/17) ~ c(1:length(ashevilleFinalDataset[, 
##     1])), data = ashevilleFinalDataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.015920 -0.009768 -0.002033  0.008235  0.054935 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             -4.026e-02  1.406e-04 -286.38   <2e-16
## c(1:length(ashevilleFinalDataset[, 1])) -1.553e-07  1.041e-08  -14.92   <2e-16
##                                            
## (Intercept)                             ***
## c(1:length(ashevilleFinalDataset[, 1])) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01075 on 23389 degrees of freedom
## Multiple R-squared:  0.009423,   Adjusted R-squared:  0.009381 
## F-statistic: 222.5 on 1 and 23389 DF,  p-value: < 2.2e-16
summary(indTmaxNLModel)
## 
## Call:
## lm(formula = I(-cos(TMAX * pi/180)/25) ~ c(1:length(indianapolisFinalDataset[, 
##     1])), data = indianapolisFinalDataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.006180 -0.005393 -0.003036  0.004102  0.039767 
## 
## Coefficients:
##                                              Estimate Std. Error  t value
## (Intercept)                                -3.381e-02  8.327e-05 -406.073
## c(1:length(indianapolisFinalDataset[, 1])) -3.010e-08  5.357e-09   -5.618
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## c(1:length(indianapolisFinalDataset[, 1])) 1.95e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.006831 on 26920 degrees of freedom
## Multiple R-squared:  0.001171,   Adjusted R-squared:  0.001134 
## F-statistic: 31.56 on 1 and 26920 DF,  p-value: 1.953e-08
summary(indTminNLModel)
## 
## Call:
## lm(formula = (-cos(TMIN * pi/180)/23) ~ c(1:length(indianapolisFinalDataset[, 
##     1])), data = indianapolisFinalDataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.013605 -0.008697 -0.001941  0.006558  0.047833 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -2.889e-02  1.220e-04  -236.9
## c(1:length(indianapolisFinalDataset[, 1])) -7.848e-08  7.846e-09   -10.0
##                                            Pr(>|t|)    
## (Intercept)                                  <2e-16 ***
## c(1:length(indianapolisFinalDataset[, 1]))   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01 on 26920 degrees of freedom
## Multiple R-squared:  0.003703,   Adjusted R-squared:  0.003666 
## F-statistic: 100.1 on 1 and 26920 DF,  p-value: < 2.2e-16

Box-Cox Transformations

We can also perform Box-Cox analysis on the data to see if there are any further transformations that are needed.

boxcox(avlTMAX_by_datapoint)

boxcox(avlTMIN_by_datapoint)

boxcox(indTMAX_by_datapoint)

boxcox(indTMIN_by_datapoint)

Box-Cox, cont.

We can see that this implies we should transform the data by squaring it. We do this by the following:

avlTMAX_bc_by_datapoint <- lm(I(TMAX^2) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
avlTMIN_bc_by_datapoint <- lm(I(TMIN^2) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
indTMAX_bc_by_datapoint <- lm(I(TMAX^2) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
indTMIN_bc_by_datapoint <- lm(I(TMIN^2) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)

boxcox(avlTMAX_bc_by_datapoint)

boxcox(avlTMIN_bc_by_datapoint)

boxcox(indTMAX_bc_by_datapoint)

boxcox(indTMIN_bc_by_datapoint)

Box-Cox, cont

Looking at these boxcox plots, we see that a transformation is still needed.

avlTMAX_bc2_by_datapoint <- lm(I(TMAX^4) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
avlTMIN_bc2_by_datapoint <- lm(I(TMIN^4) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
indTMAX_bc2_by_datapoint <- lm(I(TMAX^4) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
indTMIN_bc2_by_datapoint <- lm(I(TMIN^4) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)

boxcox(avlTMAX_bc2_by_datapoint)

boxcox(avlTMIN_bc2_by_datapoint)

boxcox(indTMAX_bc2_by_datapoint)

boxcox(indTMIN_bc2_by_datapoint)

Box-Cox Summary

This shows that there’s an interesting relationship with the dates. In other words, the temperature is increasing very slowly with time. Thus we can see a linear relationship \(T^4 \propto C\), where \(C\) is the \(x\) value on our graph, or the Date.

This may be an interesting correlation to an equation used in determining the steady state temperature of the atmosphere - the Stefan-Boltzmann equation for blackbody radiation. The equation is \(\epsilon \sigma T^{4}\) - where \(\epsilon\) is the emissivity of the object, \(\sigma\) is a constant, and \(T^{4}\) is the temperature to the fourth power. Thus, this transformation would be representative of the “radiant exitance” or the amount of radiant energy leaving the object. Thus, this transformation validates the Stefan-Boltzmann law for the lowest layer of the atmosphere.

In other words, \(T^4 \propto C\), as shown earlier.

Future Prediction of Temperatures

Taking all these things into consideration, we see the slope of each of the trend lines for each variable:

Asheville (TMAX)

summary(lm(TMAX ~ DATE, ashevilleFinalDataset))$coefficients
##                 Estimate   Std. Error     t value   Pr(>|t|)
## (Intercept) 3.400584e+02 1.487104e-01 2286.715650 0.00000000
## DATE        3.447632e-05 1.412314e-05    2.441122 0.01464907

Asheville (TMIN)

summary(lm(TMIN ~ DATE, ashevilleFinalDataset))$coefficients
##                 Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 3.161580e+02 1.464641e-01 2158.6042 0.000000e+00
## DATE        2.158357e-04 1.390981e-05   15.5168 4.967061e-54

Indianapolis (TMAX)

summary(lm(TMAX ~ DATE, indianapolisFinalDataset))$coefficients
##                 Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept) 3.351203e+02 1.596197e-01 2099.492467 0.000000e+00
## DATE        8.191076e-05 1.610141e-05    5.087178 3.658695e-07

Indianapolis (TMIN)

summary(lm(TMIN ~ DATE, indianapolisFinalDataset))$coefficients
##                 Estimate   Std. Error    t value    Pr(>|t|)
## (Intercept) 3.154695e+02 1.453157e-01 2170.92565 0.00000e+00
## DATE        1.582551e-04 1.465852e-05   10.79612 4.08063e-27

Forecasting, cont.

Thus, if we were to estimate a forecasted temperature for future times, we can calculate this by calculating the number of days in the future and multiplying by the slope or coefficient of the trend-line. Thus we can estimate the average temperature increase over the next year simply as 365 * (coefficient) * (# years). And we see the following.

The 1 year temperature change for the following are:

1 year Max Temp change for Asheville: 0.01

1 year Min Temp change for Asheville: 0.08

1 year Max Temp change for Indianapolis: 0.03

1 year Min Temp change for Indianapolis: 0.06

The 5 year temperature change for the following are:

5 year Max Temp change for Asheville: 0.06

5 year Min Temp change for Asheville: 0.39

5 year Max Temp change for Indianapolis: 0.15

5 year Min Temp change for Indianapolis: 0.29

Longer Term Forecasts

The 25 year temperature change for the following are:

25 year Max Temp change for Asheville: 0.31

25 year Min Temp change for Asheville: 1.97

25 year Max Temp change for Indianapolis: 0.75

25 year Min Temp change for Indianapolis: 1.44

The 50 year temperature change for the following are:

50 year Max Temp change for Asheville: 0.63

50 year Min Temp change for Asheville: 3.94

50 year Max Temp change for Indianapolis: 1.49

50 year Min Temp change for Indianapolis: 2.89

So What?

We see the following outcomes from this analysis:

Summary and Deployment

The purpose of this research problem was to investigate how R handled temperature regressions across time. The intent was to use this to estimate a value for maximum and minimum average temperatures at future times.

According to the EPA, we have been seeing larger increases in the low temperatures than high temperatures (https://www.epa.gov/climate-indicators/weather-climate). They show that the world is not cooling off in the evenings as much as it used to.

Summary of Work

Insights from Analysis

Results:

Implications:

Limitations and Future Work

Limitations:

Future work: