Aaron Hurley
2023-12-11
The mission for the National Centers for Environmental Information (NCEI) is to provide “environmental data, products, and services covering the depths of the ocean to the surface of the sun to drive resilience, prosperity and equity for current and future generations.” They are considered the climatology center for the United States. They do this by aggregating the weather, satellite, and other sensor observations from around the world and storing them in an archive. Their archive of historical record is the Global Historical Climatology Network (GHCN). Through the GHCN, individuals and organizations can request data of more than 100,000 stations in 180 countries and across varying time-periods. These time periods could be as short as a day to more than 175 years.
NCEI, through the GHCN, analyze the data as well. They provide climatological analysis of the records through drought and precipitation products, wildfire products, snow and ice products, temperature, storm and wind products, as well as climate norms and other daily to monthly climatological products. These products are also available for download to the public. Recently, the National Oceanic and Atmospheric Administration (NOAA) – the parent organization to NCEI – has decided to begin transitioning to the Cloud. One reason to transition to the Cloud is to explore the possibility of Cloud Computing, and thus relive NOAA of some of the burden of calculating the preponderance of weather and climate models in the United States. The data procured from NCEI includes a total of 114 columns of data for two sites within the United States.
These sites include:
The original concept was to work on six disparate sites across the US, however, the complexity of the data proved difficult for the scope of the project, so the previous two sites were identified as of high enough interest with viable data.
R is a powerful statistical and calculation tool. The supposition of this project is to determine whether R can analyze historical weather data, calculating the amount of variance from the accepted historical climatology for each selected location, and attempting to calculate a short-range climate forecast for each location and validate this against the accepted climate model’s output for those locations.
The following is a small subset of what the data looks like. The complete structure follows.
ashvlNC <- read.csv("C:/Users/elder/OneDrive/Documents/MSDA 621/ClimateProject/data/KAVL.csv")
indiaIN <- read.csv("C:/Users/elder/OneDrive/Documents/MSDA 621/ClimateProject/data/KIND.csv")
head(ashvlNC)[,c(2,3,4,6,7,27)]## NAME LATITUDE LONGITUDE DATE JULIAN_DATE
## 1 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/1/1950 1
## 2 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/2/1950 2
## 3 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/3/1950 3
## 4 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/4/1950 4
## 5 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/5/1950 5
## 6 ASHEVILLE REGIONAL AIRPORT, NC US 35.43178 -82.53787 1/6/1950 6
## PRCP
## 1 0.00
## 2 0.00
## 3 0.01
## 4 0.02
## 5 0.00
## 6 0.28
There are 50,311 rows of data – beginning 01 Jan 1950 and runs to 15 Sep 2023. It is noted that Asheville had several years of missing data (1/1/1955 - 8/31/1964). This should not impact the analysis since we are primarily looking at the weather conditions by Julian Date.
The original purpose of this data is to hold daily climatalogical data. It has data about the daily maximum and minimum temperatures, daily precipitation amount, snow amount, cloud cover, and solar radiation. There are a total of roughly 118 variables. I say roughly since it depends on whether you consider the station name, ID, latitude, longitude, and elevation as variables.
However, to understand the datasets, we can look at their structures. Due to their sizes, rather than look at both, we will only look at one. Asheville, NC is a good example of the complicated structure of each dataset for each location:
## 'data.frame': 23390 obs. of 120 variables:
## $ STATION : chr "USW00003812" "USW00003812" "USW00003812" "USW00003812" ...
## $ NAME : chr "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" ...
## $ LATITUDE : num 35.4 35.4 35.4 35.4 35.4 ...
## $ LONGITUDE : num -82.5 -82.5 -82.5 -82.5 -82.5 ...
## $ ELEVATION : num 646 646 646 646 646 ...
## $ DATE : chr "1/1/1950" "1/2/1950" "1/3/1950" "1/4/1950" ...
## $ JULIAN_DATE : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ACMC : logi NA NA NA NA NA NA ...
## $ ACMC_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ ACMH : int NA NA NA NA NA NA NA NA NA NA ...
## $ ACMH_ATTRIBUTES: chr "" "" "" "" ...
## $ ACSC : logi NA NA NA NA NA NA ...
## $ ACSC_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ ACSH : int NA NA NA NA NA NA NA NA NA NA ...
## $ ACSH_ATTRIBUTES: chr "" "" "" "" ...
## $ AWND : num NA NA NA NA NA NA NA NA NA NA ...
## $ AWND_ATTRIBUTES: chr "" "" "" "" ...
## $ FMTM : int NA NA NA NA NA NA NA NA NA NA ...
## $ FMTM_ATTRIBUTES: chr "" "" "" "" ...
## $ FRGT : logi NA NA NA NA NA NA ...
## $ FRGT_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ GAHT : num NA NA NA NA NA NA NA NA NA NA ...
## $ GAHT_ATTRIBUTES: chr "" "" "" "" ...
## $ PGTM : int NA NA NA NA NA NA NA NA NA NA ...
## $ PGTM_ATTRIBUTES: chr "" "" "" "" ...
## $ precipDelta : num -0.0745 -0.1228 -0.1363 -0.1083 -0.0455 ...
## $ PRCP : num 0 0 0.01 0.02 0 0.28 0.01 0 0 0.22 ...
## $ PRCP_ATTRIBUTES: chr ",,X," "T,,X," ",,X," ",,X," ...
## $ PSUN : int NA NA NA NA NA NA NA NA NA NA ...
## $ PSUN_ATTRIBUTES: chr "" "" "" "" ...
## $ snowDelta : num -0.03906 -0.02969 -0.06406 -0.04844 -0.00469 ...
## $ SNOW : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SNOW_ATTRIBUTES: chr ",,X," ",,X," ",,X," ",,X," ...
## $ SNWD : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SNWD_ATTRIBUTES: chr ",,X," ",,X," ",,X," ",,X," ...
## $ tavgDelta : num -42.4 -41.7 -41.1 -40.4 -37.1 ...
## $ TAVG : int NA NA NA NA NA NA NA NA NA NA ...
## $ TAVG_ATTRIBUTES: chr "" "" "" "" ...
## $ tmaxDelta : num 3.45 9.58 10.77 13.39 20.42 ...
## $ TMAX : int 54 59 59 61 68 62 47 52 54 60 ...
## $ TMAX_ATTRIBUTES: chr ",,X" ",,X" ",,X" ",,X" ...
## $ tminDelta : num -5.41 8.17 18.06 28.11 31.94 ...
## $ TMIN : int 25 38 48 56 59 47 32 23 23 49 ...
## $ TMIN_ATTRIBUTES: chr ",,X" ",,X" ",,X" ",,X" ...
## $ TOBS : logi NA NA NA NA NA NA ...
## $ TOBS_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ TSUN : int NA NA NA NA NA NA NA NA NA NA ...
## $ TSUN_ATTRIBUTES: chr "" "" "" "" ...
## $ WDF1 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WDF1_ATTRIBUTES: chr "" "" "" "" ...
## $ WDF2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WDF2_ATTRIBUTES: chr "" "" "" "" ...
## $ WDF5 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WDF5_ATTRIBUTES: chr "" "" "" "" ...
## $ WDFG : int NA NA NA NA NA NA NA NA NA NA ...
## $ WDFG_ATTRIBUTES: chr "" "" "" "" ...
## $ WDFM : logi NA NA NA NA NA NA ...
## $ WDFM_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ WESD : num NA NA NA NA NA NA NA NA NA NA ...
## $ WESD_ATTRIBUTES: chr "" "" "" "" ...
## $ WSF1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ WSF1_ATTRIBUTES: chr "" "" "" "" ...
## $ WSF2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ WSF2_ATTRIBUTES: chr "" "" "" "" ...
## $ WSF5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ WSF5_ATTRIBUTES: chr "" "" "" "" ...
## $ WSFG : num NA NA NA NA NA NA NA NA NA NA ...
## $ WSFG_ATTRIBUTES: chr "" "" "" "" ...
## $ WSFM : logi NA NA NA NA NA NA ...
## $ WSFM_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ WT01 : int 1 NA NA 1 NA 1 NA NA NA 1 ...
## $ WT01_ATTRIBUTES: chr ",,X" "" "" ",,X" ...
## $ WT02 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT02_ATTRIBUTES: chr "" "" "" "" ...
## $ WT03 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT03_ATTRIBUTES: chr "" "" "" "" ...
## $ WT04 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT04_ATTRIBUTES: chr "" "" "" "" ...
## $ WT05 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT05_ATTRIBUTES: chr "" "" "" "" ...
## $ WT06 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT06_ATTRIBUTES: chr "" "" "" "" ...
## $ WT07 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT07_ATTRIBUTES: chr "" "" "" "" ...
## $ WT08 : int 1 NA NA NA NA 1 NA NA NA 1 ...
## $ WT08_ATTRIBUTES: chr ",,X" "" "" "" ...
## $ WT09 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT09_ATTRIBUTES: chr "" "" "" "" ...
## $ WT10 : logi NA NA NA NA NA NA ...
## $ WT10_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ WT11 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT11_ATTRIBUTES: chr "" "" "" "" ...
## $ WT12 : logi NA NA NA NA NA NA ...
## $ WT12_ATTRIBUTES: logi NA NA NA NA NA NA ...
## $ WT13 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT13_ATTRIBUTES: chr "" "" "" "" ...
## $ WT14 : int NA NA NA NA NA NA NA NA NA NA ...
## $ WT14_ATTRIBUTES: chr "" "" "" "" ...
## $ WT15 : int NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
Additionally, there are many null values.
## STATION NAME LATITUDE LONGITUDE ELEVATION
## 0 0 0 0 0
## DATE JULIAN_DATE ACMC ACMC_ATTRIBUTES ACMH
## 0 0 23390 23390 11932
## ACMH_ATTRIBUTES ACSC ACSC_ATTRIBUTES ACSH ACSH_ATTRIBUTES
## 0 23390 23390 11931 0
## AWND AWND_ATTRIBUTES FMTM FMTM_ATTRIBUTES FRGT
## 8917 0 13243 0 23390
## FRGT_ATTRIBUTES GAHT GAHT_ATTRIBUTES PGTM PGTM_ATTRIBUTES
## 23390 23270 0 10674 0
## precipDelta PRCP PRCP_ATTRIBUTES PSUN PSUN_ATTRIBUTES
## 0 0 0 16451 0
## snowDelta SNOW SNOW_ATTRIBUTES SNWD SNWD_ATTRIBUTES
## 0 1 0 1 0
## tavgDelta TAVG TAVG_ATTRIBUTES tmaxDelta TMAX
## 0 16893 0 0 0
## TMAX_ATTRIBUTES tminDelta TMIN TMIN_ATTRIBUTES TOBS
## 0 0 0 0 23390
## TOBS_ATTRIBUTES TSUN TSUN_ATTRIBUTES WDF1 WDF1_ATTRIBUTES
## 23390 10614 0 11931 0
## WDF2 WDF2_ATTRIBUTES WDF5 WDF5_ATTRIBUTES WDFG
## 13423 0 13435 0 15473
## WDFG_ATTRIBUTES WDFM WDFM_ATTRIBUTES WESD WESD_ATTRIBUTES
## 0 23390 23390 23173 0
## WSF1 WSF1_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5
## 11931 0 13423 0 13435
## WSF5_ATTRIBUTES WSFG WSFG_ATTRIBUTES WSFM WSFM_ATTRIBUTES
## 0 15466 0 23390 23390
## WT01 WT01_ATTRIBUTES WT02 WT02_ATTRIBUTES WT03
## 11833 0 19601 0 20752
## WT03_ATTRIBUTES WT04 WT04_ATTRIBUTES WT05 WT05_ATTRIBUTES
## 0 23206 0 23349 0
## WT06 WT06_ATTRIBUTES WT07 WT07_ATTRIBUTES WT08
## 23231 0 23389 0 18639
## WT08_ATTRIBUTES WT09 WT09_ATTRIBUTES WT10 WT10_ATTRIBUTES
## 0 23346 0 23390 23390
## WT11 WT11_ATTRIBUTES WT12 WT12_ATTRIBUTES WT13
## 23382 0 23390 23390 19698
## WT13_ATTRIBUTES WT14 WT14_ATTRIBUTES WT15 WT15_ATTRIBUTES
## 0 22933 0 23370 0
## WT16 WT16_ATTRIBUTES WT17 WT17_ATTRIBUTES WT18
## 15121 0 23286 0 22181
## WT18_ATTRIBUTES WT19 WT19_ATTRIBUTES WT21 WT21_ATTRIBUTES
## 0 23262 0 23259 0
## WT22 WT22_ATTRIBUTES WV01 WV01_ATTRIBUTES WV03
## 23305 0 23390 23390 23322
## WV03_ATTRIBUTES WV07 WV07_ATTRIBUTES WV20 WV20_ATTRIBUTES
## 0 23390 23390 23390 23390
To simplify the data for each location we can remove all of the columns except the following. These will give us precipitation (snow and rain) and temperature data for each day of the year from 1/1/1950 to 9/15/2023.
Notes: TAVG (daily average temperature) was considered as an option to analyze, however, there were far too few values to provide a useful dataset. Thus, it was removed as well. Additionally, SNOW was highly considered to be used, however, there ended up being too few rows of data for accurate predictions.
The revised dataset looks like this:
columnsToKeep <- c(2,6,7,27,40,43)
ashvlNCrevised <- ashvlNC[,columnsToKeep]
indiaINrevised <- indiaIN[,columnsToKeep]
str(ashvlNCrevised)## 'data.frame': 23390 obs. of 6 variables:
## $ NAME : chr "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" "ASHEVILLE REGIONAL AIRPORT, NC US" ...
## $ DATE : chr "1/1/1950" "1/2/1950" "1/3/1950" "1/4/1950" ...
## $ JULIAN_DATE: int 1 2 3 4 5 6 7 8 9 10 ...
## $ PRCP : num 0 0 0.01 0.02 0 0.28 0.01 0 0 0.22 ...
## $ TMAX : int 54 59 59 61 68 62 47 52 54 60 ...
## $ TMIN : int 25 38 48 56 59 47 32 23 23 49 ...
Now, there are zero null values.
## NAME DATE JULIAN_DATE PRCP TMAX TMIN
## 0 0 0 0 0 0
## NAME DATE JULIAN_DATE PRCP TMAX TMIN
## 0 0 0 0 0 0
The initial scope of the project contained six(6) locations, where there were many null values. After simplifying the scope, there were no null values to worry about. However, the following was the approach used to handle the null values within the greater project.
We will not remove rows of data from the dataset because the climatological estimate will depend on longevity of data. By interpolating the missing data, we can turn a discrete dataset into a continuous one that allows us to analyze it in different ways.
In order to make changes to the dataset as we desire, we will run a
for() loop to find the following new variables:
The “Avg” for each variable will be calculated based upon the Julian Day of the data and will be kept in a separate data frame to access in calculating the daily Delta from the norm for each variable.
AVLprecipAvg <- c()
AVLtmaxAvg <- c()
AVLtminAvg <- c()
INDprecipAvg <- c()
INDtmaxAvg <- c()
INDtminAvg <- c()
for(i in 1:366) {
AVLfilteredVal <- filter(ashvlNCrevised, JULIAN_DATE == i)
INDfilteredVal <- filter(indiaINrevised, JULIAN_DATE == i)
AVLprecipAvg[i] <- mean(AVLfilteredVal$PRCP, na.rm = TRUE)
AVLtmaxAvg[i] <- mean(AVLfilteredVal$TMAX, na.rm = TRUE)
AVLtminAvg[i] <- mean(AVLfilteredVal$TMIN, na.rm = TRUE)
INDprecipAvg[i] <- mean(INDfilteredVal$PRCP, na.rm = TRUE)
INDtmaxAvg[i] <- mean(INDfilteredVal$TMAX, na.rm = TRUE)
INDtminAvg[i] <- mean(INDfilteredVal$TMIN, na.rm = TRUE)
}
AVLavgDataFrame <- data.frame(AVLprecipAvg,AVLtmaxAvg,AVLtminAvg)
INDavgDataFrame <- data.frame(INDprecipAvg,INDtmaxAvg,INDtminAvg)Thus, using the values procured previously, we can calculate the difference from normal the values are by subtracting the average from the actual. We will store these as the following variables:
AVLprecipDelta <- data.frame(matrix(ncol = 1, nrow = 0))
AVLtmaxDelta <- data.frame(matrix(ncol = 1, nrow = 0))
AVLtminDelta <- data.frame(matrix(ncol = 1, nrow = 0))
AVLdate <- data.frame(matrix(ncol = 1, nrow = 0))
INDprecipDelta <- data.frame(matrix(ncol = 1, nrow = 0))
INDtmaxDelta <- data.frame(matrix(ncol = 1, nrow = 0))
INDtminDelta <- data.frame(matrix(ncol = 1, nrow = 0))
INDdate <- data.frame(matrix(ncol = 1, nrow = 0))
for(i in 1:366) {
AVLfilter <- filter(ashvlNCrevised, JULIAN_DATE == i)
INDfilter <- filter(indiaINrevised, JULIAN_DATE == i)
AVLdate <- rbind(AVLdate,data.frame(AVLfilter$DATE))
AVLprecipDelta <- rbind(AVLprecipDelta, data.frame(AVLfilter$PRCP - AVLavgDataFrame$AVLprecipAvg[i]))
AVLtmaxDelta <- rbind(AVLtmaxDelta, data.frame(AVLfilter$TMAX - AVLavgDataFrame$AVLtmaxAvg[i]))
AVLtminDelta <- rbind(AVLtminDelta, data.frame(AVLfilter$TMIN - AVLavgDataFrame$AVLtminAvg[i]))
if(i == 366) {
colnames(AVLdate) <- c("DataDate")
colnames(AVLprecipDelta) <- c("PrecipDelta")
colnames(AVLtmaxDelta) <- c("TmaxDelta")
colnames(AVLtminDelta) <- c("TminDelta")
AVLdeparture <- data.frame(AVLdate, AVLprecipDelta, AVLtmaxDelta, AVLtminDelta)
}
INDdate <- rbind(INDdate, data.frame(INDfilter$DATE))
INDprecipDelta <- rbind(INDprecipDelta, data.frame(INDfilter$PRCP - INDavgDataFrame$INDprecipAvg[i]))
INDtmaxDelta <- rbind(INDtmaxDelta, data.frame(INDfilter$TMAX - INDavgDataFrame$INDtmaxAvg[i]))
INDtminDelta <- rbind(INDtminDelta, data.frame(INDfilter$TMIN - INDavgDataFrame$INDtminAvg[i]))
if(i == 366) {
colnames(INDdate) <- c("DataDate")
colnames(INDprecipDelta) <- c("PrecipDelta")
colnames(INDtmaxDelta) <- c("TmaxDelta")
colnames(INDtminDelta) <- c("TminDelta")
INDdeparture <- data.frame(INDdate, INDprecipDelta, INDtmaxDelta, INDtminDelta)
}
}
ashevilleFinalDataset <- inner_join(ashvlNCrevised,AVLdeparture, by =c("DATE" = "DataDate"))
indianapolisFinalDataset <- inner_join(indiaINrevised, INDdeparture, by = c("DATE" = "DataDate"))
ashevilleFinalDataset$DATE <- as.Date(ashevilleFinalDataset$DATE, "%m/%d/%Y")
indianapolisFinalDataset$DATE <- as.Date(indianapolisFinalDataset$DATE, "%m/%d/%Y")
ashevilleFinalDataset <- ashevilleFinalDataset[order(ashevilleFinalDataset$DATE),]
indianapolisFinalDataset <- indianapolisFinalDataset[order(indianapolisFinalDataset$DATE),]
ashevilleFinalDataset$TMAX <- ashevilleFinalDataset$TMAX + 273
ashevilleFinalDataset$TMIN <- ashevilleFinalDataset$TMIN + 273
indianapolisFinalDataset$TMAX <- indianapolisFinalDataset$TMAX + 273
indianapolisFinalDataset$TMIN <- indianapolisFinalDataset$TMIN + 273
head(ashevilleFinalDataset)## NAME DATE JULIAN_DATE PRCP TMAX TMIN
## 1 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-01 1 0.00 327 298
## 2 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-02 2 0.00 332 311
## 3 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-03 3 0.01 332 321
## 4 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-04 4 0.02 334 329
## 5 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-05 5 0.00 341 332
## 6 ASHEVILLE REGIONAL AIRPORT, NC US 1950-01-06 6 0.28 335 320
## PrecipDelta TmaxDelta TminDelta
## 1 -0.07453125 3.453125 -5.406250
## 2 -0.12281250 9.578125 8.171875
## 3 -0.13625000 10.765625 18.062500
## 4 -0.10828125 13.390625 28.109375
## 5 -0.04546875 20.421875 31.937500
## 6 0.17437500 13.890625 19.125000
Note: The temperature was changed to convert it to Kelvin (adding 273). This keeps all temperatures positive so we can perform a Box-Cox transformation on the models.
The summary statistics for Asheville, NC (TMAX, TMIN, and PRCP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 281.0 329.0 343.0 340.3 353.0 374.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 257.0 305.0 318.0 317.8 332.0 348.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.131 0.060 5.290
The summary statistics for Indianapolis, IN (TMAX, TMIN, and PRCP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 262.0 319.0 339.0 335.6 354.0 378.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 246.0 303.0 317.0 316.5 333.0 354.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1139 0.0500 7.2000
Although NCEI hosts climatology data, the National Oceanic and Atmospheric Administration (NOAA) processes and produces much of the analysis of the data. The following visualizations are the climatological averages for the respective cities we are interested in. Of note, these behave as you might expect - lower high and low temperatures in the winter and higher temperatures in the summer.
Additionally of note, the higher latitudes will have lower temperatures while lower latitudes have higher temperatures. Also, of note, expect locations near the ocean to have a smaller range of daily and seasonal temperatures, while inland temperatures will have a larger variation in daily and seasonal temperatures.
Additionally, because we found the residuals from normal each data point was, we can perform a visualization on these to see if there are any trends we need to take into account when we model the data.
The following are the residuals by Julian Date:
For Asheville, North Carolina:
avlTmaxDModel <- lm(TmaxDelta ~ JULIAN_DATE,ashevilleFinalDataset)
plot(ashevilleFinalDataset$JULIAN_DATE, avlTmaxDModel$residuals, pch = 20, col = "blue")
abline(h = 0)avlTminDModel <- lm(TminDelta ~ JULIAN_DATE,ashevilleFinalDataset)
plot(ashevilleFinalDataset$JULIAN_DATE, avlTminDModel$residuals, pch = 20, col = "blue")
abline(h = 0)avlPrecipDModel <- lm(PrecipDelta ~ JULIAN_DATE,ashevilleFinalDataset)
plot(ashevilleFinalDataset$JULIAN_DATE, avlPrecipDModel$residuals, pch = 20, col = "blue")
abline(h = 0)For Indianapolis, Indiana:
indTmaxDModel <- lm(TmaxDelta ~ JULIAN_DATE,indianapolisFinalDataset)
plot(indianapolisFinalDataset$JULIAN_DATE, indTmaxDModel$residuals, pch = 20, col = "blue")
abline(h = 0)indTminDModel <- lm(TminDelta ~ JULIAN_DATE,indianapolisFinalDataset)
plot(indianapolisFinalDataset$JULIAN_DATE, indTminDModel$residuals, pch = 20, col = "blue")
abline(h = 0)indPrecipDModel <- lm(PrecipDelta ~ JULIAN_DATE,indianapolisFinalDataset)
plot(indianapolisFinalDataset$JULIAN_DATE, indPrecipDModel$residuals, pch = 20, col = "blue")
abline(h = 0)These are time series plots for each location:
For Asheville, North Carolina:
For Indianapolis, Indiana:
plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)Additionally, we can see the time series for each variable as follows, by Julian Date:
For Asheville, North Carolina:
plot(ashevilleFinalDataset$JULIAN_DATE, ashevilleFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)plot(ashevilleFinalDataset$JULIAN_DATE, ashevilleFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)plot(ashevilleFinalDataset$JULIAN_DATE, ashevilleFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)For Indianapolis, Indiana:
plot(indianapolisFinalDataset$JULIAN_DATE, indianapolisFinalDataset$TMAX, pch = 20, col = "blue")
abline(h = 0)plot(indianapolisFinalDataset$JULIAN_DATE, indianapolisFinalDataset$TMIN, pch = 20, col = "blue")
abline(h = 0)plot(indianapolisFinalDataset$JULIAN_DATE, indianapolisFinalDataset$PRCP, pch = 20, col = "blue")
abline(h = 0)The following R packages are used in the analysis:
ggplot2 - this is used for graphical visualizations
dplyr - this is used for data manipulation (filter, arrange, etc.)
MASS - this is used for calculating BoxCox transformations
tidyr - this is used for some general functionality that doesn’t exist in basic R
The primary form of model we will use will be using linear regression. This makes the most sense for this dataset as temperatures and atmospheric phenomena are continuous. This makes it difficult to justify using Bagging or Boosting. However, this could be something considered in future analyses.
The following is the visualization and analysis of the slope of the data, and an analysis of the slopes of the trend lines by Julian Date and Calendar Date. These also show the trend lines showing the trend of temperatures and precipitation amounts through the time periods.
avlTMAXtrendSlope <- c()
avlTMINtrendSlope <- c()
avlPRCPtrendSlope <- c()
indTMAXtrendSlope <- c()
indTMINtrendSlope <- c()
indPRCPtrendSlope <- c()
for(i in 1:366) {
AVLfilteredData <- filter(ashevilleFinalDataset, JULIAN_DATE == i)
INDfilteredData <- filter(indianapolisFinalDataset, JULIAN_DATE == i)
avlTMAXtrendSlope[i] <- summary(lm(TMAX ~ DATE, AVLfilteredData))$coefficients[2,1]
avlTMINtrendSlope[i] <- summary(lm(TMIN ~ DATE, AVLfilteredData))$coefficients[2,1]
avlPRCPtrendSlope[i] <- summary(lm(PRCP ~ DATE, AVLfilteredData))$coefficients[2,1]
indTMAXtrendSlope[i] <- summary(lm(TMAX ~ DATE, INDfilteredData))$coefficients[2,1]
indTMINtrendSlope[i] <- summary(lm(TMIN ~ DATE, INDfilteredData))$coefficients[2,1]
indPRCPtrendSlope[i] <- summary(lm(PRCP ~ DATE, INDfilteredData))$coefficients[2,1]
}
avlTMAXtrendSlope <- data.frame(avlTMAXtrendSlope)
avlTMINtrendSlope <- data.frame(avlTMINtrendSlope)
avlPRCPtrendSlope <- data.frame(avlPRCPtrendSlope)
indTMAXtrendSlope <- data.frame(indTMAXtrendSlope)
indTMINtrendSlope <- data.frame(indTMINtrendSlope)
indPRCPtrendSlope <- data.frame(indPRCPtrendSlope)
plot(1:366,avlTMAXtrendSlope[,1])
lines(predict(lm(avlTMAXtrendSlope ~ c(1:366),avlTMAXtrendSlope)), col = 'red')
abline(h=0)plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$TmaxDelta, col = 'blue')
lines(predict(lm(TmaxDelta ~ DATE, ashevilleFinalDataset)), col = 'red')
abline(h=0)plot(1:366,avlTMINtrendSlope[,1])
lines(predict(lm(avlTMINtrendSlope ~ c(1:366),avlTMINtrendSlope)), col = 'red')
abline(h=0)plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$TminDelta, col = 'blue')
lines(predict(lm(TminDelta ~ DATE, ashevilleFinalDataset)), col = 'red')
abline(h=0)plot(1:366,avlPRCPtrendSlope[,1])
lines(predict(lm(avlPRCPtrendSlope ~ c(1:366),avlPRCPtrendSlope)), col = 'red')
abline(h=0)plot(ashevilleFinalDataset$DATE, ashevilleFinalDataset$PrecipDelta, col = 'blue')
lines(predict(lm(PrecipDelta ~ DATE, ashevilleFinalDataset)), col = 'red')
abline(h=0)plot(1:366,indTMAXtrendSlope[,1])
lines(predict(lm(indTMAXtrendSlope ~ c(1:366),indTMAXtrendSlope)), col = 'red')
abline(h=0)plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TmaxDelta, col = 'blue')
lines(predict(lm(TmaxDelta ~ DATE, indianapolisFinalDataset)), col = 'red')
abline(h=0)plot(1:366,indTMINtrendSlope[,1])
lines(predict(lm(indTMINtrendSlope ~ c(1:366),indTMINtrendSlope)), col = 'red')
abline(h=0)plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$TminDelta, col = 'blue')
lines(predict(lm(TminDelta ~ DATE, indianapolisFinalDataset)), col = 'red')
abline(h=0)plot(1:366,indPRCPtrendSlope[,1])
lines(predict(lm(indPRCPtrendSlope ~ c(1:366),indPRCPtrendSlope)), col = 'red')
abline(h=0)plot(indianapolisFinalDataset$DATE, indianapolisFinalDataset$PrecipDelta, col = 'blue')
lines(predict(lm(PrecipDelta ~ DATE, indianapolisFinalDataset)), col = 'red')
abline(h=0)Additionally, we can look at the Q-Q plots for each regression:
Note: Cook’s Distance was looked at for each plot, however, no values near 1 were encountered. This tells us that the outliers did not impact the regressions performed.
When we begin to look at the non-linearity of the data, we can see the following. Note that there is no discernible trend in the precipitation so no non-linear trend is computed.:
#Asheville, NC
avlTmaxNLModel <- lm(I(-cos(TMAX*pi/180)/17) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
qplot(ashevilleFinalDataset$DATE, avlTmaxNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
avlTminNLModel <- lm((-cos(TMIN*pi/180)/17) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
qplot(ashevilleFinalDataset$DATE, avlTminNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
#Indianapolis, IN
indTmaxNLModel <- lm(I(-cos(TMAX*pi/180)/25) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
qplot(indianapolisFinalDataset$DATE, indTmaxNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
indTminNLModel <- lm((-cos(TMIN*pi/180)/23) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
qplot(indianapolisFinalDataset$DATE, indTminNLModel$residuals)+stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
And from the summaries of the models, we see the following:
##
## Call:
## lm(formula = I(-cos(TMAX * pi/180)/17) ~ c(1:length(ashevilleFinalDataset[,
## 1])), data = ashevilleFinalDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005663 -0.004973 -0.002723 0.003021 0.042150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.316e-02 8.482e-05 -626.686 < 2e-16
## c(1:length(ashevilleFinalDataset[, 1])) -2.340e-08 6.281e-09 -3.726 0.000195
##
## (Intercept) ***
## c(1:length(ashevilleFinalDataset[, 1])) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.006486 on 23389 degrees of freedom
## Multiple R-squared: 0.0005933, Adjusted R-squared: 0.0005505
## F-statistic: 13.88 on 1 and 23389 DF, p-value: 0.0001949
##
## Call:
## lm(formula = (-cos(TMIN * pi/180)/17) ~ c(1:length(ashevilleFinalDataset[,
## 1])), data = ashevilleFinalDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.015920 -0.009768 -0.002033 0.008235 0.054935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.026e-02 1.406e-04 -286.38 <2e-16
## c(1:length(ashevilleFinalDataset[, 1])) -1.553e-07 1.041e-08 -14.92 <2e-16
##
## (Intercept) ***
## c(1:length(ashevilleFinalDataset[, 1])) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01075 on 23389 degrees of freedom
## Multiple R-squared: 0.009423, Adjusted R-squared: 0.009381
## F-statistic: 222.5 on 1 and 23389 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = I(-cos(TMAX * pi/180)/25) ~ c(1:length(indianapolisFinalDataset[,
## 1])), data = indianapolisFinalDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.006180 -0.005393 -0.003036 0.004102 0.039767
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -3.381e-02 8.327e-05 -406.073
## c(1:length(indianapolisFinalDataset[, 1])) -3.010e-08 5.357e-09 -5.618
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## c(1:length(indianapolisFinalDataset[, 1])) 1.95e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.006831 on 26920 degrees of freedom
## Multiple R-squared: 0.001171, Adjusted R-squared: 0.001134
## F-statistic: 31.56 on 1 and 26920 DF, p-value: 1.953e-08
##
## Call:
## lm(formula = (-cos(TMIN * pi/180)/23) ~ c(1:length(indianapolisFinalDataset[,
## 1])), data = indianapolisFinalDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.013605 -0.008697 -0.001941 0.006558 0.047833
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2.889e-02 1.220e-04 -236.9
## c(1:length(indianapolisFinalDataset[, 1])) -7.848e-08 7.846e-09 -10.0
## Pr(>|t|)
## (Intercept) <2e-16 ***
## c(1:length(indianapolisFinalDataset[, 1])) <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01 on 26920 degrees of freedom
## Multiple R-squared: 0.003703, Adjusted R-squared: 0.003666
## F-statistic: 100.1 on 1 and 26920 DF, p-value: < 2.2e-16
We can also perform Box-Cox analysis on the data to see if there are any further transformations that are needed.
We can see that this implies we should transform the data by squaring it. We do this by the following:
avlTMAX_bc_by_datapoint <- lm(I(TMAX^2) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
avlTMIN_bc_by_datapoint <- lm(I(TMIN^2) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
indTMAX_bc_by_datapoint <- lm(I(TMAX^2) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
indTMIN_bc_by_datapoint <- lm(I(TMIN^2) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
boxcox(avlTMAX_bc_by_datapoint)Looking at these boxcox plots, we see that a transformation is still needed.
avlTMAX_bc2_by_datapoint <- lm(I(TMAX^4) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
avlTMIN_bc2_by_datapoint <- lm(I(TMIN^4) ~ c(1:length(ashevilleFinalDataset[,1])), ashevilleFinalDataset)
indTMAX_bc2_by_datapoint <- lm(I(TMAX^4) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
indTMIN_bc2_by_datapoint <- lm(I(TMIN^4) ~ c(1:length(indianapolisFinalDataset[,1])), indianapolisFinalDataset)
boxcox(avlTMAX_bc2_by_datapoint)This shows that there’s an interesting relationship with the dates. In other words, the temperature is increasing very slowly with time. Thus we can see a linear relationship \(T^4 \propto C\), where \(C\) is the \(x\) value on our graph, or the Date.
This may be an interesting correlation to an equation used in determining the steady state temperature of the atmosphere - the Stefan-Boltzmann equation for blackbody radiation. The equation is \(\epsilon \sigma T^{4}\) - where \(\epsilon\) is the emissivity of the object, \(\sigma\) is a constant, and \(T^{4}\) is the temperature to the fourth power. Thus, this transformation would be representative of the “radiant exitance” or the amount of radiant energy leaving the object. Thus, this transformation validates the Stefan-Boltzmann law for the lowest layer of the atmosphere.
In other words, \(T^4 \propto C\), as shown earlier.
Taking all these things into consideration, we see the slope of each of the trend lines for each variable:
Asheville (TMAX)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.400584e+02 1.487104e-01 2286.715650 0.00000000
## DATE 3.447632e-05 1.412314e-05 2.441122 0.01464907
Asheville (TMIN)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.161580e+02 1.464641e-01 2158.6042 0.000000e+00
## DATE 2.158357e-04 1.390981e-05 15.5168 4.967061e-54
Indianapolis (TMAX)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.351203e+02 1.596197e-01 2099.492467 0.000000e+00
## DATE 8.191076e-05 1.610141e-05 5.087178 3.658695e-07
Indianapolis (TMIN)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.154695e+02 1.453157e-01 2170.92565 0.00000e+00
## DATE 1.582551e-04 1.465852e-05 10.79612 4.08063e-27
Thus, if we were to estimate a forecasted temperature for future times, we can calculate this by calculating the number of days in the future and multiplying by the slope or coefficient of the trend-line. Thus we can estimate the average temperature increase over the next year simply as 365 * (coefficient) * (# years). And we see the following.
The 1 year temperature change for the following are:
1 year Max Temp change for Asheville: 0.01
1 year Min Temp change for Asheville: 0.08
1 year Max Temp change for Indianapolis: 0.03
1 year Min Temp change for Indianapolis: 0.06
The 5 year temperature change for the following are:
5 year Max Temp change for Asheville: 0.06
5 year Min Temp change for Asheville: 0.39
5 year Max Temp change for Indianapolis: 0.15
5 year Min Temp change for Indianapolis: 0.29
The 25 year temperature change for the following are:
25 year Max Temp change for Asheville: 0.31
25 year Min Temp change for Asheville: 1.97
25 year Max Temp change for Indianapolis: 0.75
25 year Min Temp change for Indianapolis: 1.44
The 50 year temperature change for the following are:
50 year Max Temp change for Asheville: 0.63
50 year Min Temp change for Asheville: 3.94
50 year Max Temp change for Indianapolis: 1.49
50 year Min Temp change for Indianapolis: 2.89
We see the following outcomes from this analysis:
We can see a trend of slightly increasing values over the years projected into the future.
We see an increase in minimum temperature at a faster rate than an increase in maximum temperature. This is interesting since this implies that there could be a point in time with very little difference between max and min temperatures.
Inconclusive precipitation analysis
The purpose of this research problem was to investigate how R handled temperature regressions across time. The intent was to use this to estimate a value for maximum and minimum average temperatures at future times.
According to the EPA, we have been seeing larger increases in the low temperatures than high temperatures (https://www.epa.gov/climate-indicators/weather-climate). They show that the world is not cooling off in the evenings as much as it used to.
Identified which variables to utilize in analysis (TMAX, TMIN, and PRCP)
Identified whether to interpolate or remove null values
Analyzed residuals from normal
Analyzed data
Transformed data for analysis
Plotted regressions
Determined Box-Cox relationships with the data
Results:
Temperatures increasing with the Min temperature increasing faster than Max temperature
Inconclusive precipitation analysis
Implications:
Low temperatures increasing faster than high temperatures imply the earth is not cooling off as quickly as it used to.
This implies as well that daily temperature variance will decrease as time continues. This could imply that, as time continues, a feedback loop could occur and cause the max and min temperatures to increase even faster.
Limitations:
Only two locations identified
Only three variables analyzed
Very long timeframe - heavier weight on earlier data; less weight on the more recent datapoints
Future work:
Narrower timeframe - Focus on 10yr, 20yr, and 30yr and see how the outlook compares with the 70yr data analyzed within this dataset
Additional metrics - Look at cloudiness and humidity
Look at semi-permanent High and Low pressure systems and how they have influenced the climate over decadal time spans
Look at other types of predictive methods (Bagging and Boosting) to see how it handles in relation to linear regression
Perform this analysis for other locations and determine how quickly the temperature is changing in various climate regions
Look closer at how precipitation amounts vary by number of years they are averaged over to determine if there is a trend in the data