library(tidyverse)
library(kernlab)
library(neuralnet)
library(caret)
library (RCurl)
library(geosphere)
library(psych)
library(lubridate)
library(ggfortify)
library(glmnet)
Import the data sets from Google Drive: Fairbanks Weather:
# It should be available here:
# "https://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&stations=USW00026411&startDate=1970-01-01&endDate=2021-03-01&format=csv"
# but their site was down for some time luckily it came back up but forced me to modify my project last minute. As this is the more robust solution I've included the CSV here:
# Online:
id <- "1TpnJfFSCWYm7jJG4H9Owag_mTwONBMhR" # google file ID
fair_weather <-read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
# Local (its faster):
fair_weather <-read.csv("Faribanks_weather_air.csv")
Cleaning up the Fairbanks data we’re going to have to pull just the data from the airport as its the station with the highest compliance and the longest data range all the way back to 1923, In other words their data is of the highest quality, and they don’t miss days. I initially ran into to trouble as NOAA was down for a week in the middle of my project so I was unable to update my data to the latest dates.
fair_weather_air <- fair_weather %>% filter(STATION == "USW00026411")
lets start to reduce the number of variables first removing all those that have no data:
#summary(fair_weather_air)
#fair_weather_air <- fair_weather_air %>% #select(-c("DAPR","DASF","EVAP","MDPR","MDSF","MNPN","MXPN","MXPN","SN52"))
removerows <- c()
for(i in 1:ncol(fair_weather_air)){
numrows <- nrow(fair_weather_air)
numNA <- sum(is.na(fair_weather_air[,i]))
if(numrows == numNA){
removerows <- append(removerows,i)
}
}
i <- NULL
if(!is.null(removerows)){fair_weather_air <- fair_weather_air[,-removerows]}
fair_weather_air$DATE <- as.Date(fair_weather_air$DATE,"%m/%d/%Y")
great now we’re down to much fewer variables lets take a look (not shown for brevity):
#summary(fair_weather_air)
Based on the data here or lack of data the variables we want are: “PRCP”,“SNWD”,“SNOW”,“ACMH”,“ACSH”,“TSUN”,“TAVG”,“TMAX”,“TMIN”,“WESD”,“AWND” the reason we can throw away so much data is that much of it is irrelevant to ice breakage such as wind direction “WD” and weather type “WT”, or data the doesn’t contain anything like WESD - Water equivalent of snow on the ground (all 0’s), TAVG as it’s not present before 2016. The ones we selected are:
PRCP - Precipitation
SNWD - Snow depth
SNOW - Snowfall
TSUN - Total sunshine for the period – not present but applied
TMAX - Maximum temperature
TMIN - Minimum temperature
AWND - Average wind speed
Note: that we’d like pressure and rh too but it is not included in this data – it is only reflected in precipitation.
fair_weather_air <- fair_weather_air %>% select(DATE,PRCP,SNWD,SNOW,AWND,TMAX,TMIN)
Our Ice data starts in 1989 and winds aren’t recorded here till 1983 so lets cut off the date early here:
fair_weather_air <- fair_weather_air %>% filter(DATE >= "1989-01-01")
#summary(fair_weather_air)
Great only 31 NA’s to impute, for these NA’s lets use the median not the mean of wind direction this is an ok compromise as this data is slightly right skewed (shown below) and also its Very few data points:
fair_weather_air <- fair_weather_air %>% mutate(AWND = ifelse(is.na(AWND),median(AWND,na.rm=TRUE),AWND))
#summary(fair_weather_air)
’d like to add a column here based on another package, and the amount of sunlight on the day as this data should have time of sun and cloudiness factor, but it does not. It can be calculated with the geosphere package:
# the Latitude/Longitude of the station is 64.8039°, -147.8761°
# from "https://html.scirp.org/file/_6-1770622_1.htm"
fair_weather_air <- fair_weather_air %>% mutate(daylength = daylength(64.8039,DATE))
summary(fair_weather_air)
## DATE PRCP SNWD SNOW
## Min. :1989-01-01 Min. :0.00000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1997-01-23 1st Qu.:0.00000 1st Qu.: 0.000 1st Qu.: 0.000
## Median :2005-02-15 Median :0.00000 Median : 2.000 Median : 0.000
## Mean :2005-02-15 Mean :0.03228 Mean : 7.685 Mean : 0.183
## 3rd Qu.:2013-03-09 3rd Qu.:0.01000 3rd Qu.:14.000 3rd Qu.: 0.000
## Max. :2021-04-01 Max. :2.27000 Max. :54.000 Max. :11.900
## AWND TMAX TMIN daylength
## Min. : 0.000 Min. :-45.00 Min. :-58.00 Min. : 3.705
## 1st Qu.: 2.010 1st Qu.: 15.00 1st Qu.: -6.00 1st Qu.: 7.316
## Median : 3.800 Median : 41.00 Median : 23.00 Median :12.444
## Mean : 4.166 Mean : 38.11 Mean : 17.95 Mean :12.516
## 3rd Qu.: 5.820 3rd Qu.: 64.00 3rd Qu.: 44.00 3rd Qu.:17.599
## Max. :22.370 Max. : 94.00 Max. : 70.00 Max. :21.811
And That’s It – the data is ready!
Now we can finally check multicolinearity:
cor(fair_weather_air[,-1])
## PRCP SNWD SNOW AWND TMAX
## PRCP 1.00000000 -0.08768294 0.362440095 0.094184935 0.1018422
## SNWD -0.08768294 1.00000000 0.200834898 -0.177218024 -0.6339349
## SNOW 0.36244010 0.20083490 1.000000000 0.005168856 -0.1699666
## AWND 0.09418493 -0.17721802 0.005168856 1.000000000 0.4013000
## TMAX 0.10184215 -0.63393493 -0.169966605 0.401300000 1.0000000
## TMIN 0.18356940 -0.69266427 -0.120317952 0.405761686 0.9588552
## daylength 0.11725831 -0.54492580 -0.214221195 0.421069798 0.8765209
## TMIN daylength
## PRCP 0.1835694 0.1172583
## SNWD -0.6926643 -0.5449258
## SNOW -0.1203180 -0.2142212
## AWND 0.4057617 0.4210698
## TMAX 0.9588552 0.8765209
## TMIN 1.0000000 0.8464458
## daylength 0.8464458 1.0000000
It appears that Temp Min and Max are of course highly correlated – we could calculate an average temp like the dataset did that we removed but we’ll leave these in for the time being as we haven’t built a linear model yet to truly look at the colinearity.
It also looks like if we want normality for some of our variables we’ll need to normalize them – this may be a min max here because of the non normality of the data.
But what about the ICE Remember, we now still have to BUILD our response variable from the data! So now lets add Ice Thickness to our dataset – this will be our response variable!
Ice data:
# Online
id <- "1RxhXPuZYu-iD4hxaoak0k8Tj3IUvQOKp" # google file ID
river_ice <-read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
# EPA hosted data Thank Goodness it still works!
Break_date <-read.csv("https://www.epa.gov/sites/production/files/2016-08/river-ice_fig-1.csv")
#Online ice tickness data: from https://www.nenanaakiceclassic.com/ice.htm
#PAINSTAKINLY MANUALLY ENTERED
#https://drive.google.com/file/d/1hs9bcJw0Ex_d20JIUwvlK12DMHWe6p4V/view?usp=sharing
# Online
id <- "1hs9bcJw0Ex_d20JIUwvlK12DMHWe6p4V" # google file ID
ice_depth <-read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id), fileEncoding="UTF-8-BOM")
# Local:
river_ice <-read.csv("river_ice_stages_landsat.csv", fileEncoding="UTF-8-BOM")
Break_date <-read.csv("icebreak.csv", fileEncoding="UTF-8-BOM")
ice_depth <- read.csv("ICEDepth.csv", fileEncoding="UTF-8-BOM")
#str(Break_date)
Clean the break date data, make it specific for our lottery, and add EPA missing variables 2017,2018,2019, and 2020.
Break_date_clean <- tail(Break_date,length(Break_date)-9) %>% rename(Yukon.River = X)%>% rename(Tanana.River=X.1)%>% rename(Year=Figure.1..Ice.Breakup.Dates.for.Two.Alaskan.Rivers..1896.2016)
Break_date_clean <- Break_date_clean %>% mutate(Tanana.date = as.Date(as.numeric(Break_date_clean$Tanana.River),origin = paste0(Break_date_clean$Year,"-01-01")))%>%
select(Year,Tanana.date) %>%
filter(!is.na(Tanana.date))
Break_date_clean$Year <- as.numeric(Break_date_clean$Year)
# add missing most up to date data:
# From: "https://www.nenanaakiceclassic.com/2021%20Nenana%20Ice%20Classic%20Brochure.pdf"
newFreezeData <-tibble(Year = c(2017,2018,2019,2020),Tanana.date = as.Date(c("2017-05-01","2018-05-01","2019-04-14","2020-04-27")))
Break_date_clean <- bind_rows(Break_date_clean,newFreezeData)
ice_depth$Clean.Date <- as.Date(ice_depth$Clean.Date)
Break_date_clean$zerodepth <- 0
Unfortunately the river freeze data is a little difficult to deal with here because it is our predictive value we’re looking to see Tok station on the Tanana river’s dates. It is closest to Fairbanks where our meteorological data will be coming from. Lets convert this data into something that we can match over to our Fairbanks data. We have day of the year and year here we want to first convert that into date. we’re going to mutate a column here to do this using the date time packages like we did before.
river_ice <- river_ice %>% mutate(date = as.Date(river_ice$day_of_year,origin = paste0(river_ice$year,"-01-01")))
The one we’re after here is the Tanana River as we have Fairbanks weather data for that one and its the location of the lottery we’re interested in.
lets try the approximate function to fill in the data for the ice depths:
ice_final_s0 <- data.frame(Date = seq(as.Date("1989-01-01"), as.Date("2020-12-31"), by="days"))
ice_final_s0$year <- as.numeric(format(as.Date(ice_final_s0$Date), "%Y"))
ice_final_s0 <- left_join(ice_final_s0,ice_depth,by = c("Date" = "Clean.Date"))
ice_final_s0 <- left_join(ice_final_s0,Break_date_clean,by = c("Date" = "Tanana.date")) %>% select(-Year)
ice_final_s0$merge <- ifelse(!is.na(ice_final_s0$depth.clean),ice_final_s0$depth.clean,
ifelse(!is.na(ice_final_s0$zerodepth),ice_final_s0$zerodepth,
NA))
Lets Interpolate the functions now!
# make a test df for now
# also note that no ice cores were taken before the start of the year so we can do the following interpolation by year but we will have to do it by winter if we integrate freeze ups as they happen in November / December
#We also need to add the breakups to the graph:
ice_final_s1 <- ice_final_s0
ice_final_s1$iceday <- unlist(by(ice_final_s0, ice_final_s0$year, function(df) approx(df$Date, df$merge, xout = df$Date)[2]))
ice_final_s1 <- ice_final_s1 %>% select(-depth.clean,-zerodepth)
Take a look at our plots we still need the freeze up dates and to insert all our zeros. note that break up happens rapidly and almost all at once that’s really why its a lottery.
plot.ice.depth <- ice_final_s1 %>% filter(year>2006)
ggplot(data=plot.ice.depth, aes(x=Date, y=iceday, group=year)) +
geom_line()+
geom_point()
## Warning: Removed 3711 row(s) containing missing values (geom_path).
## Warning: Removed 3711 rows containing missing values (geom_point).
lets take a look at what we can do for the freeze up data:
# we want freeze up only, don't care about lat long in so far as we only care about TOK (the site closest to fairbanks and on the same river), don't care about the satellites, unfortunately its only till 2016 so we will have to impute the values for the other years based on the distribution.
freeze.up <- river_ice %>% select(year,day_of_year,study_area,season,river_ice_stage)
#filter for freeze
freeze.up <- freeze.up %>% filter(season == "freeze-up")
#add a date
freeze.up <- freeze.up %>% mutate(Date = as.Date(as.numeric(freeze.up$day_of_year),origin = paste0(freeze.up$year,"-01-01")))
#filter out open
freeze.up <- freeze.up %>% filter(river_ice_stage != "open")
#summarize by minimum date:
min.date.freeze <- freeze.up %>% group_by(year,study_area) %>% summarise(min = min(Date))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
min.date.freeze <- left_join(min.date.freeze,freeze.up,by = c("min" = "Date"))[,-c(4:7)]
head(min.date.freeze)
## # A tibble: 6 x 4
## year.x study_area.x min river_ice_stage
## <int> <chr> <date> <chr>
## 1 1972 Beaver 1972-11-05 mostly ice-covered
## 2 1973 Beaver 1973-10-31 mostly ice-covered
## 3 1973 Grayling 1973-10-20 mostly open
## 4 1973 Tok 1973-11-13 ice-covered
## 5 1974 Beaver 1974-10-28 narrowly open
## 6 1974 Grayling 1974-11-19 ice-covered
great now we have good freeze up times, lets take a look at this in the histogram of all sites:
plot.freeze.up <- min.date.freeze %>% mutate(estdate = yday(min.date.freeze$min))
ggplot(plot.freeze.up, aes(x=estdate,fill=river_ice_stage)) +
geom_histogram(color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
we can see here that things tend to freeze up at around November 8th or 310. lets look at just Tok:
plot.freeze.up <- plot.freeze.up %>% filter(study_area.x == "Tok")
ggplot(plot.freeze.up, aes(x=estdate,fill=river_ice_stage)) +
geom_histogram(color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We could simply continue here but A we’re missing years of data lets use our best estimate here and say that by day 330 the maximum date in all the river ice freeze up data that we have the river will have ice on it. We can therefore use this in our river ice estimations as a starting point - point 0 for depths. While the rest of the times are going to be set at 0. Now that we’ve derived this insight from our data on the Tanana River we can reapply it to our river ice estimator.
#Our max freeze day is 330 remember lets add it to our data
ice_final_s1 <- ice_final_s1 %>% mutate(merge = ifelse(ice_final_s1$Date == as.Date(as.numeric(330),origin = paste0(ice_final_s1$year,"-01-01")),0.0,ice_final_s1$merge))
Now we have to use a two month offset year to redo our group by approximations– this is just to be safe – and because our ice always breaks before fall we’ll be safe on the other side of the grouping.
ice_final_s1$year.offset <- as.numeric(format(as.Date(ice_final_s0$Date)+60, "%Y"))
ice_final_s1$year.offset <- ifelse(ice_final_s1$year.offset==max(ice_final_s1$year.offset),max(ice_final_s1$year.offset)-1,ice_final_s1$year.offset)
Ok now redo our grouping and approximation with our new groups year offset and new starting point year day 330.
ice_final_s2 <- ice_final_s1
ice_final_s2$iceday <- unlist(by(ice_final_s1, ice_final_s1$year.offset, function(df) approx(df$Date, df$merge, xout = df$Date)[2]))
fill in the 0’s:
ice_final_s2$iceday <- ifelse(is.na(ice_final_s2$iceday),0.0,ice_final_s2$iceday)
great now to revisit our plot this time for three years:
plot.ice.depth <- ice_final_s2 %>% filter(year>2017)
ggplot(data=plot.ice.depth, aes(x=Date, y=iceday, group=year)) +
geom_line()+
geom_point()
Beautiful now we have squeaky clean ice depth data for our predictive variable. Again this is an approximation what we REALLY care about is the approximation of this curve based on the other variables, and then in the end how it is related to the last day of Ice using our best model available. We really should take a smoothing curve here but that’s what our models are doing for us ;).
Lets clean up our calculated fields for a easy left join by date:
ice_final_s3 <- ice_final_s2 %>% select(Date,iceday)
Join our variables for our final data set!!! Our data exploration and data collection stages here has been a real slugfest and there can be a LOT of discussion about the data collected as of this writing I still do not have the most up to date data in fair_weather due to the NOAA outages.
fair_ice_raw_all <- left_join(fair_weather_air,ice_final_s3,by = c("DATE" = "Date"))
# lets cut of at the guaranteed freeze ice date of 2020 330th day or Nov 25th
fair_ice_raw_filter <- fair_ice_raw_all %>% filter(DATE < "2020-11-26")
#summary(fair_ice_raw)
lets look again at our panels:
pairs.panels(fair_ice_raw_filter[,-1])
highly correlated with snow and with temperature – especially the low temperatures! I know that through this process I’ve moved the goalposts and now have continuous data instead of Boolean data since the data on ice depth has allowed me to map this.
I will therefore be looking at a multiple regression first looking to predict my ice data. Lets first split our data into training and testing datasets 80 20 randomly.
set.seed(123)
in_row <- sample(c(1:nrow(fair_ice_raw_filter)),size = nrow(fair_ice_raw_filter)*.2, replace = FALSE)
# remove data for colinearity to day length and not as good of a metric
fair_ice_train <- fair_ice_raw_filter[-c(in_row),-1]
fair_ice_test <- fair_ice_raw_filter[c(in_row),-1]
lets do a simple linear model here no adjustment:
lm_base = lm(iceday~., data=fair_ice_train)
summary(lm_base)
##
## Call:
## lm(formula = iceday ~ ., data = fair_ice_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.474 -5.503 -1.840 2.945 43.797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.77004 0.45332 -12.728 <2e-16 ***
## PRCP 0.60463 1.27048 0.476 0.6342
## SNWD 1.01020 0.01724 58.594 <2e-16 ***
## SNOW -0.42372 0.18654 -2.271 0.0231 *
## AWND 0.09382 0.04467 2.100 0.0357 *
## TMAX 0.18482 0.01623 11.386 <2e-16 ***
## TMIN -0.42746 0.01653 -25.864 <2e-16 ***
## daylength 0.83374 0.04332 19.245 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.11 on 9314 degrees of freedom
## Multiple R-squared: 0.566, Adjusted R-squared: 0.5657
## F-statistic: 1735 on 7 and 9314 DF, p-value: < 2.2e-16
And our prediction with root mean squared error:
lm_base_pre <- predict(lm_base,fair_ice_test)
test <- data.frame(true = fair_ice_test$iceday, pre = lm_base_pre)
sqrt(mean((test$true - test$pre)^2))
## [1] 11.17688
Lets now do the weakest form of boosting we can: making negative values set to 0 as there cannot be negative ice – or there can but only if you like Kurt Vonnegut.
test$pre <- ifelse(test$pre <0,0,test$pre)
sqrt(mean((test$true - test$pre)^2))
## [1] 11.15571
Just a little better this is only boosting because we applied a rule set certain values to 0 and a model – value 0.
Again a predictive model with a off the charts p value, Not bad for a first try, and no data normalization. We see an adjusted R squared of .70 so we’re explaining approx 60% of the variation in ice depth with these variables. We’ll return to multiple regression, look at lasso and ridge after we adjust variables to try and make wind speed significant, and try our a few other machine learning models first.
This is fine but based on the histograms of the variables we’re going to likely want to scale them in some way. First lets scale our ice values? This appears to be useless as it is a very flat distribution.
hist(log(fair_ice_raw_filter$iceday+1))
hist(sqrt(fair_ice_raw_filter$iceday))
luckily Temp is pretty normal lets look at snow precipitation and the others: same issue as ice day it seems– Except for Average wind speed which really likes a log scale this makes sense as winds tend to be in magnitudes! We’ll keep it – it will assist us in the next steps!
hist(log(fair_ice_raw_filter$SNOW+1))
hist(log(fair_ice_raw_filter$PRCP+1))
hist(log(fair_ice_raw_filter$AWND+1))
# we have a winner for normalization
fair_ice_norm <- fair_ice_raw_filter
fair_ice_norm$AWND <- log(fair_ice_norm$AWND+1)
Now before we do more lets min max adjust the entire dataset min max is a z scale free way to normalize data and is useful here as our data is NOT normal and can’t be transformed easily:
# textbook function for min max normalization
normalize <- function(x){
x <- as.numeric(x)
return((x-min(x))/(max(x)- min(x)))
}
and its opposite unnormalize function for predictive variable so we can compare mean squared error:
unnormal_ice <- function(x){
return(x*(max(fair_ice_raw_filter$iceday)-min(fair_ice_raw_filter$iceday))+min(fair_ice_raw_filter$iceday))
}
lets apply it removing our date column (we’ll add it back later):
fair_ice_norm <- as.data.frame((lapply(fair_ice_norm[,-1],normalize)))
And look at our corr plots again!
pairs.panels(fair_ice_norm)
Things have improved a lot for Wind now with our correction. This didn’t do much as we really didn’t have any outlier problems…
Lets take a look at our PCA next since we can now do so since we’ve scaled our data.
colors_ice <- fair_ice_norm %>% mutate(isice = ifelse(fair_ice_norm$iceday == 0,1,0))
Ice.pca <- prcomp(fair_ice_norm)
summary(Ice.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.4839 0.2615 0.16071 0.10331 0.09614 0.06211 0.04134
## Proportion of Variance 0.6594 0.1927 0.07274 0.03006 0.02603 0.01086 0.00481
## Cumulative Proportion 0.6594 0.8521 0.92484 0.95490 0.98093 0.99179 0.99661
## PC8
## Standard deviation 0.03471
## Proportion of Variance 0.00339
## Cumulative Proportion 1.00000
Ice.pca$rotation
## PC1 PC2 PC3 PC4 PC5
## PRCP -0.01396858 0.003496728 -0.01785897 -0.001525217 -0.05738973
## SNWD 0.28039019 -0.266738665 -0.15984320 0.579908779 -0.68655259
## SNOW 0.02194143 0.009621602 -0.06817997 -0.021628355 -0.16854657
## AWND -0.22456750 -0.261935941 -0.91805953 -0.048794737 0.18713195
## TMAX -0.40773861 -0.144092846 0.07912659 -0.306152095 -0.41322425
## TMIN -0.43947684 -0.060894308 0.05358500 -0.379644024 -0.44006207
## daylength -0.59245712 -0.427630431 0.29447493 0.531934549 0.28934640
## iceday 0.39984763 -0.807958287 0.17550369 -0.374221478 0.11745381
## PC6 PC7 PC8
## PRCP -0.51022602 0.509162826 -0.690362532
## SNWD 0.10695269 0.070252899 0.025670522
## SNOW -0.79571823 -0.538717279 0.206198821
## AWND 0.02606341 0.006012066 -0.003310688
## TMAX 0.26432378 -0.469988748 -0.501483384
## TMIN -0.11101659 0.473196800 0.475664309
## daylength -0.10304169 -0.011682174 0.044514375
## iceday -0.03967865 0.024983737 0.022091535
autoplot(Ice.pca, data=colors_ice, colour = "isice", loadings = TRUE,
loadings.label = TRUE, loadings.label.size = 3)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
It looks like principal component analysis is telling us that snow and ice depth are related which we knew, temperatures, wind and day length are all related. This makes sense longer days are related to higher temperatures like summer for example! The last two precipitation and snowfall are not related to anything here really but their data is sparse as we saw. It appears the principal component here atmospheric weather vs snow conditions, PC 2 appears to be related to magnitude of thaw as all variables decrease in this dimension. Together they take up 86% of the variation in the model.
Now that we have a baseline for our models an un optimized multiple linear regression we can build other models to compare it with.
Lets first start with our data fair ice norm that has our min max transform applied then split it into two datasets using our same randomizer from before in_row:
fair_ice_norm_train <- fair_ice_norm[-in_row,]
fair_ice_norm_test <- fair_ice_norm[in_row,]
our first predictor:
set.seed(123)
neural_basic_m <- neuralnet(iceday~.,data=fair_ice_norm_train)
plot it:
plot(neural_basic_m, rep = "best")
That’s a lot of error! lets take a look at performance:
nural_basic_pre <- compute(neural_basic_m,fair_ice_norm_test)
and take a look at the results:
cor(nural_basic_pre$net.result,fair_ice_norm_test$iceday)
## [,1]
## [1,] 0.8025024
ok not bad but not great looking at our simple dumb boosted linear regression we see:
cor(test$pre,test$true)
## [1] 0.751322
It appears that we made a model that does quite well, better than un optimized multiple regression at the vary least – tho it can be thought of as quite similar with a single hidden node. Lets now kick it up a few notches and see if we can do MUCH better.
lets try bagging with several smaller datasets and then recombine them.
basic_neurons <- list()
for(i in 1:10){
#lets halve the training data
set.seed(i*2)
halfrow <- sample(c(1:nrow(fair_ice_norm_test)),size = nrow(fair_ice_norm_test)*.5, replace = FALSE)
small_set <- fair_ice_norm_train[halfrow,]
neural_tiny <- neuralnet(iceday~.,data=small_set)
#add to a list
basic_neurons[[paste0("Model",i)]] <- neural_tiny
}
i <- NULL
next we predict for each model:
neural.bag.df <-data.frame(Id = c(1:nrow(fair_ice_norm_test)))
for(i in 1:10){
#unpack the list
neural_tiny_pre <- compute(basic_neurons[[i]],fair_ice_norm_test)
Neupre <- paste('Neupre', i, sep= '')
neural.bag.df[[Neupre]] <- unnormal_ice(neural_tiny_pre$net.result)
}
then we row sum and divide by 10
neural.bag.df <- neural.bag.df %>% rowwise() %>% mutate(finalpre = sum(c_across(Neupre1:Neupre10))/10)
lets take a look at the RSME results of our bagged ensemble of 10 tiny neurons:
sqrt(mean((test$true - neural.bag.df$finalpre)^2))
## [1] 13.84382
Ok well that didn’t work very well… but it wasn’t horrible but wasn’t great. Maybe neural networks are not meant to be bagged.
Next lets use a rectilinear classifier, and add a c(3,3) grid of nodes to 1 save some computational time and to see if we can make our classifier even better! Then re compare it using mean squared error not simply cor().
define function:
softplus <- function(x) {log(1+exp(x))}
optimize:
set.seed(123)
neural_opt1_m <- neuralnet(iceday~.,data=fair_ice_norm_train, hidden = c(3,3), act.fct = softplus, stepmax=1e7)
IMPORTANT SAVE with eval false and LOAD THE MODEL this will save a lot of time in knitting:
# save the model to disk
saveRDS(neural_opt1_m, "./neural_opt1_m.rds")
# load the model local
#neural_opt1_m <- readRDS("./neural_opt1_m.rds")
# Online:
neural_opt1_m <- readRDS(url("https://docs.google.com/uc?id=1MB95WCCQexPWTq3CVgSh-zPjCWRy9vhv&export=download"))
plot the model:
plot(neural_opt1_m, rep = "best")
with this model we HALVED the error reported!
lets look at our prediction: prediction:
nural_opt1_pre <- compute(neural_opt1_m,fair_ice_norm_test)
and take a look at the results:
cor(nural_opt1_pre$net.result,fair_ice_norm_test$iceday)
## [,1]
## [1,] 0.8410116
Over 84% correlation This is a great result we could increase the nodes here but we’ll test it against our gold standard un-optimized un variable transformed multiple regression. to do that we’ll need to unnormalize our predicted data and compare it to our test and since we’re using the same seed as before we can simply:
test$pre_nural <- unnormal_ice(nural_opt1_pre$net.result)
sqrt(mean((test$true - test$pre_nural)^2))
## [1] 9.459324
#apply the same negative boost filter:
test$pre_nural <- ifelse(test$pre_nural<0,0,test$pre_nural)
sqrt(mean((test$true - test$pre_nural)^2))
## [1] 9.437372
again this is incredible compare it to our 0 value boosted garbage multiple regression:
sqrt(mean((test$true - test$pre)^2))
## [1] 11.15571
We took away a third of the RMSE! Ok, this is the model to beat at this point. Lets move on to our next model!
this one is relatively a strait forward implementation of the great ksvm() model and optimization. we can reuse our normalized dataset here too.
set.seed(123)
SVM_basic_m <- ksvm(iceday ~ ., data = fair_ice_norm_train, kernal ="vanilladot")
great lets test it:
SVM_basic_m
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.373978976068998
##
## Number of Support Vectors : 4243
##
## Objective Function Value : -1839.932
## Training error : 0.205172
and thats not that bad of a training error. lets look at predictive error:
SVM_basic_pre <- predict(SVM_basic_m,fair_ice_norm_test)
cor(fair_ice_norm_test$iceday,SVM_basic_pre)
## [,1]
## [1,] 0.880378
test$presvm1 <-unnormal_ice(SVM_basic_pre)
sqrt(mean((test$true - test$presvm1)^2))
## [1] 8.016093
and lets filter boost like the others:
test$presvm1 <- ifelse(test$presvm1<0,0,test$presvm1)
sqrt(mean((test$true - test$presvm1)^2))
## [1] 8.00183
even better accuracy.
lets test it again this time with a different kernel and cost parameters:
set.seed(123)
SVM_rbf_m <- ksvm(iceday ~ ., data = fair_ice_norm_train, kernal ="rbfdot",C=10)
SVM_rbf_m
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 10
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.373978976068998
##
## Number of Support Vectors : 4275
##
## Objective Function Value : -15762.34
## Training error : 0.179584
Ahh it appears to only use the Gaussian kernel as before, so what we are looking at is cost optimization alone, and the kernel is not changing.
and same procedure for looking at accuracy:
SVM_rbf_pre <- predict(SVM_rbf_m,fair_ice_norm_test)
cor(fair_ice_norm_test$iceday,SVM_rbf_pre)
## [,1]
## [1,] 0.8828462
test$presvm2 <-unnormal_ice(SVM_rbf_pre)
sqrt(mean((test$true - test$presvm2)^2))
## [1] 7.950857
# and boosted
test$presvm2 <- ifelse(test$presvm2<0,0,test$presvm2)
sqrt(mean((test$true - test$presvm2)^2))
## [1] 7.927145
Interesting this only made things worse! A lower cost in this instance – a smoothing as one could say, actually increases our accuracy! even while our training error is slightly lower.
lets try to optimize this looking at error at different costs:
cost_values <- c(1, seq(from = 2 , to= 20, by = 2))
acc_svm_cost <- sapply(cost_values,function(x){
set.seed(123)
m <- ksvm(iceday ~ ., data = fair_ice_norm_train, kernal ="rbfdot",C=x)
pre <- predict(m,fair_ice_norm_test)
norm <- unnormal_ice(pre)
RMSE <- sqrt(mean((test$true - norm)^2))
return(RMSE)
}
)
# save the vector to disk
saveRDS(acc_svm_cost, "./acc_svm_cost.rds")
# load the vector local
#acc_svm_cost <- readRDS("./acc_svm_cost.rds")
# Online:
acc_svm_cost <-readRDS(url("https://docs.google.com/uc?id=1rywMkNY5rnvO64tiDnnYUPTSm2PvRJlv&export=download"))
and the plot:
plot(cost_values,acc_svm_cost,type = "b")
Interesting there seems to be a sweet spot at about 4 and 10 where our model preforms best – this will be our selected model!
set.seed(123)
SVM_rbf2_m <- ksvm(iceday ~ ., data = fair_ice_norm_train, kernal ="rbfdot",C=4)
SVM_rbf2_pre <- predict(SVM_rbf2_m,fair_ice_norm_test)
cor(fair_ice_norm_test$iceday,SVM_rbf2_pre)
## [,1]
## [1,] 0.8833217
test$presvm3 <-unnormal_ice(SVM_rbf2_pre)
sqrt(mean((test$true - test$presvm3)^2))
## [1] 7.92674
Our best model yet! I keep saying this but in this case its true! and again can be boosted too!
test$presvm3 <- ifelse(test$presvm3<0,0,test$presvm3)
sqrt(mean((test$true - test$presvm3)^2))
## [1] 7.908099
first lets revisit our old punching bag unnormal un modified linear regression
summary(lm_base)
##
## Call:
## lm(formula = iceday ~ ., data = fair_ice_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.474 -5.503 -1.840 2.945 43.797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.77004 0.45332 -12.728 <2e-16 ***
## PRCP 0.60463 1.27048 0.476 0.6342
## SNWD 1.01020 0.01724 58.594 <2e-16 ***
## SNOW -0.42372 0.18654 -2.271 0.0231 *
## AWND 0.09382 0.04467 2.100 0.0357 *
## TMAX 0.18482 0.01623 11.386 <2e-16 ***
## TMIN -0.42746 0.01653 -25.864 <2e-16 ***
## daylength 0.83374 0.04332 19.245 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.11 on 9314 degrees of freedom
## Multiple R-squared: 0.566, Adjusted R-squared: 0.5657
## F-statistic: 1735 on 7 and 9314 DF, p-value: < 2.2e-16
next simply use our normalized/ log normalized variables:
lm_normal <- lm(iceday~., data=fair_ice_norm_train)
summary(lm_normal)
##
## Call:
## lm(formula = iceday ~ ., data = fair_ice_norm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67822 -0.09434 -0.03128 0.05129 0.75396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.239570 0.010294 23.272 <2e-16 ***
## PRCP 0.023082 0.049733 0.464 0.6426
## SNWD 0.947032 0.016178 58.540 <2e-16 ***
## SNOW -0.079530 0.038291 -2.077 0.0378 *
## AWND -0.009775 0.012231 -0.799 0.4242
## TMAX 0.441905 0.038950 11.346 <2e-16 ***
## TMIN -0.932861 0.036396 -25.631 <2e-16 ***
## daylength 0.266287 0.013472 19.766 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1915 on 9314 degrees of freedom
## Multiple R-squared: 0.5658, Adjusted R-squared: 0.5655
## F-statistic: 1734 on 7 and 9314 DF, p-value: < 2.2e-16
Great we increased our r squared slightly and made our wind speed more predictive. since LM models unlike others will need to be tested again and again for RSME and other stats lets make a quick function to plug our linear models into to get our stats for us:
get_stats <- function(model){
pre <- predict(model,fair_ice_norm_test)
cor <- cor(fair_ice_norm_test$iceday,pre)
unnorm <-unnormal_ice(pre)
RSME <- sqrt(mean((test$true - unnorm)^2))
Rsq <- ifelse(class(model) == "lm",round(summary(model)$adj.r.squared,digits=5),"None")
out<- paste("\n",deparse(substitute(model)), "Stats:\n","Cor:","\t",round(cor,digits=5),"\n","RSME:","\t",round(RSME,digits=5),"\n","adj.R.sqr:","\t",Rsq)
return(cat(out))
}
and test it on our lm model an other optimized models:
get_stats(lm_normal)
##
## lm_normal Stats:
## Cor: 0.74994
## RSME: 11.17618
## adj.R.sqr: 0.5655
get_stats(SVM_rbf2_m)
##
## SVM_rbf2_m Stats:
## Cor: 0.88332
## RSME: 7.92674
## adj.R.sqr: None
get_stats(neural_opt1_m)
##
## neural_opt1_m Stats:
## Cor: 0.84101
## RSME: 9.45932
## adj.R.sqr: None
great now that we have a starting point for things lets move on to optimizing variables:
for(i in 1:(ncol(fair_ice_norm_train)-1)){
print(i)
lm_normal_minus <- lm(iceday~., data=fair_ice_norm_train[,-i])
get_stats(lm_normal_minus)
print(i)
}
## [1] 1
##
## lm_normal_minus Stats:
## Cor: 0.74991
## RSME: 11.17675
## adj.R.sqr: 0.56554[1] 1
## [1] 2
##
## lm_normal_minus Stats:
## Cor: 0.61528
## RSME: 13.32141
## adj.R.sqr: 0.4057[1] 2
## [1] 3
##
## lm_normal_minus Stats:
## Cor: 0.7498
## RSME: 11.17901
## adj.R.sqr: 0.56535[1] 3
## [1] 4
##
## lm_normal_minus Stats:
## Cor: 0.74988
## RSME: 11.17732
## adj.R.sqr: 0.56552[1] 4
## [1] 5
##
## lm_normal_minus Stats:
## Cor: 0.74654
## RSME: 11.24128
## adj.R.sqr: 0.55954[1] 5
## [1] 6
##
## lm_normal_minus Stats:
## Cor: 0.73336
## RSME: 11.48733
## adj.R.sqr: 0.53491[1] 6
## [1] 7
##
## lm_normal_minus Stats:
## Cor: 0.73744
## RSME: 11.41199
## adj.R.sqr: 0.54732[1] 7
i <- NULL
Any higher than the original? Not really, removing precipitation (the first column) only moves the model a very tiny fraction but compared to our other models this is hardly a success. Lets move on to lasso and ridge regressions.
lucky this process is not too difficult – in the parlance here ridge is alpha and lambda is lasso
#set up lambda levels
lambdalevels <- 10^seq(2, -3, by = -.1)
#use the package to find ideal lambda levels note: using only training set -- needs to be a matrix here
fair.lasso.0 = cv.glmnet(as.matrix(fair_ice_norm_train[,-8]),fair_ice_norm_train[,8], alpha=0, lambda = lambdalevels)
fair.lasso.5 = cv.glmnet(as.matrix(fair_ice_norm_train[,-8]),fair_ice_norm_train[,8], alpha=0.5, lambda = lambdalevels)
fair.lasso.1 = cv.glmnet(as.matrix(fair_ice_norm_train[,-8]),fair_ice_norm_train[,8], alpha=1, lambda = lambdalevels)
#Pure lambda - ridge
bestlambda.lasso.0 <- fair.lasso.0$lambda.min
paste("Best lambda pure ridge:", bestlambda.lasso.0)
## [1] "Best lambda pure ridge: 0.001"
#half and half
bestlambda.lasso.5 <- fair.lasso.5$lambda.min
paste("Best lambda half lasso, half ridge:",bestlambda.lasso.5)
## [1] "Best lambda half lasso, half ridge: 0.001"
#pure alpha - lasso
bestlambda.lasso.1 <- fair.lasso.1$lambda.min
paste("Best lambda pure lasso:", bestlambda.lasso.1)
## [1] "Best lambda pure lasso: 0.001"
#pre lasso 0 from before
pre.lasso.0 <- predict(fair.lasso.0$glmnet.fit, s=fair.lasso.0$lambda.min, newx=as.matrix(fair_ice_norm_test[,-8]))
paste("Cor Ridge:",cor(pre.lasso.0,test$true))
## [1] "Cor Ridge: 0.750056743592359"
rmse.lasso.0 <- sqrt(mean((test$true - unnormal_ice(pre.lasso.0))^2))
paste("RSME Ridge:",rmse.lasso.0)
## [1] "RSME Ridge: 11.1747910423885"
Ok better than before! We’ve increased the cor and kept a similar RSME.
#pre lasso .5 from before
pre.lasso.5 <- predict(fair.lasso.5$glmnet.fit, s=fair.lasso.5$lambda.min, newx=as.matrix(fair_ice_norm_test[,-8]))
paste("Cor half Ridge half Lasso:",cor(pre.lasso.5,test$true))
## [1] "Cor half Ridge half Lasso: 0.75003836877492"
rmse.lasso.5 <- sqrt(mean((test$true - unnormal_ice(pre.lasso.5))^2))
paste("RSME half Ridge half Lasso:",rmse.lasso.5)
## [1] "RSME half Ridge half Lasso: 11.1756384975234"
worse than the last one but ok.
#pre lasso .5 from before
pre.lasso.1 <- predict(fair.lasso.1$glmnet.fit, s=fair.lasso.1$lambda.min, newx=as.matrix(fair_ice_norm_test[,-8]))
paste("Cor Lasso:",cor(pre.lasso.1,test$true))
## [1] "Cor Lasso: 0.749944733758354"
rmse.lasso.1 <- sqrt(mean((test$true - unnormal_ice(pre.lasso.1))^2))
paste("RSME Lasso:",rmse.lasso.1)
## [1] "RSME Lasso: 11.1780727444272"
and not as good as any of the others…
It appears that the ridge model – alpha 0 with a lambda of .001 “fair.lasso.0” worked the best here and will be the model to use going forward for multiple regression.
Now finally lets get to making a function that predicts what we WANT – the final day of ice on the river near Fairbanks so we can claim our glorious 100K+ prize! Note: you can totally buy multiple tickets but we don’t want to spend too much money on guesses so lets make a function that combines our models and gives us an estimate for when the ice will be ready to break so we can spend our money most wisely.
Luckily we’ve left 2021 out of this so we can use this year as a predictive value.
first lets take our model and make an ensemble function based on the RSME:
rsmeVector <- c(rmse.lasso.0,sqrt(mean((test$true - test$pre_nural)^2)),sqrt(mean((test$true - test$presvm3)^2)))
ensemble <- function(norm.predf,ridge,neural,svm,rsmeVector){
#take the inverse of the sum of RSME's and turn it into a scaling factor, then make a vector of weights
scales <- sapply(rsmeVector,function(x){sum(rsmeVector)/x})
weights <- sapply(scales,function(x){x/sum(scales)})
#make predictions with each:
pre.ridge <- predict(ridge$glmnet.fit, s=ridge$lambda.min, newx=as.matrix(norm.predf[,-8]))*weights[1]
SVM_pre <- predict(svm,norm.predf)*weights[3]
neural_pre <- compute(neural,norm.predf)
neural_pre <- neural_pre$net.result*weights[2]
df <- data.frame(neural = neural_pre,svm = SVM_pre, ridge = pre.ridge)
#print(df)
sum <- df %>% rowwise() %>% mutate(m = sum(c(neural, svm, X1)))
return(unnormal_ice(sum$m))
}
and test our ensemble function:
ensemble.pre <- ensemble(fair_ice_norm_test,fair.lasso.0,neural_opt1_m,SVM_rbf2_m,rsmeVector)
sqrt(mean((test$true - ensemble.pre)^2))
## [1] 8.362569
interesting despite scaling by accuracy it still could not make as good a prediction as a stand alone optimized SVM based on RMSE, lets give it a slight advantage again:
ensemble.pre <- ifelse(ensemble.pre<0,0,ensemble.pre)
sqrt(mean((test$true - ensemble.pre)^2))
## [1] 8.357897
still not great, but not that bad, likely more robust than other models.
Finally, this is an optional step but I think it’d be “valuable” to do lets look at this year the ice hasn’t broken yet but its close, and we have ice values till April 1st. So we can take a look at the error of these methods new dataset of depths.
lets pull the Ice data manually from the website https://www.nenanaakiceclassic.com/ice.htm
2021:
Jan 12 - 34 Inches
Feb 18 - 43 Inches
March 1 - 44.5 Inches
March 7 - 45.5 Inches
March 15 - 46.2 Inches
March 18 - 46 Inches
March 22 - 45.2 Inches
March 26 - 45.0 Inches
March 29 - 42.3 Inches
April 1 - 45.3 Inches
April 5 - 45.2 Inches
April 8 - 45 Inches
April 12 - 45.2 Inches
April 15 - 45 Inches April 19 - 43.3 Inches
# table to join
Win2021 <- data.frame(Date = seq(as.Date("2020-11-26"), as.Date("2021-04-01"), by="days"),year = 2021)
# table to join
Ice2021 <- data.frame(Date = as.Date( c("2020-11-26","2021-01-12","2021-02-18","2021-03-01","2021-03-07","2021-03-15","2021-03-18","2021-03-22","2021-03-26","2021-03-29","2021-04-01")), Thick = c(0,34,43,44.5,45.5,46.2,46,45.2,45,42.3,45.2))
Win2021 <-left_join(Win2021,Ice2021,by = c("Date" = "Date"))
Win2021$iceday <- unlist(by(Win2021, Win2021$year, function(df) approx(df$Date, df$Thick, xout = df$Date)[2]))
Win2021 <-left_join(Win2021,fair_ice_raw_all,by = c("Date" = "DATE")) %>% rename(DATE = Date)%>% select(-c(Thick,iceday.y,year)) %>% rename(iceday = iceday.x)%>% relocate(iceday, .after = daylength)
#insert 60 prior days:
extra_days <- data.frame(Date = as.Date(seq(as.Date("2020-08-25"), as.Date("2020-11-25"), by="days")))
extra_days <- left_join(extra_days,fair_ice_raw_filter,by = c("Date" = "DATE"))
Win2021 <- bind_rows(extra_days,Win2021)
dates2021<- ifelse(is.na(Win2021$DATE),Win2021$Date,Win2021$DATE)
dates2021 <- as.Date(dates2021,origin = "1970-01-01")
next apply transforms, and normalize the data by our original dataset:
set_normalize <- function(x){
x <- as.numeric(x)
return((x-min(fair_ice_raw_filter$iceday))/(max(fair_ice_raw_filter$iceday)- min(fair_ice_raw_filter$iceday)))
}
#drop date and log transform wind
Win2021$AWND <- log(Win2021$AWND+1)
Win2021 <- as.data.frame((lapply(Win2021[,-1],set_normalize)))%>% select(-DATE)
great now we have a new dataset that we can see if our model holds up:
pre2021 <- data.frame(test = unnormal_ice(Win2021$iceday))
nural_2021_pre <- compute(neural_opt1_m,Win2021)
pre2021$Neural <- unnormal_ice(nural_2021_pre$net.result)
SVM_2021_pre <- predict(SVM_rbf2_m,Win2021)
pre2021$SVM <- unnormal_ice(SVM_2021_pre)
pre2021$ridge <- unnormal_ice(predict(fair.lasso.0$glmnet.fit, s=fair.lasso.0$lambda.min, newx=as.matrix(Win2021[,-8])))
pre2021$ensamb <- ensemble(Win2021,fair.lasso.0,neural_opt1_m,SVM_rbf2_m,rsmeVector)
and lastly plot the model:
#add back dates
pre2021$Date <- dates2021
ggplot(data=pre2021, aes(x=Date)) +
geom_line(aes(y = test), color = "red") +
geom_line(aes(y = Neural), color="blue") +
geom_line(aes(y = SVM), color="green") +
geom_line(aes(y = ridge), color="orange") +
geom_line(aes(y = ensamb), color="purple")
While our models might be optimized to our datasets – it appears that they are not great at modeling a new year besides the Neural network (blue) here. Particularly bad was our SVM (green) it might be that it over optimized based on the training dataset – it is predicting ice in September here. It may be worth adding a smoothing or minimum pass filtering here as well.
In conclusion, there appear to be many good machine learning algorithms all make distinctions between summer and winter months, but some appear to over fit to training data meaning that they are unable to be as good in a real world situation. This can be improved by bagging the data like we did in brief for neural networks or by boosting by filtering and making ensemble functions. In this case it appears that we have a lot to do to make a great lottery guesser! But that isn’t to say that we didn’t do a good job here building robust functions on to model a real world dynamic system.
Thanks for looking over my project and for a great Last Semester. I’ve learned a lot and come away with a far more nuanced view of data science, and better still more actionable tools to apply to difficult problems– like predicting the thickness if river Ice.
-Nate