# linear modeling
library(MASS)
library(car)
library(leaps)
library(EnvStats)
# neural-net modeling
library(caret)
library(nnet)
library(NeuralNetTools)
# data input
library(readr)
library(readxl)
# data organization
library(tidyr)
library(dplyr)
# plotting
library(ggformula)
The following is a modeling question based on the Addiction solitaire game, available as an app - see information for Google Play or for Apple).
Scores are produced for each game played. Assumptions are that the score is a simple-ish combination of:
moves, the number of cards moved, including 1 per
“undo” and 1 per “shuffle”;time, the number of seconds the game takes to
complete.While the scoring appears to be deterministic (since any
games with the exact same moves and time result in the same score), the
formula for the computation is not obvious. We will look at the
following: how can we combine number of moves (count) and
time spent (in seconds) to directly compute the score?
Data is pre-processed and organized. Visual assessment is used to explore the proposed nature of the score computation. Finally, linear models, neural nets, and a simpler nonlinear method (linear model with transformed predictor terms) are considered for modeling / identifying the deterministic computation of scores.
Main takeaway: more complex methods (such as neural nets) are not magic! In this situation, using a linear combination of properly transformed predictors does a near-perfect job identifying the (deterministic / exact) relationship. So, in this case, clever (and informed) application of a simpler model works better!
Data can be downloaded from this Google sheet (first tab).
gameresults = read_excel("AddictionResults.xlsx")
First, identify any pairings between observed timesused
and movesused.
timesused <- unique(gameresults$time)
timesused <- timesused[order(timesused)]
movesused <- unique(gameresults$moves)
movesused <- movesused[order(movesused)]
n_gamesused <- length(movesused)*length(timesused)
checkresults <- data.frame(time=rep(NA,n_gamesused),
moves=rep(NA,n_gamesused),
total_games=rep(NA,n_gamesused),
avg_score=rep(NA,n_gamesused),
sd_score=rep(NA,n_gamesused))
Next, compile the number of games, along with the mean and standard
deviation of scores, for each pairing of time and
moves. There will not be observed games for most
the combinations.
locresults <- 1
for (i in 1:length(timesused)) {
for (j in 1:length(movesused)) {
whichsame <- (gameresults$time == timesused[i]) & (gameresults$moves == movesused[j])
if (sum(whichsame) > 0) {
samegames <- gameresults[whichsame,]
allscore <- samegames$score
checkresults[locresults,] <- c(timesused[i], movesused[j],
length(allscore),
mean(allscore),
ifelse(length(allscore)==1, 0,
sd(allscore)))
locresults = locresults+1
} else {
checkresults[locresults,c(1,2)] <- c(timesused[i], movesused[j])
locresults = locresults+1
}
}
}
Check that number of compiled games equal the number of total games.
sum(checkresults$total_games, na.rm=T) == dim(gameresults)[1]
## [1] TRUE
Check that every (non-missing) row has scores with a standard deviation of 0 (to verify consistency in scoring):
table(checkresults$sd_score)
##
## 0
## 161
The total number of games is 185, and the total number of unique scoring situations is 161.
maxrepeats = max(checkresults$total_games, na.rm = T)
gf_histogram(~total_games, data=checkresults,
bins = maxrepeats,
fill = rainbow(maxrepeats),
color = "black")
table(checkresults$total_games)
##
## 1 2 3 4
## 140 19 1 1
Include unique games only:
uniqueresults <- checkresults[!is.na(checkresults$avg_score),]
Reorganize and rename: since baseline score is 13,400 (for getting
all the rows completed), define scoreadj to be
(score \(- 13,400\)) and
use that as the response to be predicted from time and
moves.
cardresults <- uniqueresults |> # start with unique game results
select(time, moves, avg_score) |> # select important / useful variables
rename(score = avg_score) |> # consistent game results
mutate(scoreadj = score-13400) |> # define scoreadj to remove baseline effect
arrange(score)
str(cardresults)
## 'data.frame': 161 obs. of 4 variables:
## $ time : num 213 89 85 82 99 72 69 69 67 68 ...
## $ moves : num 34 44 45 41 36 44 46 43 44 43 ...
## $ score : num 24771 27163 27386 29160 29190 ...
## $ scoreadj: num 11371 13763 13986 15760 15790 ...
There are a vast majority of games with score \(\le 65000\) points; only 4.9% of games have scores above 65,000 (influential, so care must be taken with modeling choices).
maxlim = ceiling(max(cardresults$score)/5000)+1
gf_histogram(~score, data=cardresults,
color = "black", breaks=(3:maxlim)*5000)
Relationship of score to
time: A simple plot of score against
time, marked / grouped by moves, shows
monotonically decreasing behavior. Specifically, for any value of
moves that contains 3 or more observations, a decreasing,
concave-up behavior of score across time is
evident. So, while longer time results in lower scores,
this is clearly nonlinear on original scale.
gf_line(scoreadj ~ time, color=~factor(moves), data=arrange(cardresults, time), size=2)
Relationship of score to
moves: Similarly simple plot of score
against moves, marked / grouped by time, shows
monotonically decreasing behavior. Specifically, for any value of
time that contains 3 or more observations, a decreasing,
concave-up behavior of score across moves is
evident. So, while longer moves results in lower scores,
this is clearly nonlinear on original scale.
gf_line(scoreadj ~ moves, color=~factor(time), data=arrange(cardresults, moves), size=2)
We begin with an interaction model: \(score = \beta_0 + \beta_1 time + \beta_2 moves + \beta_3 time\cdot moves + error\)
However, as noted before, the “baseline” score is 13,400 for
completing the game, so we are going to model the response
adjscore via an interaction model without an
intercept: \(adjscore = \beta_1 time + \beta_2
moves + \beta_3 time\cdot moves + error\)
An initial interaction regression model suggests that a response transformation is clearly needed:
lmfit_interact_init <- lm(scoreadj ~ time + moves + I(time*moves) -1,
data=cardresults)
BC_interact <- MASS::boxcox(lmfit_interact_init)
transform_interact <- round(BC_interact$x[which.max(BC_interact$y)],2)
As a checkpoint,
cardresults <- cardresults |> mutate(yhat_interact_init = lmfit_interact_init$fitted.values)
gf_point(scoreadj ~ yhat_interact_init, data=cardresults) + geom_abline(slope=1, intercept=0)
MAE_interact_init = mean(abs(cardresults$scoreadj-cardresults$yhat_interact_init))
the MAE for the interaction model with untransformed (adjusted) score is 5518.81, and we see a clear curved deviation of predicted from observed.
Next, we fit the interaction model with the transformed response (to the power of -1.8).
cardresults <- cardresults |> mutate(scoreadj_trans_int = scoreadj^(transform_interact))
lmfit_interact <- lm(scoreadj_trans_int ~ time + moves + I(time*moves) -1,
data=cardresults)
yhattrans_interact = lmfit_interact$fitted.values
cardresults <- cardresults |> mutate(yhat_interact = yhattrans_interact^(1/transform_interact))
gf_point(scoreadj ~ yhat_interact, data=cardresults) + geom_abline(slope=1, intercept=0)
MAE_interact = mean(abs(cardresults$scoreadj-cardresults$yhat_interact))
On the transformed scale, the regression model explains about 99.445% of the variability in the transformed response.
For comparison with the prior results, we backtransform the predicted response to the original scale, resulting in an MAE (measured on the original units) for the interaction model with transformed (adjusted) score as 2459.86.
We next consider a full quadratic model: \(score = \beta_0 + \beta_1 time + \beta_2 moves + \beta_3 time\cdot moves + \beta_4 time^2 + \beta_5 moves^2 + error\)
However, as noted before, the “baseline” score is 13,400 for
completing the game, so we are going to model the response
adjscore via a quadratic model without an
intercept: \(adjscore = \beta_1 time + \beta_2
moves + \beta_3 time\cdot moves + \beta_4 time^2 + \beta_5 moves^2 +
error\)
An initial quadratic regression model suggests that a response transformation is clearly needed:
lmfit_quad_init <- lm(scoreadj ~ time + moves + I(time*moves)
+ I(time^2) + I(moves^2)-1,
data=cardresults)
BC_quad <- MASS::boxcox(lmfit_quad_init)
transform_quad <- round(BC_quad$x[which.max(BC_quad$y)],2)
As a checkpoint,
cardresults <- cardresults |> mutate(yhat_quad_init = lmfit_quad_init$fitted.values)
gf_point(scoreadj ~ yhat_quad_init, data=cardresults) + geom_abline(slope=1, intercept=0)
MAE_quad_init = mean(abs(cardresults$scoreadj-cardresults$yhat_quad_init))
the MAE for the quadratic model with untransformed (adjusted) score is 2803.02, and we see a upper bound on the predicted compared to observed.
Next, we fit the quadratic model with the transformed response (to the power of -1.43).
cardresults <- cardresults |> mutate(scoreadj_trans_quad = scoreadj^(transform_quad))
lmfit_quad <- lm(scoreadj_trans_quad ~ time + moves + I(time*moves)
+ I(time^2) + I(moves^2)-1,
data=cardresults)
yhattrans_quad = lmfit_quad$fitted.values
cardresults <- cardresults |> mutate(yhat_quad = yhattrans_quad^(1/transform_quad))
gf_point(scoreadj ~ yhat_quad, data=cardresults) + geom_abline(slope=1, intercept=0)
MAE_quad = mean(abs(cardresults$scoreadj-cardresults$yhat_quad))
On the transformed scale, the regression model explains about 99.965% of the variability in the transformed response.
For comparison with the prior results, we backtransform the predicted response to the original scale, resulting in an MAE (measured on the original units) for the interaction model with transformed (adjusted) score as 462.57.
This is looking much better, but there is still increasing variability of the scores across predicted values
A check for outliers shows that high-moves data are
identified as unusual:
plot(lmfit_quad, which=1, id.n=5)
#outlierTest(lmfit_quad)
adjOutliers <- rosnerTest(lmfit_quad$residuals, k=5)
The data identified as outlying is shown below:
cardresults |>
slice(adjOutliers$all.stats$Obs.Num[adjOutliers$all.stats$Outlier]) |>
select(time, moves, score, scoreadj)
## time moves score scoreadj
## 1 85 45 27386 13986
## 2 89 44 27163 13763
moves_breaks = max(cardresults$moves)-min(cardresults$moves)+1
Compare this to the histogram of all values of
moves:
gf_histogram(~moves, data=cardresults,
fill="skyblue", color="black",
bins=moves_breaks)
Additionally, there is a very clear evidence that
high-time data is (overwhelmingly) identified as
influential:
plot(lmfit_quad, which=5, id.n=5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
CooksD <- cooks.distance(lmfit_quad)
max(CooksD) # anything > 0.5 is suspect influential and > 1 is almost surely influential
## [1] 12.9099
The data identified as outlying is shown below:
cardresults |>
filter(CooksD > 0.5) |>
select(time, moves, score, scoreadj)
## time moves score scoreadj
## 1 213 34 24771 11371
## 2 89 44 27163 13763
## 3 85 45 27386 13986
time_breaks = max(cardresults$time)-min(cardresults$time)+1
Compare this to the histogram of all values of time:
gf_histogram(~time, data=cardresults,
fill="pink3", color="black",
bins=time_breaks)
So what other options are available?
A “hot” topic right now is deep learning, which makes use of the modeling tool of neural nets. We consider a simple version of neural nets, with only one layer but multiple nodes.
We use a more complex model-selection process, by applying the
train function from that caret package. As
per documentation for available
models in caret, the simple neural net application can optionally
adjust two tuning parameters, size (the number of nodes in
hidden layer) and decay (used for weight decay). Further
documentation about the function’s applications and arguments can be
found at the online
help file for nnet.
We begin by specifying the model-selection process (using 10-fold cross-validation to obtain honest predictions):
set.seed(42)
ctrl = trainControl(method = "cv", number = 5)
We then consider six possible options for fitting neural nets:
nnetfit_add_orig <- train(scoreadj ~ time + moves,
data = cardresults,
method = "nnet",
tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
# preProc = c("center", "scale"),
linout =T, trace = FALSE,
metric="MAE", trControl = ctrl)
The minimum MAE is 945.7273425, using tuning parameters
size = 7 and decay = 0.1.
nnetfit_add_stand <- train(scoreadj ~ time + moves,
data = cardresults,
method = "nnet",
tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
preProc = c("center", "scale"),
linout =T, trace = FALSE,
metric="MAE", trControl = ctrl)
The minimum MAE is 1123.5407581, using tuning parameters
size = 2 and decay = 0.5.
nnetfit_int_orig <- train(scoreadj ~ time + moves + I(time*moves),
data = cardresults,
method = "nnet",
tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
# preProc = c("center", "scale"),
linout =T, trace = FALSE,
metric="MAE", trControl = ctrl)
The minimum MAE is 1583.7632872, using tuning parameters
size = 9 and decay = 0.5.
nnetfit_int_stand <- train(scoreadj ~ time + moves + I(time*moves),
data = cardresults,
method = "nnet",
tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
preProc = c("center", "scale"),
linout =T, trace = FALSE,
metric="MAE", trControl = ctrl)
The minimum MAE is 1228.3833052, using tuning parameters
size = 1 and decay = 0.2.
nnetfit_quad_orig <- train(scoreadj ~ time + moves + I(time*moves) + I(time^2) + I(moves^2),
data = cardresults,
method = "nnet",
tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
# preProc = c("center", "scale"),
linout =T, trace = FALSE,
metric="MAE", trControl = ctrl)
The minimum MAE is 2161.3607821, using tuning parameters
size = 6 and decay = 0.1.
nnetfit_quad_stand <- train(scoreadj ~ time + moves + I(time*moves) + I(time^2) + I(moves^2),
data = cardresults,
method = "nnet",
tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
preProc = c("center", "scale"),
linout =T, trace = FALSE,
metric="MAE", trControl = ctrl)
The minimum MAE is 2150.0260011, using tuning parameters
size = 3 and decay = 0.3.
Combining the fits and the minimum metrics:
allmodels = list(nnetfit_add_orig,
nnetfit_add_stand,
nnetfit_int_orig,
nnetfit_int_stand,
nnetfit_quad_orig,
nnetfit_quad_stand )
allminMAE = c(min(nnetfit_add_orig$results$MAE),
min(nnetfit_add_stand$results$MAE),
min(nnetfit_int_orig$results$MAE),
min(nnetfit_int_stand$results$MAE),
min(nnetfit_quad_orig$results$MAE),
min(nnetfit_quad_stand$results$MAE) )
We can identify that the best neural net is the additive model on original-unit scale.
We store the best model, and look at the metric values across tuning parameter options:
nnetfit_selected <- allmodels[[which.min(allminMAE)]]
gf_line(MAE ~ decay, color=~factor(size), data=nnetfit_selected$results, size=2)
A visualization of the neural network is shown below:
par(mar = c(.1, .1, .1, .1))
plotnet(nnetfit_selected)
par(mar = c(2,2,2,2))
MAE_nnet = mean(abs(cardresults$scoreadj-nnetfit_selected$finalModel$fitted.values))
This model has a re-assessed MAE of 1025.93, which is relatively quite good (though not as good as the best linear model).
However, a plot of the observed (adjusted) score versus the predictions shows a wavy pattern, which means we have not yet identified the deterministic relationship, even with a much more complex mathematical model.
cardresults <- cardresults |> mutate(yhat_nnet = nnetfit_selected$finalModel$fitted.values)
gf_point(scoreadj ~ yhat_nnet, data=cardresults) + geom_abline(slope=1, intercept=0)
All the models are only fitting in the observed range.
However, problems happen if we move outside that range. We start by
considering how (adjusted) adjscore behaves across
time, for different values of moves:
maxadjScore = max(cardresults$scoreadj)+1000
gf_line(scoreadj ~ time, color=~factor(moves), data=arrange(cardresults, time), size=2) +
ylim(0,maxadjScore) + xlim(0,300)
There is a clear curved pattern over time from about 15 to 100 seconds. But what happens when we move into higher times (which are possible to obtain)?
For example, we can readily visualize the pattern of
adjscore vs time for values of
moves (27, 33, or 40 card-moves) in which we have multiple
game results:
multiple_results <- cardresults |> filter(moves %in% c(27, 33, 40)) |> select(1:4)
table(multiple_results$moves)
##
## 27 33 40
## 9 11 9
gf_point(scoreadj ~ time, color=~factor(moves), data=arrange(multiple_results, time), size=2)+
ylim(0,maxadjScore) + xlim(0,300) +
scale_color_manual(values=c("green", "gray50", "blue"))
## [1] -8665.441
## [1] 8373.931
## I(time^2)
## -20.58873
## [1] -72545.59
## [1] 11683.23
## I(time^2)
## -20.58873
## [1] -175546.5
## [1] 15544.09
## I(time^2)
## -20.58873
However, when we add in the fitted best model: \(\hat{scoreadj} = \left( \frac{1}{1000000000000} \cdot (\hat{\beta}_1 \cdot time + \hat{\beta}_2 \cdot moves + \hat{\beta}_3 \cdot time \cdot moves +\hat{\beta}_4 \cdot time^2 +\hat{\beta}_5 \cdot moves^2 )\right)^{\frac{1}{-1.56}}\)
with \(\hat{\beta}_1\) = -6517.9304, \(\hat{\beta}_2\) = 8127.3984, \(\hat{\beta}_3\) = 551.5504, \(\hat{\beta}_4\) = -20.5887, and \(\hat{\beta}_5\) = -312.9015,
at moves=27: \(\hat{scoreadj}
= (\left( \frac{1}{1000000000000} \right)\cdot (\)
-8665.4405292+ 8373.9305066\(\cdot
time\) + -20.5887269+ \(\cdot time^2
))^{\frac{1}{-1.56}}\)
at moves=33: \(\hat{scoreadj}
= (\left( \frac{1}{1000000000000} \right)\cdot (\)
-7.2545592^{4}+ 1.1683233^{4}\(\cdot
time\) + -20.5887269+ \(\cdot time^2
))^{\frac{1}{-1.56}}\)
at moves=40: \(\hat{scoreadj}
= (\left( \frac{1}{1000000000000} \right)\cdot (\)
-1.7554647^{5}+ 1.5544086^{4}\(\cdot
time\) + -20.5887269+ \(\cdot time^2
))^{\frac{1}{-1.56}}\)
This looks quite good over the original range, and even quite a ways beyond:
gf_point(scoreadj ~ time, color=~factor(moves), data=arrange(multiple_results, time), size=3, shape=16)+
ylim(0,maxadjScore) + xlim(0,300) +
scale_color_manual(values=c("green", "gray50", "blue")) +
stat_function(fun=fun27, col="green", lwd=1) +
stat_function(fun=fun33, col="gray50", lwd=1) +
stat_function(fun=fun40, col="blue", lwd=1)
However, because this is a quadratic function (with a
positive second derivative) in time… eventually
the fitted pattern will curve back up:
gf_point(scoreadj ~ time, color=~factor(moves), data=arrange(multiple_results, time), size=3, shape=16)+
ylim(0,maxadjScore) + xlim(0,1000) +
scale_color_manual(values=c("green", "gray50", "blue")) +
stat_function(fun=fun27, col="green", lwd=1) +
stat_function(fun=fun33, col="gray50", lwd=1) +
stat_function(fun=fun40, col="blue", lwd=1)
This behavior does not align with (presumed) monotonic decrease in (adjusted) score as time increases.
So what else can we do?
So, it doesn’t seem to work to transform just the response. What if we were to consider transforming one (or both) of the predictor variables?
Predictor time: Consider the
relationship of adjscore vs time for fixed
values of moves. Can we come up with a transformation of
time such that the relationships look linear?
gf_line(scoreadj ~ time, color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)
gf_line(scoreadj ~ log(time), color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)
gf_line(scoreadj ~ 1/(time), color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)
gf_line(scoreadj ~ 1/sqrt(time), color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)
The inverse relationship looks very linear. Let’s check the
pattern of adjscore vs 1/(time) for values of
moves (34 card-moves) in which we have multiple game
results:
multiple_results <- cardresults |> filter(moves == 34) |> select(1:4)
table(multiple_results$moves)
##
## 34
## 9
gf_line(scoreadj ~ 1/(time), color=~factor(moves), data=arrange(multiple_results, time), size=1) + geom_point(size=2)
lmfit_invtime <- lm(scoreadj ~ I(1/(time)), color=~factor(moves), data=multiple_results)
summary(lmfit_invtime)$r.squared
## [1] 1
This is suggesting an essentially perfect fit.
So, we will revise modeling to use the inverse-transform of
time.
Predictor moves: Consider the
relationship of adjscore vs moves for fixed
values of time. Can we come up with a transformation of
time such that the relationships look linear?
gf_line(scoreadj ~ moves, color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)
gf_line(scoreadj ~ log(moves), color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)
gf_line(scoreadj ~ 1/(moves), color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)
gf_line(scoreadj ~ 1/sqrt(moves), color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)
This one is a little less clear. Let’s check the pattern of
adjscore vs log(moves) for value of
time (43 seconds) in which we have multiple game
results:
multiple_results <- cardresults |> filter(time == 44) |> select(1:4)
table(multiple_results$time)
##
## 44
## 6
lmfit_logmoves <- lm(scoreadj ~ I(log(moves)), color=~factor(time), data=multiple_results)
gf_line(scoreadj ~ log(moves), color=~factor(time), data=cardresults, size=1) +
geom_point(size=2) + geom_abline(slope=lmfit_logmoves$coef[2], intercept=lmfit_logmoves$coef[1])
summary(lmfit_logmoves)$r.squared
## [1] 0.9999459
predpower = -1/4
lmfit_powermoves <- lm(scoreadj ~ I((moves)^predpower), color=~factor(time), data=multiple_results)
gf_line(scoreadj ~ (moves^predpower), color=~factor(time), data=cardresults, size=1) +
geom_point(size=2) + geom_abline(slope=lmfit_powermoves$coef[2], intercept=lmfit_powermoves$coef[1])
summary(lmfit_powermoves)$r.squared
## [1] 0.9999337
Each is a very very good, though not quite perfect, fit.
We will try the inverse-quarter-root-transform of moves,
though may need to revisit.
Based on exploration, the following transformations of predictors were added:
cardresults <- cardresults |> mutate(time_inv_2 = 1/time^2,
time_inv = 1/time,
time_inv_sqrt = 1/(time^.5),
time_inv_quarter = 1/(time^.25),
time_log = log(time),
moves_inv_2 = 1/moves^2,
moves_inv = 1/moves,
moves_inv_sqrt = 1/(moves^.5),
moves_inv_quarter = 1/(moves^.25),
moves_log = log(moves)
)
I first tried the interaction model without an intercept, using transformed predictors:
\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^{1/4}} + \beta_3 \frac{1}{(time)(moves^{1/4})} + error\)
lmfit_transx14_pred <- lm(scoreadj ~ time_inv + moves_inv_quarter + I(time_inv*moves_inv_quarter)-1,
data=cardresults)
summary(lmfit_transx14_pred)
##
## Call:
## lm(formula = scoreadj ~ time_inv + moves_inv_quarter + I(time_inv *
## moves_inv_quarter) - 1, data = cardresults)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4330.0 -424.0 173.8 570.4 1945.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## time_inv -772949.2 87257.0 -8.858 1.59e-15 ***
## moves_inv_quarter 16103.8 758.3 21.236 < 2e-16 ***
## I(time_inv * moves_inv_quarter) 4252241.9 187596.0 22.667 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 965.2 on 158 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 6.647e+04 on 3 and 158 DF, p-value: < 2.2e-16
#MASS::boxcox(lmfit_transx14_pred)
#plot(lmfit_transx14_pred)
cardresults <- cardresults |> mutate(yhat_xtrans14 = lmfit_transx14_pred$fitted.values)
gf_point(scoreadj ~ yhat_xtrans14, data=cardresults) + geom_abline(slope=1, intercept=0)
MAE_xtrans14 = mean(abs(cardresults$scoreadj-cardresults$yhat_xtrans14)); MAE_xtrans14
## [1] 704.3861
This helped quite a bit, almost halving the prior MAE to an MAE (measured on the original units) of 704.39 for the interaction model with transformed (adjusted) predictors.
But it seems like we should be able to get a better fit, especially since this should be a deterministic relationship.
We are going to consider a “best” model, of each of various numbers
of subsets. Based on univariate explorations, we are going to start with
10 predictors (time_inv_2, time_inv,
time_inv_sqrt, time_inv_quarter,
time_log, moves_inv_2, moves_inv,
moves_inv_sqrt, moves_inv_quarter, and
moves_log) plus 10 interaction terms. We will identify the
best model by Bayes
Information Criterion, with an application
example for reference.
modeltries <- regsubsets(scoreadj ~ time_inv_2 + time_inv + time_inv_sqrt + time_inv_quarter #+ time_log
+ moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter #+ moves_log
# + time_inv:moves_inv_2 + time_inv:moves_inv + time_inv:moves_inv_sqrt + time_inv:moves_inv_quarter + time_inv:moves_log
# + time_log:moves_inv_2 + time_log:moves_inv + time_log:moves_inv_sqrt + time_log:moves_inv_quarter + time_log:moves_log,
, data = cardresults, nvmax = 7)
summary(modeltries)
## Subset selection object
## Call: regsubsets.formula(scoreadj ~ time_inv_2 + time_inv + time_inv_sqrt +
## time_inv_quarter + moves_inv_2 + moves_inv + moves_inv_sqrt +
## moves_inv_quarter, data = cardresults, nvmax = 7)
## 8 Variables (and intercept)
## Forced in Forced out
## time_inv_2 FALSE FALSE
## time_inv FALSE FALSE
## time_inv_sqrt FALSE FALSE
## time_inv_quarter FALSE FALSE
## moves_inv_2 FALSE FALSE
## moves_inv FALSE FALSE
## moves_inv_sqrt FALSE FALSE
## moves_inv_quarter FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## time_inv_2 time_inv time_inv_sqrt time_inv_quarter moves_inv_2
## 1 ( 1 ) " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " " " "*"
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " "*"
## 6 ( 1 ) " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*"
## moves_inv moves_inv_sqrt moves_inv_quarter
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) "*" " " " "
## 4 ( 1 ) "*" "*" "*"
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
summary(modeltries)$bic # jumps at 2, 3, and 5; minimized at 10 variables
## [1] -529.3658 -1616.8150 -1932.3207 -2122.7185 -2625.2487 -2621.3802 -2616.3068
summary(modeltries)$rsq # looks like 9-10 variables is deterministic
## [1] 0.9649545 0.9999604 0.9999946 0.9999984 0.9999999 0.9999999 0.9999999
summary(modeltries)$outmat[c(2,3,5),]
## time_inv_2 time_inv time_inv_sqrt time_inv_quarter moves_inv_2
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " " " "*"
## 5 ( 1 ) " " "*" " " " " "*"
## moves_inv moves_inv_sqrt moves_inv_quarter
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) "*" " " " "
## 5 ( 1 ) "*" "*" "*"
Based on this, we only need first-order terms with transforms, but not any interactions.
So, we will fit the three models (the first two are considered for slight simplification):
\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^{1/4}} + error\)
\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^2} + \beta_3 \frac{1}{moves} + error\)
\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^2} + \beta_3 \frac{1}{moves} + \beta_4 \frac{1}{moves^{1/2}} + \beta_5 \frac{1}{moves^{1/4}} + error\)
\(adjscore = \beta_1 \frac{1}{time^2} + \beta_2 \frac{1}{time} + \beta_3 \frac{1}{time^{1/2}} + \beta_4 \frac{1}{time^{1/4}} + \beta_5 log(time) + \beta_6 \frac{1}{moves^2} + \beta_7 \frac{1}{moves} + \beta_8 \frac{1}{moves^{1/2}} + \beta_9 \frac{1}{moves^{1/4}} + \beta_{10} log(moves) + error\)
\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^2} + \beta_3 \frac{1}{moves} + \beta_4 \frac{1}{moves^{1/2}} + \beta_5 \frac{1}{time^{1/4}} + \beta_6 log(time) + \beta_7\frac{1}{(moves^2)(time)} + \beta_8 \frac{1}{(moves)(time)} + \beta_9 \frac{1}{(moves^{1/2})(time))} + \beta_{10} \frac{1}{(moves^{1/4})(time))} + error\)
lmfit_2pred <- lm(scoreadj ~ time_inv + moves_inv_quarter-1,
data=cardresults)
lmfit_3pred <- lm(scoreadj ~ time_inv + moves_inv_2 + moves_inv-1,
data=cardresults)
lmfit_5pred <- lm(scoreadj ~ time_inv + moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter-1,
data=cardresults)
lmfit_10pred <- lm(scoreadj ~ time_inv_2 + time_inv + time_inv_sqrt + time_inv_quarter + time_log +
moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter + moves_log - 1,
data=cardresults)
lmfit_6predint <- lm(scoreadj ~ time_inv +
moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter + moves_log +
time_inv:moves_inv_2 + time_inv:moves_inv + time_inv:moves_inv_sqrt + time_inv:moves_inv_quarter - 1,
data=cardresults)
lmfit = lmfit_6predint
summary(lmfit)$r.squared
## [1] 1
MAE_2pred = mean(abs(cardresults$scoreadj-lmfit_2pred$fitted.values))
MAE_3pred = mean(abs(cardresults$scoreadj-lmfit_3pred$fitted.values))
MAE_5pred = mean(abs(cardresults$scoreadj-lmfit_5pred$fitted.values))
MAE_10pred = mean(abs(cardresults$scoreadj-lmfit_10pred$fitted.values))
MAE_6predint = mean(abs(cardresults$scoreadj-lmfit_6predint$fitted.values))
Sidenote: why not just try to fit the data perfectly? While in this case we are assuming a deterministic model, in most situations we want to avoid the problem of overfitting.
Determinisitically, it appears that score is additively related to
the inverse of time, along with moves raised to a power that is
somehow dependent on the values of time.
However, the question remains - have we overfit the data?
Okay, so we’ll take a look at some brand-new games (not used to fit the model).
newresults = read_excel("AddictionNew.xlsx") |>
select(score, time, moves, scoreadj) |>
mutate(scoreadj_trans_quad = scoreadj^(transform_quad),
time_inv_2 = 1/time^2,
time_inv = 1/time,
time_inv_sqrt = 1/(time^.5),
time_inv_quarter = 1/(time^.25),
time_log = log(time),
moves_inv_2 = 1/moves^2,
moves_inv = 1/moves,
moves_inv_sqrt = 1/(moves^.5),
moves_inv_quarter = 1/(moves^.25),
moves_log = log(moves)
)
newresults[,1:4]
## # A tibble: 11 Ă— 4
## score time moves scoreadj
## <dbl> <dbl> <dbl> <dbl>
## 1 38440 53 35 25040
## 2 34457 56 42 21057
## 3 38803 52 35 25403
## 4 61403 29 23 48003
## 5 51291 36 28 37891
## 6 35991 70 31 22591
## 7 31435 71 40 18035
## 8 52393 32 32 38993
## 9 38588 49 38 25188
## 10 49714 35 32 36314
## 11 26447 90 46 13047
names(newresults)
## [1] "score" "time" "moves"
## [4] "scoreadj" "scoreadj_trans_quad" "time_inv_2"
## [7] "time_inv" "time_inv_sqrt" "time_inv_quarter"
## [10] "time_log" "moves_inv_2" "moves_inv"
## [13] "moves_inv_sqrt" "moves_inv_quarter" "moves_log"
We will use each of the following models to predict (adjusted) score
of the new data, based on observed time and
moves.
time and
moves: MAE = 462.57predict_quad_trans <- predict(lmfit_quad, newdata = newresults)
predict_quad <- predict_quad_trans^(1/transform_quad)
gf_point(newresults$scoreadj ~ predict_quad, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_quad)),2)
## [1] 543.83
(honest refers to the fact that we are predicting data that was not used to fit the model)
time and moves: MAE = 1025.93predict_nnet <- nnetfit_int_orig |> predict(newresults)
gf_point(newresults$scoreadj ~ predict_nnet, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_nnet)),2)
## [1] 1673.73
predict_2pred <- predict(lmfit_2pred, newdata = newresults)
gf_point(newresults$scoreadj ~ predict_2pred, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_2pred)),2)
## [1] 1653.08
predict_3pred <- predict(lmfit_3pred, newdata = newresults)
gf_point(newresults$scoreadj ~ predict_3pred, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_3pred)),2)
## [1] 583.39
predict_5pred <- predict(lmfit_5pred, newdata = newresults)
gf_point(newresults$scoreadj ~ predict_5pred, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_5pred)),2)
## [1] 19.82
predict_10pred <- predict(lmfit_10pred, newdata = newresults)
gf_point(newresults$scoreadj ~ predict_10pred, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_10pred)),2)
## [1] 3.28
predict_6predint <- predict(lmfit_6predint, newdata = newresults)
gf_point(newresults$scoreadj ~ predict_6predint, ylab = "adjusted score") +
geom_abline(slope=1, intercept=0,)
with honest MAE measure:
round(mean(abs(newresults$scoreadj-predict_6predint)),2)
## [1] 1.23
We pick what is arguably the most complex model (though it doesn’t
directly incorporate a dependence of the power of
moves on time). We observe almost perfect
predictions of new data, with the lowest honest MAE of 1.23.
The coefficients for this model are:
coef_lmfit = round(coef(lmfit_6predint),0)
\(\hat{adjscore} =\) -8.317251^{6} \(\cdot\frac{1}{time} +\) 4.6432001^{7} \(\cdot\frac{1}{moves^2} +\) -2.333079^{7} \(\cdot\frac{1}{moves} +\) 1.6079286^{7} \(\cdot\frac{1}{moves^{1/2}} +\) -6.119955^{6} \(\cdot\frac{1}{moves^{1/4}} +\) 1.21813^{5} \(\cdot log(moves)\)
No transform is suggested:
MASS::boxcox(lmfit)
A check for outliers shows that high-moves data are
identified as unusual:
plot(lmfit, which=1, id.n=5)
#outlierTest(lmfit_quad)
adjOutliers <- rosnerTest(lmfit$residuals, k=5)
The data identified as outlying is shown below:
cardresults |>
slice(adjOutliers$all.stats$Obs.Num[adjOutliers$all.stats$Outlier]) |>
select(time, moves, score, scoreadj)
## time moves score scoreadj
## 1 43 36 42334 28934
## 2 69 46 29828 16428
## 3 45 45 37844 24444
moves_breaks = max(cardresults$moves)-min(cardresults$moves)+1
Additionally, there is a very clear evidence that
high-time data is (overwhelmingly) identified as
influential:
plot(lmfit, which=5, id.n=5)
CooksD <- cooks.distance(lmfit_quad)
max(CooksD) # anything > 0.5 is suspect influential and > 1 is almost surely influential
## [1] 12.9099
The data identified as outlying is shown below:
cardresults |>
filter(CooksD > 0.5) |>
select(time, moves, score, scoreadj)
## time moves score scoreadj
## 1 213 34 24771 11371
## 2 89 44 27163 13763
## 3 85 45 27386 13986
time_breaks = max(cardresults$time)-min(cardresults$time)+1
What do you think? How would you select and/or further revise the model to be used?