libraries used

# linear modeling
library(MASS)
library(car)
library(leaps)
library(EnvStats)

# neural-net modeling
library(caret)
library(nnet)
library(NeuralNetTools)

# data input
library(readr)
library(readxl)

# data organization
library(tidyr)
library(dplyr)

# plotting
library(ggformula)

Introduction

Description

The following is a modeling question based on the Addiction solitaire game, available as an app - see information for Google Play or for Apple).

Scores are produced for each game played. Assumptions are that the score is a simple-ish combination of:

  • baseline of 13,400 (assigned for getting all the rows completed);
  • moves, the number of cards moved, including 1 per “undo” and 1 per “shuffle”;
  • time, the number of seconds the game takes to complete.

Question

While the scoring appears to be deterministic (since any games with the exact same moves and time result in the same score), the formula for the computation is not obvious. We will look at the following: how can we combine number of moves (count) and time spent (in seconds) to directly compute the score?

Process

Data is pre-processed and organized. Visual assessment is used to explore the proposed nature of the score computation. Finally, linear models, neural nets, and a simpler nonlinear method (linear model with transformed predictor terms) are considered for modeling / identifying the deterministic computation of scores.

Main takeaway: more complex methods (such as neural nets) are not magic! In this situation, using a linear combination of properly transformed predictors does a near-perfect job identifying the (deterministic / exact) relationship. So, in this case, clever (and informed) application of a simpler model works better!

Setup

Read in the data and convert the response variable to a character variable

Data can be downloaded from this Google sheet (first tab).

gameresults = read_excel("AddictionResults.xlsx")

Clean results

First, identify any pairings between observed timesused and movesused.

timesused <- unique(gameresults$time)
timesused <- timesused[order(timesused)]
movesused <- unique(gameresults$moves)
movesused <- movesused[order(movesused)]
n_gamesused <- length(movesused)*length(timesused)
checkresults <- data.frame(time=rep(NA,n_gamesused),
                           moves=rep(NA,n_gamesused),
                           total_games=rep(NA,n_gamesused),
                           avg_score=rep(NA,n_gamesused),
                           sd_score=rep(NA,n_gamesused))

Next, compile the number of games, along with the mean and standard deviation of scores, for each pairing of time and moves. There will not be observed games for most the combinations.

locresults <- 1
for (i in 1:length(timesused)) {
  for (j in 1:length(movesused)) {
    whichsame <- (gameresults$time == timesused[i]) & (gameresults$moves == movesused[j])
    if (sum(whichsame) > 0) {
      samegames <- gameresults[whichsame,]
      allscore <- samegames$score
      checkresults[locresults,] <- c(timesused[i], movesused[j],
                                     length(allscore), 
                                     mean(allscore),
                                     ifelse(length(allscore)==1, 0, 
                                            sd(allscore)))
      locresults = locresults+1
    } else {
      checkresults[locresults,c(1,2)] <- c(timesused[i], movesused[j])
      locresults = locresults+1
    }
  }
}

Check dimensions and accuracy of records

Check that number of compiled games equal the number of total games.

sum(checkresults$total_games, na.rm=T) == dim(gameresults)[1]
## [1] TRUE

Check that every (non-missing) row has scores with a standard deviation of 0 (to verify consistency in scoring):

table(checkresults$sd_score)
## 
##   0 
## 161

Visualize number of occurrences of each unique game result

The total number of games is 185, and the total number of unique scoring situations is 161.

maxrepeats = max(checkresults$total_games, na.rm = T)
gf_histogram(~total_games, data=checkresults, 
             bins = maxrepeats,
             fill = rainbow(maxrepeats),
             color = "black")

table(checkresults$total_games)
## 
##   1   2   3   4 
## 140  19   1   1

Review and overview

Final data frame

Include unique games only:

uniqueresults <- checkresults[!is.na(checkresults$avg_score),]

Reorganize and rename: since baseline score is 13,400 (for getting all the rows completed), define scoreadj to be (score \(- 13,400\)) and use that as the response to be predicted from time and moves.

cardresults <- uniqueresults |>        # start with unique game results
  select(time, moves, avg_score) |>    # select important / useful variables
  rename(score = avg_score) |>         # consistent game results 
  mutate(scoreadj = score-13400) |>    # define scoreadj to remove baseline effect
  arrange(score)

str(cardresults)
## 'data.frame':    161 obs. of  4 variables:
##  $ time    : num  213 89 85 82 99 72 69 69 67 68 ...
##  $ moves   : num  34 44 45 41 36 44 46 43 44 43 ...
##  $ score   : num  24771 27163 27386 29160 29190 ...
##  $ scoreadj: num  11371 13763 13986 15760 15790 ...

There are a vast majority of games with score \(\le 65000\) points; only 4.9% of games have scores above 65,000 (influential, so care must be taken with modeling choices).

maxlim = ceiling(max(cardresults$score)/5000)+1
gf_histogram(~score, data=cardresults,
             color = "black", breaks=(3:maxlim)*5000)

EDA (Exploratory Data Analysis)

Relationship of score to time: A simple plot of score against time, marked / grouped by moves, shows monotonically decreasing behavior. Specifically, for any value of moves that contains 3 or more observations, a decreasing, concave-up behavior of score across time is evident. So, while longer time results in lower scores, this is clearly nonlinear on original scale.

gf_line(scoreadj ~ time, color=~factor(moves), data=arrange(cardresults, time), size=2)

Relationship of score to moves: Similarly simple plot of score against moves, marked / grouped by time, shows monotonically decreasing behavior. Specifically, for any value of time that contains 3 or more observations, a decreasing, concave-up behavior of score across moves is evident. So, while longer moves results in lower scores, this is clearly nonlinear on original scale.

gf_line(scoreadj ~ moves, color=~factor(time), data=arrange(cardresults, moves), size=2)

Modeling Approaches

Linear Regression - Interaction

We begin with an interaction model: \(score = \beta_0 + \beta_1 time + \beta_2 moves + \beta_3 time\cdot moves + error\)

However, as noted before, the “baseline” score is 13,400 for completing the game, so we are going to model the response adjscore via an interaction model without an intercept: \(adjscore = \beta_1 time + \beta_2 moves + \beta_3 time\cdot moves + error\)

An initial interaction regression model suggests that a response transformation is clearly needed:

lmfit_interact_init <- lm(scoreadj ~ time + moves + I(time*moves) -1,
                          data=cardresults)
BC_interact <- MASS::boxcox(lmfit_interact_init)

transform_interact <- round(BC_interact$x[which.max(BC_interact$y)],2)

As a checkpoint,

cardresults <- cardresults |> mutate(yhat_interact_init = lmfit_interact_init$fitted.values)
gf_point(scoreadj ~ yhat_interact_init, data=cardresults) + geom_abline(slope=1, intercept=0)

MAE_interact_init = mean(abs(cardresults$scoreadj-cardresults$yhat_interact_init))

the MAE for the interaction model with untransformed (adjusted) score is 5518.81, and we see a clear curved deviation of predicted from observed.


Next, we fit the interaction model with the transformed response (to the power of -1.8).

cardresults <- cardresults |> mutate(scoreadj_trans_int = scoreadj^(transform_interact))
lmfit_interact <- lm(scoreadj_trans_int ~ time + moves + I(time*moves) -1,
                     data=cardresults)
yhattrans_interact = lmfit_interact$fitted.values
cardresults <- cardresults |> mutate(yhat_interact = yhattrans_interact^(1/transform_interact))
gf_point(scoreadj ~ yhat_interact, data=cardresults) + geom_abline(slope=1, intercept=0)

MAE_interact = mean(abs(cardresults$scoreadj-cardresults$yhat_interact))

On the transformed scale, the regression model explains about 99.445% of the variability in the transformed response.

For comparison with the prior results, we backtransform the predicted response to the original scale, resulting in an MAE (measured on the original units) for the interaction model with transformed (adjusted) score as 2459.86.

Linear Regression - Quadratic

We next consider a full quadratic model: \(score = \beta_0 + \beta_1 time + \beta_2 moves + \beta_3 time\cdot moves + \beta_4 time^2 + \beta_5 moves^2 + error\)

However, as noted before, the “baseline” score is 13,400 for completing the game, so we are going to model the response adjscore via a quadratic model without an intercept: \(adjscore = \beta_1 time + \beta_2 moves + \beta_3 time\cdot moves + \beta_4 time^2 + \beta_5 moves^2 + error\)

An initial quadratic regression model suggests that a response transformation is clearly needed:

lmfit_quad_init <- lm(scoreadj ~ time + moves + I(time*moves) 
                                 + I(time^2) + I(moves^2)-1,
                      data=cardresults)
BC_quad <- MASS::boxcox(lmfit_quad_init)

transform_quad <- round(BC_quad$x[which.max(BC_quad$y)],2)

As a checkpoint,

cardresults <- cardresults |> mutate(yhat_quad_init = lmfit_quad_init$fitted.values)
gf_point(scoreadj ~ yhat_quad_init, data=cardresults) + geom_abline(slope=1, intercept=0)

MAE_quad_init = mean(abs(cardresults$scoreadj-cardresults$yhat_quad_init))

the MAE for the quadratic model with untransformed (adjusted) score is 2803.02, and we see a upper bound on the predicted compared to observed.


Next, we fit the quadratic model with the transformed response (to the power of -1.43).

cardresults <- cardresults |> mutate(scoreadj_trans_quad = scoreadj^(transform_quad))
lmfit_quad <- lm(scoreadj_trans_quad ~ time + moves + I(time*moves) 
                                       + I(time^2) + I(moves^2)-1,
                 data=cardresults)
yhattrans_quad = lmfit_quad$fitted.values
cardresults <- cardresults |> mutate(yhat_quad = yhattrans_quad^(1/transform_quad))
gf_point(scoreadj ~ yhat_quad, data=cardresults) + geom_abline(slope=1, intercept=0)

MAE_quad = mean(abs(cardresults$scoreadj-cardresults$yhat_quad))

On the transformed scale, the regression model explains about 99.965% of the variability in the transformed response.

For comparison with the prior results, we backtransform the predicted response to the original scale, resulting in an MAE (measured on the original units) for the interaction model with transformed (adjusted) score as 462.57.

This is looking much better, but there is still increasing variability of the scores across predicted values

Linear Model Assessment

A check for outliers shows that high-moves data are identified as unusual:

plot(lmfit_quad, which=1, id.n=5)

#outlierTest(lmfit_quad)
adjOutliers <- rosnerTest(lmfit_quad$residuals, k=5)

The data identified as outlying is shown below:

cardresults |> 
  slice(adjOutliers$all.stats$Obs.Num[adjOutliers$all.stats$Outlier]) |> 
  select(time, moves, score, scoreadj)
##   time moves score scoreadj
## 1   85    45 27386    13986
## 2   89    44 27163    13763
moves_breaks = max(cardresults$moves)-min(cardresults$moves)+1

Compare this to the histogram of all values of moves:

gf_histogram(~moves, data=cardresults,
             fill="skyblue", color="black",
             bins=moves_breaks)

Additionally, there is a very clear evidence that high-time data is (overwhelmingly) identified as influential:

plot(lmfit_quad, which=5, id.n=5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

CooksD <- cooks.distance(lmfit_quad)
max(CooksD)  # anything > 0.5 is suspect influential and > 1 is almost surely influential
## [1] 12.9099

The data identified as outlying is shown below:

cardresults |> 
  filter(CooksD > 0.5) |> 
  select(time, moves, score, scoreadj)
##   time moves score scoreadj
## 1  213    34 24771    11371
## 2   89    44 27163    13763
## 3   85    45 27386    13986
time_breaks = max(cardresults$time)-min(cardresults$time)+1

Compare this to the histogram of all values of time:

gf_histogram(~time, data=cardresults,
             fill="pink3", color="black",
             bins=time_breaks)


So what other options are available?

A “hot” topic right now is deep learning, which makes use of the modeling tool of neural nets. We consider a simple version of neural nets, with only one layer but multiple nodes.


Neural Nets

We use a more complex model-selection process, by applying the train function from that caret package. As per documentation for available models in caret, the simple neural net application can optionally adjust two tuning parameters, size (the number of nodes in hidden layer) and decay (used for weight decay). Further documentation about the function’s applications and arguments can be found at the online help file for nnet.

We begin by specifying the model-selection process (using 10-fold cross-validation to obtain honest predictions):

set.seed(42)
ctrl = trainControl(method = "cv", number = 5)

We then consider six possible options for fitting neural nets:

  • additive-term model, using original-unit predictors
nnetfit_add_orig <- train(scoreadj ~ time + moves,
                        data = cardresults,
                        method = "nnet",
                        tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
#                        preProc = c("center", "scale"),
                        linout =T, trace = FALSE,
                        metric="MAE", trControl = ctrl)

The minimum MAE is 945.7273425, using tuning parameters size = 7 and decay = 0.1.

  • additive-term model, using standardized predictors
nnetfit_add_stand <- train(scoreadj ~ time + moves,
                        data = cardresults,
                        method = "nnet",
                        tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
                        preProc = c("center", "scale"),
                        linout =T, trace = FALSE,
                        metric="MAE", trControl = ctrl)

The minimum MAE is 1123.5407581, using tuning parameters size = 2 and decay = 0.5.

  • interaction model, using original-unit predictors
nnetfit_int_orig <- train(scoreadj ~ time + moves + I(time*moves),
                        data = cardresults,
                        method = "nnet",
                        tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
#                        preProc = c("center", "scale"),
                        linout =T, trace = FALSE,
                        metric="MAE", trControl = ctrl)

The minimum MAE is 1583.7632872, using tuning parameters size = 9 and decay = 0.5.

  • interaction model, using standardized predictors
nnetfit_int_stand <- train(scoreadj ~ time + moves + I(time*moves),
                        data = cardresults,
                        method = "nnet",
                        tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
                        preProc = c("center", "scale"),
                        linout =T, trace = FALSE,
                        metric="MAE", trControl = ctrl)

The minimum MAE is 1228.3833052, using tuning parameters size = 1 and decay = 0.2.

  • quadratic model, using original-unit predictors
nnetfit_quad_orig <- train(scoreadj ~ time + moves + I(time*moves) + I(time^2) + I(moves^2),
                        data = cardresults,
                        method = "nnet",
                        tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
#                        preProc = c("center", "scale"),
                        linout =T, trace = FALSE,
                        metric="MAE", trControl = ctrl)

The minimum MAE is 2161.3607821, using tuning parameters size = 6 and decay = 0.1.

  • quadratic model, using standardized predictors
nnetfit_quad_stand <- train(scoreadj ~ time + moves + I(time*moves) + I(time^2) + I(moves^2),
                        data = cardresults,
                        method = "nnet",
                        tuneGrid = expand.grid(size = c(1:10), decay = c(1:10)/10),
                        preProc = c("center", "scale"),
                        linout =T, trace = FALSE,
                        metric="MAE", trControl = ctrl)

The minimum MAE is 2150.0260011, using tuning parameters size = 3 and decay = 0.3.


Neural Net Selection

Combining the fits and the minimum metrics:

allmodels = list(nnetfit_add_orig,
                 nnetfit_add_stand,
                 nnetfit_int_orig,
                 nnetfit_int_stand,
                 nnetfit_quad_orig,
                 nnetfit_quad_stand )
allminMAE = c(min(nnetfit_add_orig$results$MAE),
              min(nnetfit_add_stand$results$MAE),
              min(nnetfit_int_orig$results$MAE),
              min(nnetfit_int_stand$results$MAE),
              min(nnetfit_quad_orig$results$MAE),
              min(nnetfit_quad_stand$results$MAE) )

We can identify that the best neural net is the additive model on original-unit scale.

We store the best model, and look at the metric values across tuning parameter options:

nnetfit_selected <- allmodels[[which.min(allminMAE)]]
gf_line(MAE ~ decay, color=~factor(size), data=nnetfit_selected$results, size=2)

A visualization of the neural network is shown below:

par(mar = c(.1, .1, .1, .1))
plotnet(nnetfit_selected)

par(mar = c(2,2,2,2))

Neural Net Assessment

MAE_nnet = mean(abs(cardresults$scoreadj-nnetfit_selected$finalModel$fitted.values))

This model has a re-assessed MAE of 1025.93, which is relatively quite good (though not as good as the best linear model).

However, a plot of the observed (adjusted) score versus the predictions shows a wavy pattern, which means we have not yet identified the deterministic relationship, even with a much more complex mathematical model.

cardresults <- cardresults |> mutate(yhat_nnet = nnetfit_selected$finalModel$fitted.values)
gf_point(scoreadj ~ yhat_nnet, data=cardresults) + geom_abline(slope=1, intercept=0)

Re-examining Applicability

Problems with Models to-date

All the models are only fitting in the observed range. However, problems happen if we move outside that range. We start by considering how (adjusted) adjscore behaves across time, for different values of moves:

maxadjScore = max(cardresults$scoreadj)+1000
gf_line(scoreadj ~ time, color=~factor(moves), data=arrange(cardresults, time), size=2) +
  ylim(0,maxadjScore) + xlim(0,300)

There is a clear curved pattern over time from about 15 to 100 seconds. But what happens when we move into higher times (which are possible to obtain)?

For example, we can readily visualize the pattern of adjscore vs time for values of moves (27, 33, or 40 card-moves) in which we have multiple game results:

multiple_results <- cardresults |> filter(moves %in% c(27, 33, 40)) |> select(1:4)
table(multiple_results$moves)
## 
## 27 33 40 
##  9 11  9
gf_point(scoreadj ~ time, color=~factor(moves), data=arrange(multiple_results, time), size=2)+
  ylim(0,maxadjScore) + xlim(0,300) +
  scale_color_manual(values=c("green", "gray50", "blue"))

## [1] -8665.441
## [1] 8373.931
## I(time^2) 
## -20.58873
## [1] -72545.59
## [1] 11683.23
## I(time^2) 
## -20.58873
## [1] -175546.5
## [1] 15544.09
## I(time^2) 
## -20.58873

However, when we add in the fitted best model: \(\hat{scoreadj} = \left( \frac{1}{1000000000000} \cdot (\hat{\beta}_1 \cdot time + \hat{\beta}_2 \cdot moves + \hat{\beta}_3 \cdot time \cdot moves +\hat{\beta}_4 \cdot time^2 +\hat{\beta}_5 \cdot moves^2 )\right)^{\frac{1}{-1.56}}\)

with \(\hat{\beta}_1\) = -6517.9304, \(\hat{\beta}_2\) = 8127.3984, \(\hat{\beta}_3\) = 551.5504, \(\hat{\beta}_4\) = -20.5887, and \(\hat{\beta}_5\) = -312.9015,

at moves=27: \(\hat{scoreadj} = (\left( \frac{1}{1000000000000} \right)\cdot (\) -8665.4405292+ 8373.9305066\(\cdot time\) + -20.5887269+ \(\cdot time^2 ))^{\frac{1}{-1.56}}\)

at moves=33: \(\hat{scoreadj} = (\left( \frac{1}{1000000000000} \right)\cdot (\) -7.2545592^{4}+ 1.1683233^{4}\(\cdot time\) + -20.5887269+ \(\cdot time^2 ))^{\frac{1}{-1.56}}\)

at moves=40: \(\hat{scoreadj} = (\left( \frac{1}{1000000000000} \right)\cdot (\) -1.7554647^{5}+ 1.5544086^{4}\(\cdot time\) + -20.5887269+ \(\cdot time^2 ))^{\frac{1}{-1.56}}\)

This looks quite good over the original range, and even quite a ways beyond:

gf_point(scoreadj ~ time, color=~factor(moves), data=arrange(multiple_results, time), size=3, shape=16)+
  ylim(0,maxadjScore) + xlim(0,300) +
  scale_color_manual(values=c("green", "gray50", "blue")) + 
  stat_function(fun=fun27, col="green", lwd=1) + 
  stat_function(fun=fun33, col="gray50", lwd=1) + 
  stat_function(fun=fun40, col="blue", lwd=1)

However, because this is a quadratic function (with a positive second derivative) in time… eventually the fitted pattern will curve back up:

gf_point(scoreadj ~ time, color=~factor(moves), data=arrange(multiple_results, time), size=3, shape=16)+
  ylim(0,maxadjScore) + xlim(0,1000) +
  scale_color_manual(values=c("green", "gray50", "blue")) + 
  stat_function(fun=fun27, col="green", lwd=1) + 
  stat_function(fun=fun33, col="gray50", lwd=1) + 
  stat_function(fun=fun40, col="blue", lwd=1)

This behavior does not align with (presumed) monotonic decrease in (adjusted) score as time increases.

So what else can we do?

Review nonlinear behavior

So, it doesn’t seem to work to transform just the response. What if we were to consider transforming one (or both) of the predictor variables?

Predictor time: Consider the relationship of adjscore vs time for fixed values of moves. Can we come up with a transformation of time such that the relationships look linear?

gf_line(scoreadj ~ time, color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)

gf_line(scoreadj ~ log(time), color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)

gf_line(scoreadj ~ 1/(time), color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)

gf_line(scoreadj ~ 1/sqrt(time), color=~factor(moves), data=arrange(cardresults, time), size=1) + geom_point(size=2)

The inverse relationship looks very linear. Let’s check the pattern of adjscore vs 1/(time) for values of moves (34 card-moves) in which we have multiple game results:

multiple_results <- cardresults |> filter(moves == 34) |> select(1:4)
table(multiple_results$moves)
## 
## 34 
##  9
gf_line(scoreadj ~ 1/(time), color=~factor(moves), data=arrange(multiple_results, time), size=1) + geom_point(size=2)

lmfit_invtime <- lm(scoreadj ~ I(1/(time)), color=~factor(moves), data=multiple_results)
summary(lmfit_invtime)$r.squared
## [1] 1

This is suggesting an essentially perfect fit.

So, we will revise modeling to use the inverse-transform of time.

Predictor moves: Consider the relationship of adjscore vs moves for fixed values of time. Can we come up with a transformation of time such that the relationships look linear?

gf_line(scoreadj ~ moves, color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)

gf_line(scoreadj ~ log(moves), color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)

gf_line(scoreadj ~ 1/(moves), color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)

gf_line(scoreadj ~ 1/sqrt(moves), color=~factor(time), data=arrange(cardresults, moves), size=1) + geom_point(size=2)

This one is a little less clear. Let’s check the pattern of adjscore vs log(moves) for value of time (43 seconds) in which we have multiple game results:

multiple_results <- cardresults |> filter(time == 44) |> select(1:4)
table(multiple_results$time)
## 
## 44 
##  6
lmfit_logmoves <- lm(scoreadj ~ I(log(moves)), color=~factor(time), data=multiple_results)
gf_line(scoreadj ~ log(moves), color=~factor(time), data=cardresults, size=1) + 
  geom_point(size=2)  + geom_abline(slope=lmfit_logmoves$coef[2], intercept=lmfit_logmoves$coef[1])

summary(lmfit_logmoves)$r.squared
## [1] 0.9999459
predpower = -1/4
lmfit_powermoves <- lm(scoreadj ~ I((moves)^predpower), color=~factor(time), data=multiple_results)
gf_line(scoreadj ~ (moves^predpower), color=~factor(time), data=cardresults, size=1) + 
  geom_point(size=2)  + geom_abline(slope=lmfit_powermoves$coef[2], intercept=lmfit_powermoves$coef[1])

summary(lmfit_powermoves)$r.squared
## [1] 0.9999337

Each is a very very good, though not quite perfect, fit.

We will try the inverse-quarter-root-transform of moves, though may need to revisit.

Rebuilding Linear Models

Transformations of predictors

Based on exploration, the following transformations of predictors were added:

cardresults <- cardresults |> mutate(time_inv_2 = 1/time^2,
                                     time_inv = 1/time,
                                     time_inv_sqrt = 1/(time^.5),
                                     time_inv_quarter = 1/(time^.25),
                                     time_log = log(time),
                                     moves_inv_2 = 1/moves^2,
                                     moves_inv = 1/moves,
                                     moves_inv_sqrt = 1/(moves^.5),
                                     moves_inv_quarter = 1/(moves^.25),
                                     moves_log = log(moves)
                                     )

Interaction model

I first tried the interaction model without an intercept, using transformed predictors:

\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^{1/4}} + \beta_3 \frac{1}{(time)(moves^{1/4})} + error\)

lmfit_transx14_pred <- lm(scoreadj ~ time_inv + moves_inv_quarter + I(time_inv*moves_inv_quarter)-1,
                          data=cardresults)
summary(lmfit_transx14_pred)
## 
## Call:
## lm(formula = scoreadj ~ time_inv + moves_inv_quarter + I(time_inv * 
##     moves_inv_quarter) - 1, data = cardresults)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4330.0  -424.0   173.8   570.4  1945.7 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## time_inv                        -772949.2    87257.0  -8.858 1.59e-15 ***
## moves_inv_quarter                 16103.8      758.3  21.236  < 2e-16 ***
## I(time_inv * moves_inv_quarter) 4252241.9   187596.0  22.667  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 965.2 on 158 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 6.647e+04 on 3 and 158 DF,  p-value: < 2.2e-16
#MASS::boxcox(lmfit_transx14_pred)
#plot(lmfit_transx14_pred)
cardresults <- cardresults |> mutate(yhat_xtrans14 = lmfit_transx14_pred$fitted.values)
gf_point(scoreadj ~ yhat_xtrans14, data=cardresults) + geom_abline(slope=1, intercept=0)

MAE_xtrans14 = mean(abs(cardresults$scoreadj-cardresults$yhat_xtrans14)); MAE_xtrans14
## [1] 704.3861

This helped quite a bit, almost halving the prior MAE to an MAE (measured on the original units) of 704.39 for the interaction model with transformed (adjusted) predictors.

But it seems like we should be able to get a better fit, especially since this should be a deterministic relationship.

All-subsets model

We are going to consider a “best” model, of each of various numbers of subsets. Based on univariate explorations, we are going to start with 10 predictors (time_inv_2, time_inv, time_inv_sqrt, time_inv_quarter, time_log, moves_inv_2, moves_inv, moves_inv_sqrt, moves_inv_quarter, and moves_log) plus 10 interaction terms. We will identify the best model by Bayes Information Criterion, with an application example for reference.

modeltries <- regsubsets(scoreadj ~ time_inv_2 + time_inv + time_inv_sqrt + time_inv_quarter #+ time_log 
                            + moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter #+ moves_log
#                            + time_inv:moves_inv_2 + time_inv:moves_inv + time_inv:moves_inv_sqrt + time_inv:moves_inv_quarter + time_inv:moves_log
#                            + time_log:moves_inv_2 + time_log:moves_inv + time_log:moves_inv_sqrt + time_log:moves_inv_quarter + time_log:moves_log, 
                         , data = cardresults, nvmax = 7)
summary(modeltries)
## Subset selection object
## Call: regsubsets.formula(scoreadj ~ time_inv_2 + time_inv + time_inv_sqrt + 
##     time_inv_quarter + moves_inv_2 + moves_inv + moves_inv_sqrt + 
##     moves_inv_quarter, data = cardresults, nvmax = 7)
## 8 Variables  (and intercept)
##                   Forced in Forced out
## time_inv_2            FALSE      FALSE
## time_inv              FALSE      FALSE
## time_inv_sqrt         FALSE      FALSE
## time_inv_quarter      FALSE      FALSE
## moves_inv_2           FALSE      FALSE
## moves_inv             FALSE      FALSE
## moves_inv_sqrt        FALSE      FALSE
## moves_inv_quarter     FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          time_inv_2 time_inv time_inv_sqrt time_inv_quarter moves_inv_2
## 1  ( 1 ) " "        "*"      " "           " "              " "        
## 2  ( 1 ) " "        "*"      " "           " "              " "        
## 3  ( 1 ) " "        "*"      " "           " "              "*"        
## 4  ( 1 ) " "        "*"      " "           " "              " "        
## 5  ( 1 ) " "        "*"      " "           " "              "*"        
## 6  ( 1 ) " "        "*"      " "           "*"              "*"        
## 7  ( 1 ) "*"        "*"      " "           "*"              "*"        
##          moves_inv moves_inv_sqrt moves_inv_quarter
## 1  ( 1 ) " "       " "            " "              
## 2  ( 1 ) " "       " "            "*"              
## 3  ( 1 ) "*"       " "            " "              
## 4  ( 1 ) "*"       "*"            "*"              
## 5  ( 1 ) "*"       "*"            "*"              
## 6  ( 1 ) "*"       "*"            "*"              
## 7  ( 1 ) "*"       "*"            "*"
summary(modeltries)$bic  # jumps at 2, 3, and 5; minimized at 10 variables
## [1]  -529.3658 -1616.8150 -1932.3207 -2122.7185 -2625.2487 -2621.3802 -2616.3068
summary(modeltries)$rsq  # looks like 9-10 variables is deterministic
## [1] 0.9649545 0.9999604 0.9999946 0.9999984 0.9999999 0.9999999 0.9999999
summary(modeltries)$outmat[c(2,3,5),]
##          time_inv_2 time_inv time_inv_sqrt time_inv_quarter moves_inv_2
## 2  ( 1 ) " "        "*"      " "           " "              " "        
## 3  ( 1 ) " "        "*"      " "           " "              "*"        
## 5  ( 1 ) " "        "*"      " "           " "              "*"        
##          moves_inv moves_inv_sqrt moves_inv_quarter
## 2  ( 1 ) " "       " "            "*"              
## 3  ( 1 ) "*"       " "            " "              
## 5  ( 1 ) "*"       "*"            "*"

Based on this, we only need first-order terms with transforms, but not any interactions.

So, we will fit the three models (the first two are considered for slight simplification):

  • 2-predictor model: there is a “jump” in criteria values, so we’ll try fitting

\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^{1/4}} + error\)

  • 3-predictor model: there is another “jump” in criteria values, so we’ll try fitting

\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^2} + \beta_3 \frac{1}{moves} + error\)

  • 5-predictor model: this is the optimal model by BIC criterion:

\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^2} + \beta_3 \frac{1}{moves} + \beta_4 \frac{1}{moves^{1/2}} + \beta_5 \frac{1}{moves^{1/4}} + error\)

  • 10-predictor model: this is the optimal model by BIC criterion:

\(adjscore = \beta_1 \frac{1}{time^2} + \beta_2 \frac{1}{time} + \beta_3 \frac{1}{time^{1/2}} + \beta_4 \frac{1}{time^{1/4}} + \beta_5 log(time) + \beta_6 \frac{1}{moves^2} + \beta_7 \frac{1}{moves} + \beta_8 \frac{1}{moves^{1/2}} + \beta_9 \frac{1}{moves^{1/4}} + \beta_{10} log(moves) + error\)

  • 6-predictor model with interactions: an alternative, based on data exploration:

\(adjscore = \beta_1 \frac{1}{time} + \beta_2 \frac{1}{moves^2} + \beta_3 \frac{1}{moves} + \beta_4 \frac{1}{moves^{1/2}} + \beta_5 \frac{1}{time^{1/4}} + \beta_6 log(time) + \beta_7\frac{1}{(moves^2)(time)} + \beta_8 \frac{1}{(moves)(time)} + \beta_9 \frac{1}{(moves^{1/2})(time))} + \beta_{10} \frac{1}{(moves^{1/4})(time))} + error\)

lmfit_2pred <- lm(scoreadj ~ time_inv + moves_inv_quarter-1,
                          data=cardresults)
lmfit_3pred <- lm(scoreadj ~ time_inv + moves_inv_2 + moves_inv-1,
                          data=cardresults)
lmfit_5pred <- lm(scoreadj ~ time_inv + moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter-1,
                          data=cardresults)
lmfit_10pred <- lm(scoreadj ~ time_inv_2 + time_inv + time_inv_sqrt + time_inv_quarter + time_log +
                  moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter + moves_log - 1,
                          data=cardresults)
lmfit_6predint <- lm(scoreadj ~ time_inv + 
                  moves_inv_2 + moves_inv + moves_inv_sqrt + moves_inv_quarter + moves_log + 
                  time_inv:moves_inv_2 + time_inv:moves_inv + time_inv:moves_inv_sqrt  + time_inv:moves_inv_quarter  - 1,
                          data=cardresults)

lmfit = lmfit_6predint
summary(lmfit)$r.squared
## [1] 1
MAE_2pred = mean(abs(cardresults$scoreadj-lmfit_2pred$fitted.values))
MAE_3pred = mean(abs(cardresults$scoreadj-lmfit_3pred$fitted.values))
MAE_5pred = mean(abs(cardresults$scoreadj-lmfit_5pred$fitted.values))
MAE_10pred = mean(abs(cardresults$scoreadj-lmfit_10pred$fitted.values))
MAE_6predint = mean(abs(cardresults$scoreadj-lmfit_6predint$fitted.values))

Sidenote: why not just try to fit the data perfectly? While in this case we are assuming a deterministic model, in most situations we want to avoid the problem of overfitting.

Determinisitically, it appears that score is additively related to the inverse of time, along with moves raised to a power that is somehow dependent on the values of time.

Validation and Selected Model

However, the question remains - have we overfit the data?

Model Validation

Okay, so we’ll take a look at some brand-new games (not used to fit the model).

newresults = read_excel("AddictionNew.xlsx") |>
  select(score, time, moves, scoreadj) |> 
  mutate(scoreadj_trans_quad = scoreadj^(transform_quad),
         time_inv_2 = 1/time^2,
         time_inv = 1/time,
         time_inv_sqrt = 1/(time^.5),
         time_inv_quarter = 1/(time^.25),
         time_log = log(time),
         moves_inv_2 = 1/moves^2,
         moves_inv = 1/moves,
         moves_inv_sqrt = 1/(moves^.5),
         moves_inv_quarter = 1/(moves^.25),
         moves_log = log(moves)
  )

newresults[,1:4]
## # A tibble: 11 Ă— 4
##    score  time moves scoreadj
##    <dbl> <dbl> <dbl>    <dbl>
##  1 38440    53    35    25040
##  2 34457    56    42    21057
##  3 38803    52    35    25403
##  4 61403    29    23    48003
##  5 51291    36    28    37891
##  6 35991    70    31    22591
##  7 31435    71    40    18035
##  8 52393    32    32    38993
##  9 38588    49    38    25188
## 10 49714    35    32    36314
## 11 26447    90    46    13047
names(newresults)
##  [1] "score"               "time"                "moves"              
##  [4] "scoreadj"            "scoreadj_trans_quad" "time_inv_2"         
##  [7] "time_inv"            "time_inv_sqrt"       "time_inv_quarter"   
## [10] "time_log"            "moves_inv_2"         "moves_inv"          
## [13] "moves_inv_sqrt"      "moves_inv_quarter"   "moves_log"

We will use each of the following models to predict (adjusted) score of the new data, based on observed time and moves.

  • linear model with the transformed response (to the power of -1.43), that is quadratic in the predictors time and moves: MAE = 462.57
predict_quad_trans <- predict(lmfit_quad, newdata = newresults) 
predict_quad <- predict_quad_trans^(1/transform_quad)
gf_point(newresults$scoreadj ~ predict_quad, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_quad)),2)
## [1] 543.83

(honest refers to the fact that we are predicting data that was not used to fit the model)

  • neural network, with interaction model on original-unit predictors time and moves: MAE = 1025.93
predict_nnet <- nnetfit_int_orig |> predict(newresults) 
gf_point(newresults$scoreadj ~ predict_nnet, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_nnet)),2)
## [1] 1673.73
  • 2-predictor model, with transformed predictors: MAE = 1601.59
predict_2pred <- predict(lmfit_2pred, newdata = newresults) 
gf_point(newresults$scoreadj ~ predict_2pred, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_2pred)),2)
## [1] 1653.08
  • 3-predictor model, with transformed predictors: MAE = 468.76
predict_3pred <- predict(lmfit_3pred, newdata = newresults) 
gf_point(newresults$scoreadj ~ predict_3pred, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_3pred)),2)
## [1] 583.39
  • 5-predictor model, with transformed predictors: MAE = 13.93
predict_5pred <- predict(lmfit_5pred, newdata = newresults) 
gf_point(newresults$scoreadj ~ predict_5pred, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_5pred)),2)
## [1] 19.82
  • 10-predictor model, with transformed predictors: MAE = 2.02
predict_10pred <- predict(lmfit_10pred, newdata = newresults) 
gf_point(newresults$scoreadj ~ predict_10pred, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_10pred)),2)
## [1] 3.28
  • 6-predictor model with interactions, with transformed predictors: MAE = 1
predict_6predint <- predict(lmfit_6predint, newdata = newresults) 
gf_point(newresults$scoreadj ~ predict_6predint, ylab = "adjusted score") + 
  geom_abline(slope=1, intercept=0,)

with honest MAE measure:

round(mean(abs(newresults$scoreadj-predict_6predint)),2)
## [1] 1.23

Final(?) Model

We pick what is arguably the most complex model (though it doesn’t directly incorporate a dependence of the power of moves on time). We observe almost perfect predictions of new data, with the lowest honest MAE of 1.23.

The coefficients for this model are:

coef_lmfit = round(coef(lmfit_6predint),0)

Estimated final model:

\(\hat{adjscore} =\) -8.317251^{6} \(\cdot\frac{1}{time} +\) 4.6432001^{7} \(\cdot\frac{1}{moves^2} +\) -2.333079^{7} \(\cdot\frac{1}{moves} +\) 1.6079286^{7} \(\cdot\frac{1}{moves^{1/2}} +\) -6.119955^{6} \(\cdot\frac{1}{moves^{1/4}} +\) 1.21813^{5} \(\cdot log(moves)\)

  • -4.3065972^{8} \(\cdot\frac{1}{(moves^2)(time)} +\) 1.9513556^{8} \(\cdot\frac{1}{(moves)(time)} +\) -1.3954593^{8} \(\cdot\frac{1}{(moves^{1/2})(time))} +\) 6.7328602^{7} \(\cdot\frac{1}{(moves^{1/4})(time))} + error\)

Final Model Assessment

No transform is suggested:

MASS::boxcox(lmfit)

A check for outliers shows that high-moves data are identified as unusual:

plot(lmfit, which=1, id.n=5)

#outlierTest(lmfit_quad)
adjOutliers <- rosnerTest(lmfit$residuals, k=5)

The data identified as outlying is shown below:

cardresults |> 
  slice(adjOutliers$all.stats$Obs.Num[adjOutliers$all.stats$Outlier]) |> 
  select(time, moves, score, scoreadj)
##   time moves score scoreadj
## 1   43    36 42334    28934
## 2   69    46 29828    16428
## 3   45    45 37844    24444
moves_breaks = max(cardresults$moves)-min(cardresults$moves)+1

Additionally, there is a very clear evidence that high-time data is (overwhelmingly) identified as influential:

plot(lmfit, which=5, id.n=5)

CooksD <- cooks.distance(lmfit_quad)
max(CooksD)  # anything > 0.5 is suspect influential and > 1 is almost surely influential
## [1] 12.9099

The data identified as outlying is shown below:

cardresults |> 
  filter(CooksD > 0.5) |> 
  select(time, moves, score, scoreadj)
##   time moves score scoreadj
## 1  213    34 24771    11371
## 2   89    44 27163    13763
## 3   85    45 27386    13986
time_breaks = max(cardresults$time)-min(cardresults$time)+1

Last question

What do you think? How would you select and/or further revise the model to be used?