As most MMO players, I wanted to optimize my DPS, which led me to check out which stats top prioritize as a PLD. What i stumbled upon the most were spreadsheets about the gain brought by each stat, which is not exactly the same as statsweights, which take into account DPS rotations and their interaction with the stats. What got out was that, in theory, crit was the stat to focus on as a PLD for that small DPS boost that is sometimes needed in tough Savage raids.
But I was still interested in statsweight, and since the way they are calculated is through multiple simulations that reveal the stats are top contributors to simulated DPS, my attempt focused on collecting logs from boss fights and see if I could figure out statweights through a linear regression.
The data I got were from PLD DPS logs on Ramuh, which had the highest values of DPS (since you can have a high uptime on that boss once the mechanics are mastered).
Unfortunately, there seems to be a high contribution of player skill, with an error so high that my model only explains 35% of the DPS at best, the rest being statistical noise during measurement
Ideally, I would use a large number of DPS logs against training dummies to have a better model fit. Do not hesitate to send me those if you have some.
I collected data from combat logs in FFlogs : https://www.fflogs.com/zone/rankings/33#metric=dps&class=Global&boss=-1&spec=Paladin&dpstype=pdps
I’ll put a quick description at the end about how I collected the data for those that are interested
I ended up with a clean data like this :
clean_df <- read.csv("./travaux persos/clean_df.csv")
head(clean_df)
## name hp STR tenacity crit DH deter skill_speed DPS
## 1 Kyoko Terai 161047 4734 528 3863 1760 2282 1017 12142.2
## 2 Meekus Cat 161079 4737 528 3863 1820 2342 897 11758.8
## 3 Sleepy Chan 161047 4734 528 3863 1700 2342 1017 11841.1
## 4 Hina Gamiluna 161016 4734 528 3863 1760 2282 1017 12017.2
## 5 Seronin Ashekage 161079 4737 528 3863 1700 2282 1077 11795.1
## 6 Crystalyn Romanoff 161079 4737 528 3863 1760 2282 1017 12110.3
In order to build the model, I do a tiny bit of pre-processing :
DPSsd <- sd(clean_df$DPS)
DPSmean <- mean(clean_df$DPS)
library(dplyr)
model_df <- clean_df %>% select(-c(name,hp)) %>% filter(DPS <= DPSmean + 2*DPSsd) %>%
filter(DPS >= DPSmean - 2*DPSsd)
We have the following DPS repartition :
library(ggplot2)
ggplot(data = model_df, aes(x = DPS)) + geom_histogram(fill = "lightblue")
I slice my data into 2, 80% for model training and 20% for testing.
I set an RNG seed so that you can try it at home and obtain the same results.
library(caret)
set.seed(69420)
inTrain <- createDataPartition(y = model_df$DPS, p = 0.8, list = F)
training <- model_df[inTrain, ]
testing <- model_df[-inTrain, ]
I build my linear model on the training data frame :
modlm <- train(DPS ~ ., data = training, method = "lm")
One way of checking the model’s accuracy is to :
\[ RMSE = \frac{1}{n} ~ \sum_n (predicted~ value - true~ value)² \]
My RSME is :
RMSE(predict(modlm, testing), testing$DPS)
## [1] 765.1144
Compared to the global dps, it is about 0.3% :
RMSE(predict(modlm, testing), testing$DPS)/sum(testing$DPS)
## [1] 0.003519549
Here’s a quick plot of the actual values vs predicted values. If the model was perfect, all the dots would be on the red line :
qplot(predict(modlm, testing), testing$DPS) + geom_abline(intercept = 0, slope = 1, colour = "red")
And here’s the data frame of predicted values vs measured values :
confirm <- data.frame(predicted = predict(modlm, testing), real = testing$DPS)
head(confirm, 20)
## predicted real
## 2 11238.650 11758.8
## 10 11229.787 11723.3
## 15 10850.366 11624.7
## 16 9946.778 11598.7
## 19 10104.310 11311.3
## 21 11228.172 11389.9
## 41 11172.946 11108.1
## 42 11138.794 11363.6
## 44 10955.385 10655.6
## 46 11177.791 11415.1
## 48 10812.692 10745.7
## 49 10897.517 11014.3
## 66 10504.166 10469.2
## 67 10574.578 10516.7
## 68 10192.969 10238.5
## 79 10564.740 9431.9
## 80 10922.667 9401.0
## 92 11176.176 9745.3
## 95 11224.942 11898.3
## 97 10218.967 9979.9
So we have a model that’s quite accurate yet far from perfect.
Now let’s check what coefficient our model puts for each stat :
summary(modlm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1773.19 -279.31 -46.64 398.31 1137.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12687.7786 2248.5696 5.643 3e-07 ***
## STR -1.6149 0.7126 -2.266 0.02641 *
## tenacity -0.5265 0.6070 -0.867 0.38861
## crit 1.4090 0.5111 2.757 0.00737 **
## DH 0.3380 0.4645 0.728 0.46924
## deter 0.1331 0.5438 0.245 0.80730
## skill_speed 0.1213 0.5880 0.206 0.83713
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 597.4 on 73 degrees of freedom
## Multiple R-squared: 0.3841, Adjusted R-squared: 0.3335
## F-statistic: 7.588 on 6 and 73 DF, p-value: 2.406e-06
Crit seems to be the top stat for PLD, however, the model gives a negative value to STR.
That might be explained by the fact that players that have high values of strength performed poorly in comparison to players with lower values of STR :
qplot(STR, DPS, data = model_df)
Globally, the contribution of other stats is impossible to tell because the measured coefficients are not statistically significant (p-values far above 5%).
As said in the summary, the data I got was from Ramuh instead of dummies and therefore heavily relies on player skill.
When checking the R-Squared value, my model only explains 40% of the DPS data, the rest being noise, and even that’s inflated since I use 6 regressors. My model explains, at best, 35-ish % of the data.
So far, without proper data, the only that might be confirmed from this is that crit is indeed the top stat for PLD.
I collect the data using the selenium package for R.
I make Selenium go to the main page :
link <- "https://www.fflogs.com/zone/rankings/33#metric=dps&class=Global&boss=-1&spec=Paladin&dpstype=pdps"
library(RSelenium)
eCaps <- list(chromeOptions = list(
args = c('--headless', '--disable-gpu', '--window-size=1280,800', '--no-sandbox')
))
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L, browserName = "chrome",
extraCapabilities = eCaps)
remDr$open()
remDr$navigate("https://www.fflogs.com/zone/rankings/33#metric=dps&class=Global&boss=-1&spec=Paladin&dpstype=pdps")
From that, I get all the links in the page. Since all the links to player logs have the same pattern, I filter them using grepl (regular expressions) for the patterns “zone=33” (Eden) and “/character/id/” :
temp2 <- remDr$findElements(using = 'css selector', value = 'a')
hreflinks <- character()
for (i in 1:434){
hreflinks <- c(hreflinks, temp2[[i]]$getElementAttribute("href"))
}
hreflinks2 <- character()
for (i in 1:length(hreflinks)){
hreflinks2 <- c(hreflinks2, hreflinks[[i]])
}
sshref <- grepl(pattern = "zone=33", hreflinks2)
hreflinks2 <- hreflinks2[sshref]
sshref2 <- grepl(pattern = "/character/id/", hreflinks2)
hreflinks2 <- hreflinks2[sshref2]
All the links are in the “hreflinks2” vector. You can either loop this process to do it in every pages, or repeat that manually.
Now once we have those links, we extract the name, DPS, and stats with a while() loop.
I used a while() loop instead of for() because then I know, from the current “m” value, where to start again if the extraction fails or crashes (and it crashed a lot).
There are actually so many incomplete logs that will make the collection crash that it was actually easier to build the “hreflinks2” vector manually by copy-pasting valid links
library(stringr)
name <- character()
hp <- numeric()
dps <- numeric()
str <- numeric()
tenacity <- numeric()
crit <- numeric()
DH <- numeric()
determination <- numeric()
skill_speed <- numeric()
m <- 1
while (m <= length(hreflinks2)){
print(paste0("getting data n°", m))
remDr$navigate(hreflinks2[m])
hpxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[1]/div[2]')
hp <- c(hp, as.numeric(hpxpath$getElementText()[[1]]))
# choper le nom
nom <- strsplit(remDr$getTitle()[[1]], "-")
nom <- nom[[1]]
nom <- nom[1]
name <- c(name, nom)
# choper le dps
deeps <- remDr$findElement(using = 'xpath', value = '//*[@id="boss-table-33"]/tbody/tr[1]/td[3]')
dps <- c(dps, as.numeric(str_remove(deeps$getElementText()[[1]], ",")))
Strxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[2]/div[2]')
str <- c(str, as.numeric(Strxpath$getElementText()[[1]]))
Tenaxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[3]/div[2]')
tenacity <- c(tenacity, as.numeric(Tenaxpath$getElementText()[[1]]))
Critxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[5]/div[2]')
crit <- c(crit, as.numeric(Critxpath$getElementText()[[1]]))
DHxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[6]/div[2]')
DH <- c(DH, as.numeric(DHxpath$getElementText()[[1]]))
Detxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[7]/div[2]')
determination <- c(determination, as.numeric(Detxpath$getElementText()[[1]]))
Skspxpath <- remDr$findElement(using = 'xpath', value = '//*[@id="stats-box-contents"]/div[8]/div[2]')
skill_speed <- c(skill_speed, as.numeric(Skspxpath$getElementText()[[1]]))
m <- m+1
}
full_data_frame <- data.frame(name = name, hp = hp, STR = str, tenacity = tenacity,
crit = crit, DH = DH, deter = determination,
skill_speed = skill_speed, DPS = dps)
This gets you the raw data frame. Sometimes characters that aren’t PLD are in PLD logs so I filtered these out by deleting any observation in which the HP < 125000.
In addition, rows 31 and 32 had some missing values so I did some manual cleaning :
clean_df <- full_data_frame %>% filter(hp > 125000)
clean_df[31, 8] <- 1017
clean_df[32, 4:8] <- c(round(mean(clean_df$tenacity), 0), 3863, 1520, 2282, 1257)
And this is the clean data I started the work from.