In this first assignment, we’ll attempt to predict ratings with very little information. We’ll first look at just raw averages across all (training dataset) users. We’ll then account for “bias” by normalizing across users and across items.
Please code as much of your work as possible in R or Python. You may use standard functions (e.g. from base R and the tidyverse). Your project should be delivered in an R Markdown or a Jupyter notebook, then the notebook should be saved into a GitHub repository. You should include a link to your GitHub repository in your assignment submission link.
This system recommends Chess players to a group of gamblers who bet on Chess winnings. Break the ratings into separate training and test datasets. Using the training data, calculate the raw average (mean) rating for every user-item combination. Calculate the RMSE for raw average for both the training data and the test data. Using the training data, calculate the bias for each user and each item. From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination. Calculate the RMSE for the baseline predictors for both the training data and the test data. Summarize the results.
The data set is a combination of actual players from a prior project and synthetic ratings created for this task. The Chess player’s rating is provided by the Members of Panel who scores and numerically rate the players based on skills, proficiency and number of points accumulated for each game. The highest number of points per game is six (6) and the highest rating per game is five (5).The points and ratings does not depend on the number of chess games played.
chess <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-612/master/ChessRating.csv", header=T, sep = ",")
chess## PlayerName PlayerID PlayerState GamesPlayed
## 1 GARY HUA 12456 ON 1794
## 2 DAKSHESH DARURI 43521 MI 1553
## 3 ADITYA BAJAJ 13452 MI 1384
## 4 PATRICK H SCHILLING 12367 MI 1716
## 5 HANSHI ZUO 45671 MI 1655
## 6 HANSEN SONG 23146 OH 1686
## 7 GARY DEE SWATHELL 12437 MI 1649
## 8 EZEKIEL HOUGHTON 13426 MI 1641
## 9 STEFANO LEE 24312 ON 1411
## 10 ANVIT RAO 24683 MI 1365
## 11 CAMERON WILLIAM MC LEMAN 24135 MI 1712
## 12 KENNETH J TACK 45621 MI 1663
## 13 TORRANCE HENRY JR 23142 MI 1666
## 14 BRADLEY SHAW 24128 MI 1610
## 15 ZACHARY JAMES HOUGHTON 29024 MI 1220
## 16 MIKE NIKITIN 24671 MI 1604
## 17 RONALD GRZEGORCZYK 64234 MI 1629
## 18 DAVID SUNDEEN 32142 MI 1600
## 19 DIPANKAR ROY 23451 MI 1564
## 20 JASON ZHENG 24162 MI 1595
## 21 DINH DANG BUI 24132 ON 1563
## 22 EUGENE L MCCLURE 42312 MI 1555
## 23 ALAN BUI 43126 ON 1363
## 24 MICHAEL R ALDRICH 23141 MI 1229
## 25 LOREN SCHWIEBERT 22441 MI 1745
## 26 MAX ZHU 33233 ON 1579
## 27 GAURAV GIDWANI 31455 MI 1552
## 28 SOFIA ADINA 32445 MI 1507
## 29 CHIEDOZIE OKORIE 42351 MI 1602
## 30 GEORGE AVERY JONES 23542 ON 1522
## 31 RISHI SHETTY 11334 MI 1494
## 32 JOSHUA PHILIP MATHEWS 34122 ON 1441
## 33 JADE GE 33219 MI 1449
## 34 MICHAEL JEFFERY THOMAS 41923 MI 1399
## 35 JOSHUA DAVID LEE 24136 MI 1438
## 36 SIDDHARTH JHA 33561 MI 1355
## 37 AMIYATOSH PWNANANDAM 34513 MI 980
## 38 BRIAN LIU 22566 MI 1423
## 39 JOEL R HENDON 33524 MI 1436
## 40 FOREST ZHANG 22431 MI 1348
## 41 KYLE WILLIAM MURPHY 33524 MI 1403
## 42 JARED GE 33121 MI 1332
## 43 ROBERT GLEN VASEY 33241 MI 1283
## 44 JUSTIN D SCHILLING 24511 MI 1199
## 45 DEREK YAN 22341 MI 1242
## 46 JACOB ALEXANDER LAVALLEY 22413 MI 377
## 47 ERIC WRIGHT 55321 MI 1362
## 48 DANIEL KHAIN 22413 MI 1382
## 49 MICHAEL J MARTIN 22437 MI 1291
## 50 SHIVAM JHA 22413 MI 1056
## 51 TEJAS AYYAGARI 22567 MI 1011
## 52 ETHAN GUO 22515 MI 935
## 53 JOSE C YBARRA 22345 MI 1393
## 54 LARRY HODGE 22531 MI 1270
## 55 ALEX KONG 32451 MI 1186
## 56 MARISA RICCI 44626 MI 1153
## 57 MICHAEL LU 34216 MI 1092
## 58 VIRAJ MOHILE 23451 MI 917
## 59 SEAN M MC CORMICK 22458 MI 853
## 60 JULIA SHEN 34871 MI 967
## 61 JEZZEL FARKAS 43562 ON 955
## 62 ASHWIN BALAJI 22904 MI 1530
## 63 THOMAS JOSEPH HOSMER 24095 MI 1175
## 64 BEN LI 33294 MI 1163
## PanelMember PanelMemberID TotalNumberofPoints PlayerRating
## 1 John Morrison 2312 6.0 5
## 2 Kate Foo 1245 6.0 5
## 3 Andrew Sung 1125 6.0 5
## 4 Maria Dubel 1561 5.5 4
## 5 Henry Churk 3216 5.5 4
## 6 Solomon Dugary 1921 5.0 4
## 7 Andrew Sung 1125 5.0 4
## 8 Maria Dubel 1561 5.0 4
## 9 Kate Foo 1245 5.0 4
## 10 Smith McHenry 1425 5.0 4
## 11 Solomon Dugary 1921 4.5 3
## 12 Maria Dubel 1561 4.5 3
## 13 Andrew Sung 1125 4.5 3
## 14 Solomon Dugary 1921 4.5 3
## 15 Mathew King 2513 4.5 3
## 16 Anna Henry 2413 4.0 3
## 17 Nadine Young 2318 4.0 3
## 18 Henry Churk 3216 4.0 3
## 19 Solomon Dugary 1921 4.0 3
## 20 Maria Dubel 1561 4.0 3
## 21 Maria Dubel 1561 4.0 3
## 22 John Morrison 2312 4.0 4
## 23 Andrew Sung 1125 4.0 4
## 24 Maria Dubel 1561 4.0 3
## 25 Smith McHenry 1425 3.5 2
## 26 Solomon Dugary 1921 3.5 2
## 27 Henry Churk 3216 3.5 2
## 28 Maria Dubel 1561 3.5 2
## 29 Andrew Sung 1125 3.5 2
## 30 Solomon Dugary 1921 3.5 2
## 31 John Morrison 2312 3.5 2
## 32 Kate Foo 1245 3.5 2
## 33 Solomon Dugary 1921 3.5 2
## 34 Maria Dubel 1561 3.5 2
## 35 Andrew Sung 1125 3.5 2
## 36 Kate Foo 1245 3.5 2
## 37 John Morrison 2312 3.5 2
## 38 Kate Foo 1245 3.0 2
## 39 Andrew Sung 1125 3.0 2
## 40 Smith McHenry 1425 3.0 2
## 41 Solomon Dugary 1921 3.0 2
## 42 Kate Foo 1245 3.0 2
## 43 Smith McHenry 1425 3.0 2
## 44 John Morrison 2312 3.0 2
## 45 Kate Foo 1245 3.0 2
## 46 Solomon Dugary 1921 3.0 2
## 47 Henry Churk 3216 2.5 1
## 48 Maria Dubel 1561 2.5 1
## 49 John Morrison 2312 2.5 1
## 50 Henry Churk 3216 2.5 1
## 51 Solomon Dugary 1921 2.5 1
## 52 Henry Churk 3216 2.5 1
## 53 Maria Dubel 1561 2.0 1
## 54 Andrew Sung 1125 2.0 1
## 55 Maria Dubel 1561 2.0 1
## 56 Solomon Dugary 1921 2.0 1
## 57 Smith McHenry 1425 2.0 1
## 58 Henry Churk 3216 2.0 1
## 59 Kate Foo 1245 2.0 1
## 60 Smith McHenry 1425 1.5 1
## 61 Solomon Dugary 1921 1.5 1
## 62 Andrew Sung 1125 1.0 1
## 63 Maria Dubel 1561 1.0 1
## 64 John Morrison 2312 1.0 1
## 'data.frame': 64 obs. of 8 variables:
## $ PlayerName : Factor w/ 64 levels "ADITYA BAJAJ",..: 24 12 1 51 28 27 23 21 59 5 ...
## $ PlayerID : int 12456 43521 13452 12367 45671 23146 12437 13426 24312 24683 ...
## $ PlayerState : Factor w/ 3 levels "MI","OH","ON": 3 1 1 1 1 2 1 1 3 1 ...
## $ GamesPlayed : int 1794 1553 1384 1716 1655 1686 1649 1641 1411 1365 ...
## $ PanelMember : Factor w/ 10 levels "Andrew Sung",..: 4 5 1 6 3 10 1 6 5 9 ...
## $ PanelMemberID : int 2312 1245 1125 1561 3216 1921 1125 1561 1245 1425 ...
## $ TotalNumberofPoints: num 6 6 6 5.5 5.5 5 5 5 5 5 ...
## $ PlayerRating : int 5 5 5 4 4 4 4 4 4 4 ...
Use 80% for training and 20% for testing the model
chessnew <- dplyr::select(chess, PlayerName,GamesPlayed,PanelMember,PlayerRating)
# Randomly split the player's ratings for training and testing data sets
set.seed(234)
chesssplit <- sample.split(chessnew$PlayerRating, SplitRatio = 0.80)Prepare training dataset
## PlayerName GamesPlayed PanelMember PlayerRating
## 1 GARY HUA 1794 John Morrison 5
## 2 DAKSHESH DARURI 1553 Kate Foo NA
## 3 ADITYA BAJAJ 1384 Andrew Sung 5
## 4 PATRICK H SCHILLING 1716 Maria Dubel 4
## 5 HANSHI ZUO 1655 Henry Churk 4
## 6 HANSEN SONG 1686 Solomon Dugary 4
Prepare testing set
## PlayerName GamesPlayed PanelMember PlayerRating
## 1 GARY HUA 1794 John Morrison NA
## 2 DAKSHESH DARURI 1553 Kate Foo 5
## 3 ADITYA BAJAJ 1384 Andrew Sung NA
## 4 PATRICK H SCHILLING 1716 Maria Dubel NA
## 5 HANSHI ZUO 1655 Henry Churk NA
## 6 HANSEN SONG 1686 Solomon Dugary NA
Convert data to numeric
chess_train<-as.data.frame(sapply(chess_train, as.numeric))
chess_test<-as.data.frame(sapply(chess_test, as.numeric))Calculating Raw Average
RawAverage_train <- sum(chess_train$PlayerRating, na.rm = TRUE) / length(which(!is.na(chess_train$PlayerRating)))
RawAverage_train## [1] 2.313725
train_error <- RawAverage_train-chess_train$PlayerRating
trainRMSE <- sqrt(mean((train_error^2), na.rm=TRUE))
round(trainRMSE,2)## [1] 1.13
RawAverage_test <- sum(chess_test$PlayerRating, na.rm = TRUE) / length(which(!is.na(chess_test$PlayerRating)))
RawAverage_test## [1] 2.384615
test_error <- RawAverage_test-chess_test$PlayerRating
testRMSE <- sqrt(mean((test_error^2), na.rm=TRUE))
round(testRMSE,2)## [1] 1.27
PanelBias <- round(((rowMeans(chess_train, na.rm=TRUE))-RawAverage_train),3)
y<-cbind(chessnew,PanelBias)
y %>% kable(caption = "Reviewer Bias Calculations") %>% kable_styling("striped", full_width = TRUE)| PlayerName | GamesPlayed | PanelMember | PlayerRating | PanelBias |
|---|---|---|---|---|
| GARY HUA | 1794 | John Morrison | 5 | 454.436 |
| DAKSHESH DARURI | 1553 | Kate Foo | 5 | 521.020 |
| ADITYA BAJAJ | 1384 | Andrew Sung | 5 | 345.436 |
| PATRICK H SCHILLING | 1716 | Maria Dubel | 4 | 441.936 |
| HANSHI ZUO | 1655 | Henry Churk | 4 | 420.186 |
| HANSEN SONG | 1686 | Solomon Dugary | 4 | 429.436 |
| GARY DEE SWATHELL | 1649 | Andrew Sung | 4 | 555.353 |
| EZEKIEL HOUGHTON | 1641 | Maria Dubel | 4 | 415.686 |
| STEFANO LEE | 1411 | Kate Foo | 4 | 489.353 |
| ANVIT RAO | 1365 | Smith McHenry | 4 | 343.436 |
| CAMERON WILLIAM MC LEMAN | 1712 | Solomon Dugary | 3 | 431.436 |
| KENNETH J TACK | 1663 | Maria Dubel | 3 | 425.686 |
| TORRANCE HENRY JR | 1666 | Andrew Sung | 3 | 430.686 |
| BRADLEY SHAW | 1610 | Solomon Dugary | 3 | 405.436 |
| ZACHARY JAMES HOUGHTON | 1220 | Mathew King | 3 | 321.186 |
| MIKE NIKITIN | 1604 | Anna Henry | 3 | 412.436 |
| RONALD GRZEGORCZYK | 1629 | Nadine Young | 3 | 421.186 |
| DAVID SUNDEEN | 1600 | Henry Churk | 3 | 536.686 |
| DIPANKAR ROY | 1564 | Solomon Dugary | 3 | 396.186 |
| JASON ZHENG | 1595 | Maria Dubel | 3 | 406.686 |
| DINH DANG BUI | 1563 | Maria Dubel | 3 | 526.020 |
| EUGENE L MCCLURE | 1555 | John Morrison | 4 | 393.436 |
| ALAN BUI | 1363 | Andrew Sung | 4 | 340.186 |
| MICHAEL R ALDRICH | 1229 | Maria Dubel | 3 | 319.436 |
| LOREN SCHWIEBERT | 1745 | Smith McHenry | 2 | 447.436 |
| MAX ZHU | 1579 | Solomon Dugary | 2 | 406.686 |
| GAURAV GIDWANI | 1552 | Henry Churk | 2 | 393.186 |
| SOFIA ADINA | 1507 | Maria Dubel | 2 | 390.936 |
| CHIEDOZIE OKORIE | 1602 | Andrew Sung | 2 | 535.686 |
| GEORGE AVERY JONES | 1522 | Solomon Dugary | 2 | 517.020 |
| RISHI SHETTY | 1494 | John Morrison | 2 | 385.686 |
| JOSHUA PHILIP MATHEWS | 1441 | Kate Foo | 2 | 368.936 |
| JADE GE | 1449 | Solomon Dugary | 2 | 370.436 |
| MICHAEL JEFFERY THOMAS | 1399 | Maria Dubel | 2 | 361.186 |
| JOSHUA DAVID LEE | 1438 | Andrew Sung | 2 | 366.936 |
| SIDDHARTH JHA | 1355 | Kate Foo | 2 | 352.436 |
| AMIYATOSH PWNANANDAM | 980 | John Morrison | 2 | 245.186 |
| BRIAN LIU | 1423 | Kate Foo | 2 | 357.436 |
| JOEL R HENDON | 1436 | Andrew Sung | 2 | 365.936 |
| FOREST ZHANG | 1348 | Smith McHenry | 2 | 342.936 |
| KYLE WILLIAM MURPHY | 1403 | Solomon Dugary | 2 | 361.686 |
| JARED GE | 1332 | Kate Foo | 2 | 453.686 |
| ROBERT GLEN VASEY | 1283 | Smith McHenry | 2 | 446.020 |
| JUSTIN D SCHILLING | 1199 | John Morrison | 2 | 308.686 |
| DEREK YAN | 1242 | Kate Foo | 2 | 313.686 |
| JACOB ALEXANDER LAVALLEY | 377 | Solomon Dugary | 2 | 102.186 |
| ERIC WRIGHT | 1362 | Henry Churk | 1 | 343.686 |
| DANIEL KHAIN | 1382 | Maria Dubel | 1 | 348.186 |
| MICHAEL J MARTIN | 1291 | John Morrison | 1 | 333.186 |
| SHIVAM JHA | 1056 | Henry Churk | 1 | 276.686 |
| TEJAS AYYAGARI | 1011 | Solomon Dugary | 1 | 268.186 |
| ETHAN GUO | 935 | Henry Churk | 1 | 237.186 |
| JOSE C YBARRA | 1393 | Maria Dubel | 1 | 356.436 |
| LARRY HODGE | 1270 | Andrew Sung | 1 | 435.353 |
| ALEX KONG | 1186 | Maria Dubel | 1 | 296.686 |
| MARISA RICCI | 1153 | Solomon Dugary | 1 | 299.686 |
| MICHAEL LU | 1092 | Smith McHenry | 1 | 285.186 |
| VIRAJ MOHILE | 917 | Henry Churk | 1 | 243.686 |
| SEAN M MC CORMICK | 853 | Kate Foo | 1 | 302.020 |
| JULIA SHEN | 967 | Smith McHenry | 1 | 335.686 |
| JEZZEL FARKAS | 955 | Solomon Dugary | 1 | 247.436 |
| ASHWIN BALAJI | 1530 | Andrew Sung | 1 | 382.186 |
| THOMAS JOSEPH HOSMER | 1175 | Maria Dubel | 1 | 308.436 |
| BEN LI | 1163 | John Morrison | 1 | 389.020 |
PlayerBias <- round(((colMeans(chess_train, na.rm=TRUE))-RawAverage_train),3)
x<-cbind(chessnew,PlayerBias)## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded
x %>% kable(caption = "Calculation for Player Bias") %>% kable_styling("striped", full_width = TRUE)| PlayerName | GamesPlayed | PanelMember | PlayerRating | PlayerBias |
|---|---|---|---|---|
| GARY HUA | 1794 | John Morrison | 5 | 30.186 |
| DAKSHESH DARURI | 1553 | Kate Foo | 5 | 1376.186 |
| ADITYA BAJAJ | 1384 | Andrew Sung | 5 | 3.327 |
| PATRICK H SCHILLING | 1716 | Maria Dubel | 4 | 0.000 |
| HANSHI ZUO | 1655 | Henry Churk | 4 | 30.186 |
| HANSEN SONG | 1686 | Solomon Dugary | 4 | 1376.186 |
| GARY DEE SWATHELL | 1649 | Andrew Sung | 4 | 3.327 |
| EZEKIEL HOUGHTON | 1641 | Maria Dubel | 4 | 0.000 |
| STEFANO LEE | 1411 | Kate Foo | 4 | 30.186 |
| ANVIT RAO | 1365 | Smith McHenry | 4 | 1376.186 |
| CAMERON WILLIAM MC LEMAN | 1712 | Solomon Dugary | 3 | 3.327 |
| KENNETH J TACK | 1663 | Maria Dubel | 3 | 0.000 |
| TORRANCE HENRY JR | 1666 | Andrew Sung | 3 | 30.186 |
| BRADLEY SHAW | 1610 | Solomon Dugary | 3 | 1376.186 |
| ZACHARY JAMES HOUGHTON | 1220 | Mathew King | 3 | 3.327 |
| MIKE NIKITIN | 1604 | Anna Henry | 3 | 0.000 |
| RONALD GRZEGORCZYK | 1629 | Nadine Young | 3 | 30.186 |
| DAVID SUNDEEN | 1600 | Henry Churk | 3 | 1376.186 |
| DIPANKAR ROY | 1564 | Solomon Dugary | 3 | 3.327 |
| JASON ZHENG | 1595 | Maria Dubel | 3 | 0.000 |
| DINH DANG BUI | 1563 | Maria Dubel | 3 | 30.186 |
| EUGENE L MCCLURE | 1555 | John Morrison | 4 | 1376.186 |
| ALAN BUI | 1363 | Andrew Sung | 4 | 3.327 |
| MICHAEL R ALDRICH | 1229 | Maria Dubel | 3 | 0.000 |
| LOREN SCHWIEBERT | 1745 | Smith McHenry | 2 | 30.186 |
| MAX ZHU | 1579 | Solomon Dugary | 2 | 1376.186 |
| GAURAV GIDWANI | 1552 | Henry Churk | 2 | 3.327 |
| SOFIA ADINA | 1507 | Maria Dubel | 2 | 0.000 |
| CHIEDOZIE OKORIE | 1602 | Andrew Sung | 2 | 30.186 |
| GEORGE AVERY JONES | 1522 | Solomon Dugary | 2 | 1376.186 |
| RISHI SHETTY | 1494 | John Morrison | 2 | 3.327 |
| JOSHUA PHILIP MATHEWS | 1441 | Kate Foo | 2 | 0.000 |
| JADE GE | 1449 | Solomon Dugary | 2 | 30.186 |
| MICHAEL JEFFERY THOMAS | 1399 | Maria Dubel | 2 | 1376.186 |
| JOSHUA DAVID LEE | 1438 | Andrew Sung | 2 | 3.327 |
| SIDDHARTH JHA | 1355 | Kate Foo | 2 | 0.000 |
| AMIYATOSH PWNANANDAM | 980 | John Morrison | 2 | 30.186 |
| BRIAN LIU | 1423 | Kate Foo | 2 | 1376.186 |
| JOEL R HENDON | 1436 | Andrew Sung | 2 | 3.327 |
| FOREST ZHANG | 1348 | Smith McHenry | 2 | 0.000 |
| KYLE WILLIAM MURPHY | 1403 | Solomon Dugary | 2 | 30.186 |
| JARED GE | 1332 | Kate Foo | 2 | 1376.186 |
| ROBERT GLEN VASEY | 1283 | Smith McHenry | 2 | 3.327 |
| JUSTIN D SCHILLING | 1199 | John Morrison | 2 | 0.000 |
| DEREK YAN | 1242 | Kate Foo | 2 | 30.186 |
| JACOB ALEXANDER LAVALLEY | 377 | Solomon Dugary | 2 | 1376.186 |
| ERIC WRIGHT | 1362 | Henry Churk | 1 | 3.327 |
| DANIEL KHAIN | 1382 | Maria Dubel | 1 | 0.000 |
| MICHAEL J MARTIN | 1291 | John Morrison | 1 | 30.186 |
| SHIVAM JHA | 1056 | Henry Churk | 1 | 1376.186 |
| TEJAS AYYAGARI | 1011 | Solomon Dugary | 1 | 3.327 |
| ETHAN GUO | 935 | Henry Churk | 1 | 0.000 |
| JOSE C YBARRA | 1393 | Maria Dubel | 1 | 30.186 |
| LARRY HODGE | 1270 | Andrew Sung | 1 | 1376.186 |
| ALEX KONG | 1186 | Maria Dubel | 1 | 3.327 |
| MARISA RICCI | 1153 | Solomon Dugary | 1 | 0.000 |
| MICHAEL LU | 1092 | Smith McHenry | 1 | 30.186 |
| VIRAJ MOHILE | 917 | Henry Churk | 1 | 1376.186 |
| SEAN M MC CORMICK | 853 | Kate Foo | 1 | 3.327 |
| JULIA SHEN | 967 | Smith McHenry | 1 | 0.000 |
| JEZZEL FARKAS | 955 | Solomon Dugary | 1 | 30.186 |
| ASHWIN BALAJI | 1530 | Andrew Sung | 1 | 1376.186 |
| THOMAS JOSEPH HOSMER | 1175 | Maria Dubel | 1 | 3.327 |
| BEN LI | 1163 | John Morrison | 1 | 0.000 |
Training dataset
Player Ratings is between 1 and 5
RMSE for Baseline Predictors (train)
## [1] 2.686275
Testing dataset
Player Ratings is between 1 and 5
RMSE for Baseline Predictors (test)
## [1] 2.615385
Dataset <- c("Training Set", "Testing set","Baseline Pred Train", "Baseline Pred Test")
RMSE <- c("1.13", "1.27", "2.69", "2.61")
Raw_Average <- c("2.31", "2.38 ", "N/A", "N/A")
PredictorPerformance <- data.frame(Dataset,RMSE,Raw_Average)
PredictorPerformance## Dataset RMSE Raw_Average
## 1 Training Set 1.13 2.31
## 2 Testing set 1.27 2.38
## 3 Baseline Pred Train 2.69 N/A
## 4 Baseline Pred Test 2.61 N/A
The analysis shows that RMSE did not improve from the training and testing datasets to the Baseline. This is perhaps because the data is synthetic with ratings that I made up. However, the exercise provided me with the opportunity to understand how these types of Recommenders could work.