Project_Three

Which are the most valued data science skills?

If you google this question you will receive all kinds of results saying things like Python, Sql, R, Machine Learning, Natural Language Processessing, are the most important skills to have to be a data scientist. Which is accurate, those are necessary skills but it leaves out the most important skill that a person needs. Which is decision making. In class we read the article from the New York Times Magazine called ‘What Google Learned From Its Quest to Build the Perfect Team’, by Charles Duhigg. This article was all about Google’s studies on what makes teams successful and what makes them unsuccessful. Ultimately, it all came down to decision making. How did the teams make their decisions? How did they implement their plan? How did the break up the division of labor? How did they get along? The key to everything wasn’t about the skills but about the decisions they made as a group and you can’t understand the group without understanding the smaller individual pieces. Nothing is understood by looking at the big picture. To gain a full understanding of anything you have to learn about each fundamental piece. In this case, each individual and how they work, what they prefer, what is their personality and how they make decisions. Then and only then can you can start to get an idea of how they fit within the group and what type of group atmosphere each individual thrives in.

library(RMySQL)

## Loading required package: DBI

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readxl)
library(ini)
library(DT)
library(ggpubr)
library(ggplot2)

Database sign in and set download file path

The user_name variable sets the user name for the database sign in. While the password variable sets the password. The xlsx variable is the path to the directory where the xlsx files are stored. Leave the trailing /(backslash) off. If on a system other than linux then the slashes might have to be changed.

# user_name <- readline(prompt='Enter user name ')
# password <- readline(prompt='Enter password ')
# './decision making' <- readline(prompt='Enter directory where xlsx files are: ')
my_ini <- read.ini('/home/ray/config/Ultimate.ini')
db_con <- dbConnect(MySQL(),user=my_ini$MYSQL$user,password=my_ini$MYSQL$password,host='localhost')

Fix files

This section of code is to read in the xlsx files and fix the column names by removing any white spaces and punctuations. It then writes the xlsx file to a csv file for exportation into a mysql database.

custom_name_repair <- function(nms) gsub('[[:punct:][:space:]]+', "", nms)
game_sessions <- read_excel(paste0('./decision making','/Game Sessions.xlsx',collapse=''),.name_repair=custom_name_repair)
decisions <- read_excel(paste0('./decision making','/Decisions.xlsx',collapse=''),.name_repair=custom_name_repair)
worker_agents <- read_excel(paste0('./decision making','/Worker Agents.xlsx',collapse=''),.name_repair=custom_name_repair)
game_levels <- read_excel(paste0('./decision making','/Game Levels.xlsx',collapse = ''),.name_repair=custom_name_repair)
users_table <- read_excel(paste0('./decision making','/Users.xlsx',collapse = ''),.name_repair=custom_name_repair)
task <- read_excel(paste0('./decision making','/Tasks.xlsx',collapse = ''),.name_repair=custom_name_repair)
write.csv(task,paste0('./decision making','/Tasks.csv',collapse=''), row.names = FALSE)
write.csv(users_table,paste0('./decision making','/Users.csv',collapse = ''), row.names = FALSE)
write.csv(game_levels,paste0('./decision making','/Game Levels.csv',collapse = ''), row.names = FALSE)
write.csv(worker_agents,paste0('./decision making','/Worker Agents.csv',collapse = ''), row.names = FALSE)
write.csv(game_sessions,paste0('./decision making','/Game Sessions.csv',collapse=''), row.names = FALSE)
write.csv(decisions,paste0('./decision making','/Decisions.csv',collapse=''), row.names = FALSE)

Create database

Create the database if it does not exist and set it as the database to for use.

create_database_statement<- 'CREATE DATABASE IF NOT EXISTS Project_Three;'
create_database <- dbSendQuery(db_con, create_database_statement)
use_database <- dbSendQuery(db_con, 'USE Project_Three;')

Create tables

Using RMySQL package’s dbSendQuery to create the tables. Each query is the SQL statement for creating the table.

worker_agents <- 'CREATE TABLE IF NOT EXISTS WorkerAgents(
ID int NOT NULL PRIMARY KEY, 
HighQualityOutputProbability int, 
MaxProductivityNoofEffortUnitsperRound int, 
SvQSetting int);'
dbSendQuery(db_con, worker_agents)

## <MySQLResult:8,0,2>

user_table <- 'CREATE TABLE IF NOT EXISTS Users (
 ID int NOT NULL PRIMARY KEY,
 Gender varchar (225),
 Education varchar (225),
 Country varchar (225),
 Age int,
 AccountCreationTime int,
 PQ1 int, 
 PQ2 int,
 PQ3 int,
 PQ4 int,
 PQ5 int,
 PQ6 int,
 PQ7 int,
 PQ8 int,
 PQ9 int,
 PQ10 int,
 AQ1 int,
 AQ2 int,
 AQ3 int,
 AQ4 int,
 AQ5 int,
 AQ6 int,
 AQ7 int,
 AQ8 int,
 AQ9 int,
 AQ10 int,
 AQ11 int,
 AQ12 int,
 AQ13 int,
 AQ14 int,
 AQ15 int,
 AQ16 int,
 AQ17 int,
 AQ18 int,
 AQ19 int,
 AQ20 int
 );'
dbSendQuery(db_con, user_table)

## <MySQLResult:-574630872,0,3>

 tasks_table <- 'CREATE TABLE IF NOT EXISTS Tasks (
 ID int NOT NULL PRIMARY KEY,
 Value int,
 Difficulty int,
 EffortRequired int,
 Deadline int
 );'
dbSendQuery(db_con, tasks_table)

## <MySQLResult:1,0,4>

game_levels <- "CREATE TABLE IF NOT EXISTS `GameLevels` (
  Level int NOT NULL PRIMARY KEY, 
  SpeedvsQualityTradeoffSvQ int,
  NoofRounds int,
  TasksperRound int,
  AverageWorkerAgentProductivityOutputRate int);"
dbSendQuery(db_con,game_levels)

## <MySQLResult:0,0,5>

game_session_table <- 'CREATE TABLE IF NOT EXISTS `GameSessions` (
 ID int NOT NULL PRIMARY KEY,
 UserID int,
 Level int,
 PlayerScore float,
 PlayerScoreLossLowQuality float,
 PlayerScoreLossTardiness float,
 AIScore float,
 AIScoreLossLowQuality float,
 AIScoreLossTardiness float,
 UserStrategyIndex int,
 UserStrategyDescription varchar (225),
 FacialExpressionID int, 
 Happiness int, 
 Sadness int,
 Excitement int,
 Boredom int,
 Anger int, 
 Surprise int,
 StartTime datetime,
 EndTime datetime,
 FOREIGN KEY (UserID) REFERENCES Users(ID),
 FOREIGN KEY (Level) REFERENCES GameLevels(Level)
 );'
 dbSendQuery(db_con, game_session_table)

## <MySQLResult:8,0,6>

decisions_table <- 'CREATE TABLE IF NOT EXISTS Decisions (
 ID int NOT NULL PRIMARY KEY,
 SessionID int,
 Round int,
 WorkerAgentID int,
 WorkerAgentBacklogNoofTasks int,
 WorkerAgentBacklogNoofEffortUnits int,
 TheBacklogQueue int,
 WorkerAgentReputation int,
 FOREIGN KEY (ID) REFERENCES Users(ID),
 FOREIGN KEY (WorkerAgentID) REFERENCES WorkerAgents(ID)
 );'
dbSendQuery(db_con,decisions_table)

## <MySQLResult:0,0,7>

Populate database tables with csv data

Using the sqlite package’s dbWriteTable function to populate the database tables with its corresponding data.

Note:

MySql needs a minor configuration change. If you are using Mysql 8, in the my.cnf and my.cnf.d configuration files local-inline has to be added under the [MYSQL] in the file. In older versions of mysql just add it to the my.cnf file and add local-infile=True, under the [MYSQL].

dbWriteTable(db_con, value = paste0('./decision making','/Game Levels.csv',collapse=''), name = "GameLevels", append = TRUE, row.names = FALSE)

## [1] TRUE

dbWriteTable(db_con, value = paste0('./decision making','/Decisions.csv',collapse=''), name = "Decisions", append = TRUE, row.names = FALSE)

## [1] TRUE

dbWriteTable(db_con, value = paste0('./decision making','/Game Sessions.csv',collapse=''), name = "GameSessions", append = TRUE, row.names = FALSE)

## [1] TRUE

dbWriteTable(db_con, value = paste0('./decision making','/Tasks.csv',collapse=''), name = "Tasks", append = TRUE, row.names = FALSE)

## [1] TRUE

dbWriteTable(db_con, value = paste0('./decision making','/Users.csv',collapse=''), name = "Users", append = TRUE, row.names = FALSE)

## [1] TRUE

dbWriteTable(db_con, value = paste0('./decision making','/Worker Agents.csv',collapse=''), name = "WorkerAgents", append = TRUE, row.names = FALSE)

## [1] TRUE

Close all connections

Close all connections in order to close all result sets. Not doing so hinders the ability to do queries that pull data from the database. dbListConnections(MySQL()) checks if any connections are open, which should return an empty list.

lapply(dbListConnections( dbDriver( drv = "MySQL")), dbDisconnect)

## Warning: Closing open result sets

## [[1]]
## [1] TRUE

dbListConnections(MySQL())

## list()

Reopen connections

Since connections were closed to empty the results sets, a new connection must be established. The database to be used must also be reset.

db_con <- dbConnect(MySQL(),user=my_ini$MYSQL$user,password=my_ini$MYSQL$password,host='localhost')
use_database <- dbSendQuery(db_con, 'USE Project_Three;')

Database Extraction

A query statement and query for the data from the Game Sessions table. Which holds the data that will be analyzed.

get_users <- ("Select * From Users")
users_table <- dbGetQuery(db_con,get_users)
get_game_sessions <- ("Select * FROM GameSessions;")
sessions_table <- dbGetQuery(db_con,get_game_sessions)
datatable(users_table,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

datatable(sessions_table,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

Exploratory Analysis

In these two charts we have histograms of player scores and ai scores. We were expecting the ai scores to be closer to a normal distribution but it was not close to normal at all. The players had a much smoother distribution. Possibly because the learning algorithm for the ai was not as refined as it should have been.

p_v_ai <- as.data.frame(sessions_table) %>% 
  select(Level,PlayerScore,AIScore)%>%
  pivot_longer(!Level,names_to = 'Player_Type',values_to = 'Score')
ggplot(p_v_ai)+
  geom_density(aes(x=Score,fill=factor(Player_Type),alpha=.7,))

ggqqplot(sessions_table$PlayerScore)

From the qq plot above we have further evidence that the player scores are not normally distributed.

ggqqplot(sessions_table$PlayerScore)

We also have further evidence that the ai scores are not normally distributed.

Looking at the density plots and the qq plots we can pretty much see that the distributions are not normal. Just to make sure we did a Shapiro-Wilkes test.

norm_player <-shapiro.test(sessions_table$PlayerScore)
norm_ai <-shapiro.test(sessions_table$AIScore)
norm_player

## 
##  Shapiro-Wilk normality test
## 
## data:  sessions_table$PlayerScore
## W = 0.93093, p-value < 2.2e-16

norm_ai

## 
##  Shapiro-Wilk normality test
## 
## data:  sessions_table$AIScore
## W = 0.90676, p-value < 2.2e-16

Both distributions fail the Shapiro test.

SQL join of the User table and Game Session table on the UserID column.

user_game_ses_join <- 'SELECT Project_Three.Users.ID, Project_Three.GameSessions.UserID, Project_Three.Users.Gender, Project_Three.Users.Education, Project_Three.Users.Country, Project_Three.Users.Age, Project_Three.Users.PQ1, Project_Three.Users.PQ2, Project_Three.Users.PQ3, Project_Three.Users.PQ4, Project_Three.Users.PQ5, Project_Three.Users.PQ6, Project_Three.Users.PQ7, Project_Three.Users.PQ8, Project_Three.Users.PQ9, Project_Three.Users.PQ10, Project_Three.Users.AQ1, Project_Three.Users.AQ2, Project_Three.Users.AQ3, Project_Three.Users.AQ4, Project_Three.Users.AQ5, Project_Three.Users.AQ6, Project_Three.Users.AQ7, Project_Three.Users.AQ8, Project_Three.Users.AQ9, Project_Three.Users.AQ10, Project_Three.Users.AQ11, Project_Three.Users.AQ12,Project_Three.Users.AQ13, Project_Three.Users.AQ15, Project_Three.Users.AQ16, Project_Three.Users.AQ17, Project_Three.Users.AQ18, Project_Three.Users.AQ19, Project_Three.Users.AQ20
  FROM Project_Three.Users
  JOIN Project_Three.GameSessions ON Project_Three.Users.ID = Project_Three.GameSessions.UserID;'
user_game_ses_table <- dbGetQuery(db_con,user_game_ses_join)
datatable(user_game_ses_table,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

SQL join of Game Sessions and Game Levels on Level variable and Level variable

game_levs_ses_join <- 'SELECT Project_Three.GameSessions.UserID, Project_Three.GameSessions.Level, Project_Three.GameSessions.PlayerScore, Project_Three.GameSessions.AIScore, Project_Three.GameSessions.UserStrategyIndex, Project_Three.GameSessions.FacialExpressionID
  FROM Project_Three.GameSessions
  JOIN Project_Three.GameLevels ON Project_Three.GameSessions.Level = Project_Three.GameLevels.Level;'
game_levs_ses_table <- dbGetQuery(db_con,game_levs_ses_join)
datatable(game_levs_ses_table,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

Analysis

game_levs_ses_table <- game_levs_ses_table %>%
select(UserID, Level, PlayerScore, AIScore) %>%
arrange(desc(Level))

# for PlayerScore
game_levs_ses_table %>%
group_by(Level)%>%
summarise(
mean(PlayerScore)
)

## # A tibble: 6 × 2
##   Level `mean(PlayerScore)`
##   <int>               <dbl>
## 1     1               0.825
## 2     2               0.778
## 3     3               0.795
## 4     4               0.674
## 5     5               0.714
## 6     6               0.548

# for AIScore
game_levs_ses_table %>%
group_by(Level)%>%
summarise(
mean(AIScore)
)

## # A tibble: 6 × 2
##   Level `mean(AIScore)`
##   <int>           <dbl>
## 1     1           0.916
## 2     2           0.877
## 3     3           0.827
## 4     4           0.699
## 5     5           0.718
## 6     6           0.556

One observation that continues to be true for both Players and AI is that as level increases, mean decreases. We also note that AI is more likely to score higher per level. An explanations to this may include emotions, and how they impact decision-making skills.