SYNOPSIS

The goal is to study personality traits of leaders and their link to the performance. For almost 20 leaders data about their personalities, social demographics and performance are available.

Your task has three parts:

Preferred solution uses [r] or Python. All your codes should be able to handle new data in the same format. The report can be static (e.g. R Markdown/Jupyter) or dynamic (e.g. Shiny/Dash).

References

Obviously we cannot use our real data so for testing purposes we will use publically available data from the Big 5 personality test http://personality-testing.info/tests/BIG5.php, which can be downloaded at[2] [3] http://personality-testing.info/_rawdata/BIG5.zip.

Objective

The objective is to measure how parameters from personality test might be pertinent indicators to understand people performance.

Before training and predicting, let’s proceed to basic data exploration (size, quality, variable types etc…). We will especially look at:

We will then have a global understanding of how specific personality traits (or family of personality traits) might be good indicators for performance predictions.

After while, we will dive into more details with US subset and use some machine learning algorithms to predict the performance based on all the other data collected during the personality test. We are more likely to use classification trees or random forests algorithms, which are well adapted to this kind of problem.

Data Exploration and Cleaning

## Loading required package: downloader
## Loading required package: reshape2
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin

Let’s load file directly from the website: For convenient reasons, the files are loaded to the related local folder, as well as the csv reading options (header, sep, skip lines etc…)

#--------------- Start with getting the data from web URL ----------------------
## Set local paths and directory
setwd("~/Desktop/Class R/PMI")
OriginalData <- c("./data/BIG5.zip") # Local path
FileURL <- "http://personality-testing.info/_rawdata/BIG5.zip" # File URL


## If the dataset does not exist, download the zipped file and unzip it
if (!file.exists(OriginalData)) {
        print("File  does not exist")
        download.file(FileURL, destfile = OriginalData, method = "libcurl")
        unzip(zipfile = OriginalData,exdir = "./data/")
        }else {print("Files already exist")
}
## [1] "Files already exist"
## Let's read the files
WebSet <- read.csv("./data/BIG5/data.csv", sep = "\t")
Header <- read.csv("./data/BIG5/codebook.txt", sep = "\t", skip=4, header=FALSE)

Let’s load the local ‘performance’ data set: For convenient reasons, the file was already loaded to the related local folder, as well as the csv reading options (header, sep etc…)

PerfCol <- read.csv("./data/performance.csv", sep = "\t")

Since we have loaded the ‘performance’ related data, let’s add it to the dataframe

colnames(Header) <- c("code","question")
WebSet['performance'] <- PerfCol ## Append the performance column to the web dataset
CompleteSet <- WebSet[c(58,1:57)] ## Re-arrange columns to get performance as as 1st column

We proceed now to basic data exploration (summary, head etc…). Results are hidden, given the large size of the dataset.

head(CompleteSet)
summary(CompleteSet)

Questions can be grouped into the Big Five personality traits: 1) E1-10 are related to ‘Extroversion’ skills 2) N1-10 are related to ‘Neuroticism’ skills 3) A1-10 are related to ‘Agreeableness’ skills 4) C1-10 are related to ‘Conscientiousness’ skills 5) O1-10 are related to ‘Openness’ skills

To have a better global impression of the impact of each trait on performance, we can calculate the average score for each trait and each participant. Let’s add five columns for each Big Five trait, and calculate the related average score.

## Let's creat a new data set with 6 columns: performance and the Big Five traits
SimplifiedSubset <- cbind(CompleteSet[1:8],rowMeans(CompleteSet[9:18]),rowMeans(CompleteSet[19:28]), rowMeans(CompleteSet[29:38]),rowMeans(CompleteSet[39:48]),rowMeans(CompleteSet[49:58]))
colnames(SimplifiedSubset) <- c(colnames(CompleteSet)[1:8],'Extroversion','Neuroticism','Agreeableness','Conscientousness','Openness')
 c(colnames(CompleteSet)[1:8],'Extroversion','Neuroticism','Agreeableness','Conscientousness','Openness')
##  [1] "performance"      "race"             "age"             
##  [4] "engnat"           "gender"           "hand"            
##  [7] "source"           "country"          "Extroversion"    
## [10] "Neuroticism"      "Agreeableness"    "Conscientousness"
## [13] "Openness"
dim(SimplifiedSubset)
## [1] 19719    13
MeltSubset <-melt(cbind(SimplifiedSubset[1],SimplifiedSubset[9:13]), id.vars="performance")
p <- ggplot(MeltSubset, aes(performance,value, col=variable)) + 
  geom_point(alpha = 0.5) + geom_smooth(method=lm)+ facet_grid(variable ~ .)
p

From these plots, correlation between each trait and performance is not obvious, let’s have a look to Pearson’s correlation coefficients.

corrExtroversion <- cor.test(x=SimplifiedSubset$Extroversion, y=SimplifiedSubset$performance, method = 'pearson')
corrNeuroticism <- cor.test(x=SimplifiedSubset$Neuroticism, y=SimplifiedSubset$performance, method = 'pearson')
corrAgreeableness <- cor.test(x=SimplifiedSubset$Agreeableness, y=SimplifiedSubset$performance, method = 'pearson')
corrConscientousness <- cor.test(x=SimplifiedSubset$Conscientousness, y=SimplifiedSubset$performance, method = 'pearson')
corrOpenness <- cor.test(x=SimplifiedSubset$Openness, y=SimplifiedSubset$performance, method = 'pearson')

corrRace <- cor.test(x = CompleteSet$race,y = CompleteSet$performance,method ='pearson')
corrAge <- cor.test(x = CompleteSet$age,y = CompleteSet$performance,method ='pearson')
corrEngnat <- cor.test(x = CompleteSet$engnat,y = CompleteSet$performance,method ='pearson')
corrGender <- cor.test(x = CompleteSet$gender,y = CompleteSet$performance,method ='pearson')
corrCountry <- cor.test(x = as.numeric(factor(CompleteSet$country)),y = CompleteSet$performance,method = 'pearson')

ExtroversionPearson <- cbind(corrExtroversion['parameter'],
                             corrExtroversion['p.value'],
                             corrExtroversion['estimate'])
NeuroticismPearson <- cbind(corrNeuroticism['parameter'],
                             corrNeuroticism['p.value'],
                             corrNeuroticism['estimate'])
AgreeablenessPearson <- cbind(corrAgreeableness['parameter'],
                             corrAgreeableness['p.value'],
                             corrAgreeableness['estimate'])
ConscientousnessPearson <- cbind(corrConscientousness['parameter'],
                             corrConscientousness['p.value'],
                             corrConscientousness['estimate'])
OpennessPearson <- cbind(corrOpenness['parameter'],
                             corrOpenness['p.value'],
                             corrOpenness['estimate'])
RacePearson <- cbind(corrRace['parameter'],
                             corrRace['p.value'],
                             corrRace['estimate'])
AgePearson <- cbind(corrAge['parameter'],
                             corrAge['p.value'],
                             corrAge['estimate'])
EngnatPearson <- cbind(corrEngnat['parameter'],
                             corrEngnat['p.value'],
                             corrEngnat['estimate'])
GenderPearson <- cbind(corrGender['parameter'],
                             corrGender['p.value'],
                             corrGender['estimate'])
CountryPearson <- cbind(corrCountry['parameter'],
                             corrCountry['p.value'],
                             corrCountry['estimate'])

PearsonScores <- rbind(ExtroversionPearson,NeuroticismPearson,AgreeablenessPearson,ConscientousnessPearson,OpennessPearson,
                    RacePearson, AgePearson,EngnatPearson,GenderPearson,CountryPearson)
rownames(PearsonScores) <- c('Extroversion','Neuroticism','Agreeableness','Conscientousness','Openness','Race','Age','Engnat','Gender','Country')
colnames(PearsonScores) <- c('degree freedom','P-Value','Corr Coeff')
PearsonScores
##                  degree freedom P-Value       Corr Coeff  
## Extroversion     19717          1.407008e-156 0.18816     
## Neuroticism      19717          0             0.3617212   
## Agreeableness    19717          0             0.3792565   
## Conscientousness 19717          2.666537e-111 0.1586382   
## Openness         19717          2.357957e-215 0.2203188   
## Race             19717          0.001704135   0.02234139  
## Age              19717          0.1559615     0.01010394  
## Engnat           19717          0.470249      -0.005142377
## Gender           19717          4.313767e-65  0.1208925   
## Country          19709          0.748905      0.002280006

Pearson’s correlations score are eloquent: the highest score is a week correlation of about 0.4. P-values are less than 0.01, showing that we can reject the null hypothesis (“Performance and big five traits are unrelated”). The good news is that gender, race or age are not likely correlated with performance. Neurotics and agreeableness have the ‘highest’ correlation score with performance. Let’s notice that for predictors such as Age or Country, we failed to reject the null hypothesis despite very low Pearson’s correlation coefficients.

So what?

Given the typology of variables (all represented by discrete values from 0 to 5), the number of predictors and the number of predictors, it is not surprising not to get fantastic regression results. Given the number of observations (a little less than 20’000), we might expect better results from machine learning algorithms.

Performance Prediction for US Subset

Let’s pass to the second part of the study and to to build predictive models for US with machine learning algorithms to see if we can detect hidden patterns, with higher predictive power than simple regression analysis.This time we can use each question result for each traits. We can also check that there is not too much NA values.

USSubset <- subset(CompleteSet, country == 'US')
dim(USSubset)
## [1] 8753   58
## Evaluate if non NA ratio is above the threshold (TRUE or FALSE)
USSubsetNaN <- sapply(colnames(USSubset), function(x) sum(is.na(USSubset[, x])))

The US Subset consists of 8753 variables for 58 predictors. Fortunately, there is no missing values at all for the whole subset. Now, we can have a look to predictors with very low variance (close to zero) and get rid of them, as they will not be useful to determine cost functions. Surprisingly

## We transform the country values into numeric values
USSubset[,'country']=as.factor(USSubset[,'country'])
## We are using the nearZeroVar function to identify predictors with variance NOT close to zero
GoodVar <- !nearZeroVar(USSubset, saveMetrics = TRUE)$nzv
## We filter the training set
USSubset2 <- USSubset[, GoodVar]
dim(USSubset2)

This step did eliminate only one predictor with variance close to zero, guess which one? Of course the ‘country’ column as we filtered the data for only US…. Finally, from this reduced set, let’s see which predictors are strongly correlated.

correlationUSSubset <- findCorrelation(cor(USSubset2[,-57]),cutoff = .6)
names(USSubset2)[correlationUSSubset]
##  [1] "E3" "E5" "E7" "E4" "N6" "E6" "N8" "N1" "A9" "O8"

We observe that many predictors are correlated: 4 over ten for Extroversion, 2 over 10 for Neuroticism and Agreeableness. PCA pre-processing will be therefore implemented for the training. Our final list of predictors is:

names(USSubset2)
##  [1] "performance" "race"        "age"         "engnat"      "gender"     
##  [6] "hand"        "source"      "E1"          "E2"          "E3"         
## [11] "E4"          "E5"          "E6"          "E7"          "E8"         
## [16] "E9"          "E10"         "N1"          "N2"          "N3"         
## [21] "N4"          "N5"          "N6"          "N7"          "N8"         
## [26] "N9"          "N10"         "A1"          "A2"          "A3"         
## [31] "A4"          "A5"          "A6"          "A7"          "A8"         
## [36] "A9"          "A10"         "C1"          "C2"          "C3"         
## [41] "C4"          "C5"          "C6"          "C7"          "C8"         
## [46] "C9"          "C10"         "O1"          "O2"          "O3"         
## [51] "O4"          "O5"          "O6"          "O7"          "O8"         
## [56] "O9"          "O10"

Model Specification and cross-validation

To keep our test sets accurate, we will use cross-validation (k-folds, with k = 5) with PCA pre-processing. We can now split our dataset into training and test sets: 80% for training and 20% for testing:

set.seed(1260) ## We set seed to make processes reproducible
trainIndex <- createDataPartition(USSubset2$performance, p = .8, list = FALSE, times = 1)

performanceTraining <-USSubset2[trainIndex,]
performanceTest <-USSubset2[-trainIndex,]

We will compare two machine learning algorithms: Tree (rpart) and Random Forest (rf)

set.seed(1260) ## We set seed to make processes reproducible
trainingControlSet <- trainControl(method = "cv", number = 5, verboseIter=FALSE , preProcOptions="pca", allowParallel=TRUE)

rpartModel<- train(performance ~ ., data = performanceTraining, method = "rpart", trControl= trainingControlSet)
rfModel<- train(performance ~ ., data = performanceTraining, method = "rf", trControl= trainingControlSet)

Let’s look how the models perform on the training dataset:

modelHeader <- c("Tree Rpart", "Random Forest")
AccuracyTable <- c(max(rpartModel$results$Rsquared),
                max(rfModel$results$Rsquared))
Results <- as.data.frame(cbind(modelHeader,AccuracyTable))
Results[order(Results$AccuracyTable,decreasing = TRUE),]
##     modelHeader     AccuracyTable
## 2 Random Forest 0.991885289342011
## 1    Tree Rpart 0.591992259630709

The most accurate is (not surprising) the random forest model with an accuracy of 99.2%. The decision tree model is from far less accurate with 59% of accuracy. We can now apply the random forest model on the test data set and verify that we our predictions are close to real values:

predictionRF <- round(predict(rfModel,performanceTest),0)
NumberVarTest <- length(predictionRF)
RealValue<-performanceTest$performance
Results <- as.data.frame(cbind(predictionRF,RealValue))
Results['Difference']=Results['predictionRF']-Results['RealValue']
ErrorTypeI<-subset(Results, Difference >0)
ErrorTypeII<-subset(Results, Difference <0)
Error_Table <- as.data.frame(c('Error Type I',nrow(ErrorTypeI)),c('Error Type II',nrow(ErrorTypeII)))
Error_Table
##               c("Error Type I", nrow(ErrorTypeI))
## Error Type II                        Error Type I
## 11                                              7
ErrorRate=1-(nrow(ErrorTypeI)+nrow(ErrorTypeII))/NumberVarTest
sprintf("%.3f %%", 100*ErrorRate)
## [1] "98.971 %"

Conclusion

We tried to better understand the correlation between performance and the Big Five personality traits. As values for each predictors are integer between 0 and 5, correlation was not strong enough to use regular regression models. Given the large numbers of variables and predictors, this was a perfect case-study for machine learning algorithms. Random Forest algorithm was most probably to offer the highest accuracy and with 99,2% (98,9% during testing), we might consider the model as accurate. On x testing samples, the model made 18 errors: 11 of Type I (predicted more performant) and 7 of Type II (predicted less performant than reality). This would be interesting to test deep learning CNN algorithms, to see if hidden patterns at different levels of details can beatt random forest results. T