In this post, I am going to apply Principal Component Analysis (PCA)
to a dataset of fictional character personalities [213600 11] .
PCA is a common technique for dimensionality reduction, which you
might want to do if you are, say, trying to put together a
classification model and you have a dataset with a lot of variables.
The dataset I am using is of crowdsourced scores of personality
traits for 800 fictional characters from books/movies/TV shows like:
StarTrek, Game of Thrones, Pride and Prejudice, and The Lion King.
I will take a TV Series I really love: Two and Half Men. Its a TV
Series full of joke. My favourite character is Berta, is really really
genius!!!! I love to see her.
## Registering fonts with R
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
##
## ✔ broom 1.0.0 ✔ rsample 1.1.0
## ✔ dials 1.0.0 ✔ tune 1.0.0
## ✔ infer 1.0.3 ✔ workflows 1.0.0
## ✔ modeldata 1.0.0 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.1 ✔ yardstick 1.0.0
## ✔ recipes 1.0.1
##
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
##
##
## Attaching package: 'plotly'
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
##
##
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
##
##
## Rows: 213600 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (7): character_code, fictional_work, character_name, gender, spectrum, s...
## dbl (3): mean, ratings, sd
## lgl (1): is_emoji
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data inspection
## [1] "character_code" "fictional_work" "character_name" "gender"
## [5] "spectrum" "spectrum_low" "spectrum_high" "is_emoji"
## [9] "mean" "ratings" "sd"
## [1] 213600 11
## spec_tbl_df [213,600 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ character_code: chr [1:213600] "A/4" "A/4" "A/4" "A/4" ...
## $ fictional_work: chr [1:213600] "Alien" "Alien" "Alien" "Alien" ...
## $ character_name: chr [1:213600] "Ash" "Ash" "Ash" "Ash" ...
## $ gender : chr [1:213600] "male" "male" "male" "male" ...
## $ spectrum : chr [1:213600] "BAP1" "BAP2" "BAP3" "BAP4" ...
## $ spectrum_low : chr [1:213600] "playful" "shy" "cheery" "masculine" ...
## $ spectrum_high : chr [1:213600] "serious" "bold" "sorrowful" "feminine" ...
## $ is_emoji : logi [1:213600] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ mean : num [1:213600] 41.4 11.1 22.4 -16.9 23.1 3.8 -30.4 -32.8 -24.8 28.9 ...
## $ ratings : num [1:213600] 51 63 78 71 72 60 76 74 72 70 ...
## $ sd : num [1:213600] 10.9 27.3 14 22.3 25.2 32.9 23.7 20.9 29.1 23.9 ...
## - attr(*, "spec")=
## .. cols(
## .. character_code = col_character(),
## .. fictional_work = col_character(),
## .. character_name = col_character(),
## .. gender = col_character(),
## .. spectrum = col_character(),
## .. spectrum_low = col_character(),
## .. spectrum_high = col_character(),
## .. is_emoji = col_logical(),
## .. mean = col_double(),
## .. ratings = col_double(),
## .. sd = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## character_code fictional_work character_name gender
## Length:213600 Length:213600 Length:213600 Length:213600
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## spectrum spectrum_low spectrum_high is_emoji
## Length:213600 Length:213600 Length:213600 Mode :logical
## Class :character Class :character Class :character FALSE:188000
## Mode :character Mode :character Mode :character TRUE :25600
##
##
##
## mean ratings sd
## Min. :-49.0000 Min. : 2.0 Min. : 0.00
## 1st Qu.:-17.4000 1st Qu.: 41.0 1st Qu.:21.10
## Median : -0.6000 Median : 131.0 Median :25.50
## Mean : -0.3822 Mean : 206.2 Mean :24.76
## 3rd Qu.: 16.7000 3rd Qu.: 292.0 3rd Qu.:28.90
## Max. : 49.4000 Max. :2459.0 Max. :45.50
The main fields we’re interested in are: spectrum_low,
spectrum_high, and mean:
The spectrum fields tell us what trait is on each end of the spectrum
being considered mean is a score (from -50 to +50), where a score closer
to -50 means the character is more like the spectrum_low trait and a
score closer to +50 means the character is more like the spectrum_high
trait. Lets see the live results!!!!
Let’s look at an example character: Charlie Harper from WARNER TV
Series:Two and Half Men.
|
character_code
|
fictional_work
|
character_name
|
gender
|
spectrum
|
spectrum_low
|
spectrum_high
|
is_emoji
|
mean
|
ratings
|
sd
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP1
|
playful
|
serious
|
FALSE
|
-36.9
|
114
|
14.8
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP4
|
masculine
|
feminine
|
FALSE
|
-35.6
|
128
|
16.6
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP5
|
charming
|
awkward
|
FALSE
|
-29.1
|
119
|
23.8
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP6
|
lewd
|
tasteful
|
FALSE
|
-16.8
|
115
|
30.6
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP3
|
cheery
|
sorrowful
|
FALSE
|
-3.1
|
98
|
29.0
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP7
|
intellectual
|
physical
|
FALSE
|
29.8
|
109
|
26.0
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP8
|
strict
|
lenient
|
FALSE
|
35.3
|
115
|
21.3
|
|
THM/1
|
Two and Half Men
|
Charlie Harper
|
male
|
BAP2
|
shy
|
bold
|
FALSE
|
44.1
|
116
|
9.5
|
Wow Excellent for a drunk-alcoholic-sex addict!!!!!