Analysis of Personality Data

Introduction

This is an example of how to draw some charts and graphs of our personality test data using R and tidyverse. Using this as a template, you should be able to draw your own analysis. Copy and paste this code to do your own analysis!

The R-Markdown file which generated this document is in the Discover Data Science v2 workspace in templates\personality_tests folder, the file is called Personality_Analysis_Guide.Rmd.

About the personality data

The “Big 5” test attemps to break personality down into 5 key traits.

Openness
Conscientiousness
Extroversion
Agreeableness
Neuroticism

You can remember them by the acronym “O.C.E.A.N.”

You can read more about the Big 5 test on Wikipedia.

The test we took also broke each of these five traits down into multiple sub-categories, such as anger, atruism, achievement, etc. These variables are linked to each other, and should be correlated. I have indicated which sub-category goes with which Big 5 trait by adding a suffix such as “.o” or “.c”. For example, trust.a indicates that the “trust” measurement is a subcategory of agreeableness where activity.level.e is a subcategory of extroversion.

Step 1: Load libraries and data

We’re going to use the tidyverse and lubridate libraries to help us out. They’re not loaded by default, so we will need to do that by using the library() function. We’re also going to run the DOD_Library.R script which has a few useful functions you might need.

# Load some libraries
library(tidyverse)
library(lubridate)

# Source the DOD_Library.R script
source("libraries/DOD_Library.R")

# Read in the raw data
big.five.raw <- read_csv("templates/personality_tests/2019_big5.csv")

# Tidy the raw data
big.five.raw %>%
  # Tell R how to understand the timestamp column.
  mutate(timestamp = ymd_hms(timestamp)) %>% 
  # Reorder the class.day column so it will plot in the correct order
  mutate(class.day = fct_relevel(class.day, "Mon", "Tue", "Wed", "Thu", "Fri")) %>% 
  # Reorder the Likert scale for the self-reported assessment of the test's accuracy.
  mutate(self.reported.accuracy = fct_relevel(self.reported.accuracy, 
                                              "Very Inaccurate", 
                                              "Moderately Inaccurate", 
                                              "Neither Accurate Nor Inaccurate",
                                              "Moderately Accurate", "Very Accurate")) %>%
  # Save this as a new object
  {.} -> big.five

# Get rid of the raw data
rm(big.five.raw)

Step 2: Look at the data

You should always give the data a look before you start using it, to make sure it has imported in a way that looks OK. Also, you will need to know what the column names are to draw plots with. You can look at your data with the View() function

# View the data
View(big.five)

Step 3: Draw some graphs

Below are some examples of how to draw a few graphs, use this code as a template for your own graphs. Copy, paste, and remix!

Remember: When writing the code for the graphs, you have to use the same spelling as the columns in the big.five table. Here’s a list of the columns:

colnames(big.five)

##  [1] "timestamp"              "class.day"              "grade"                 
##  [4] "school"                 "gender"                 "neuroticism"           
##  [7] "anxiety.n"              "anger.n"                "depression.n"          
## [10] "self.consciousness.n"   "immoderation.n"         "vulnerability.n"       
## [13] "extroversion"           "friendliness.e"         "gregariousness.e"      
## [16] "assertiveness.e"        "activity.level.e"       "excitement.seeking.e"  
## [19] "cheerfulness.e"         "openness"               "imagination.o"         
## [22] "artistic.interests.o"   "emotionality.o"         "adventuroursness.o"    
## [25] "intellect.o"            "liberalism.o"           "agreeableness"         
## [28] "trust.a"                "morality.a"             "altruism.a"            
## [31] "cooperation.a"          "modesty.a"              "sympathy.a"            
## [34] "conscientiousness"      "self.efficacy.c"        "orderliness.c"         
## [37] "dutifulness.c"          "achievement.striving.c" "self.discipline.c"     
## [40] "cautiousness.c"         "self.reported.accuracy" "surprising.results"

Histograms

You only need to specify what data will go on the x-axis, the y-axis will always be the count of how many things fall within that range. Here’s a histogram of modesty.a:

big.five %>% 
  ggplot(aes (x = modesty.a)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram doesn’t look that good though does it? Notice the gaps between the bars? Those are ranges where no one seemed to have a score. R is not very good at picking an appropriate binwidth which is size of the intervals it breaks the numbers into. If you look closely at the histogram you can see it’s being broken into bins with a size less than 1. Since the scores are integers (there is no score of 15.5), there are empty bars netween almost all the numbers. You sometimes need to adjust this by setting the binwidth parameter in the geom_histogram() function. If we set it to 1, the graph looks a lot better:

big.five %>% 
  ggplot(aes (x = modesty.a)) +
    geom_histogram(binwidth = 1)

Boxplots

Basic Boxplot

Boxplots are good when you want to compare a value accross categories in your data. For instance you could look at how the modesty.a scores vary between Day of Discovery classes. For a boxplot you need to have a value for both X and Y, but one of them must be a factor or a category, such as class.day, school, grade, or gender. To make a boxplot you use geom_boxplot().

big.five %>% 
  ggplot(aes(y = modesty.a, x = class.day)) +
  geom_boxplot()

Adding Color to Boxplots

You can color the boxplot by a variable by using the fill parameter inside the aes() function like so:

big.five %>% 
  ggplot(aes(y = modesty.a, x = class.day, fill = class.day)) +
  geom_boxplot()

In this case the we set fill = class.day, which is the same thing we split the data by on the x-axis. We could set fill to any other categorical factor though, such as grade, like so:

big.five %>% 
  ggplot(aes(y = modesty.a, x = class.day, fill = grade)) +
  geom_boxplot()

Notice that Tuesday’s class has both 7th and 8th graders, and since we wanted it colored by grade, it automatically split Tuesday’s data by grade; very smart!

Scatterplots

Basic Scatterplot

Scatterplots are useful when you want to see the relationships between two numerical variables. They are good for addressing questiong like, “Does someone’s score on sympathy.a predict their score on altruism.a?” If the variables are related you would expect there to see a trend in the points. With a scatterplot both X and Y need to be continuous variables like their score on a personality trait, not a category like their grade or gender.

To make a scatterplot you use geom_point().

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a)) +
    geom_point()

Fitting Curves to Scatterplots

You can also add a fitted line to the points to see what the trend might be. To do this you need to add geom_smooth() to your plot. By default, geom_smooth() will use a LOESS method for drawing the line. This almost never what you want to do. Instead you usually want to fit a straight line; a linear model. To add a straight line you use geom_smooth(method = "lm").

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a)) +
    geom_point() +
    geom_smooth(method = "lm")

You can also color these points just like you colored the boxplots. To do that you set the color parameter in the aes() function like so:

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a, color = gender)) +
    geom_point() +
    geom_smooth(method = "lm")

Notice that since we specified the data should be colored by grade, the scatterplot also fit separate lines for each grade.

There is something weird about this graph though!

Doesn’t it seem like it doesn’t have enough points? We have data for 84 students, but that doesn’t look like 84 points on the graph, what gives? Well, remember that the scores are all whole-numbers, and it’s likely that many students could have the same scores for these two traits. The points could be printing on top of each other, hiding how many points there are. Luckily data scientists have a way around this and it’s called “jittering”. By adding geom_jitter() to our plot, we tell R to “jitter” the points around a bit, moving them randomly so they don’t land on top of each other. This is just for representation though, and it should always be made clear that this has been done. The lines drawn with geom_smooth() will not be affected by this.

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a, color = grade)) +
    geom_point() +
    geom_jitter() +
    geom_smooth(method = "lm")

Facets

One more really useful thing you can do with your plots is “facet” them. This means to split the data and draw two plots instead of one. You can do this with any geom_ function. For instance if I add facet_grid(~grade) to the chart above, I will draw two charts, one for each grade.

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a, color = grade)) +
    geom_point() +
    geom_jitter() +
    geom_smooth(method = "lm") +
    facet_grid(~grade)

If you would rather have them stacked on top of each other rather than side by side, then use facet_grid(grade~.).

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a, color = grade)) +
    geom_point() +
    geom_jitter() +
    geom_smooth(method = "lm") +
    facet_grid(grade~.)

Labels

Lastly, before you publish and share your graphs, you should always make sure they are properly labeled. It’s easy to do this in R using the xlab(), ylab() and ggtitle() functions. Just be sure to put the text you want in the title in quotes. For example:

big.five %>% 
  ggplot(aes(x = sympathy.a, y = altruism.a, color = gender)) +
    geom_point() +
    geom_jitter() +
    geom_smooth(method = "lm") +
    facet_grid(~class.day) +
    xlab("Sympathy") +
    ylab("Altruism") +
    ggtitle("Relationship Between Sympath and Altrisum Between Grades")

Mix and Match

You can, of course mix and match these things in many, many ways to get all kinda of representations of your data.

Here’s a few more examples of things you can do.

big.five %>% 
  ggplot(aes(x = school, y = friendliness.e, fill = school)) +
  geom_boxplot() +
  facet_grid(~gender) +
  labs(x = "School", y = "Friendliness", title = "Friendliness by School and Gender")

Using geom_violin() will make a “Violin Plot” which is mix between a boxplot and a histogram. geom_jitter() will also overlay the points on it.

big.five %>% 
  ggplot(aes(x = school, y = friendliness.e, fill = school)) +
  geom_violin() +
  geom_jitter() +
  facet_grid(~gender) +
  labs(x = "School", y = "Friendliness", title = "Friendliness by School and Gender")

Using facet_grid(school~grade) will make two dimension of facets.

big.five %>% 
  ggplot(aes(x = anxiety.n, fill = gender)) +
  geom_histogram(binwidth = 3) +
  facet_grid(school~grade) +
  labs(x = "School", y = "Friendliness", title = "Friendliness by School and Gender")

geom_density() is like a histogram, but smoothed out. You can use position = "stack" to make them go on top of each other.

big.five %>% 
  ggplot(aes(x = artistic.interests.o, fill = class.day)) +
  geom_density(position = "stack") +
  facet_grid(gender~.)

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Removed 1 rows containing missing values (position_stack).

big.five %>% 
  ggplot(aes(x = openness, fill = class.day)) +
  geom_density(position = "stack") +
  facet_grid(gender~grade) +
  scale_fill_viridis_d()

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Removed 1 rows containing missing values (position_stack).

You can use geom_density(position = "fill") to make them fill the whole plot area, but the amount they fill is dependent on their relative proportions.

big.five %>% 
  ggplot(aes(x = anxiety.n, fill = class.day)) +
  geom_density(position = "fill") +
  facet_grid(gender~grade) +
  scale_fill_viridis_d()

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Removed 1 rows containing missing values (position_stack).