This activity will introduce you to using R and R Markdown for data processing. Before starting, you must have first installed R and R Studio, launched R Studio, and created your first RMD. Alternately, you can follow the link on Moodle to work in RStudio Cloud.

We will be using the comic characters dataset collected by Five Thirty Eight for the Comic Books Are Still Made By Men, For Men And About Men article. Read that article, and visit the GitHub page with comic data associated with the article.

Please type the activity below on your own computer, complete the visualization task at the end, and upload your Knit HTML (not the RMD!) to Moodle.

library(tidyverse)    # includes ggplot2, dplyr, tidyr, and others
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringr)      # string manipulation

# Load the csv
marvel_df <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   page_id = col_double(),
##   name = col_character(),
##   urlslug = col_character(),
##   ID = col_character(),
##   ALIGN = col_character(),
##   EYE = col_character(),
##   HAIR = col_character(),
##   SEX = col_character(),
##   GSM = col_character(),
##   ALIVE = col_character(),
##   APPEARANCES = col_double(),
##   `FIRST APPEARANCE` = col_character(),
##   Year = col_double()
## )
# Show the beginning of the CSV
head(marvel_df)
# Show the dimensions
dim(marvel_df)
## [1] 16376    13
# Show the column names
colnames(marvel_df)
##  [1] "page_id"          "name"             "urlslug"          "ID"              
##  [5] "ALIGN"            "EYE"              "HAIR"             "SEX"             
##  [9] "GSM"              "ALIVE"            "APPEARANCES"      "FIRST APPEARANCE"
## [13] "Year"
# Count how many nulls there are in each column.
colSums(is.na(marvel_df))
##          page_id             name          urlslug               ID 
##                0                0                0             3770 
##            ALIGN              EYE             HAIR              SEX 
##             2812             9767             4264              854 
##              GSM            ALIVE      APPEARANCES FIRST APPEARANCE 
##            16286                3             1096              815 
##             Year 
##              815

Basic manipulation of the data frame

# Get the first row of the data frame
marvel_df %>% 
  head(1)
# Also: marvel_df[1,]

# Get the first three values for the eye column
marvel_df %>%
  select(EYE) %>%
  head(3)
# Also: marvel_df$EYE[1:3]

# Get the first 3 values for the name, eye, and hair fields.
marvel_df %>%
  select(name, EYE, HAIR) %>%
  head(3)
# Show the names of the five earliest characters
marvel_df %>%
  arrange(Year) %>%
  head(5) %>%
  select(name, Year)
# Show the names of the five most popular characters.
# Note that the dataset is already sorted in this way!
marvel_df %>%
  arrange(desc(APPEARANCES)) %>%
  head(5) %>%
  select(name)

Basic descriptive analysis

Create a bar chart for frequency of eye color

# Bit of code for rotating labels found by Googling "ggplot rotate axis labels"
ggplot(marvel_df, aes(x = EYE)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Create a density plot of appearances

ggplot(marvel_df, aes(x=APPEARANCES)) +
  geom_density() + 
  scale_x_log10()
## Warning: Removed 1096 rows containing non-finite values (stat_density).

Multivariate Visualization

Complete this task: Visualize the relationship between two (or more) variables in the dataset. This will mean you need to have a single bivariate visualization that showcases two (or more) different aspects of the dataset. Make sure your visualization is clear, with an understandable title and well-labeled axes.

The visualizations above don’t show you how to create bivariate visualizations. You’ll need to do a little bit of research to see how to do this using ggplot2. If you get stuck, ask on Slack!

Once you are done, please 1) upload your Knit HTML (not your RMD) to Moodle and 2) take a screenshot of your visualization and share it on our Slack channel.