This activity will introduce you to using R and R Markdown for data processing. Before starting, you must have first installed R and R Studio, launched R Studio, and created your first RMD. Alternately, you can follow the link on Moodle to work in RStudio Cloud.
We will be using the comic characters dataset collected by Five Thirty Eight for the Comic Books Are Still Made By Men, For Men And About Men article. Read that article, and visit the GitHub page with comic data associated with the article.
Please type the activity below on your own computer, complete the visualization task at the end, and upload your Knit HTML (not the RMD!) to Moodle.
library(tidyverse) # includes ggplot2, dplyr, tidyr, and others
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr) # string manipulation
# Load the csv
marvel_df <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## page_id = col_double(),
## name = col_character(),
## urlslug = col_character(),
## ID = col_character(),
## ALIGN = col_character(),
## EYE = col_character(),
## HAIR = col_character(),
## SEX = col_character(),
## GSM = col_character(),
## ALIVE = col_character(),
## APPEARANCES = col_double(),
## `FIRST APPEARANCE` = col_character(),
## Year = col_double()
## )
# Show the beginning of the CSV
head(marvel_df)
# Show the dimensions
dim(marvel_df)
## [1] 16376 13
# Show the column names
colnames(marvel_df)
## [1] "page_id" "name" "urlslug" "ID"
## [5] "ALIGN" "EYE" "HAIR" "SEX"
## [9] "GSM" "ALIVE" "APPEARANCES" "FIRST APPEARANCE"
## [13] "Year"
# Count how many nulls there are in each column.
colSums(is.na(marvel_df))
## page_id name urlslug ID
## 0 0 0 3770
## ALIGN EYE HAIR SEX
## 2812 9767 4264 854
## GSM ALIVE APPEARANCES FIRST APPEARANCE
## 16286 3 1096 815
## Year
## 815
# Get the first row of the data frame
marvel_df %>%
head(1)
# Also: marvel_df[1,]
# Get the first three values for the eye column
marvel_df %>%
select(EYE) %>%
head(3)
# Also: marvel_df$EYE[1:3]
# Get the first 3 values for the name, eye, and hair fields.
marvel_df %>%
select(name, EYE, HAIR) %>%
head(3)
# Show the names of the five earliest characters
marvel_df %>%
arrange(Year) %>%
head(5) %>%
select(name, Year)
# Show the names of the five most popular characters.
# Note that the dataset is already sorted in this way!
marvel_df %>%
arrange(desc(APPEARANCES)) %>%
head(5) %>%
select(name)
Create a bar chart for frequency of eye color
# Bit of code for rotating labels found by Googling "ggplot rotate axis labels"
ggplot(marvel_df, aes(x = EYE)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Create a density plot of appearances
ggplot(marvel_df, aes(x=APPEARANCES)) +
geom_density() +
scale_x_log10()
## Warning: Removed 1096 rows containing non-finite values (stat_density).
Complete this task: Visualize the relationship between two (or more) variables in the dataset. This will mean you need to have a single bivariate visualization that showcases two (or more) different aspects of the dataset. Make sure your visualization is clear, with an understandable title and well-labeled axes.
The visualizations above don’t show you how to create bivariate visualizations. You’ll need to do a little bit of research to see how to do this using ggplot2. If you get stuck, ask on Slack!
Once you are done, please 1) upload your Knit HTML (not your RMD) to Moodle and 2) take a screenshot of your visualization and share it on our Slack channel.