The comics Dataset
Two publishers, Marvel and DC, have created a host of superheroes
that have made their way into popular culture. You’re probably familiar
with Batman and Spiderman, but what about Mor the Mighty? The comics
dataset has information on all comic characters that have been
introduced by DC and Marvel. If we type the name of the dataset at the
console, we get the first few rows and columns. Here we see that each
row, or case, is a different character and each column, or variable, is
a different observation made on that character. At the top it tell us
the dimensions of this dataset: over 23,000 cases and 11 variables.
Right under the variable names, it tells us that all three of these are
factors, R’s preferred way to represent categorical variables.
# Load comics dataset
comics <- read.csv("C:/Users/JuanFer Mosquera/Documents/datasets/comics.csv")
Working with factors
It’s clear that the alignment variable can be “good”
or “neutral”, but what other values are possible? If we
run levels on the align column, we learn that there are in fact four
possible alignments, including reformed criminal.
comics <- comics %>%
mutate(across(c(id, align, eye, gender, hair, publisher), as.factor))
# Working with factors
class(comics$align)
[1] "factor"
levels(comics$align)
[1] "Bad" "Good" "Neutral" "Reformed Criminals"
levels(comics$id)
[1] "No Dual" "Public" "Secret" "Unknown"
levels(comics$gender)
[1] "Female" "Male" "Other"
levels(comics$publisher)
[1] "dc" "marvel"
A common way to represent the number of cases that fall into each
combination of levels of two categorical variables, like these, is with
a contingency table. This is done with the
table()
command, which takes as arguments the variables
that you’re interested in.
tab <- table(comics$id, comics$align)
tab_a <- table(comics$gender, comics$align)
Dropping levels
The contingency table revealed that there are some levels that have
very low counts. To simplify the analysis, it often helps to drop such
levels.
In R
, this requires two steps: first filtering
out any rows with the levels that have very low counts, then
removing these levels from the factor variable with
droplevels()
.
This is because the droplevels()
function would keep
levels that have just 1 or 2 counts; it only drops levels that don’t
exist in a dataset.
tab
Bad Good Neutral Reformed Criminals
No Dual 474 647 390 0
Public 2172 2930 965 1
Secret 4493 2475 959 1
Unknown 7 0 2 0
tab_a
Bad Good Neutral Reformed Criminals
Female 1573 2490 836 1
Male 7561 4809 1799 2
Other 32 17 17 0
Let´s see which characters are not reformed criminals:
# Remove align level
comics_filtered <- comics %>%
filter(align != 'Reformed Criminals') %>%
droplevels()
comics_filtered
Bar chart
Let´s construct two side-by-side bar charts of
the comics
data. This shows that there can often be two or
more options for presenting the same data. Passing the
argument position = "dodge"
to geom_bar()
says
that you want a side-by-side (i.e. not stacked) bar chart.
# Create side-by-side bar chart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")

# Create side-by-side bar chart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))

# Create a fill barchart of id by align
ggplot(comics, aes(x = id, fill = align)) + geom_bar()

Let’s look carefully at how this is constructed: each colored bar
segment actually corresponds to a count in our table, with the x-axis
and the fill=
color indicating the category that we’re
looking at. Several things pop out, like the fact that there are very
few characters whose identities are unknown, but there
are many where we don’t have data; that’s what the NAs mean.
The single largest bar segment corresponds to the most common
category: characters with secret identities that are also bad. We can
look across the identity types, though, and realize that bad is not
always the largest category. This indicates that there is indeed an
association between alignment and identity.
Counts vs. proportions
Sometimes raw counts of cases can be useful, but often it’s the
proportions that are more interesting. We can do our best to compute
these proportions in our head or we could do it explicitly.
From counts to proportions
Let’s return to our table of counts of cases by identity and
alignment. If we wanted to instead get a sense of the proportion of all
cases that fell into each category, we can take the original table of
counts, saved as tab_cnt
, and provide it as input to the
prop.table()
function. We see here that the single largest
category are characters that are bad and secret at about \(29%\) of characters. Also note that because
these are all proportions out of the whole dataset, the sum of all of
these proportions is 1.
# Simlify display format
options(scipen = 999, digits = 3)
tab_cnt <- table(comics$id, comics$align)
tab_cnt
Bad Good Neutral Reformed Criminals
No Dual 474 647 390 0
Public 2172 2930 965 1
Secret 4493 2475 959 1
Unknown 7 0 2 0
prop.table(tab_cnt)
Bad Good Neutral Reformed Criminals
No Dual 0.0305491 0.0416989 0.0251353 0.0000000
Public 0.1399845 0.1888373 0.0621939 0.0000644
Secret 0.2895721 0.1595128 0.0618072 0.0000644
Unknown 0.0004511 0.0000000 0.0001289 0.0000000
Conditional proportions
If we’re curious about systematic associations between variables, we
should look to conditional proportions. An example of a conditional
proportion is the proportion of public identity characters that
are good.
To build a table of these conditional proportions, add a 1 as the
second argument, specifying that you’d like to condition on the rows. We
see here that around \(57%\) of all
secret characters are bad. Because we’re conditioning on identity, it’s
every row that now sums to one. To condition on the columns instead,
change that argument to 2. Now it’s the columns that sum to one and we
learn, for example, that the proportion of bad characters that are
secret is around 63%. As the number of cells in these tables gets large,
it becomes much easier to make sense of your data using graphics. The
bar chart is still a good choice, but we’re going to need to add some
options.
prop.table(tab_cnt, 1)
Bad Good Neutral Reformed Criminals
No Dual 0.313700 0.428193 0.258107 0.000000
Public 0.357943 0.482861 0.159031 0.000165
Secret 0.566726 0.312185 0.120964 0.000126
Unknown 0.777778 0.000000 0.222222 0.000000
prop.table(tab_cnt, 2)
Bad Good Neutral Reformed Criminals
No Dual 0.066331 0.106907 0.168394 0.000000
Public 0.303946 0.484137 0.416667 0.500000
Secret 0.628743 0.408956 0.414076 0.500000
Unknown 0.000980 0.000000 0.000864 0.000000
Conditional bar chart
Here is the code for the bar chart based on counts. We want to
condition on whatever is on the x axis and stretch those bars to each
add up to a total proportion of 1, so we add the position equals fill
option to the geom bar function.
ggplot(comics, aes(id, fill = align)) + geom_bar(position = "fill") +
ylab("Proportion")

ggplot(comics, aes(align, fill = id)) + geom_bar(position = "fill") +
ylab("Proportion")

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3)
prop.table(tab)
Female Male Other
Bad 0.0821968 0.3950985 0.0016722
Good 0.1301144 0.2512933 0.0008883
Neutral 0.0436850 0.0940064 0.0008883
Reformed Criminals 0.0000523 0.0001045 0.0000000
prop.table(tab, 2)
Female Male Other
Bad 0.321020 0.533554 0.484848
Good 0.508163 0.339355 0.257576
Neutral 0.170612 0.126949 0.257576
Reformed Criminals 0.000204 0.000141 0.000000
Plot of gender by align
ggplot(comics_filtered, aes(align, fill = gender)) +
geom_bar()

Plot proportion of gender, conditional on
align
ggplot(comics_filtered, aes(align, fill = gender)) + geom_bar(position = "fill") +
ylab("proportion")

Distribution of one variable
Marginal distribution
To compute a table of counts for a single variable like id, just
provide vector into into the table function by the sole argument. One
way to think of what we’ve done is to take the original two-way table
and then, sum the cells across each level of align. Since we’ve summed
over the margins of the other variables, this is sometimes known as a
marginal distribution.
table(comics$id)
No Dual Public Secret Unknown
1788 6994 8698 9
tab_cnt <- table(comics$id, comics$align)
tab_cnt
Bad Good Neutral Reformed Criminals
No Dual 474 647 390 0
Public 2172 2930 965 1
Secret 4493 2475 959 1
Unknown 7 0 2 0
Simple bar chart
The syntax to create the simple bar chart is straightforward as well,
just remove the fill equals align argument.
ggplot(comics, aes(id)) +
geom_bar()

Faceting
Another useful way to form the distribution of a single variable is
to condition on a particular value of another variable. We might be
interested, for example, in the distribution of id for
all neutral characters. We could either filter the
dataset and build a bar chart using only cases where alignment was
neutral, or we could use a technique called faceting. Faceting breaks
the data into subsets based on the levels of a categorical variable and
then constructs a plot for each.
Faceted bar charts
To implement this in ggplot2
, we just need to add a
faceting layer: the facet_wrap()
function, then a tilde
(~), which can be read as “broken down by” and then our
variable align
. The result is three simple bar charts
side-by-side, the first one corresponding to the distribution of id
within all cases that have a bad alignment, and so on, for good and
neutral alignments. If this plot feels familiar, it should.
ggplot(comics_filtered, aes(id)) + geom_bar() +
facet_wrap(~ align)

Marginal bar chart
If you are interested in the distribution of alignment of
all superheroes, it makes sense to construct a bar chart for
just that single variable.
You can improve the interpretability of the plot, though, by
implementing some sensible ordering. Superheroes that are
"Neutral"
show an alignment between "Good"
and
"Bad"
, so it makes sense to put that bar in the middle.
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))
# Create the plot of align
ggplot(comics, aes(align)) + geom_bar()

Conditional bar chart
Now, if you want to break down the distribution of alignment based on
gender, you’re looking for conditional distributions.
You could make these by creating multiple filtered datasets (one for
each gender) or by faceting the plot of alignment based on gender. As a
point of comparison, we’ve provided your plot of the marginal
distribution of alignment from the last exercise.
ggplot(comics_filtered, aes(align)) + geom_bar() +
facet_wrap(~ gender)

Improve pie chart
The pie chart is a very common way to represent the distribution of a
single categorical variable, but they can be more difficult to interpret
than bar charts.
ggplot(comics, aes(align)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))

END
---
title: "Exploratory data analysis"
author: Juan Fernando Mosquera
date: 2024-03-02
output: html_notebook
---

```{r}
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
```

## **The comics Dataset**

Two publishers, Marvel and DC, have created a host of superheroes that have made their way into popular culture. You're probably familiar with Batman and Spiderman, but what about Mor the Mighty? The comics dataset has information on all comic characters that have been introduced by DC and Marvel. If we type the name of the dataset at the console, we get the first few rows and columns. Here we see that each row, or case, is a different character and each column, or variable, is a different observation made on that character. At the top it tell us the dimensions of this dataset: over 23,000 cases and 11 variables. Right under the variable names, it tells us that all three of these are factors, R's preferred way to represent categorical variables.

```{r}
# Load comics dataset
comics <- read.csv("C:/Users/JuanFer Mosquera/Documents/datasets/comics.csv")
```

```{r}
head(comics, 20)
```

### **Working with factors**

It's clear that the alignment variable can be "**good**" or "**neutral**", but what other values are possible? If we run levels on the align column, we learn that there are in fact four possible alignments, including **reformed criminal**.

```{r}
comics <- comics %>%
  mutate(across(c(id, align, eye, gender, hair, publisher), as.factor))
```

```{r}
# Working with factors 
class(comics$align)
levels(comics$align)
levels(comics$id)
levels(comics$gender)
levels(comics$publisher)
```

A common way to represent the number of cases that fall into each combination of levels of two categorical variables, like these, is with a **contingency table**. This is done with the `table()` command, which takes as arguments the variables that you're interested in.

```{r}
tab <- table(comics$id, comics$align)
tab_a <- table(comics$gender, comics$align)
```

### **Dropping levels**

The contingency table revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In `R`, this requires two steps: first **filtering out any rows with the levels that have very low counts**, then **removing these levels from the factor variable** with `droplevels()`.

This is because the `droplevels()` function would keep levels that have just 1 or 2 counts; it only drops levels that don't exist in a dataset.

```{r}
tab
tab_a
```

Let´s see which characters are not reformed criminals:

```{r}
# Remove align level
comics_filtered <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

comics_filtered
```

### **Bar chart**

Let´s construct two side-by-side bar charts of the `comics` data. This shows that there can often be two or more options for presenting the same data. Passing the argument `position = "dodge"` to `geom_bar()` says that you want a side-by-side (i.e. not stacked) bar chart.

```{r}
# Create side-by-side bar chart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")
```

```{r}
# Create side-by-side bar chart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))
```

```{r}
# Create a fill barchart of id by align
ggplot(comics, aes(x = id, fill = align)) + geom_bar()
```

Let's look carefully at how this is constructed: each colored bar segment actually corresponds to a count in our table, with the x-axis and the `fill=` color indicating the category that we're looking at. Several things pop out, like the fact that there are very few characters whose **identities are unknown**, but there are many where we don't have data; that's what the NAs mean.

The single largest bar segment corresponds to the most common category: characters with secret identities that are also bad. We can look across the identity types, though, and realize that bad is not always the largest category. This indicates that there is indeed an association between alignment and identity.

## **Counts vs. proportions**

Sometimes raw counts of cases can be useful, but often it's the proportions that are more interesting. We can do our best to compute these proportions in our head or we could do it explicitly.

### From counts to proportions

Let's return to our table of counts of cases by identity and alignment. If we wanted to instead get a sense of the proportion of all cases that fell into each category, we can take the original table of counts, saved as `tab_cnt`, and provide it as input to the `prop.table()` function. We see here that the single largest category are characters that are bad and secret at about $29%$ of characters. Also note that because these are all proportions out of the whole dataset, the sum of all of these proportions is 1.

```{r}
# Simlify display format
options(scipen = 999, digits = 3)
tab_cnt <- table(comics$id, comics$align)
tab_cnt
```

```{r}
prop.table(tab_cnt)
```

### **Conditional proportions**

If we're curious about systematic associations between variables, we should look to conditional proportions. An example of a conditional proportion **is the proportion of public identity characters that are good**.

To build a table of these conditional proportions, add a 1 as the second argument, specifying that you'd like to condition on the rows. We see here that around $57%$ of all secret characters are bad. Because we're conditioning on identity, it's every row that now sums to one. To condition on the columns instead, change that argument to 2. Now it's the columns that sum to one and we learn, for example, that the proportion of bad characters that are secret is around 63%. As the number of cells in these tables gets large, it becomes much easier to make sense of your data using graphics. The bar chart is still a good choice, but we're going to need to add some options.

```{r}
prop.table(tab_cnt, 1)
```

```{r}
prop.table(tab_cnt, 2)
```

### **Conditional bar chart**

Here is the code for the bar chart based on counts. We want to condition on whatever is on the x axis and stretch those bars to each add up to a total proportion of 1, so we add the position equals fill option to the geom bar function.

```{r}
ggplot(comics, aes(id, fill = align)) + geom_bar(position = "fill") + 
  ylab("Proportion")
```

```{r}
ggplot(comics, aes(align, fill = id)) + geom_bar(position = "fill") +
  ylab("Proportion")
```

```{r}
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3)
prop.table(tab)
prop.table(tab, 2)
```

### **Plot of gender by align**

```{r}
ggplot(comics_filtered, aes(align, fill = gender)) + 
  geom_bar()
```

### **Plot proportion of gender, conditional on align**

```{r}
ggplot(comics_filtered, aes(align, fill = gender)) + geom_bar(position = "fill") +
  ylab("proportion")
```

## **Distribution of one variable**

### **Marginal distribution**

To compute a table of counts for a single variable like id, just provide vector into into the table function by the sole argument. One way to think of what we've done is to take the original two-way table and then, sum the cells across each level of align. Since we've summed over the margins of the other variables, this is sometimes known as a marginal distribution.

```{r}
table(comics$id)
```

```{r}
tab_cnt <- table(comics$id, comics$align)
tab_cnt
```

### **Simple bar chart**

The syntax to create the simple bar chart is straightforward as well, just remove the fill equals align argument.

```{r}
ggplot(comics, aes(id)) + 
  geom_bar()
```

### **Faceting**

Another useful way to form the distribution of a single variable is to condition on a particular value of another variable. We might be interested, for example, in the distribution of **id** for all **neutral characters**. We could either filter the dataset and build a bar chart using only cases where alignment was neutral, or we could use a technique called faceting. Faceting breaks the data into subsets based on the levels of a categorical variable and then constructs a plot for each.

### **Faceted bar charts**

To implement this in `ggplot2`, we just need to add a faceting layer: the `facet_wrap()` function, then a tilde (\~), which can be read as "**broken down by**" and then our variable `align`. The result is three simple bar charts side-by-side, the first one corresponding to the distribution of id within all cases that have a bad alignment, and so on, for good and neutral alignments. If this plot feels familiar, it should.

```{r}
ggplot(comics_filtered, aes(id)) + geom_bar() + 
  facet_wrap(~ align)
```

### **Marginal bar chart**

If you are interested in the distribution of alignment of *all* superheroes, it makes sense to construct a bar chart for just that single variable.

You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are `"Neutral"` show an alignment between `"Good"` and `"Bad"`, so it makes sense to put that bar in the middle.

```{r}
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

# Create the plot of align
ggplot(comics, aes(align)) + geom_bar()
```

### **Conditional bar chart**

Now, if you want to break down the distribution of alignment based on gender, you're looking for conditional distributions.

You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender. As a point of comparison, we've provided your plot of the marginal distribution of alignment from the last exercise.

```{r}
ggplot(comics_filtered, aes(align)) + geom_bar() + 
  facet_wrap(~ gender)
```

### **Improve pie chart**

The pie chart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than bar charts.

```{r}
ggplot(comics, aes(align)) +
  geom_bar(fill = "chartreuse") +
  theme(axis.text.x = element_text(angle = 90))
```

### **END**
