# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)

The comics Dataset

Two publishers, Marvel and DC, have created a host of superheroes that have made their way into popular culture. You’re probably familiar with Batman and Spiderman, but what about Mor the Mighty? The comics dataset has information on all comic characters that have been introduced by DC and Marvel. If we type the name of the dataset at the console, we get the first few rows and columns. Here we see that each row, or case, is a different character and each column, or variable, is a different observation made on that character. At the top it tell us the dimensions of this dataset: over 23,000 cases and 11 variables. Right under the variable names, it tells us that all three of these are factors, R’s preferred way to represent categorical variables.

# Load comics dataset
comics <- read.csv("C:/Users/JuanFer Mosquera/Documents/datasets/comics.csv")

Working with factors

It’s clear that the alignment variable can be “good” or “neutral”, but what other values are possible? If we run levels on the align column, we learn that there are in fact four possible alignments, including reformed criminal.

comics <- comics %>%
  mutate(across(c(id, align, eye, gender, hair, publisher), as.factor))
# Working with factors 
class(comics$align)
[1] "factor"
levels(comics$align)
[1] "Bad"                "Good"               "Neutral"            "Reformed Criminals"
levels(comics$id)
[1] "No Dual" "Public"  "Secret"  "Unknown"
levels(comics$gender)
[1] "Female" "Male"   "Other" 
levels(comics$publisher)
[1] "dc"     "marvel"

A common way to represent the number of cases that fall into each combination of levels of two categorical variables, like these, is with a contingency table. This is done with the table() command, which takes as arguments the variables that you’re interested in.

tab <- table(comics$id, comics$align)
tab_a <- table(comics$gender, comics$align)

Dropping levels

The contingency table revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels().

This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.

tab
         
           Bad Good Neutral Reformed Criminals
  No Dual  474  647     390                  0
  Public  2172 2930     965                  1
  Secret  4493 2475     959                  1
  Unknown    7    0       2                  0
tab_a
        
          Bad Good Neutral Reformed Criminals
  Female 1573 2490     836                  1
  Male   7561 4809    1799                  2
  Other    32   17      17                  0

Let´s see which characters are not reformed criminals:

# Remove align level
comics_filtered <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

comics_filtered

Bar chart

Let´s construct two side-by-side bar charts of the comics data. This shows that there can often be two or more options for presenting the same data. Passing the argument position = "dodge" to geom_bar() says that you want a side-by-side (i.e. not stacked) bar chart.

# Create side-by-side bar chart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")

# Create side-by-side bar chart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

# Create a fill barchart of id by align
ggplot(comics, aes(x = id, fill = align)) + geom_bar()

Let’s look carefully at how this is constructed: each colored bar segment actually corresponds to a count in our table, with the x-axis and the fill= color indicating the category that we’re looking at. Several things pop out, like the fact that there are very few characters whose identities are unknown, but there are many where we don’t have data; that’s what the NAs mean.

The single largest bar segment corresponds to the most common category: characters with secret identities that are also bad. We can look across the identity types, though, and realize that bad is not always the largest category. This indicates that there is indeed an association between alignment and identity.

Counts vs. proportions

Sometimes raw counts of cases can be useful, but often it’s the proportions that are more interesting. We can do our best to compute these proportions in our head or we could do it explicitly.

From counts to proportions

Let’s return to our table of counts of cases by identity and alignment. If we wanted to instead get a sense of the proportion of all cases that fell into each category, we can take the original table of counts, saved as tab_cnt, and provide it as input to the prop.table() function. We see here that the single largest category are characters that are bad and secret at about \(29%\) of characters. Also note that because these are all proportions out of the whole dataset, the sum of all of these proportions is 1.

# Simlify display format
options(scipen = 999, digits = 3)
tab_cnt <- table(comics$id, comics$align)
tab_cnt
         
           Bad Good Neutral Reformed Criminals
  No Dual  474  647     390                  0
  Public  2172 2930     965                  1
  Secret  4493 2475     959                  1
  Unknown    7    0       2                  0
prop.table(tab_cnt)
         
                Bad      Good   Neutral Reformed Criminals
  No Dual 0.0305491 0.0416989 0.0251353          0.0000000
  Public  0.1399845 0.1888373 0.0621939          0.0000644
  Secret  0.2895721 0.1595128 0.0618072          0.0000644
  Unknown 0.0004511 0.0000000 0.0001289          0.0000000

Conditional proportions

If we’re curious about systematic associations between variables, we should look to conditional proportions. An example of a conditional proportion is the proportion of public identity characters that are good.

To build a table of these conditional proportions, add a 1 as the second argument, specifying that you’d like to condition on the rows. We see here that around \(57%\) of all secret characters are bad. Because we’re conditioning on identity, it’s every row that now sums to one. To condition on the columns instead, change that argument to 2. Now it’s the columns that sum to one and we learn, for example, that the proportion of bad characters that are secret is around 63%. As the number of cells in these tables gets large, it becomes much easier to make sense of your data using graphics. The bar chart is still a good choice, but we’re going to need to add some options.

prop.table(tab_cnt, 1)
         
               Bad     Good  Neutral Reformed Criminals
  No Dual 0.313700 0.428193 0.258107           0.000000
  Public  0.357943 0.482861 0.159031           0.000165
  Secret  0.566726 0.312185 0.120964           0.000126
  Unknown 0.777778 0.000000 0.222222           0.000000
prop.table(tab_cnt, 2)
         
               Bad     Good  Neutral Reformed Criminals
  No Dual 0.066331 0.106907 0.168394           0.000000
  Public  0.303946 0.484137 0.416667           0.500000
  Secret  0.628743 0.408956 0.414076           0.500000
  Unknown 0.000980 0.000000 0.000864           0.000000

Conditional bar chart

Here is the code for the bar chart based on counts. We want to condition on whatever is on the x axis and stretch those bars to each add up to a total proportion of 1, so we add the position equals fill option to the geom bar function.

ggplot(comics, aes(id, fill = align)) + geom_bar(position = "fill") + 
  ylab("Proportion")

ggplot(comics, aes(align, fill = id)) + geom_bar(position = "fill") +
  ylab("Proportion")

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3)
prop.table(tab)
                    
                        Female      Male     Other
  Bad                0.0821968 0.3950985 0.0016722
  Good               0.1301144 0.2512933 0.0008883
  Neutral            0.0436850 0.0940064 0.0008883
  Reformed Criminals 0.0000523 0.0001045 0.0000000
prop.table(tab, 2)
                    
                       Female     Male    Other
  Bad                0.321020 0.533554 0.484848
  Good               0.508163 0.339355 0.257576
  Neutral            0.170612 0.126949 0.257576
  Reformed Criminals 0.000204 0.000141 0.000000

Plot of gender by align

ggplot(comics_filtered, aes(align, fill = gender)) + 
  geom_bar()

Plot proportion of gender, conditional on align

ggplot(comics_filtered, aes(align, fill = gender)) + geom_bar(position = "fill") +
  ylab("proportion")

Distribution of one variable

Marginal distribution

To compute a table of counts for a single variable like id, just provide vector into into the table function by the sole argument. One way to think of what we’ve done is to take the original two-way table and then, sum the cells across each level of align. Since we’ve summed over the margins of the other variables, this is sometimes known as a marginal distribution.

table(comics$id)

No Dual  Public  Secret Unknown 
   1788    6994    8698       9 
tab_cnt <- table(comics$id, comics$align)
tab_cnt
         
           Bad Good Neutral Reformed Criminals
  No Dual  474  647     390                  0
  Public  2172 2930     965                  1
  Secret  4493 2475     959                  1
  Unknown    7    0       2                  0

Simple bar chart

The syntax to create the simple bar chart is straightforward as well, just remove the fill equals align argument.

ggplot(comics, aes(id)) + 
  geom_bar()

Faceting

Another useful way to form the distribution of a single variable is to condition on a particular value of another variable. We might be interested, for example, in the distribution of id for all neutral characters. We could either filter the dataset and build a bar chart using only cases where alignment was neutral, or we could use a technique called faceting. Faceting breaks the data into subsets based on the levels of a categorical variable and then constructs a plot for each.

Faceted bar charts

To implement this in ggplot2, we just need to add a faceting layer: the facet_wrap() function, then a tilde (~), which can be read as “broken down by” and then our variable align. The result is three simple bar charts side-by-side, the first one corresponding to the distribution of id within all cases that have a bad alignment, and so on, for good and neutral alignments. If this plot feels familiar, it should.

ggplot(comics_filtered, aes(id)) + geom_bar() + 
  facet_wrap(~ align)

Marginal bar chart

If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a bar chart for just that single variable.

You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are "Neutral" show an alignment between "Good" and "Bad", so it makes sense to put that bar in the middle.

comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

# Create the plot of align
ggplot(comics, aes(align)) + geom_bar()

Conditional bar chart

Now, if you want to break down the distribution of alignment based on gender, you’re looking for conditional distributions.

You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender. As a point of comparison, we’ve provided your plot of the marginal distribution of alignment from the last exercise.

ggplot(comics_filtered, aes(align)) + geom_bar() + 
  facet_wrap(~ gender)

Improve pie chart

The pie chart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than bar charts.

ggplot(comics, aes(align)) +
  geom_bar(fill = "chartreuse") +
  theme(axis.text.x = element_text(angle = 90))

END

---
title: "Exploratory data analysis"
author: Juan Fernando Mosquera
date: 2024-03-02
output: html_notebook
---

```{r}
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
```

## **The comics Dataset**

Two publishers, Marvel and DC, have created a host of superheroes that have made their way into popular culture. You're probably familiar with Batman and Spiderman, but what about Mor the Mighty? The comics dataset has information on all comic characters that have been introduced by DC and Marvel. If we type the name of the dataset at the console, we get the first few rows and columns. Here we see that each row, or case, is a different character and each column, or variable, is a different observation made on that character. At the top it tell us the dimensions of this dataset: over 23,000 cases and 11 variables. Right under the variable names, it tells us that all three of these are factors, R's preferred way to represent categorical variables.

```{r}
# Load comics dataset
comics <- read.csv("C:/Users/JuanFer Mosquera/Documents/datasets/comics.csv")
```

```{r}
head(comics, 20)
```

### **Working with factors**

It's clear that the alignment variable can be "**good**" or "**neutral**", but what other values are possible? If we run levels on the align column, we learn that there are in fact four possible alignments, including **reformed criminal**.

```{r}
comics <- comics %>%
  mutate(across(c(id, align, eye, gender, hair, publisher), as.factor))
```

```{r}
# Working with factors 
class(comics$align)
levels(comics$align)
levels(comics$id)
levels(comics$gender)
levels(comics$publisher)
```

A common way to represent the number of cases that fall into each combination of levels of two categorical variables, like these, is with a **contingency table**. This is done with the `table()` command, which takes as arguments the variables that you're interested in.

```{r}
tab <- table(comics$id, comics$align)
tab_a <- table(comics$gender, comics$align)
```

### **Dropping levels**

The contingency table revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In `R`, this requires two steps: first **filtering out any rows with the levels that have very low counts**, then **removing these levels from the factor variable** with `droplevels()`.

This is because the `droplevels()` function would keep levels that have just 1 or 2 counts; it only drops levels that don't exist in a dataset.

```{r}
tab
tab_a
```

Let´s see which characters are not reformed criminals:

```{r}
# Remove align level
comics_filtered <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

comics_filtered
```

### **Bar chart**

Let´s construct two side-by-side bar charts of the `comics` data. This shows that there can often be two or more options for presenting the same data. Passing the argument `position = "dodge"` to `geom_bar()` says that you want a side-by-side (i.e. not stacked) bar chart.

```{r}
# Create side-by-side bar chart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")
```

```{r}
# Create side-by-side bar chart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))
```

```{r}
# Create a fill barchart of id by align
ggplot(comics, aes(x = id, fill = align)) + geom_bar()
```

Let's look carefully at how this is constructed: each colored bar segment actually corresponds to a count in our table, with the x-axis and the `fill=` color indicating the category that we're looking at. Several things pop out, like the fact that there are very few characters whose **identities are unknown**, but there are many where we don't have data; that's what the NAs mean.

The single largest bar segment corresponds to the most common category: characters with secret identities that are also bad. We can look across the identity types, though, and realize that bad is not always the largest category. This indicates that there is indeed an association between alignment and identity.

## **Counts vs. proportions**

Sometimes raw counts of cases can be useful, but often it's the proportions that are more interesting. We can do our best to compute these proportions in our head or we could do it explicitly.

### From counts to proportions

Let's return to our table of counts of cases by identity and alignment. If we wanted to instead get a sense of the proportion of all cases that fell into each category, we can take the original table of counts, saved as `tab_cnt`, and provide it as input to the `prop.table()` function. We see here that the single largest category are characters that are bad and secret at about $29%$ of characters. Also note that because these are all proportions out of the whole dataset, the sum of all of these proportions is 1.

```{r}
# Simlify display format
options(scipen = 999, digits = 3)
tab_cnt <- table(comics$id, comics$align)
tab_cnt
```

```{r}
prop.table(tab_cnt)
```

### **Conditional proportions**

If we're curious about systematic associations between variables, we should look to conditional proportions. An example of a conditional proportion **is the proportion of public identity characters that are good**.

To build a table of these conditional proportions, add a 1 as the second argument, specifying that you'd like to condition on the rows. We see here that around $57%$ of all secret characters are bad. Because we're conditioning on identity, it's every row that now sums to one. To condition on the columns instead, change that argument to 2. Now it's the columns that sum to one and we learn, for example, that the proportion of bad characters that are secret is around 63%. As the number of cells in these tables gets large, it becomes much easier to make sense of your data using graphics. The bar chart is still a good choice, but we're going to need to add some options.

```{r}
prop.table(tab_cnt, 1)
```

```{r}
prop.table(tab_cnt, 2)
```

### **Conditional bar chart**

Here is the code for the bar chart based on counts. We want to condition on whatever is on the x axis and stretch those bars to each add up to a total proportion of 1, so we add the position equals fill option to the geom bar function.

```{r}
ggplot(comics, aes(id, fill = align)) + geom_bar(position = "fill") + 
  ylab("Proportion")
```

```{r}
ggplot(comics, aes(align, fill = id)) + geom_bar(position = "fill") +
  ylab("Proportion")
```

```{r}
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3)
prop.table(tab)
prop.table(tab, 2)
```

### **Plot of gender by align**

```{r}
ggplot(comics_filtered, aes(align, fill = gender)) + 
  geom_bar()
```

### **Plot proportion of gender, conditional on align**

```{r}
ggplot(comics_filtered, aes(align, fill = gender)) + geom_bar(position = "fill") +
  ylab("proportion")
```

## **Distribution of one variable**

### **Marginal distribution**

To compute a table of counts for a single variable like id, just provide vector into into the table function by the sole argument. One way to think of what we've done is to take the original two-way table and then, sum the cells across each level of align. Since we've summed over the margins of the other variables, this is sometimes known as a marginal distribution.

```{r}
table(comics$id)
```

```{r}
tab_cnt <- table(comics$id, comics$align)
tab_cnt
```

### **Simple bar chart**

The syntax to create the simple bar chart is straightforward as well, just remove the fill equals align argument.

```{r}
ggplot(comics, aes(id)) + 
  geom_bar()
```

### **Faceting**

Another useful way to form the distribution of a single variable is to condition on a particular value of another variable. We might be interested, for example, in the distribution of **id** for all **neutral characters**. We could either filter the dataset and build a bar chart using only cases where alignment was neutral, or we could use a technique called faceting. Faceting breaks the data into subsets based on the levels of a categorical variable and then constructs a plot for each.

### **Faceted bar charts**

To implement this in `ggplot2`, we just need to add a faceting layer: the `facet_wrap()` function, then a tilde (\~), which can be read as "**broken down by**" and then our variable `align`. The result is three simple bar charts side-by-side, the first one corresponding to the distribution of id within all cases that have a bad alignment, and so on, for good and neutral alignments. If this plot feels familiar, it should.

```{r}
ggplot(comics_filtered, aes(id)) + geom_bar() + 
  facet_wrap(~ align)
```

### **Marginal bar chart**

If you are interested in the distribution of alignment of *all* superheroes, it makes sense to construct a bar chart for just that single variable.

You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are `"Neutral"` show an alignment between `"Good"` and `"Bad"`, so it makes sense to put that bar in the middle.

```{r}
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

# Create the plot of align
ggplot(comics, aes(align)) + geom_bar()
```

### **Conditional bar chart**

Now, if you want to break down the distribution of alignment based on gender, you're looking for conditional distributions.

You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender. As a point of comparison, we've provided your plot of the marginal distribution of alignment from the last exercise.

```{r}
ggplot(comics_filtered, aes(align)) + geom_bar() + 
  facet_wrap(~ gender)
```

### **Improve pie chart**

The pie chart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than bar charts.

```{r}
ggplot(comics, aes(align)) +
  geom_bar(fill = "chartreuse") +
  theme(axis.text.x = element_text(angle = 90))
```

### **END**
