Histograms

tidyverse

Let’s bring the tidyverse into R Studio.

library(tidyverse)

Histograms and Bar Plots

What is the difference between a histogram and a bar plot? Let’s compare the two side-by-side.

A bar plot counts the cars in each category. Our data has a lot of SUVs and not many 2-seaters or minivans. We don’t know anything else about them right now, except how many of them there are.

A histogram, on the other hand, “counts” the cars based on their highway mileage. Most of the cars, be they SUVs or minivans (we don’t know), seem to have an efficiency of between 15-30 miles per gallon on a highway.

Unlike bar plots, histograms are a way of counting numerical data. Like boxplots or violin plots, they show us the spread of the data.

The Brain Cancer Dataset

Let’s use the ISLR2 package again. This time, we’ll explore the BrainCancer dataset, which has information on the results of a study conducted on survival times for patients diagnosed with brain cancer. Read more about this dataset at https://cran.r-project.org/web//packages/ISLR2/ISLR2.pdf. Take some time to understand the variables.

library(ISLR2)
BrainCancer <- BrainCancer %>% as_tibble()
BrainCancer
## # A tibble: 88 x 8
##    sex    diagnosis  loc               ki   gtv stereo status  time
##    <fct>  <fct>      <fct>          <int> <dbl> <fct>   <int> <dbl>
##  1 Female Meningioma Infratentorial    90  6.11 SRS         0 57.6 
##  2 Male   HG glioma  Supratentorial    90 19.4  SRT         1  8.98
##  3 Female Meningioma Infratentorial    70  7.95 SRS         0 26.5 
##  4 Female LG glioma  Supratentorial    80  7.61 SRT         1 47.8 
##  5 Male   HG glioma  Supratentorial    90  5.06 SRT         1  6.3 
##  6 Female Meningioma Supratentorial    80  4.82 SRS         0 52.8 
##  7 Male   Meningioma Supratentorial    80  3.19 SRT         0 55.8 
##  8 Male   LG glioma  Supratentorial    80 12.4  SRT         0 42.1 
##  9 Female Meningioma Supratentorial    70 12.2  SRT         0 34.7 
## 10 Male   HG glioma  Supratentorial   100  2.53 SRT         0 11.5 
## # ... with 78 more rows

What are the categories recorded? Among others, we have the sex, the type of brain cancer (diagnosis), and the status, which tracks whether the patient was alive at the end of the study.

Bar plots

So how many male and female patients do we have?

ggplot(BrainCancer, aes(sex)) + 
  geom_bar()

We should know how to get this information as a table too.

BrainCancer %>% count(sex)
## # A tibble: 2 x 2
##   sex        n
##   <fct>  <int>
## 1 Female    45
## 2 Male      43

What types of cancers do we have information for?

ggplot(BrainCancer, aes(diagnosis)) + 
  geom_bar()

BrainCancer %>% count(diagnosis)
## # A tibble: 5 x 2
##   diagnosis      n
##   <fct>      <int>
## 1 Meningioma    42
## 2 LG glioma      9
## 3 HG glioma     22
## 4 Other         14
## 5 <NA>           1

We have a few NA values. Let’s try and get rid of these.

BrainCancer %>% 
  filter(!is.na(diagnosis))
## # A tibble: 87 x 8
##    sex    diagnosis  loc               ki   gtv stereo status  time
##    <fct>  <fct>      <fct>          <int> <dbl> <fct>   <int> <dbl>
##  1 Female Meningioma Infratentorial    90  6.11 SRS         0 57.6 
##  2 Male   HG glioma  Supratentorial    90 19.4  SRT         1  8.98
##  3 Female Meningioma Infratentorial    70  7.95 SRS         0 26.5 
##  4 Female LG glioma  Supratentorial    80  7.61 SRT         1 47.8 
##  5 Male   HG glioma  Supratentorial    90  5.06 SRT         1  6.3 
##  6 Female Meningioma Supratentorial    80  4.82 SRS         0 52.8 
##  7 Male   Meningioma Supratentorial    80  3.19 SRT         0 55.8 
##  8 Male   LG glioma  Supratentorial    80 12.4  SRT         0 42.1 
##  9 Female Meningioma Supratentorial    70 12.2  SRT         0 34.7 
## 10 Male   HG glioma  Supratentorial   100  2.53 SRT         0 11.5 
## # ... with 77 more rows

This filters everything that is not an NA in the diagnosis column. The is.na() function gets the NA values from a column. The ! means “not”.

So now we have everything that is not an NA. We can see that this has 87 rows. The original had 88, so only one entry didn’t have a diagnosis to it.

We have to actually assign this to something, otherwise we’ll just be using the original BrainCancer. We could assign it back to BrainCancer and change the original data, but let’s create a new dataset.

BrainCancer1 <- BrainCancer %>% 
  filter(!is.na(diagnosis))
BrainCancer1
## # A tibble: 87 x 8
##    sex    diagnosis  loc               ki   gtv stereo status  time
##    <fct>  <fct>      <fct>          <int> <dbl> <fct>   <int> <dbl>
##  1 Female Meningioma Infratentorial    90  6.11 SRS         0 57.6 
##  2 Male   HG glioma  Supratentorial    90 19.4  SRT         1  8.98
##  3 Female Meningioma Infratentorial    70  7.95 SRS         0 26.5 
##  4 Female LG glioma  Supratentorial    80  7.61 SRT         1 47.8 
##  5 Male   HG glioma  Supratentorial    90  5.06 SRT         1  6.3 
##  6 Female Meningioma Supratentorial    80  4.82 SRS         0 52.8 
##  7 Male   Meningioma Supratentorial    80  3.19 SRT         0 55.8 
##  8 Male   LG glioma  Supratentorial    80 12.4  SRT         0 42.1 
##  9 Female Meningioma Supratentorial    70 12.2  SRT         0 34.7 
## 10 Male   HG glioma  Supratentorial   100  2.53 SRT         0 11.5 
## # ... with 77 more rows

And now let’s bring up the barplot again using BrainCancer1 instead of BrainCancer.

ggplot(BrainCancer1, aes(diagnosis)) + 
  geom_bar()

Histograms

What are the numerical variables recorded? We have the size of the brain tumor (gtv). What are the largest, smallest and average sizes?

BrainCancer1 %>%  
  summarise(largest = max(gtv), 
            smallest = min(gtv), 
            average = mean(gtv))
## # A tibble: 1 x 3
##   largest smallest average
##     <dbl>    <dbl>   <dbl>
## 1    34.6     0.01    8.69

The summarise() function creates a new table with different summaries. We tell it to create a new column (which will be called “largest”) and what should go in that column (the maximum value it can find in gtv).

So the mean is around 8.7 cubic centimetres. Imagine something that big in your head!

But does that mean that most of them are around that size? Or that many are smaller and many are larger, and they just average out to 8.7?

To find out, we need to see the spread of the data.

ggplot(BrainCancer1, aes(gtv)) + 
  geom_histogram()

So it looks like most are between 0 and 16 cubic centimetres. More tumors are less than 8.7 cubic centimetres, and a few enormous ones in the 20’s and 30’s are actually pulling the average up. Good to know.

More questions

So now we’re armed with some insights from the data. We know how many men and women participated in this study and we know what types of tumors they had. We also know what the sizes of their tumors are.

Could there be a difference in the sizes of tumors for men and women?

We would have to split the histogram into two - one histogram for men and one for women.

ggplot(BrainCancer1, aes(sex, gtv)) + 
  geom_histogram()

We get an error! This is a bummer. The geom_histogram() function just doesn’t understand what to do with a second variable. This makes sense, because by definition, it just finds the counts of one numerical variable. The count itself becomes the second variable.

We can do it another way - by using facet_wrap().

ggplot(BrainCancer1, aes(gtv, fill = sex)) + 
  geom_histogram() + 
  facet_wrap(~sex)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

facet_wrap() splits one chart into multiple mini-charts on the basis of some category. Note that we put a tilde(~) before specifying sex.

So with this, we have two histograms, both of which are still just counting the tumor size, but with two different subsets of the data - one for males and one for females.

From this we can see that there’s not too much of a difference between the two sexes, but females seem to have slightly smaller brain tumors on average. Let’s also get the actual numbers.

BrainCancer1 %>% 
  group_by(sex) %>% 
  summarise(average = mean(gtv))
## # A tibble: 2 x 2
##   sex    average
##   <fct>    <dbl>
## 1 Female    7.64
## 2 Male      9.81

Whether this is significant or not, only a doctor could tell us.

Let’s ask another question - do women tend to have more of a certain type of tumor?

For this, we don’t need histograms any more, since we want to count categorical variables. So let’s go back to geom_bar() and use facet_wrap().

ggplot(BrainCancer1, aes(diagnosis)) + 
  geom_bar() + 
  facet_wrap(~sex)

By fiddling with the plot, we can represent this in other ways.

ggplot(BrainCancer1, aes(sex)) + 
  geom_bar() + 
  facet_wrap(~diagnosis)

More women get diagnosed with meningloma and more men get diagnosed with both HG and LG glioma.

Let’s combine these different plots into a single plot. This means that we’ll need to drop facet_wrap() and instead bring in that data through some other means. One way we can split it is on the basis of colour (or fill in ggplot2 terms).

ggplot(BrainCancer1, aes(diagnosis, fill = sex)) + 
  geom_bar()

So geom_bar() just counted how many meningloma cases there were, and then divided the male cases and female cases by colour.

Some further adjustments would make this more readable. The two colours stacking on top of each is a little hard to understand. There is a special argument for geom_bar() called “position”. It shifts the positioning of the coloured bars.

ggplot(BrainCancer1, aes(diagnosis, fill = sex)) + 
  geom_bar(position = "dodge")

We can also do “dodge2” and “stack”. Try them out and see what they do!

Homework

There are other variables we didn’t explore. The status column records whether the patient is alive or not. 0 means they are and 1 means they died during the study. How many are alive?

Does that depend on the diagnosis, the sex or the tumor size?

library(viridis)
# Since status is a double, we need to convert it to a factor first

BrainCancer1 <- BrainCancer1 %>% 
  mutate(status = as.factor(status))

ggplot(BrainCancer1, aes(status, fill = sex)) +
  geom_bar(position = "dodge") +
  labs(title = "Patient status by sex",
       subtitle = "More women lived and more men died",
       caption = "Data obtained from the ISLR2 package")

ggplot(BrainCancer1, aes(diagnosis, fill = status)) +
  geom_bar(position = "dodge") +
  labs(title = "Tumor types and survival rates",
       subtitle = "Meningloma patients had a high survival rate, but most HG glioma patients died",
       caption = "Data obtained from the ISLR2 package")

ggplot(BrainCancer1, aes(gtv)) +
  geom_histogram() +
  facet_wrap(~status) + 
  labs(title = "Patient tumor size by status",
       subtitle = "Smaller tumors have a higher chance of survival",
       caption = "Data obtained from the ISLR2 package")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Explore the other variables too!

Finally, do you remember that NA value we removed? Let’s pull it out once.

BrainCancer %>% 
  filter(is.na(diagnosis))
## # A tibble: 1 x 8
##   sex   diagnosis loc               ki   gtv stereo status  time
##   <fct> <fct>     <fct>          <int> <dbl> <fct>   <int> <dbl>
## 1 Male  <NA>      Supratentorial    90  6.38 SRT         0  50.8

We actually have everything here except the diagnosis. Based on your insights into the rest of the data and how they affect or depend on the diagnosis, can you guess what type of tumor this person had?