Data Science Lab

# install.packages("dslabs")  # these are data science labs
library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-death_prob.R"                   
##  [5] "make-divorce_margarine.R"            
##  [6] "make-gapminder-rdas.R"               
##  [7] "make-greenhouse_gases.R"             
##  [8] "make-historic_co2.R"                 
##  [9] "make-mnist_27.R"                     
## [10] "make-movielens.R"                    
## [11] "make-murders-rda.R"                  
## [12] "make-na_example-rda.R"               
## [13] "make-nyc_regents_scores.R"           
## [14] "make-olive.R"                        
## [15] "make-outlier_example.R"              
## [16] "make-polls_2008.R"                   
## [17] "make-polls_us_election_2016.R"       
## [18] "make-reported_heights-rda.R"         
## [19] "make-research_funding_rates.R"       
## [20] "make-stars.R"                        
## [21] "make-temp_carbon.R"                  
## [22] "make-tissue-gene-expression.R"       
## [23] "make-trump_tweets.R"                 
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
library(ggplot2)
library(RColorBrewer)
library(dplyr)
library(tidyverse)
library(plotly)

Gapminder Dataset

This dataset includes health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.

data("gapminder")
dim(gapminder)
## [1] 10545     9
new_gapminder <- na.omit(gapminder)
dim(new_gapminder)
## [1] 7139    9

Now, I explain how to include

  • Labels for x- and y-axes and a title,
  • A theme for the graph,
  • Colors for a third variable, with a legend with the help of bar chart to anlyze the gapminder data set.

Bar chart on a categorical variable

The bar geom is used to produce 1d area plots for categorical x.

# Bar chart on a Categorical variable
g <- ggplot(new_gapminder, aes(year))
g + geom_bar() 

The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat="bin". This makes the height of each bar equal to the number of cases in each group. If we want the heights of the bars to represent values in the data, use stat="identity".

# Bar chart on a Categorical variable
g1 <- ggplot(new_gapminder, aes(year, life_expectancy))
g1 + geom_bar(stat="identity") 

Here we can observe that in order to use stat="identity" we need to mention aesthetic \(y\).

Labels for x- and y-axes and a title

We notice here that when we use bar chart, labels for \(x\) and \(y\) axes are default. I will explain this through bubble plot.

Change fill colors, width of the bar chart

It is possible to change manually barplot fill colors using the functions :

  • scale_fill_manual() : to use custom colors
  • scale_fill_brewer() : to use color palettes from RColorBrewer package
  • scale_fill_grey() : to use grey color palettes
g2 <- ggplot(new_gapminder, aes(year))+ geom_bar(aes(fill=region), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Histogram on Categorical Variable") 
g2

# use brewer color palettes
g2+scale_fill_brewer(palette="Dark2")

# Use grey scale
g2 + scale_fill_grey()

Here labs() is used to add the title for the bar chart. Similarly you can add x-label and y-label inside labs() according to your need.

Bar chart on a categorical variable would result in a frequency chart showing bars for each category. By adjusting width, you can adjust the thickness of the bars.

Bubble Plot

A bubble plot is a scatterplot where a third dimension is added: the value of an additional numeric variable is represented through the size of the dots.

You need 3 numerical variables as input: one is represented by the \(x\) axis, one by the \(y\) axis, and one by the dot size.

ds_theme_set()
# Libraries
library(hrbrthemes)
library(viridis)
library(gridExtra)
library(ggrepel)
gapminder_bubbleplot <- new_gapminder %>% mutate(gdp_per_capita = gdp/population) %>% filter(year=="2011") 
# Show a bubbleplot
gapminder_bubbleplot %>%
  arrange(desc(population)) %>%
  mutate(text = paste("Country: ", country, "\nPopulation: ", population, "\nLife Expectancy: ", life_expectancy, "\nGdp per capita: ", gdp_per_capita, sep="")) %>%
  ggplot( aes(x=gdp_per_capita, y=life_expectancy, size = population, color = continent)) +
    geom_point(alpha=0.7) +
    scale_size(range = c(1, 19), name="Population") +
    scale_color_viridis(discrete=TRUE, guide=FALSE) +
    theme(legend.position="none") 

The above plot provides the life expectancy, gdp per capita and population size for more than 100 countries.

bubble_plot  <- gapminder_bubbleplot %>%
  mutate(country = factor(country)) %>%
  mutate(text = paste("Country: ", country, "\nPopulation: ", population, "\nLife Expectancy: ", life_expectancy, "\nGdp per capita: ", gdp_per_capita, sep="")) %>%
  ggplot( aes(x=gdp_per_capita, y=life_expectancy, size = population, color = continent, text=text)) +
    geom_point(alpha=0.7) +
    scale_size(range = c(1.4, 19), name="Population") +
    scale_color_viridis(discrete=TRUE, guide=FALSE) +
   theme_ipsum_rc(grid="Y") +
#theme(axis.text.y=element_blank())+
    theme(legend.position="bottom")
ggplotly(bubble_plot, tooltip="text")

In this plot we can hover bubbles to get conutry name and zoom on a specific part of the graphic.

Draw some circles

We can draw circles with the symbols() command. Pass it values for the x-axis, y-axis, and circles, and it’ll spit out a bubble chart.

gapminder_filter <- filter(new_gapminder, country == "United States" | country == "India")
symbols(gapminder_filter$year, gapminder_filter$life_expectancy, circles=gapminder_filter$population)

Size the circles

The above command sizes the radius of the circles by population. We want to size them by area. We know that Area of circle = \(π\) \(r^2\). In this case area of the circle is population. We want to know \(r\). \(\therefore r = \sqrt(Area \;\; of \;\; circle )/ π)\).

Substitute population for the area of the circle, and translate to R, and we get this:

radius <- sqrt( gapminder_filter$population/ pi )
symbols(gapminder_filter$year, gapminder_filter$life_expectancy, circles=radius)

By default, symbols() sizes the largest bubble to one inch, and then scales the rest accordingly. We can change that by using the inches argument. Whatever value you put will take the place of the one-inch default. While we’re at it, let’s add color and change the x- and y-axis labels.

symbols(gapminder_filter$year, gapminder_filter$life_expectancy, circles=radius, inches=0.35, fg="white", bg="red", xlab="year", ylab="life_expectancy")

Here we use fg to change border color, bg to change fill color.

Add labels

From the above graph, we don’t know which circle represents each region/Country. So let’s add labels. We do this with text(), whose arguments are x-coordinates, y-coordinates, and the actual text to print.

symbols(gapminder_filter$year, gapminder_filter$life_expectancy, circles=radius, inches=0.35, fg="white", bg="red", xlab="year", ylab="life_expectancy")
text(gapminder_filter$year, gapminder_filter$life_expectancy, gapminder_filter$region, cex=0.5)

The cex argument controls text size. It is 1 by default.