1. GETTING STARTED WITH DATA IN R

1.1 Errors, warnings, and messages

  • Errors: When the red text is a legitimate error, it will be prefaced with “Error in.” and try to explain what went wrong.

  • Warnings: When the red text is a warning, it will be prefaced with “Warning:” and try to explain why there’s a warning. Generally your code will still work, but with some caveats.

  • Messages: When the red text doesn’t start with either “Error” or “Warning”, it’s just a friendly message.

1.2 Package installation

Option 1:
>instal.packages("ggplot2")
>instal.packages("dplyr")
>instal.packages("nycflights13") >instal.packages("knitr")

Option 2:
Click o Packages, Instal, Name of Package. If multiple, separate with a comma.

1.3 Package loading

This is done/loaded everytime you open an r session or when you want use a package;

>library(dplyr)
>library(knitr)
>library(nycflights13)
>library(ggplot2)

Otherwise you will get the following error: Error: could not find function

1.4. Dataset Exploration

>flights - (part of nycflights13 package)

  1. Using the View() function built for use in RStudio.
  2. Using the glimpse() function, which is included in the dplyr package.
  3. Using the kable() function, which is included in the knitr package.
  4. Using the $ operator to view a single variable in a data frame.

2. DATA IMPORTING AND “TIDY” DATA

The packages we need;

library(dplyr)  
library(ggplot2)   
library(readr)     
library(tidyr)      
library(nycflights13)       
library(fivethirtyeight)        

Spreadsheet data is often saved in one of the following formats:

2.1.1 Using the console

First, let’s import a Comma Separated Values .csv file of data directly off the internet. The .csv file dem_score.csv accessible at https://moderndive.com/data/dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992. Let’s use the read_csv() function from the readr package to read it off the web, import it into R, and save it in a data frame called dem_score

dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
dem_score
## # A tibble: 96 x 10
##    country   `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
##    <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Albania       -9     -9     -9     -9     -9     -9     -9     -9      5
##  2 Argentina     -9     -1     -1     -9     -9     -9     -8      8      7
##  3 Armenia       -9     -7     -7     -7     -7     -7     -7     -7      7
##  4 Australia     10     10     10     10     10     10     10     10     10
##  5 Austria       10     10     10     10     10     10     10     10     10
##  6 Azerbaij~     -9     -7     -7     -7     -7     -7     -7     -7      1
##  7 Belarus       -9     -7     -7     -7     -7     -7     -7     -7      7
##  8 Belgium       10     10     10     10     10     10     10     10     10
##  9 Bhutan       -10    -10    -10    -10    -10    -10    -10    -10    -10
## 10 Bolivia       -4     -3     -3     -4     -7     -7      8      9      9
## # ... with 86 more rows
  • In this dem_score data frame, the minimum value of -10 corresponds to a highly autocratic nation whereas a value of 10 corresponds to a highly democratic nation.
  • Note that the read_csv() function included in the readr package is different than the read.csv() function that comes installed with R by default. The read_csv() function is in our opinion easier to use since it can more easily read data off the web and generally imports data at a much faster speed.

2.1.2 Using RStudio’s interface

  1. Go to Import Dataset in the work environment.
  2. Click on the data where its saved on the PC
  3. Import the dataset.

Note in the bottom right of the above image there exists a “Code Preview”: you can copy and paste this code to reload your data again later automatically instead of repeating the above manual point-and-click process.

2.2 Tidy Data

Let’s consider the drinks data frame included in the fivethirtyeight data that is from a survey of the average number of servings of beer, spirits, and wine consumed for 193 countries. This data was originally reported on the data journalism website FiveThirtyEight.com in Mona Chalabi’s article “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”

drinks_smaller <- drinks %>%
  filter(country %in%  c("China", "Italy", "Saudi Arabia", "USA")) %>%
  select(-total_litres_of_pure_alcohol) %>%
  rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
## # A tibble: 4 x 4
##   country       beer spirit  wine
##   <chr>        <int>  <int> <int>
## 1 China           79    192     8
## 2 Italy           85     42   237
## 3 Saudi Arabia     0      5     0
## 4 USA            249    158    84

2.2.1 Converting to “tidy” data

We use the gather() function to convert wide/untidy data to tidy data:

drinks_smaller_tidy <- drinks_smaller %>% 
  gather(key = type, value = servings, -country)
drinks_smaller_tidy
## # A tibble: 12 x 3
##    country      type   servings
##    <chr>        <chr>     <int>
##  1 China        beer         79
##  2 Italy        beer         85
##  3 Saudi Arabia beer          0
##  4 USA          beer        249
##  5 China        spirit      192
##  6 Italy        spirit       42
##  7 Saudi Arabia spirit        5
##  8 USA          spirit      158
##  9 China        wine          8
## 10 Italy        wine        237
## 11 Saudi Arabia wine          0
## 12 USA          wine         84

We set the arguments to gather() as follows:

  1. key is the name of the column/variable in the new “tidy” frame that contains the column names of the original data frame that you want to tidy. Observe how we set key = type and in the resulting drinks_smaller_tidy the column type contains the three types of alcohol, beer, spirit, and wine.
  2. value is the name of the column/variable in the “tidy” frame that contains the rows and columns of values in the original data frame you want to tidy. Observe how we set value = servings and in the resulting drinks_smaller_tidy the column value contains the 4 × 3 = 12 numerical values.
  3. The third argument are the columns you either want to or don’t want to tidy. Observe how we set this to -country indicating that we don’t want to tidy the country variable in drinks_smaller and rather only beer, spirit, and wine.

Note the code below is very similar, but now the third argument species which columns we’d want to tidy c(beer, spirit, wine), instead of the columns we don’t want to tidy -country. Note the use of c() to create a vector of the columns in drinks_smaller that we’d like to tidy. If you run the code below, you’ll see that the resulting drinks_smaller_tidy is the same.

drinks_smaller_tidy <- drinks_smaller %>% 
  gather(key = type, value = servings, c(beer, spirit, wine))
drinks_smaller_tidy
## # A tibble: 12 x 3
##    country      type   servings
##    <chr>        <chr>     <int>
##  1 China        beer         79
##  2 Italy        beer         85
##  3 Saudi Arabia beer          0
##  4 USA          beer        249
##  5 China        spirit      192
##  6 Italy        spirit       42
##  7 Saudi Arabia spirit        5
##  8 USA          spirit      158
##  9 China        wine          8
## 10 Italy        wine        237
## 11 Saudi Arabia wine          0
## 12 USA          wine         84

With our drinks_smaller_tidy “tidy” format data frame, we can now produce a side-by-side AKA dodged barplot using geom_col() and not geom_bar(), since we would like to map the servings variable to the y-aesthetic of the bars.

ggplot(drinks_smaller_tidy, aes(x=country, y=servings, fill=type)) +
  geom_col(position = "dodge") +
  labs(x = "country", y = "servings")

Practise Questions:

  1. Tidy the airline_safety data frame in the fivethirtyeight package. Let’s ignore the incl_reg_subsidiaries and avail_seat_km_per_week variables for simplicity:
airline_safety
## # A tibble: 56 x 9
##    airline incl_reg_subsid~ avail_seat_km_p~ incidents_85_99
##    <chr>   <lgl>                       <dbl>           <int>
##  1 Aer Li~ FALSE                   320906734               2
##  2 Aerofl~ TRUE                   1197672318              76
##  3 Aeroli~ FALSE                   385803648               6
##  4 Aerome~ TRUE                    596871813               3
##  5 Air Ca~ FALSE                  1865253802               2
##  6 Air Fr~ FALSE                  3004002661              14
##  7 Air In~ TRUE                    869253552               2
##  8 Air Ne~ TRUE                    710174817               3
##  9 Alaska~ TRUE                    965346773               5
## 10 Alital~ FALSE                   698012498               7
## # ... with 46 more rows, and 5 more variables:
## #   fatal_accidents_85_99 <int>, fatalities_85_99 <int>,
## #   incidents_00_14 <int>, fatal_accidents_00_14 <int>,
## #   fatalities_00_14 <int>
airline_safety_smaller  <- airline_safety %>%
  select(-c(incl_reg_subsidiaries , avail_seat_km_per_week ))
airline_safety_smaller
## # A tibble: 56 x 7
##    airline incidents_85_99 fatal_accidents~ fatalities_85_99
##    <chr>             <int>            <int>            <int>
##  1 Aer Li~               2                0                0
##  2 Aerofl~              76               14              128
##  3 Aeroli~               6                0                0
##  4 Aerome~               3                1               64
##  5 Air Ca~               2                0                0
##  6 Air Fr~              14                4               79
##  7 Air In~               2                1              329
##  8 Air Ne~               3                0                0
##  9 Alaska~               5                0                0
## 10 Alital~               7                2               50
## # ... with 46 more rows, and 3 more variables: incidents_00_14 <int>,
## #   fatal_accidents_00_14 <int>, fatalities_00_14 <int>
airline_safety_tidy <- airline_safety_smaller %>% 
  gather(key = Incident_type, value = count, c(incidents_85_99, incidents_00_14)) %>%
  gather(key = fatal_accidents, value = count, c(fatal_accidents_85_99, fatal_accidents_00_14)) %>%
  gather(key = fatalities, value = count, c(fatalities_85_99, fatalities_00_14)) 
airline_safety_tidy
## # A tibble: 448 x 5
##    airline          Incident_type   fatal_accidents     fatalities    count
##    <chr>            <chr>           <chr>               <chr>         <int>
##  1 Aer Lingus       incidents_85_99 fatal_accidents_85~ fatalities_8~     0
##  2 Aeroflot         incidents_85_99 fatal_accidents_85~ fatalities_8~   128
##  3 Aerolineas Arge~ incidents_85_99 fatal_accidents_85~ fatalities_8~     0
##  4 Aeromexico       incidents_85_99 fatal_accidents_85~ fatalities_8~    64
##  5 Air Canada       incidents_85_99 fatal_accidents_85~ fatalities_8~     0
##  6 Air France       incidents_85_99 fatal_accidents_85~ fatalities_8~    79
##  7 Air India        incidents_85_99 fatal_accidents_85~ fatalities_8~   329
##  8 Air New Zealand  incidents_85_99 fatal_accidents_85~ fatalities_8~     0
##  9 Alaska Airlines  incidents_85_99 fatal_accidents_85~ fatalities_8~     0
## 10 Alitalia         incidents_85_99 fatal_accidents_85~ fatalities_8~    50
## # ... with 438 more rows

3. DATA VISUALIZATION

We need to load the above packages since they are for data visualizations.

3.1 Components of the Grammar of Graphics

In short, the grammar tells us that:

A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.

Specifically, we can break a graphic into the following three essential components:

  1. data: the data set composed of variables that we map.
  2. geom: the geometric object in question. This refers to the type of object we can observe in plot. For example: points, lines, and bars.
  3. aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data set.

3.2 The Five Named Graphs

3.2.1 SCATTER PLOTS

Shows relationship between 2 numerical variables (bivariate plots).
* geom_point()

we will visualize the relationship between dep_delay and arr_delay variables in the flights data frame included in the nycflights13 package. This requires filtering down the data from all 336,776 flights that left NYC in 2013, to only the 714 Alaska Airlines (AS) flights that left NYC in 2013.

library(dplyr)  
library(ggplot2)  
library(nycflights13)

alaska_flights <- flights %>%   
     filter(carrier == "AS")

 ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
  geom_point() 
## Warning: Removed 5 rows containing missing values (geom_point).

  • We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object. In this case the geometric object are points, set by specifying geom_point().
  • When adding layers to a plot, you are encouraged to start a new line after the + so that the code for each layer is on a new line. As we add more and more layers to plots, you’ll see this will greatly improve the legibility of your code.

3.2.1.1 Over-plotting

This corresponds to values being plotted on top of each other over and over again. We address this by;

Method 1: Changing the transparency
We use the alpha argument in geom_point(). By default, this value is set to 1. We can change this to any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque.

ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha =0.3 ) 
## Warning: Removed 5 rows containing missing values (geom_point).

Method 2: Jittering the points

  • This is shaking the points around a bit on the plot.

  • To create a jittered scatterplot, instead of using geom_point(), we use geom_jitter(). To specify how much jitter to add, we adjust the width and height arguments. This corresponds to how hard you’d like to shake the plot in units corresponding to those for both the horizontal and vertical variables (in this case minutes).

ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_jitter(width = 30, height = 30)
## Warning: Removed 5 rows containing missing values (geom_point).

3.2.2 LINE GRAPHS

  • Relationship between 2 numerical variables
  • Used when there is a sequential order to x-variable e.g. time (time series plots)
  • geom_line is the keyword

We can see that there is a variable called temp of hourly temperature recordings in Fahrenheit at weather stations near all three airports in New York City: Newark (origin code EWR), JFK, and La Guardia (LGA). Instead of considering hourly temperatures for all days in 2013 for all three airports however, for simplicity let’s only consider hourly temperatures at only Newark airport for the first 15 days in January.

early_january_weather <- weather %>% 
  filter(origin == "EWR" & month == 1 & day <= 15)
  
ggplot(early_january_weather, aes(x = time_hour, y=temp)) +
     geom_line()

3.2.3 HISTOGRAMS

  • Distribution of 1 numerical variable
  • geom_histogram
  • Facetted histograms show the distribution of 1 numerical variable split by values of another variable
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(bins = 40, color = "white", fill = "pink")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 10, color = "red", fill = "purple")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

  1. Color - separates the bins
  2. y - populates itself ie the counts of each bin
  3. The default number of bins is 30, unless specified otherwise
  4. fill - colour of the bin filling argument. Run colors() to see all 657 possible colors!

3.2.3.1 Faceting

  • Faceting is used when we’d like to split a particular visualization of variables by another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.

  • In the below we are splitting the histogram by month
  • Note the use of the tilde ~ before month in facet_wrap(). The tilde is required and you’ll receive the error Error in as.quoted(facets) : object 'month' not found if you don’t include it before month here.

 ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~ month)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

  • We can also specify the number of rows and columns in the grid by using the nrow and ncol arguments inside of facet_wrap(). For example, say we would like our faceted plot to have 4 rows instead of 3. Add the nrow = 4 argument to facet_wrap(~ month)
ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~ month, nrow = 4)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

3.2.4 BOXPLOTS

  • Distribution of 1 numerical variable (y axis) split by 1 categorical variable on the x axis.
    geom_boxplot()

We want to split the box plots by month, hence we put this on the x axis.

ggplot(data = weather, mapping = aes(x = month, y = temp)) +
  geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

  • The figure does not provide the information for temperature by month as we expect, this is because x variable in this case is coded as numerical We can fix this by using the factor() function to convert to categorical variables “1”, “2”, “3”, ….. as below;
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Interpretting box plots

  • The “box” portions of this visualization represent the 1st quartile, the median AKA the 2nd quartile, and the 3rd quartile.
  • The height of each box, i.e. the value of the 3rd quartile minus the value of the 1st quartile, is the interquartile range. It is a measure of spread of the middle 50% of values, with longer boxes indicating more variability.
  • The “whisker” portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles respectively. They’re set to extend out no more than \(1.5 \times IQR\) units away from either end of the boxes. We say “no more than” because the ends of the whiskers have to correspond to observed temperatures. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability.
  • The dots representing values falling outside the whiskers are called outliers. These can be thought of as anomalous values.

3.2.5 BAR PLOTS

  • To visualize distribution of one categorical variable
  • geom_bar() when counts are not pre-counted, geom_col() when counts are pre-counted

  • Let’s now go back to the flights data frame in the nycflights13 package and visualize the distribution of the categorical variable carrier. In the flights data frame, each row corresponds to a flight, hence not pre-counted this means we use geom_bar.
  • Note that there is only one variable in the aes() aesthetic mapping: the variable carrier

ggplot(data = flights, mapping = aes(x = carrier)) +
  geom_bar(fill="blue")

If the number of flights per carrier were precounted and the variable called number

flights_counted <- flights %>%
  group_by(carrier) %>%
  summarize(number=n())
flights_counted
## # A tibble: 16 x 2
##    carrier number
##    <chr>    <int>
##  1 9E       18460
##  2 AA       32729
##  3 AS         714
##  4 B6       54635
##  5 DL       48110
##  6 EV       54173
##  7 F9         685
##  8 FL        3260
##  9 HA         342
## 10 MQ       26397
## 11 OO          32
## 12 UA       58665
## 13 US       20536
## 14 VX        5162
## 15 WN       12275
## 16 YV         601

So now, lets plot the precounted categorical variable (number of flights per carrier);

 ggplot(flights_counted, aes(x=carrier, y=number)) +
 geom_col(fill="purple")

3.2.5.1 Two categorical variables

  • Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables
  • Let’s examine the joint distribution of outgoing domestic flights from NYC by carrier and origin, or in other words the number of flights for each carrier and origin combination.
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_bar()

  • The above is an example of a stacked barplot. While simple to make, in certain aspects it is not ideal. For example, it is difficult to compare the heights of the different colors between the bars, corresponding to comparing the number of flights from each origin airport between the carriers.

  • Note that fill is another aesthetic mapping much like x-position; thus it must be included within the parentheses of the aes() mapping.

  • Second, the fill aesthetic corresponds to the color used to fill the bars, while the color aesthetic corresponds to the color of the outline of the bars. Observe in Figure 3.30 that mapping origin to color and not fill yields grey bars with different colored outlines.

ggplot(data = flights, mapping = aes(x = carrier, color = origin)) +
  geom_bar()

  • Another alternative to stacked barplots are side-by-side barplots, also known as a dodged barplot. To create these, we are override the default barplot type, which is a stacked barplot, and specifying it to be a side-by-side barplot.
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_bar(position = "dodge")

  • Lastly, another type of barplot is a faceted barplot. In this, we do as below;
ggplot(flights, aes(x=carrier)) +
  geom_bar(fill="black") +
  facet_wrap(~ origin, ncol=1)

  • Barplots are the preferred way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories called levels occur. They are easy to understand and make it easy to make comparisons across levels.

Additional resources

  1. R codes click here
    2.You can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheatsheets -> “Data Visualization with ggplot2”:
  2. ModernDive Book here
  3. Grolemund, Garrett, and Hadley Wickham. 2016 R for Data Science book click here