Data @ UCI R Workshop

Hello! Thank you so much for coming to our R workshop! Today, we will be learning how to use the R package called Tidyvsere. Tidyvsese is a collection of other R libraries are useful in handling Data Science tasks.

Introduction to R Markdown File

Let’s first start by learning about R and R markdown file! R is one of the popular programming languages used in Data Science. It is mainly used for statistical computing and graphics.R markdown file is used to create a document with plane text language and R code. This is very helpful especially if you want to present your work to your team or supervisors!

To write a R code in the R markdown file, you will need to create a R Code Chunk first. To do so, you can click on the code chunk button (the green square logo at the top) or do a keyboard shortcut:

MAC: Opt + Cmd + I
Windows: Ctrl + Alt + I

# This is a R code chunk!
print("Hello world!")

## [1] "Hello world!"

To run a code chunk, you can click on the button (looks like the play button) at the top right corner of the chunk or use the keyboard shortcut:

MAC: Cmd + Enter
Windows: Ctrl + Enter

One of the cool features about the R markdown file is that you can convert this file into a HTML file! To convert your markdown file, click on the Knit button at the top. You can also choose to convert the file into other file types such as PDF or WORD.

Loading Tidyverse

To load a Tidyverse package, we need to first install a package using install.packages() function. This function requires a package name for its parameter. We want to install a Tidyverse package, so we need to put "tidyverse" with the quotation marks as its parameter, and write the whole function with the package name in the Console.

Inside your console page, type: install.packages("tidyverse")

And… Voila! You have successfully installed Tidyverse! When you want to use the package after installing it, you will need to load the package next. To load the package, we will use the library() function in your R file. This function requires the name of the package.

# Load a tidyverse package
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

When you successfully load Tidyverse, it will show a list of all the packages that comes with Tidyverse! We will be going over some of those packages in today’s workshop!

Creating Objects and Lists with R

To create an object with R, we will assign a value to our object. Suppose we want to assign a numeric value 2 to an object a, then we will use an assignment operator <-.

# Create an object named a with value 2
a <- 2
a

## [1] 2

Now, let’s make an object called b with value 3, and c that is a sum of a and b.

# Create an object named b with value 3
b <- 3
# Create an object named c with a sum of a and b
c <- a + b
c

## [1] 5

You can also make an object with non-numeric data type values!

# Create an object named d with character value "Data@UCI"
d <- "Data@UCI"
d

## [1] "Data@UCI"

Purrrfect! Next, we will learn how to create a list. A list can hold multiple object values. For examples, we can create a list named list1 with values 1, 2, and 3 by using the c() function:

# Create a list named list1 with values 1,2,3
list1 <- c(1,2,3)
list1

## [1] 1 2 3

We can access a value from the list by using its index number. To access a value at index number n, we write: list1[n] Unlike other programming languages, R indexing starts at 1. So if you want to access a first value of the list, we use index number 1.

# Access first value of the list
list1[1]

## [1] 1

Suppose we want to change the value of first element, then we can do this by using = operator:

# Change the first element of the list
list1[1] = 10
list1[1]

## [1] 10

When you want to add a value to a list, then use the append() function. This function requires 2 parameters: 1) your list object and 2) new value(s). Don’t forget to assign the new appended list to your current list in order to update your list!

# Add a new element to a list
list1 <- append(list1, 100)
list1

## [1]  10   2   3 100

# Add multiple elements to a list
new_values <- c(99,77,55)
list1 <- append(list1, new_values)
list1

## [1]  10   2   3 100  99  77  55

Loading and Displaying Dataset

Next, we will be learning how to load a dataset. R comes with different function for loading dataset for each dataset file type. In today’s workshop, we will be using read_csv() function since the dataset we are trying to load is csv file type. Inside this function, we will put a directory path to our dataset file in quotation marks. The file name and the file path may be different on your device! So please make sure you are putting the right file name and file path!

# Load dataset
df <- read_csv("data/pollution.csv")

## New names:
## Rows: 3247 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): country, code, continent dbl (7): ...1, year, death_percent, death_rate,
## clean_air_access, gdp, popul...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

# Show first 6 rows of dataset
head(df)

## # A tibble: 6 × 10
##    ...1 country     code   year death_percent death_rate clean_air_access   gdp
##   <dbl> <chr>       <chr> <dbl>         <dbl>      <dbl>            <dbl> <dbl>
## 1     1 Afghanistan AFG    2000          19.8       372.             8.8    NA 
## 2     2 Afghanistan AFG    2001          19.7       368.             9.51   NA 
## 3     3 Afghanistan AFG    2002          19.6       356.            10.4  1190.
## 4     4 Afghanistan AFG    2003          19.9       350.            11.5  1236.
## 5     5 Afghanistan AFG    2004          19.9       342.            12.4  1200.
## 6     6 Afghanistan AFG    2005          19.6       331.            13.5  1287.
## # … with 2 more variables: population <dbl>, continent <chr>

When loading a new dataset, it is helpful to look at all the features (variables/columns) of the dataset. To observe the features, we use the glimpse() function. This function will show the names, types, values, etc, of each feature.

glimpse(df)

## Rows: 3,247
## Columns: 10
## $ ...1             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ country          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghani…
## $ code             <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG…
## $ year             <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,…
## $ death_percent    <dbl> 19.818674, 19.744592, 19.589144, 19.921189, 19.860979…
## $ death_rate       <dbl> 371.95134, 368.49025, 355.87085, 350.18875, 341.85811…
## $ clean_air_access <dbl> 8.80, 9.51, 10.39, 11.46, 12.43, 13.49, 14.81, 15.99,…
## $ gdp              <dbl> NA, NA, 1189.785, 1235.810, 1200.278, 1286.794, 1315.…
## $ population       <dbl> 20779957, 21606992, 22600774, 23680871, 24726689, 256…
## $ continent        <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia…

Let’s say you want to look at just one column, To do so, we use the $ symbol next to the dataset name. You specify which column you want to select by putting down the column name next to the dollar symbol: df$column_name.This will create a list with all the values from that column. For example, if you want to create a list of years from the dataset, we will select a year column dataset by:

# create a list of years from the dataset
years <- df$year
# Printing first 10 elements from the year column
years[1:10]

##  [1] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Data Analysis with Tidyverse

Selecting Columns

Now you know how to get one column data from a dataset. You can select a column using a function from the Tidyverse package, called select(). We can do the same thing that we did last time by simply specifying the dataset name, and then the column name inside the function:

# Select year column from df
years2 <- select(df, year)
years2

## # A tibble: 3,247 × 1
##     year
##    <dbl>
##  1  2000
##  2  2001
##  3  2002
##  4  2003
##  5  2004
##  6  2005
##  7  2006
##  8  2007
##  9  2008
## 10  2009
## # … with 3,237 more rows

Unlike the last time, the select() function outputs a tibble instead of a list. A tibble is used to edit and print data frames in R.

Not just one column name, you can also specify multiple column names to get more than one column from a dataset:

# Selecting multiple columns from df
mul_cols <- select(df, 
                   country, continent, year)
mul_cols

## # A tibble: 3,247 × 3
##    country     continent  year
##    <chr>       <chr>     <dbl>
##  1 Afghanistan Asia       2000
##  2 Afghanistan Asia       2001
##  3 Afghanistan Asia       2002
##  4 Afghanistan Asia       2003
##  5 Afghanistan Asia       2004
##  6 Afghanistan Asia       2005
##  7 Afghanistan Asia       2006
##  8 Afghanistan Asia       2007
##  9 Afghanistan Asia       2008
## 10 Afghanistan Asia       2009
## # … with 3,237 more rows

If you don’t know the column name, then you can select certain columns by using their column index numbers. Remember, the R indexing starts at 1!

# Select columns based on column number
# In this example, we are selecting columns 4-6 and column 9
mul_col2 <- select(df, 4:6, 9)
mul_col2

## # A tibble: 3,247 × 4
##     year death_percent death_rate population
##    <dbl>         <dbl>      <dbl>      <dbl>
##  1  2000          19.8       372.   20779957
##  2  2001          19.7       368.   21606992
##  3  2002          19.6       356.   22600774
##  4  2003          19.9       350.   23680871
##  5  2004          19.9       342.   24726689
##  6  2005          19.6       331.   25654274
##  7  2006          19.1       320.   26433058
##  8  2007          18.7       307.   27100542
##  9  2008          18.3       293.   27722281
## 10  2009          17.9       278.   28394806
## # … with 3,237 more rows

Subsetting Rows

There are two different Tidyverse functions to subset rows. They are slice() and filter()

slice() chooses rows based on row numbers

# Choose rows 2-9 and row 3000
slice(df, 2:9, 3000)

## # A tibble: 9 × 10
##    ...1 country     code   year death_percent death_rate clean_air_access   gdp
##   <dbl> <chr>       <chr> <dbl>         <dbl>      <dbl>            <dbl> <dbl>
## 1     2 Afghanistan AFG    2001         19.7       368.              9.51   NA 
## 2     3 Afghanistan AFG    2002         19.6       356.             10.4  1190.
## 3     4 Afghanistan AFG    2003         19.9       350.             11.5  1236.
## 4     5 Afghanistan AFG    2004         19.9       342.             12.4  1200.
## 5     6 Afghanistan AFG    2005         19.6       331.             13.5  1287.
## 6     7 Afghanistan AFG    2006         19.1       320.             14.8  1316.
## 7     8 Afghanistan AFG    2007         18.7       307.             16.0  1461.
## 8     9 Afghanistan AFG    2008         18.3       293.             17.4  1484.
## 9  3000 Tuvalu      TUV    2007          4.01       50.8            30.5  3366.
## # … with 2 more variables: population <dbl>, continent <chr>

filter() chooses rows based on conditions. Suppose we want to get data from year 2007 and up, we will use this function with condition year >= 2007:

# Choose rows with year = 2007 and up
filter(df, year >= 2007)

## # A tibble: 1,910 × 10
##     ...1 country     code   year death_percent death_rate clean_air_access   gdp
##    <dbl> <chr>       <chr> <dbl>         <dbl>      <dbl>            <dbl> <dbl>
##  1     8 Afghanistan AFG    2007          18.7       307.             16.0 1461.
##  2     9 Afghanistan AFG    2008          18.3       293.             17.4 1484.
##  3    10 Afghanistan AFG    2009          17.9       278.             18.8 1759.
##  4    11 Afghanistan AFG    2010          17.4       265.             20.7 1957.
##  5    12 Afghanistan AFG    2011          16.9       252.             22.3 1905.
##  6    13 Afghanistan AFG    2012          16.3       240.             24.1 2075.
##  7    14 Afghanistan AFG    2013          15.8       227.             26.2 2116.
##  8    15 Afghanistan AFG    2014          15.1       217.             28.0 2102.
##  9    16 Afghanistan AFG    2015          14.5       208.             30.1 2068.
## 10    17 Afghanistan AFG    2016          14.1       201.             32.4 2057.
## # … with 1,900 more rows, and 2 more variables: population <dbl>,
## #   continent <chr>

ggplot2: Data Visualization with R

ggplot() is another popular function from the Tidyverse package, and is the most popular data visualization function in R!

There are 3 steps of making data visualizations with ggplot:

Step 1: Choose a dataset

Step 2: Choose data for aesthetic

Step 3: Choose a type of plot using geometric layer

Visualizing Distributions

When you want to look closely at the values of single variable (column), it is helpful to create a distribution plot of that variable. There are different types of distribution plot. Today, we will look at histogram, boxplot, and barplot.

Distribution of Quantitative Variable

For every quantitative variable, we use a histogram and/or boxplot to create a distribution plot. Remember the 3 steps of making a ggplot!

Step 1: Choose a dataset

ggplot(data = df)

Notice how the R output a blank coordinate. In the next step, we will specify x and y coordinates.

Step 2: Choose data for aesthetic

ggplot(data = df, aes( x = gdp ))

Now, the x-axis is added with the range of the gdp variable. Note that we are not specifying the variable for our y-coordinate because when creating a distribution plot, we do not often need to provide a variable for the y-coordinate.

Step 3: Choose a type of plot using geometric layer

ggplot( data = df, aes( x = gdp )) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 169 rows containing non-finite values (`stat_bin()`).

You got a histogram! Another reason why we didn’t need to specify the y-coordinate was because the geom_histogram() computed the y-coordinate values for you!

To make this plot a little more presentable, we will add colors, title, axis name, and add theme!

ggplot( data = df, aes( x = gdp )) +
  # Add border line color and fill color of histogram
  geom_histogram( color = "black", fill = "orange" ) +
  # Add ttle and change axis names
  labs(title = "GDP Distribution",
       x = "GDP", y = "Count") + 
  # Change the theme to minimal theme
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 169 rows containing non-finite values (`stat_bin()`).

Your histogram can look a lot different when you change the value of bin numbers. A bin is a range of values used to group them in graphs like histogram. Or simply put in terms, bin numbers are number of bars in the graphs. For example, at default, the bins are set equal to 30, and if we decrease the bin number, our histogram will look like this instead:

ggplot( data = df, aes( x = gdp )) +
  # Change the bin number
  geom_histogram( color = "black", fill = "orange",
                  bins = 10) +
  labs(title = "GDP Distribution (bins = 10)",
       x = "GDP", y = "Count") + 
  theme_minimal()

## Warning: Removed 169 rows containing non-finite values (`stat_bin()`).

With higher bin number, it will look like this:

ggplot( data = df, aes( x = gdp )) +
  # Change the bin number
  geom_histogram( color = "black", fill = "orange",
                  bins = 50) +
  labs(title = "GDP Distribution (bins = 50)",
       x = "GDP", y = "Count") + 
  theme_minimal()

## Warning: Removed 169 rows containing non-finite values (`stat_bin()`).

People can interpret your histogram differently with different bin number, so be careful when choosing the bin number!

We can draw the same distribution plot with boxplot. To make a boxplot, we will simply replace geom_histogram() with geom_boxplot():

ggplot( data = df, aes( x = gdp )) +
  # Change the bin number
  geom_boxplot(color = "black", fill = "orange") +
  labs(title = "GDP Distribution",
       x = "GDP") + 
  theme_minimal()

## Warning: Removed 169 rows containing non-finite values (`stat_boxplot()`).

A single boxplot do not have a y-coordinate. You can ignore the numbers on the y-axis for now. Suppose your boss wants to see the GDP distribution for each continent. One way to do this is by having a boxplot for each continent that represents the gdp distribution of that continent, and have each one side-by-side. To do so, we will select continet variable for the x-coordinate and gdp variable for the y-coordinate.

# Making side-by-side boxplot
ggplot( data = df,
        aes(x = continent, y = gdp ))+
  geom_boxplot(color = "black" ) +
  labs(title = "GDP Distribution",
       x = "GDP") + 
  theme_minimal()

## Warning: Removed 169 rows containing non-finite values (`stat_boxplot()`).

Let’s remove the NA boxplot. NA boxplot is all the data with missing data in the continent column. is.na() function is used to see whether or not data is containing any NA values. The function will output logical values (TRUE or FALSE) where TRUE means it’s NA value. Let’s look at the gdp column and see if it contains any NA values:

# First ten data from continent values
is.na(df$continent)[1:10] # TRUE means NA

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

!is.na(df$continent)[1:10] # TRUE means it's NOT NA

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# Count number of NA values in continent
sum(is.na(df$continent))

## [1] 51

There are 51 NA values in the continent column! To remove these NA values from the boxplot, we will need to change the dataset so that it excludes data with NA continent data:

# Subset a dataset without any NA gdp data
df3 <- filter(df, !is.na(continent))
# Create a new boxplot
ggplot( data = df3,
        aes(x = continent, y = gdp ))+
  geom_boxplot(color = "black" ) +
  labs(title = "GDP Distribution",
       x = "Continent", y = "GDP") + 
  theme_minimal()

## Warning: Removed 169 rows containing non-finite values (`stat_boxplot()`).

Perfect! Let’s even perfect this visualization by adding color for each boxplot. Since each boxplot represents a different continent, we can specify different color assignments by adding fill = continent to the aesthetic function. We are putting this in the aesthetic function instead of inside the geometric layer because we are changing color based on a specific variable (column), in this case, we want to change the boxplot color based on continent.

# Add different color for each continent
ggplot( data = df3,
        aes(x = continent, y = gdp, 
            fill = continent ))+
  geom_boxplot(color = "black" ) +
  # Change legend title
  labs(title = "GDP Distribution",
       x = "Continent", y = "GDP", fill = "Continent") + 
  theme_minimal()

## Warning: Removed 169 rows containing non-finite values (`stat_boxplot()`).

Distribution of Categorical Variable

For every categorical variable, we use barplot to create a distribution plot. Making barplot is simple! All you need is to replace what we wrote for the histogram with the appropriate column and different geomtric layer (geom_bar()). A continent column is a categorical varaible, so let’s look at its distribution with barplot! Since we know that continent contains NA values, let’s use the dataset without the NA data:

# Continent distribution using barplot
ggplot( data = df3, aes( x = continent )) +
  # Add border line color and fill color of histogram
  geom_bar( color = "black", fill = "orange" ) +
  # Add ttle and change axis names
  labs(title = "Continent Distribution",
       x = "Continent", y = "Count") + 
  # Change the theme to minimal theme
  theme_minimal()

For barplot, we do not need to specify bin number because the number of unique values in a variable of interest is the bin number! In this case, there are 6 unique continents, so the bin number is equal to 6.

Suppose your boss also wants to see how much each continent is getting a clean-access. Let’s look at the clean-air access distribution. Since it’s a quantitative variable, we will use a histogram to plot the distribution:

ggplot( data = df, aes( x = clean_air_access )) +
  # Add border line color and fill color of histogram
  geom_histogram( color = "black", fill = "skyblue" ) +
  # Add ttle and change axis names
  labs(title = "GDP Distribution",
       x = "Clean Air Access", y = "Count") + 
  # Change the theme to minimal theme
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s define any data with clean_air_access <= 50 as country with low clean air access, and anything above as high clean air access. To do so, we will add a new column using mutate() and case_when() function.

df4 <- mutate(df, air_status = case_when(
            clean_air_access <= 50 ~ "LOW",
            clean_air_access > 50 ~ "HIGH"
          ))
df4 <- filter(df4, !is.na(continent))
select(df4, country, air_status)

## # A tibble: 3,196 × 2
##    country     air_status
##    <chr>       <chr>     
##  1 Afghanistan LOW       
##  2 Afghanistan LOW       
##  3 Afghanistan LOW       
##  4 Afghanistan LOW       
##  5 Afghanistan LOW       
##  6 Afghanistan LOW       
##  7 Afghanistan LOW       
##  8 Afghanistan LOW       
##  9 Afghanistan LOW       
## 10 Afghanistan LOW       
## # … with 3,186 more rows

The mutate() function adds a new column and case_when() function specifies whether or not if the row has HIGH or LOW clean air access. With this new dataset, we can create the barplot your boss was asking for!

ggplot(data = df4, 
       aes(x = continent, fill = air_status)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Continent Distribution with Clearn Air Status", x = "Continet", y = "Count",
       fill = "Air Status") + 
  theme_minimal()

The position = "stack" argument inside the geom_bar() indicates how you want to display the colors in each bar. You can also do position = dodge, which looks like this:

# Position = dodge
ggplot(data = df4, 
       aes(x = continent, fill = air_status)) +
  geom_bar(position = "dodge", color = "black") +
  labs(title = "Continent Distribution with Clearn Air Status", x = "Continet", y = "Count",
       fill = "Air Status") + 
  theme_minimal()

Visualizing Relationship between 2 Quantitative Variables

One of the simplest ways to visualize the relationship between 2 quantitative variables is to use a scatterplot. To make a scatterplot, we use the geom_point() function for the geometric layer. Let’s look at the relationship between clean_air_access and death_rate. We will put clean_air_access for the x-coordinate and death_rate for the y-coordinate.

# Add a scatterplot
ggplot( data = df, 
        aes( x = clean_air_access, y = death_rate )) +
  geom_point()

The points look a little big, so let’s make it smaller by editing the size in the geometric layer. The default size of point is 1.5, so let’s make them 0.5:

ggplot( data = df, 
        aes( x = clean_air_access, y = death_rate )) +
  geom_point( size = 0.5 )

Now, let’s add colors based on continent, and change labels with minimal theme:

# Perfecting the scatterplot!
ggplot( data = df, 
        aes( x = clean_air_access, y = death_rate,
             color = continent )) +
  geom_point( size = 0.5 ) +
  labs(title = "Clean Air Access vs Death Rate",
       x = "Clean Air Access", y = "Death Rate",
       color = "Continent") + 
  theme_minimal()

What if we connect all the points with lines? We can do this by adding another geomteric layer geom_line():

# Connecting the points with line
ggplot( data = df, 
        aes( x = clean_air_access, y = death_rate,
             color = continent )) +
  geom_point( size = 0.5 ) +
  # Add geometric layer for lines
  geom_line() + 
  labs(title = "Clean Air Access vs Death Rate",
       x = "Clean Air Access", y = "Death Rate",
       color = "Continent") + 
  theme_minimal()

Extra Funzie Visualization: Time Series

Let’s say your boss asks you how the clean_air_access changed over time. Let’s put year for the x-coordinate and clean_air_access for the y-coordinate.

ggplot( data = df, 
        aes( x = year, y = clean_air_access,
             color = continent )) +
  geom_point( size = 0.5 ) +
  # Add geometric layer for lines
  geom_line() + 
  labs(title = "Change in Clean Air Access Over Time",
       x = "Year", y = "Clean Air Access",
       color = "Continent") + 
  theme_minimal()

Definitely don’t show this to your boss. How can we make this look pretty? We will need to change the y-coordinate variable to the yearly average of clean_air_access because right now you have multiple data for each year. First, we need

# Make a new dataset that contains the yearly average of each continent
df_t <- filter(group_by(df, continent, year), !is.na(continent)) 
df5 <- summarize(df_t, avg_clean_air = mean(clean_air_access))

## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

# Add group aesthetic
ggplot( data = df5, 
        aes( x = year, y = avg_clean_air,
             color = continent)) +
  geom_point() +
  # Add geometric layer for lines
  geom_line() + 
  labs(title = "Change in Clean Air Access Over Time",
       x = "Year", y = "Clean Air Access",
       color = "Continent") + 
  theme_minimal()