Data Science: R Basics

  • Course Instructor: Rafael Irizarry

Abstract

This is the first course in the HarvardX Professional Certificate in Data Science, a series of courses that prepare you to do data analysis in R, from simple computations to machine learning.

The textbook for the Data Science course series is freely available online.

Learning Objectives

  • Learn to read, extract, and create datasets in R
  • Learn to perform a variety of operations on datasets using R
  • Learn to write your own functions/sub-routines in R

Course Overview

Section 1: R Basics, Functions, Data types
You will get started with R, learn about its functions and data types.

Section 2: Vectors, Sorting
You will learn to operate on vectors and advanced functions such as sorting.

Section 3: Indexing, Data Manipulation, Plots
You will learn to wrangle and visualize data.

Section 4: Programming Basics
You will learn to use general programming features like ‘if-else’, and ‘for loop’ commands, and write your own functions to perform various operations on datasets.

Section 1 Overview

Section 1 introduces you to R Basics, Functions and Datatypes.
In Section 1, you will learn to:

  • Appreciate the rationale for data analysis using R
  • Define objects and perform basic arithmetic and logical operations
  • Use pre-defined functions to perform operations on objects
  • Distinguish between various data types

The textbook for this section is available here

Assessment 1

  1. What is the sum of the first n positive integers? The formula for the sum of integers 1 through n is n(n+1)/2. Define n = 100 and then use R to compute the sum of 1 through 100 using the formula. What is the sum?
## [1] 210
## [1] 210
## [1] 325
## [1] 5050
  1. Now use the same formula to compute the sum of the integers from 1 through 1,000.
## [1] 500500
  1. Look at the result of typing the following code into R:

n <- 1000
x <- seq(1, n)
sum(x)

Based on the result, what do you think the functions seq and sum do?
A. sum creates a list of numbers and seq adds them up.

B. seq creates a list of numbers and sum adds them up.

C. seq computes the difference between two arguments and sum computes the sum of 1 through 1000.

D. sum always returns the same number.

  1. In math and programming, we say that we evaluate a function when we replace the argument with a given number. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.
## [1] 4
## [1] 2
## [1] 1
  1. Which of the following will always return the numeric value stored in x? You can try out examples and use the help system if you want.

A. log(10^x)

B. log10(x^10)

C. log(exp(x))

D. exp(log(x, base = 2))

Assessment 2

  1. Load the US murders dataset.

Use the function str to examine the structure of the murders object. We can see that this object is a data frame with 51 rows and five columns. Which of the following best describes the variables represented in this data frame?
A. The 51 states.

B. The murder rates for all 50 states and DC.

C. The state name, the abbreviation of the state name, the state's region, and the state's population and total number of murders for 2010.

D. str shows no relevant information.

  1. What are the column names used by the data frame for these five variables?
## [1] "state"      "abb"        "region"     "population" "total"
  1. Use the accessor $ to extract the state abbreviations and assign them to the object a. What is the class of this object?
## [1] "numeric"
## [1] "character"
  1. Now use the square brackets to extract the state abbreviations and assign them to the object b. Use the identical function to determine if a and b are the same.
## [1] TRUE
## [1] TRUE
  1. We saw that the region column stores a factor. You can corroborate this by typing:

class(murders$region)

With one line of code, use the function levels and length to determine the number of regions defined by this dataset.

## [1] "factor"
## [1] 4
  1. The function table takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.
## x
## a b c 
## 2 3 1
## 
##     Northeast         South North Central          West 
##             9            17            12            13

Section 2 Overview

In Section 2.1, you will:

  • Create numeric and character vectors.
  • Name the columns of a vector.
  • Generate numeric sequences.
  • Access specific elements or parts of a vector.
  • Coerce data into different data types as needed.

In Section 2.2, you will:

  • Sort vectors in ascending and descending order.
  • Extract the indices of the sorted elements from the original vector.
  • Find the maximum and minimum elements, as well as their indices, in a vector.
  • Rank the elements of a vector in increasing order.

In Section 2.3, you will:

  • Perform arithmetic between a vector and a single number.
  • Perform arithmetic between two vectors of same length.

The textbook for this section is available here

Assessment 3

  1. Use the function c to create a vector with the average high temperatures in January for Beijing, Lagos, Paris, Rio de Janeiro, San Juan and Toronto, which are 35, 88, 42, 84, 81, and 30 degrees Fahrenheit. Call the object temp.
  1. Now create a vector with the city names and call the object city.
  1. Use the names function and the objects defined in the previous exercises to associate the temperature data with its corresponding city.
  1. Use the [ and : operators to access the temperature of the first three cities on the list.
## salads cheese  pasta 
##     90    100    150
## Beijing   Lagos   Paris 
##      35      88      42
  1. Use the [ operator to access the temperature of Paris and San Juan.
## pizza pasta 
##    50   150
##    Paris San Juan 
##       42       81
  1. Use the : operator to create a sequence of numbers 12, 13, 14,.,73.
## [1] 68
## [1] 62

7.Create a vector containing all the positive odd numbers smaller than 100.

## [1]  7 14 21 28 35 42 49
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
  1. Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6+4/7, 6+8/7, etc.. How many numbers does the list have? Hint: use seq and length.
## [1]  7 14 21 28 35 42 49
## [1]  7 14 21 28 35 42 49
## [1] 86
  1. What is the class of the following object a <- seq(1, 10, length.out = 100)?
## [1] "numeric"
  1. What is the class of the following object a <- seq(1, 10)?
## [1] "integer"
  1. The class of class(a<-1) is numeric, not integer. R defaults to numeric and to force an integer, you need to add the letter L. Confirm that the class of 1L is integer.
## [1] "integer"
  1. Define the following vector:

x <- c(“1”, “3”, “5”, “a”)

and coerce it to get integers.

## [1] "1" "3" "5" "a"

Assessment 4

For these exercises we will use the US murders dataset. Make sure you load it prior to starting.

  1. Use the $ operator to access the population size data and store it as the object pop. Then use the sort function to redefine pop so that it is sorted. Finally, use the [ operator to report the smallest population size.
## [1] "Alabama"
## [1] 563626
  1. Now instead of the smallest population size, find the index of the entry with the smallest population size. Hint: use order instead of sort.
## [1] 51
  1. We can actually perform the same operation as in the previous exercise using the function which.min. Write one line of code that does this.
## [1] 46
## [1] 51
  1. Now we know how small the smallest state is and we know which row represents it. Which state is it? Define a variable states to be the state names from the murders data frame. Report the name of the state with the smallest population.
## [1] "Wyoming"
  1. You can create a data frame using the data.frame function. Here is a quick example:

temp <- c(35, 88, 42, 84, 81, 30)
city <- c(“Beijing”, “Lagos”, “Paris”, “Rio de Janeiro”, “San Juan”, “Toronto”)
city_temps <- data.frame(name = city, temperature = temp)

Use the rank function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df.

  1. Repeat the previous exercise, but this time order my_df so that the states are ordered from least populous to most populous. Hint: create an object ind that stores the indexes needed to order the population values. Then use the bracket operator [ to re-order each column in the data frame.
  1. The na_example vector represents a series of counts. You can quickly examine the object using:

data(“na_example”)
str(na_example)
nt [1:1000] 2 1 3 2 1 3 1 4 3 2 …

However, when we compute the average with the function mean, we obtain an NA:

mean(na_example)
[1] NA

The is.na function returns a logical vector that tells us which entries are NA. Assign this logical vector to an object called ind and determine how many NAs does na_example have.

##  int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...
## [1] NA
## [1] 145
  1. Now compute the average again, but only for the entries that are not NA. Hint: remember the ! operator.
## [1] 1 3
## [1] NA
## [1] 2.301754

Section 3 Overview

Section 3 introduces to the R commands and techniques that help you wrangle, analyze, and visualize data.

In Section 3.1, you will:

  • Subset a vector based on properties of another vector.
  • Use multiple logical operators to index vectors.
  • Extract the indices of vector elements satisfying one or more logical conditions.
  • Extract the indices of vector elements matching with another vector.
  • Determine which elements in one vector are present in another vector.

In Section 3.2, you will:

  • Wrangle data tables using the functions in ‘dplyr’ package.
  • Modify a data table by adding or changing columns.
  • Subset rows in a data table.
  • Subset columns in a data table.
  • Perform a series of operations using the pipe operator.
  • Create data frames.

In Section 3.3, you will:

  • Plot data in scatter plots, box plots and histograms.

The textbook for this section is available here

Assessment 6

Start by loading the library and data.

  1. Compute the per 100,000 murder rate for each state and store it in an object called murder_rate. Then use logical operators to create a logical vector named low that tells us which entries of murder_rate are lower than 1.
  1. Now use the results from the previous exercise and the function which to determine the indices of murder_rate associated with values lower than 1.
##  [1] 12 13 16 20 24 30 35 38 42 45 46 51
  1. Use the results from the previous exercise to report the names of the states with murder rates lower than 1.
##  [1] "Hawaii"        "Idaho"         "Iowa"          "Maine"        
##  [5] "Minnesota"     "New Hampshire" "North Dakota"  "Oregon"       
##  [9] "South Dakota"  "Utah"          "Vermont"       "Wyoming"
  1. Now extend the code from exercise 2 and 3 to report the states in the Northeast with murder rates lower than 1. Hint: use the previously defined logical vector low and the logical operator &.
## [1] "Maine"         "New Hampshire" "Vermont"
  1. In a previous exercise we computed the murder rate for each state and the average of these numbers. How many states are below the average?
## [1] 27
  1. Use the match function to identify the states with abbreviations AK, MI, and IA. Hint: start by defining an index of the entries of murders$abb that match the three abbreviations, then use the [ operator to extract the states.
## [1] "Alaska"   "Michigan" "Iowa"
  1. Use the %in% operator to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU ?
## [1]  TRUE  TRUE  TRUE  TRUE FALSE
  1. Extend the code you used in exercise 7 to report the one entry that is not an actual abbreviation. Hint: use the ! operator, which turns FALSE into TRUE and vice versa, then which to obtain an index.
## [1] "MU"

Assessment 7

Load the dplyr package and the murders dataset.

  1. You can add columns using the dplyr function mutate. This function is aware of the column names and inside the function you can call them unquoted. Like this:

murders <- mutate(murders, population_in_millions = population / 10^6)

Note that we can write population rather than murders$population. The function mutate knows we are grabing columns from murders.
Use the function mutate to add a murders column named rate with the per 100,000 murder rate. Make sure you redefine murders as done in the example code above. Remember the murder rate is defined the total divided by the population size times 100,000.

  1. mutate
    Note that if rank(x) gives you the ranks of x from lowest to highest, rank(-x) gives you the ranks from highest to lowest. Use the function mutate to add a column rank containing the rank, from highest to lowest murder rate. Make sure you redeinfe murders.
## [1] 4 1 5 3 2
  1. select
    With dplyr we can use select to show only certain columns. For example with this code we would only show the states and population sizes:

select(murders, state, population)

Use select to show the state names and abbreviations in murders. Just show it, do not define a new object.

  1. filter
    The dplyr function filter is used to choose specific rows of the data frame to keep. Unlke select which is for columns, filter is for rows. For example you can show just New York row like this:

filter(murders, state == “New York”)

You can use other logical vector to filter rows.
Use filter to show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Note that you can filter based on the rank column.

  1. filter with !=
    We can remove rows using the != operator. For example to remove Florida we would do this:

no_florida <- filter(murders, state != “Florida”)

Create a new data frame called no_south that removes states from the South region. How many states are in this category? You can use the function nrow for this.

## [1] 34
  1. filter with %in%
    We can also use the %in% to filter with dplyr. For example you can see the data from New York and Texas like this:

filter(murders, state %in% c(“New York”, “Texas”))

Create a new data frame called murders_nw with only the states from the Northeast and the West. How many states are in this category?

## [1] 22
  1. filtering by two conditions
    Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter:

filter(murders, population < 5000000 & region == “Northeast”)

Add a murder rate column and a rank column as done before. Create a table, call it my_states, that satisfies both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select to show only the state name, the rate and the rank.

  1. Using the pipe %>%
    The pipe %>% can be used to perform operations sequentially without having to define intermediate objects. After redefining murder to include rate and rank.

library(dplyr)
murders <- mutate(murders, rate = total / population * 100000, rank = (-rate))

in the solution to the previous exercise we did the following:

Created a table

my_states <- filter(murders, region %in% c(“Northeast”, “West”) & rate < 1)

Used select to show only the state name, the murder rate and the rank

select(my_states, state, rate, rank)

The pipe %>% permits us to perform both operation sequentially and without having to define an intermediate variable my_states

For example we could have mutated and selected in the same line like this:

mutate(murders, rate = total / population * 100000, rank = (-rate)) %>% select(state, rate, rank)

Note that select no longer has a data frame as the first argument. The first argument is assumed to be the result of the operation conducted right before the %>%

Repeat the previous exercise, but now instead of creating a new object, show the result and only include the state, rate, and rank columns. Use a pipe %>% to do this in just one line.

  1. mutate, filter and select

Now we will make murders the original table one gets when loading using data(murders). Use just one line to create a new data frame, called, my_states that has murder rate and rank column, consider only states in the Northeast or West, which have a murder rate lower than 1 and contain only the state, rate, and rank columns. The line should have four components separated by three %>%.
- The original dataset murders
- A call to mutate to add the murder rate and the rank.
- A call to filter to keep only the states from the Northeast or West and that have a murder rate below 1
- A call to select that keeps only the columns with the stata name, the murder rate and the rank.

The line should look something like this my_states <- murders %>% mutate something %>% filter something %>% select something. Please, make sure the columns in the final data frame must be in the order: state, rate, rank.

Assessment 8

  1. We made a plot of total murders versus population and noted a strong relationship. Not surprisingly, states with larger populations had more murders.

library(dslabs)
data(murders)
population_in_millions <- murders\(population/10^6 total_gun_murders <- murders\)total
plot(population_in_millions, total_gun_murders)

Keep in mind that many states have populations below 5 million and are bunched up. We may gain further insights from making this plot in the log scale. Transform the variables using the log10 transformation and then plot them.

  1. Create a histogram of the state populations.

3. Generate boxplots of the state populations by region.

Section 4 Overview

Section 4 introduces you to general programming features like ‘if-else’, and ‘for loop’ commands so that you can write your own functions to perform various operations on datasets.

In Section 4.1, you will:

  • Understand some of the programming capabilities of R.

In Section 4.2, you will:

  • Use basic conditional expressions to perform different operations.
  • Check if any or all elements of a logical vector are TRUE.

In Section 4.3, you will:

  • Define and call functions to perform various operations.
  • Pass arguments to functions, and return variables/objects from functions.

In Section 4.4, you will:

  • Use ‘for’ loop to perform repeated operations.
  • Articulate in-built functions of R that you could try for yourself.

The textbook for this section is available here

Assessment 9

  1. What will this conditional expression return?

x <- c(1,2,-3,4)

if(all(x>0)){
print(“All Postives”)
} else{
print(“Not all positives”)
}

A. All Positives

B. Not All Positives

C. N/A

D. None of the above

  1. Which of the following expressions is always FALSE when at least one entry of a logical vector x is TRUE?

A. all(x)

B. any(x)

C. any(!x)

D. all(!x)

  1. The function nchar tells you how many characters long a character vector is.

Write a line of code that assigns to the object new_names the state abbreviation when the state name is longer than 8 characters.

  1. Create a function sum_n that for any given value, say n, computes the sum of the integers from 1 to n (inclusive). Use the function to determine the sum of integers from 1 to 5,000.
## [1] 12502500
  1. Create a function altman_plot that takes two arguments, x and y, and plots the difference against the sum.
  1. After running the code below, what is the value of x?

x <- 3
my_func <- function(y){
x <- 5
y+5
}

## [1] 3
  1. Write a function compute_s_n that for any given n computes the sum S_n = 1^2 + 2^2 + 3^2 + . n^2. Report the value of the sum when n = 10.
## [1] 5050
## [1] 385
  1. Define an empty numerical vector s_n of size 25 using s_n <- vector(“numeric”, 25) and store in the results of S_1, S_2, . S_25 using a for-loop.
  1. If we do the math, we can show that S_n=12+22+32+???+n2=n(n+1)(2n+1)/6. We have already computed the values of Sn from 1 to 25 using a for loop. If the formula is correct then a plot of Sn versus n should look cubic.

  1. Confirm that s_n and n(n+1)(2n+1)/6 are the same using the identical command.
## [1] TRUE