This is the first course in the HarvardX Professional Certificate in Data Science, a series of courses that prepare you to do data analysis in R, from simple computations to machine learning.
The textbook for the Data Science course series is freely available online.
Section 1: R Basics, Functions, Data types
You will get started with R, learn about its functions and data types.
Section 2: Vectors, Sorting
You will learn to operate on vectors and advanced functions such as sorting.
Section 3: Indexing, Data Manipulation, Plots
You will learn to wrangle and visualize data.
Section 4: Programming Basics
You will learn to use general programming features like ‘if-else’, and ‘for loop’ commands, and write your own functions to perform various operations on datasets.
Section 1 introduces you to R Basics, Functions and Datatypes.
In Section 1, you will learn to:
The textbook for this section is available here
## [1] 210
## [1] 210
## [1] 325
## [1] 5050
## [1] 500500
n <- 1000
x <- seq(1, n)
sum(x)
Based on the result, what do you think the functions seq and sum do?
A. sum creates a list of numbers and seq adds them up.
B. seq creates a list of numbers and sum adds them up.
C. seq computes the difference between two arguments and sum computes the sum of 1 through 1000.
D. sum always returns the same number.
## [1] 4
## [1] 2
## [1] 1
A. log(10^x)
B. log10(x^10)
C. log(exp(x))
D. exp(log(x, base = 2))
Use the function str to examine the structure of the murders object. We can see that this object is a data frame with 51 rows and five columns. Which of the following best describes the variables represented in this data frame?
A. The 51 states.
B. The murder rates for all 50 states and DC.
C. The state name, the abbreviation of the state name, the state's region, and the state's population and total number of murders for 2010.
D. str shows no relevant information.
# Load package and data
library(dslabs)
data(murders)
# Use the function names to extract the variable names
names(murders)
## [1] "state" "abb" "region" "population" "total"
# To access the population variable from the murders dataset use this code:
p <- murders$population
# To determine the class of object `p` we use this code:
class(p)
## [1] "numeric"
# Use the accessor to extract state abbreviations and assign it to a
a <- murders$abb
# Determine the class of a
class(a)
## [1] "character"
# We extract the population like this:
p <- murders$population
# This is how we do the same with the square brackets:
o <- murders[["population"]]
# We can confirm these two are the same
identical(o, p)
## [1] TRUE
# Use square brackets to extract `abb` from `murders` and assign it to b
b<-murders[["abb"]]
# Check if `a` and `b` are identical
identical(a, b)
## [1] TRUE
class(murders$region)
With one line of code, use the function levels and length to determine the number of regions defined by this dataset.
## [1] "factor"
## [1] 4
## x
## a b c
## 2 3 1
##
## Northeast South North Central West
## 9 17 12 13
In Section 2.1, you will:
In Section 2.2, you will:
In Section 2.3, you will:
The textbook for this section is available here
# Here is an example creating a numeric vector named cost
cost <- c(50, 75, 90, 100, 150)
# Create a numeric vector to store the temperatures listed in the instructions into a vector named temp
# Make sure to follow the same order in the instructions
temp <- c(35, 88, 42, 84, 81,30)
# here is an example of how to create a character vector
food <- c("pizza", "burgers", "salads", "cheese", "pasta")
# Create a character vector called city to store the city names
# Make sure to follow the same order as in the instructions
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Associate the cost values with its corresponding food item
cost <- c(50, 75, 90, 100, 150)
food <- c("pizza", "burgers", "salads", "cheese", "pasta")
names(cost) <- food
# You already wrote this code
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Associate the temperature values with its corresponding city
names(temp)<-city
## salads cheese pasta
## 90 100 150
## Beijing Lagos Paris
## 35 88 42
## pizza pasta
## 50 150
# Define temp
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
names(temp) <- city
# Access the temperatures of Paris and San Juan
temp[c(3,5)]
## Paris San Juan
## 42 81
# Create a vector m of integers that starts at 32 and ends at 99.
m <- 32:99
# Determine the length of object m.
length(m)
## [1] 68
# Create a vector x of integers that starts 12 and ends at 73.
x <- 12:73
# Determine the length of object x.
length(x)
## [1] 62
7.Create a vector containing all the positive odd numbers smaller than 100.
## [1] 7 14 21 28 35 42 49
# Create a vector containing all the positive odd numbers smaller than 100.
# The numbers should be in ascending order
seq(1,100,2)
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
## [1] 7 14 21 28 35 42 49
# But note that the second argument does not need to be last number.
# It simply determines the maximum value permitted.
# so the following line of code produces the same vector as seq(7, 49, 7)
seq(7, 50, 7)
## [1] 7 14 21 28 35 42 49
# Create a sequence of numbers from 6 to 55, with 4/7 increments and determine its length
length(seq(6,55,4/7))
## [1] 86
# Store the sequence in the object a
a <- seq(1, 10, length.out = 100)
# Determine the class of a
class(a)
## [1] "numeric"
## [1] "integer"
# Check the class of 1, assigned to the object a
a<-class(1)
# Confirm the class of 1L is integer
class(1L)
## [1] "integer"
x <- c(“1”, “3”, “5”, “a”)
and coerce it to get integers.
## [1] "1" "3" "5" "a"
For these exercises we will use the US murders dataset. Make sure you load it prior to starting.
# Access the `state` variable and store it in an object
states <- murders$state
# Sort the object alphabetically and redefine the object
states <- sort(states)
# Report the first alphabetical value
states[1]
## [1] "Alabama"
# Access population values from the dataset and store it in pop
pop <- murders$population
# Sort the object and save it in the same object
pop<-sort(pop)
# Report the smallest population size
pop[1]
## [1] 563626
# Access population from the dataset and store it in pop
pop <- murders$population
# Use the command order, to order pop and store in object o
o <- order(pop)
# Find the index number of the entry with the smallest population size
which.min(murders$population)
## [1] 51
## [1] 46
## [1] 51
# Define the variable i to be the index of the smallest state
i <- which.min(murders$population)
# Define variable states to hold the states
states <- murders$state
# Use the index you just defined to find the state with the smallest population
states[i]
## [1] "Wyoming"
temp <- c(35, 88, 42, 84, 81, 30)
city <- c(“Beijing”, “Lagos”, “Paris”, “Rio de Janeiro”, “San Juan”, “Toronto”)
city_temps <- data.frame(name = city, temperature = temp)
Use the rank function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df.
# Store temperatures in an object
temp <- c(35, 88, 42, 84, 81, 30)
# Store city names in an object
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Create data frame with city names and temperature
city_temps <- data.frame(name = city, temperature = temp)
# Define a variable states to be the state names
states <- murders$state
# Define a variable ranks to determine the population size ranks
ranks <- rank(murders$population)
# Create a data frame my_df with the state name and its rank
my_df <- data.frame(name=states, ranks)
# Define a variable states to be the state names from the murders data frame
states <- murders$state
# Define a variable ranks to determine the population size ranks
ranks <- rank(murders$population)
# Define a variable ind to store the indexes needed to order the population values
ind <- order(murders$population)
# Create a data frame my_df with the state name and its rank and ordered from least populous to most
my_df<-data.frame(states = states[ind], ranks = ranks[ind])
data(“na_example”)
str(na_example)
nt [1:1000] 2 1 3 2 1 3 1 4 3 2 …
However, when we compute the average with the function mean, we obtain an NA:
mean(na_example)
[1] NA
The is.na function returns a logical vector that tells us which entries are NA. Assign this logical vector to an object called ind and determine how many NAs does na_example have.
## int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...
## [1] NA
# Use is.na to create a logical index ind that tells which entries are NA
ind <- is.na(na_example)
# Determine how many NA ind has using the sum function
sum(ind)
## [1] 145
## [1] 1 3
## [1] NA
## [1] 2.301754
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)
Remake the data frame using the code above, but add a line that converts the temperature from Fahrenheit to Celsius. The conversion is C =5/9 ? (F ??? 32).
# Assign city names to `city`
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Store temperature values in `temp`
temp <- c(35, 88, 42, 84, 81, 30)
# Convert temperature into Celsius and overwrite the original values of 'temp' with these Celsius values
temp <- 5/9*(temp-32)
# Create a data frame `city_temps`
city_temps <- data.frame(name=city,temperature=temp)
## [1] 1.634984
# Store the per 100,000 murder rate for each state in murder_rate
murder_rate <- murders$total/murders$population *100000
# Calculate the average murder rate in the US
mean(murder_rate)
## [1] 2.779125
Section 3 introduces to the R commands and techniques that help you wrangle, analyze, and visualize data.
In Section 3.1, you will:
In Section 3.2, you will:
In Section 3.3, you will:
The textbook for this section is available here
Start by loading the library and data.
# Store the murder rate per 100,000 for each state, in `murder_rate`
murder_rate <- murders$total / murders$population * 100000
# Store the `murder_rate < 1` in `low`
low <- murder_rate < 1
# Store the murder rate per 100,000 for each state, in murder_rate
murder_rate <- murders$total/murders$population*100000
# Store the murder_rate < 1 in low
low <- murder_rate < 1
# Get the indices of entries that are below 1
which(low)
## [1] 12 13 16 20 24 30 35 38 42 45 46 51
# Store the murder rate per 100,000 for each state, in murder_rate
murder_rate <- murders$total/murders$population*100000
# Store the murder_rate < 1 in low
low <- murder_rate < 1
# Names of states with murder rates lower than 1
murders$state[low]
## [1] "Hawaii" "Idaho" "Iowa" "Maine"
## [5] "Minnesota" "New Hampshire" "North Dakota" "Oregon"
## [9] "South Dakota" "Utah" "Vermont" "Wyoming"
# Store the murder rate per 100,000 for each state, in `murder_rate`
murder_rate <- murders$total/murders$population*100000
# Store the `murder_rate < 1` in `low`
low <- murder_rate < 1
# Create a vector ind for states in the Northeast and with murder rates lower than
ind <- low & murders$region=='Northeast'
# Names of states in `ind`
murders$state[ind]
## [1] "Maine" "New Hampshire" "Vermont"
# Store the murder rate per 100,000 for each state, in murder_rate
murder_rate <- murders$total/murders$population*100000
# Compute average murder rate and store in avg using `mean`
avg <- mean(murder_rate)
# How many states have murder rates below avg ? Check using sum
sum(murder_rate<avg)
## [1] 27
# Store the 3 abbreviations in abbs in a vector (remember that they are character vectors and need quotes)
abbs <- c('AK','MI','IA')
# Match the abbs to the murders$abb and store in ind
ind <- match(abbs , murders$abb)
# Print state names from ind
murders$state[ind]
## [1] "Alaska" "Michigan" "Iowa"
# Store the 5 abbreviations in `abbs`. (remember that they are character vectors)
abbs <- c('MA', 'ME', 'MI', 'MO', 'MU')
# Use the %in% command to check if the entries of abbs are abbreviations in the the murders data frame
abbs%in%murders$abb
## [1] TRUE TRUE TRUE TRUE FALSE
# Store the 5 abbreviations in abbs. (remember that they are character vectors)
abbs <- c("MA", "ME", "MI", "MO", "MU")
# Use the `which` command and `!` operator to find out which abbreviation are not actually part of the dataset and store in ind
ind <- which(!abbs%in%murders$abb)
# What are the entries of abbs that are not actual abbreviations
abbs[ind]
## [1] "MU"
Load the dplyr package and the murders dataset.
murders <- mutate(murders, population_in_millions = population / 10^6)
Note that we can write population rather than murders$population. The function mutate knows we are grabing columns from murders.
Use the function mutate to add a murders column named rate with the per 100,000 murder rate. Make sure you redefine murders as done in the example code above. Remember the murder rate is defined the total divided by the population size times 100,000.
# Redefine murders so that it includes column named rate with the per 100,000 murder rates
murders <- mutate(murders, rate=total/murders$population*100000)
# Note that if you want ranks from highest to lowest you can take the negative and then compute the ranks
x <- c(88, 100, 83, 92, 94)
rank(-x)
## [1] 4 1 5 3 2
# Defining rate
rate <- murders$total/ murders$population * 100000
# Redefine murders to include a column named rank
# with the ranks of rate from highest to lowest
murders <- mutate(murders,rank=rank(-rate))
select(murders, state, population)
Use select to show the state names and abbreviations in murders. Just show it, do not define a new object.
filter(murders, state == “New York”)
You can use other logical vector to filter rows.
Use filter to show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Note that you can filter based on the rank column.
# Add the necessary columns
murders <- mutate(murders, rate = total/population * 100000, rank = rank(-rate))
# Filter to show the top 5 states with the highest murder rates
filter(murders,rate,rank<=5)
no_florida <- filter(murders, state != “Florida”)
Create a new data frame called no_south that removes states from the South region. How many states are in this category? You can use the function nrow for this.
# Use filter to create a new data frame no_south
no_south <- filter(murders, region != 'South')
# Use nrow() to calculate the number of rows
nrow(no_south)
## [1] 34
filter(murders, state %in% c(“New York”, “Texas”))
Create a new data frame called murders_nw with only the states from the Northeast and the West. How many states are in this category?
# Create a new data frame called murders_nw with only the states from the northeast and the west
murders_nw <- filter(murders, region %in% c('Northeast','West'))
# Number of states (rows) in this category
nrow(murders_nw)
## [1] 22
filter(murders, population < 5000000 & region == “Northeast”)
Add a murder rate column and a rank column as done before. Create a table, call it my_states, that satisfies both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select to show only the state name, the rate and the rank.
# add the rate column
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))
# Create a table, call it my_states, that satisfies both the conditions
my_states <- filter(murders, region %in% c('Northeast','West') & rate < 1)
# Use select to show only the state name, the murder rate and the rank
select(my_states,state,rate,rank)
library(dplyr)
murders <- mutate(murders, rate = total / population * 100000, rank = (-rate))
in the solution to the previous exercise we did the following:
Created a table
my_states <- filter(murders, region %in% c(“Northeast”, “West”) & rate < 1)
Used select to show only the state name, the murder rate and the rank
select(my_states, state, rate, rank)
The pipe %>% permits us to perform both operation sequentially and without having to define an intermediate variable my_states
For example we could have mutated and selected in the same line like this:
mutate(murders, rate = total / population * 100000, rank = (-rate)) %>% select(state, rate, rank)
Note that select no longer has a data frame as the first argument. The first argument is assumed to be the result of the operation conducted right before the %>%
Repeat the previous exercise, but now instead of creating a new object, show the result and only include the state, rate, and rank columns. Use a pipe %>% to do this in just one line.
## Define the rate column
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))
# show the result and only include the state, rate, and rank columns, all in one line
filter(murders, region %in% c("Northeast", "West") & rate < 1 )%>%
select(state, rate, rank)
Now we will make murders the original table one gets when loading using data(murders). Use just one line to create a new data frame, called, my_states that has murder rate and rank column, consider only states in the Northeast or West, which have a murder rate lower than 1 and contain only the state, rate, and rank columns. The line should have four components separated by three %>%.
- The original dataset murders
- A call to mutate to add the murder rate and the rank.
- A call to filter to keep only the states from the Northeast or West and that have a murder rate below 1
- A call to select that keeps only the columns with the stata name, the murder rate and the rank.
The line should look something like this my_states <- murders %>% mutate something %>% filter something %>% select something. Please, make sure the columns in the final data frame must be in the order: state, rate, rank.
library(dslabs)
data(murders)
population_in_millions <- murders\(population/10^6 total_gun_murders <- murders\)total
plot(population_in_millions, total_gun_murders)
Keep in mind that many states have populations below 5 million and are bunched up. We may gain further insights from making this plot in the log scale. Transform the variables using the log10 transformation and then plot them.
population_in_millions <- murders$population/10^6
total_gun_murders <- murders$total
plot(population_in_millions, total_gun_murders)
# Transform population using the log10 transformation and save to object log10_population
log10_population <-log10(murders$population)
# Transform total gun murders using log10 transformation and save to object log10_total_gun_murders
log10_total_gun_murders <- log10(total_gun_murders)
# Create a scatterplot with the log scale transformed population and murders
plot(log10_population,log10_total_gun_murders)
# Store the population in millions and save to population_in_millions
population_in_millions <- murders$population/10^6
# Create a histogram of this variable
hist(population_in_millions)
3. Generate boxplots of the state populations by region.
# Create a boxplot of state populations by region for the murders dataset
boxplot(population~region, data=murders)
Section 4 introduces you to general programming features like ‘if-else’, and ‘for loop’ commands so that you can write your own functions to perform various operations on datasets.
In Section 4.1, you will:
In Section 4.2, you will:
In Section 4.3, you will:
In Section 4.4, you will:
The textbook for this section is available here
x <- c(1,2,-3,4)
if(all(x>0)){
print(“All Postives”)
} else{
print(“Not all positives”)
}
A. All Positives
B. Not All Positives
C. N/A
D. None of the above
A. all(x)
B. any(x)
C. any(!x)
D. all(!x)
Write a line of code that assigns to the object new_names the state abbreviation when the state name is longer than 8 characters.
# Assign the state abbreviation when the state name is longer than 8 characters
new_names <- ifelse(nchar(murders$state)>8, murders$abb, murders$state)
# Create function called `sum_n`
sum_n <- function(n){
sum(1:n)
}
# Use the function to determine the sum of integers from 1 to 5000
sum_n(5000)
## [1] 12502500
x <- 3
my_func <- function(y){
x <- 5
y+5
}
## [1] 3
# Here is an example of function that adds numbers from 1 to n
example_func <- function(n){
x <- 1:n
sum(x)
}
# Here is the sum of the first 100 numbers
example_func(100)
## [1] 5050
# Write a function compute_s_n that with argument n and returns of 1 + 2^2 + ...+ n^2
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Report the value of the sum when n=10
compute_s_n(10)
## [1] 385
# Define a function and store it in `compute_s_n`
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Create a vector for storing results
s_n <- vector("numeric", 25)
# write a for-loop to store the results in s_n
for(i in 1:25){
s_n[i] <- compute_s_n(i)
}
# Define the function
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Define the vector of n
n <- 1:25
# Define the vector to store data
s_n <- vector("numeric", 25)
for(i in n){
s_n[i] <- compute_s_n(i)
}
# Create the plot
plot(n,s_n)
# Define the function
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Define the vector of n
n <- 1:25
# Define the vector to store data
s_n <- vector("numeric", 25)
for(i in n){
s_n[i] <- compute_s_n(i)
}
# Check that s_n is identical to the formula given in the instructions.
identical(s_n,n*(n+1)*(2*n+1)/6)
## [1] TRUE