PH125.1x: Data Science: R Basics
- School: EDX, HarvardX
- Course Instructor: Rafael Irizarry
Abstract
In this first course of nine sin the HarvardX Data Science Professional Certificate, we learn the basic building blocks of R.
The demand for skilled data science practitioners in industry, academia, and government is rapidly growing. The Harvard Data Science Series prepares you with the necessary knowledge base and skills to tackle real world data analysis challenges. We cover concepts such as probability, inference, regression and machine learning and develop skill sets such as R programming, data wrangling with dplyr, data visualization with ggplot2, file organization with unix, version control with GitHub, and reproducible document preparation with RStudio. Throughout the series, we use motivating case studies, we ask specific questions and answer these through data analysis. Our assessments use code checking technology that will permit you to get hands-on practice during the courses.
Throughout the series, we will be using the R software environment for all our analysis. You will learn R, statistical concepts, and data analysis techniques simultaneously. In this course, we will introduce basic R syntax to get you going. However, rather than cover every R skill you need, we introduce just enough to get you going with the next courses in this series, which will provide more in depth coverage, building upon what you learn here. We believe that you can better retain R knowledge when you learn it to solve a specific problem.
Using a motivating case study, we ask specific questions related to crime in the United States and provide a relevant dataset. You will learn some basic R skills to permit us to answer these questions.
Learning Objective:
- how to read, extract, and create datasets in R
- how to perform a variety of operations and analyses on datasets using R
- how to write your own functions/sub-routines in R
Course Outline:
Section 1: R Basics, Functions, and Data Types You will get started with R and learn about R’s functions and data types.
Section 2: Vectors and Sorting You will learn to operate on vectors and advanced functions such as sorting.
Section 3: Indexing, Data Manipulation, and Plots You will learn to wrangle, analyze and visualize data.
Section 4: Programming Basics You will learn to use general programming features like ‘if-else’, and ‘for loop’ commands to write your own functions to perform various operations on datasets.
Section 1: R Basics, Functions, and Data Types
Section 1 introduces you to R Basics, Functions and Datatypes.
In Section 1, you will learn to:
- Appreciate the rationale for data analysis using R
- Define objects and perform basic arithmetic and logical operations
- Use pre-defined functions to perform operations on objects
- Distinguish between various data types
1.1 Motivation
1.2 R Basics
log(1)## [1] 0
exp(1)## [1] 2.718282
log(exp(1))## [1] 1
# EX 2: Variable names
# Load package and data
library(dslabs)
data(murders)
# Use the function names to extract the variable names
names(murders)## [1] "state" "abb" "region" "population" "total"
# EX 3: Examining Variables
# To access the population variable from the murders dataset use this code:
p <- murders$population
# To determine the class of object `p` we use this code:
class(p)## [1] "numeric"
# Use the accessor to extract state abbreviations and assign it to a
a <- murders$abb
# Determine the class of a
class(a)## [1] "character"
# EX 4: Multiple ways to access variables
# We extract the population like this:
p <- murders$population
# This is how we do the same with the square brackets:
o <- murders[["population"]]
# We can confirm these two are the same
identical(o, p)## [1] TRUE
# Use square brackets to extract `abb` from `murders` and assign it to b
b <- murders[["abb"]]
# Check if `a` and `b` are identical
identical(a,b)## [1] TRUE
1.3 Data Types
class(2)## [1] "numeric"
class("programming")## [1] "character"
class(ls)## [1] "function"
class(murders)## [1] "data.frame"
class(murders$state)## [1] "character"
class(murders$region)## [1] "factor"
# structure
str(murders)## 'data.frame': 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
names(murders)## [1] "state" "abb" "region" "population" "total"
head(murders)## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
# EX 2: Variable names
# Load package and data
library(dslabs)
data(murders)
# Use the function names to extract the variable names
names(murders)## [1] "state" "abb" "region" "population" "total"
# EX 5: Factors
# We can see the class of the region variable using class
class(murders$region)## [1] "factor"
# Determine the number of regions included in this variable
length(levels(murders$region))## [1] 4
# EX 6: Tables
# Here is an example of what the table function does
x <- c("a", "a", "b", "b", "b", "c")
table(x)## x
## a b c
## 2 3 1
# Write one line of code to show the number of states per region
table(murders$region)##
## Northeast South North Central West
## 9 17 12 13
Section 2: Vectors, Sorting
Section 2 introduces you to vectors and functions such as sorting.
In Section 2.1, you will:
- Create numeric and character vectors.
- Name the columns of a vector.
- Generate numeric sequences.
- Access specific elements or parts of a vector.
- Coerce data into different data types as needed.
In Section 2.2, you will:
- Sort vectors in ascending and descending order.
- Extract the indices of the sorted elements from the original vector.
- Find the maximum and minimum elements, as well as their indices, in a vector.
- Rank the elements of a vector in increasing order.
In Section 2.3, you will:
- Perform arithmetic between a vector and a single number.
- Perform arithmetic between two vectors of same length.
2.1 Vectors
c stands for concatenate
codes <- c(italy=380, canada=124, egypt=818)
codes## italy canada egypt
## 380 124 818
use to access an element of a vector
codes[2]## canada
## 124
codes[1:2]## italy canada
## 380 124
codes["canada"]## canada
## 124
# codes["egypt","canada"]x <- 1:5
x## [1] 1 2 3 4 5
y <- as.character(x)
y## [1] "1" "2" "3" "4" "5"
z <- as.numeric(y)
z## [1] 1 2 3 4 5
x <- c("1", "b","3")
x## [1] "1" "b" "3"
y <- as.numeric(x)## Warning: NAs introduced by coercion
y## [1] 1 NA 3
# EX 1: Numeric Vectors
# Here is an example creating a numeric vector named cost
cost <- c(50, 75, 90, 100, 150)
# Create a numeric vector to store the temperatures listed in the instructions into a vector named temp
# Make sure to follow the same order in the instructions
temp <- c("Beijing", 35, "Lagos", 88, "Paris", 42, "Rio de Janeiro", 84, "San Juan", 81, "Toronto", 30)
temp## [1] "Beijing" "35" "Lagos" "88"
## [5] "Paris" "42" "Rio de Janeiro" "84"
## [9] "San Juan" "81" "Toronto" "30"
temp <- c(35, 88, 42, 84, 81, 30)# EX 2: Character vectors
# here is an example of how to create a character vector
food <- c("pizza", "burgers", "salads", "cheese", "pasta")
# Create a character vector called city to store the city names
# Make sure to follow the same order as in the instructions
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan","Toronto")
city## [1] "Beijing" "Lagos" "Paris" "Rio de Janeiro"
## [5] "San Juan" "Toronto"
# EX 3: Connecting Numeric and Character Vectors
# Associate the cost values with its corresponding food item
cost <- c(50, 75, 90, 100, 150)
food <- c("pizza", "burgers", "salads", "cheese", "pasta")
names(cost) <- food
# You already wrote this code
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Associate the temperature values with its corresponding city
names(temp) <- city
temp## Beijing Lagos Paris Rio de Janeiro San Juan
## 35 88 42 84 81
## Toronto
## 30
# EX 4: Subsetting vectors
# cost of the last 3 items in our food list:
cost[3:5]## salads cheese pasta
## 90 100 150
# temperatures of the first three cities in the list:
temp[0:3]## Beijing Lagos Paris
## 35 88 42
temp[c(1,2,3)]## Beijing Lagos Paris
## 35 88 42
# EX 5: Subsetting vectors continued...
# Access the cost of pizza and pasta from our food list
cost[c(1,5)]## pizza pasta
## 50 150
# Define temp
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
names(temp) <- city
# Access the temperatures of Paris and San Juan
temp[c(3,5)]## Paris San Juan
## 42 81
# EX 6: Sequences
# Create a vector m of integers that starts at 32 and ends at 99.
m <- 32:99
# Determine the length of object m.
length(m)## [1] 68
# Create a vector x of integers that starts 12 and ends at 73.
x <- 12:73
# Determine the length of object x.
length(x)## [1] 62
# EX 7: Sequences continued...
# Create a vector with the multiples of 7, smaller than 50.
seq(7, 49, 7) ## [1] 7 14 21 28 35 42 49
# Create a vector containing all the positive odd numbers smaller than 100.
# The numbers should be in ascending order
seq(1, 99, 2)## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
# EX 8: Sequences and length
# We can a vector with the multiples of 7, smaller than 50 like this
seq(7, 49, 7) ## [1] 7 14 21 28 35 42 49
# But note that the second argument does not need to be last number.
# It simply determines the maximum value permitted.
# so the following line of code produces the same vector as seq(7, 49, 7)
seq(7, 50, 7)## [1] 7 14 21 28 35 42 49
# Create a sequence of numbers from 6 to 55, with 4/7 increments and determine its length
length(seq(6, 55, 4/7))## [1] 86
# EX 9: Sequences of certain length
# Store the sequence in the object a
a <- seq(1, 10, length.out = 100)
# Determine the class of a
class(a)## [1] "numeric"
# EX 10: Integers
# Store the sequence in the object a
a <- seq(1, 10)
# Determine the class of a
class(a)## [1] "integer"
# EX 11: Integers and Numerics
# Check the class of 1, assigned to the object a
class(1)## [1] "numeric"
# Confirm the class of 1L is integer
class(1L)## [1] "integer"
# EX 12: Coercion
# Define the vector x
x <- c(1, 3, 5,"a")
# Note that the x is character vector
class(x)## [1] "character"
# Typecast the vector to get an integer vector
# You will get a warning but that is ok
x <- as.integer(x)## Warning: NAs introduced by coercion
2.2 Sorting
# how many murders
sort(murders$total)## [1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27
## [15] 32 36 38 53 63 65 67 84 93 93 97 97 99 111
## [29] 116 118 120 135 142 207 219 232 246 250 286 293 310 321
## [43] 351 364 376 413 457 517 669 805 1257
x <- c(31,4,15,92,65)
sort(x)## [1] 4 15 31 65 92
index <- order(x)
index## [1] 2 3 1 5 4
# first we order the total murders and save it to index
index <- order(murders$total)
# then we use index to look up state ordered by murdercount from low to high
murders$abb[index]## [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV"
## [15] "NE" "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK"
## [29] "KY" "MA" "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO"
## [43] "LA" "IL" "GA" "MI" "PA" "NY" "FL" "TX" "CA"
# which is the max murder number
max(murders$total)## [1] 1257
# use to look up state
i_max <- which.max(murders$total)
i_max## [1] 5
murders$state[i_max]## [1] "California"
# which is the min murder number
min(murders$total)## [1] 2
# use to look up state
i_min <- which.min(murders$total)
i_min## [1] 46
murders$state[i_min]## [1] "Vermont"
# ranking
rank(x)## [1] 3 1 2 5 4
# EX 1: sort
# Access the `state` variable and store it in an object
states <- murders$state
# Sort the object alphabetically and redefine the object
states <- sort(states)
# Report the first alphabetical value
states[1]## [1] "Alabama"
# Access population values from the dataset and store it in pop
pop <- murders$population
# Sort the object and save it in the same object
pop <- sort(pop)
# Report the smallest population size
pop[1]## [1] 563626
# EX 2: order
# Access population from the dataset and store it in pop
pop <- murders$population
# Use the command order, to order pop and store in object o
o <- order(pop)
# Find the index number of the entry with the smallest population size
o[1]## [1] 51
# EX 3: New Codes
# Find the smallest value for variable total
which.min(murders$total)## [1] 46
# Find the smallest value for population
which.min(murders$population)## [1] 51
# EX 4:Using the output of order
# Define the variable i to be the index of the smallest state
i <- which.min(murders$population)
# Define variable states to hold the states
states <- murders$state
# Use the index you just defined to find the state with the smallest population
states[i]## [1] "Wyoming"
# EX 5: Ranks# EX 5: Ranks
# Store temperatures in an object
temp <- c(35, 88, 42, 84, 81, 30)
# Store city names in an object
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Create data frame with city names and temperature
city_temps <- data.frame(name = city, temperature = temp)
# Define a variable states to be the state names
states <- murders$state
# Define a variable ranks to determine the population size ranks
ranks <- rank(murders$population)
# Create a data frame my_df with the state name and its rank
my_df <- data.frame(states = states, ranks = ranks)
my_df## states ranks
## 1 Alabama 29
## 2 Alaska 5
## 3 Arizona 36
## 4 Arkansas 20
## 5 California 51
## 6 Colorado 30
## 7 Connecticut 23
## 8 Delaware 7
## 9 District of Columbia 2
## 10 Florida 49
## 11 Georgia 44
## 12 Hawaii 12
## 13 Idaho 13
## 14 Illinois 47
## 15 Indiana 37
## 16 Iowa 22
## 17 Kansas 19
## 18 Kentucky 26
## 19 Louisiana 27
## 20 Maine 11
## 21 Maryland 33
## 22 Massachusetts 38
## 23 Michigan 43
## 24 Minnesota 31
## 25 Mississippi 21
## 26 Missouri 34
## 27 Montana 8
## 28 Nebraska 14
## 29 Nevada 17
## 30 New Hampshire 10
## 31 New Jersey 41
## 32 New Mexico 16
## 33 New York 48
## 34 North Carolina 42
## 35 North Dakota 4
## 36 Ohio 45
## 37 Oklahoma 24
## 38 Oregon 25
## 39 Pennsylvania 46
## 40 Rhode Island 9
## 41 South Carolina 28
## 42 South Dakota 6
## 43 Tennessee 35
## 44 Texas 50
## 45 Utah 18
## 46 Vermont 3
## 47 Virginia 40
## 48 Washington 39
## 49 West Virginia 15
## 50 Wisconsin 32
## 51 Wyoming 1
# EX 6: Data Frames, Ranks and Orders
# Define a variable states to be the state names from the murders data frame
states <- murders$state
# Define a variable ranks to determine the population size ranks
ranks <- rank(murders$population)
# Define a variable ind to store the indexes needed to order the population values
ind <- order(murders$population)
# Create a data frame my_df with the state name and its rank and ordered from least populous to most
my_df <- data.frame(states = states[ind], ranks = ranks[ind])
my_df## states ranks
## 1 Wyoming 1
## 2 District of Columbia 2
## 3 Vermont 3
## 4 North Dakota 4
## 5 Alaska 5
## 6 South Dakota 6
## 7 Delaware 7
## 8 Montana 8
## 9 Rhode Island 9
## 10 New Hampshire 10
## 11 Maine 11
## 12 Hawaii 12
## 13 Idaho 13
## 14 Nebraska 14
## 15 West Virginia 15
## 16 New Mexico 16
## 17 Nevada 17
## 18 Utah 18
## 19 Kansas 19
## 20 Arkansas 20
## 21 Mississippi 21
## 22 Iowa 22
## 23 Connecticut 23
## 24 Oklahoma 24
## 25 Oregon 25
## 26 Kentucky 26
## 27 Louisiana 27
## 28 South Carolina 28
## 29 Alabama 29
## 30 Colorado 30
## 31 Minnesota 31
## 32 Wisconsin 32
## 33 Maryland 33
## 34 Missouri 34
## 35 Tennessee 35
## 36 Arizona 36
## 37 Indiana 37
## 38 Massachusetts 38
## 39 Washington 39
## 40 Virginia 40
## 41 New Jersey 41
## 42 North Carolina 42
## 43 Michigan 43
## 44 Georgia 44
## 45 Ohio 45
## 46 Pennsylvania 46
## 47 Illinois 47
## 48 New York 48
## 49 Florida 49
## 50 Texas 50
## 51 California 51
# EX 7: NA
# Using new dataset
library(dslabs)
data(na_example)
# Checking the structure
str(na_example)## int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...
# Find out the mean of the entire dataset
mean(na_example)## [1] NA
# Use is.na to create a logical index ind that tells which entries are NA
ind <- is.na(na_example)
# Determine how many NA ind has using the sum function
sum(ind)## [1] 145
# EX 8: Rmoving NAs
# Note what we can do with the ! operator
x <- c(1, 2, 3)
ind <- c(FALSE, TRUE, FALSE)
x[!ind]## [1] 1 3
# Create the ind vector
library(dslabs)
data(na_example)
ind <- is.na(na_example)
# We saw that this gives an NA
mean(na_example)## [1] NA
# Compute the average, for entries of na_example that are not NA
mean(na_example[!ind])## [1] 2.301754
2.3 Vector Arithmetic
# which state is the biggest:
murders$state[which.max(murders$population)]## [1] "California"
# How many people:
max(murders$population)## [1] 37253956
Example of elementwise operations on vectors
# heights in feet
heights <- c(69,62,66,70,70,73,67,73,67,70)
heights * 2.54## [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80
murder_rate <- murders$total/murders$population*100000
murder_rate## [1] 2.8244238 2.6751860 3.6295273 3.1893901 3.3741383 1.2924531
## [7] 2.7139722 4.2319369 16.4527532 3.3980688 3.7903226 0.5145920
## [13] 0.7655102 2.8369608 2.1900730 0.6893484 2.2081106 2.6732010
## [19] 7.7425810 0.8280881 5.0748655 1.8021791 4.1786225 0.9992600
## [25] 4.0440846 5.3598917 1.2128379 1.7521372 3.1104763 0.3798036
## [31] 2.7980319 3.2537239 2.6679599 2.9993237 0.5947151 2.6871225
## [37] 2.9589340 0.9396843 3.5977513 1.5200933 4.4753235 0.9825837
## [43] 3.4509357 3.2013603 0.7959810 0.3196211 3.1246001 1.3829942
## [49] 1.4571013 1.7056487 0.8871131
murders$state[order(murder_rate,decreasing=TRUE)]## [1] "District of Columbia" "Louisiana" "Missouri"
## [4] "Maryland" "South Carolina" "Delaware"
## [7] "Michigan" "Mississippi" "Georgia"
## [10] "Arizona" "Pennsylvania" "Tennessee"
## [13] "Florida" "California" "New Mexico"
## [16] "Texas" "Arkansas" "Virginia"
## [19] "Nevada" "North Carolina" "Oklahoma"
## [22] "Illinois" "Alabama" "New Jersey"
## [25] "Connecticut" "Ohio" "Alaska"
## [28] "Kentucky" "New York" "Kansas"
## [31] "Indiana" "Massachusetts" "Nebraska"
## [34] "Wisconsin" "Rhode Island" "West Virginia"
## [37] "Washington" "Colorado" "Montana"
## [40] "Minnesota" "South Dakota" "Oregon"
## [43] "Wyoming" "Maine" "Utah"
## [46] "Idaho" "Iowa" "North Dakota"
## [49] "Hawaii" "New Hampshire" "Vermont"
# EX 1: Vectorized operations
# Assign city names to `city`
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
# Store temperature values in `temp`
temp <- c(35, 88, 42, 84, 81, 30)
# Convert temperature into Celsius and overwrite the original values of 'temp' with these Celsius values
temp <- (temp-32) * 5/9
# Create a data frame `city_temps`
city_temps <- data.frame(name = city, temperature = temp)
city_temps## name temperature
## 1 Beijing 1.666667
## 2 Lagos 31.111111
## 3 Paris 5.555556
## 4 Rio de Janeiro 28.888889
## 5 San Juan 27.222222
## 6 Toronto -1.111111
# EX 2: Vectorized operations continued...
# Define an object `x` with the numbers 1 through 100
x <- seq(1, 100)
# Sum the equation
sum(1/x^2) ## [1] 1.634984
# EX 3:Vectorized operation continued...
# Load the data
library(dslabs)
data(murders)
# Store the per 100,000 murder rate for each state in murder_rate
murder_rate <- murders$total / murders$population * 100000
# Calculate the average murder rate in the US
sum(murder_rate) / length(murder_rate)## [1] 2.779125
mean(murder_rate)## [1] 2.779125
Section 3: Indexing, Data Wrangeling, Plots
Section 3 introduces to the R commands and techniques that help you wrangle, analyze, and visualize data.
In Section 3.1, you will:
- Subset a vector based on properties of another vector.
- Use multiple logical operators to index vectors.
- Extract the indices of vector elements satisfying one or more logical conditions.
- Extract the indices of vector elements matching with another vector.
- Determine which elements in one vector are present in another vector.
In Section 3.2, you will:
- Wrangle data tables using the functions in ‘dplyr’ package.
- Modify a data table by adding or changing columns.
- Subset rows in a data table.
- Subset columns in a data table.
- Perform a series of operations using the pipe operator.
- Create data frames.
In Section 3.3, you will:
- Plot data in scatter plots, box plots and histograms.
3.1 Indexing
murder_rate <- murders$total / murders$population * 100000
# murder rate in Italy is 0.71, find us states with similar or lower rates
index <- murder_rate < 0.71
index## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [34] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# which states
murders$state[index]## [1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
## [5] "Vermont"
# how many states
sum(index)## [1] 5
# we want to find a states with mountains (West) and safe (murder_rate <= 1)
west <- murders$region == "West"
safe <- murder_rate <= 1
index <- safe & west
murders$state[index]## [1] "Hawaii" "Idaho" "Oregon" "Utah" "Wyoming"
which
x <- c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)
which(x)## [1] 2 4 5
# Ex we want to look up the murderrate in Massachusetts
index <- which(murders$state =="Massachusetts")
index## [1] 22
# so to get the murder rate, we use the index
murder_rate[index]## [1] 1.802179
# Now we want to match severral states
index <- match(c("New York", "Florida", "Texas"), murders$state)
index## [1] 33 10 44
# To confirm we got it right
murder_state <- murders$state
murder_state[index]## [1] "New York" "Florida" "Texas"
# and the murder rate of these states
murder_rate[index]## [1] 2.667960 3.398069 3.201360
x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
# so we can ask if y is in x
y %in% x## [1] TRUE TRUE FALSE
# check if three states are actually states
c("Boston", "Dakota", "Washington") %in% murders$state## [1] FALSE FALSE TRUE
# EX 1: Logical Vectors
# Store the murder rate per 100,000 for each state, in `murder_rate`
murder_rate <- murders$total / murders$population * 100000
#
# Store the `murder_rate < 1` in `low`
low <- murder_rate < 1# EX 2: which
# Store the murder rate per 100,000 for each state, in murder_rate
murder_rate <- murders$total/murders$population*100000
# Store the murder_rate < 1 in low
low <- murder_rate < 1
# Get the indices of entries that are below 1
which(low)## [1] 12 13 16 20 24 30 35 38 42 45 46 51
# EX 3: Ordering vectors
# Store the murder rate per 100,000 for each state, in murder_rate
murder_rate <- murders$total/murders$population*100000
# Store the murder_rate < 1 in low
low <- murder_rate < 1
# Names of states with murder rates lower than 1
murders$state[low]## [1] "Hawaii" "Idaho" "Iowa" "Maine"
## [5] "Minnesota" "New Hampshire" "North Dakota" "Oregon"
## [9] "South Dakota" "Utah" "Vermont" "Wyoming"
# EX 4: Filtering
# Store the murder rate per 100,000 for each state, in `murder_rate`
murder_rate <- murders$total/murders$population*100000
# Store the `murder_rate < 1` in `low`
low <- murder_rate < 1
# Create a vector ind for states in the Northeast and with murder rates lower than 1.
ind <- (murders$region == "Northeast") & (murder_rate < 1)
# Names of states in `ind`
murders$state[ind]## [1] "Maine" "New Hampshire" "Vermont"
# EX 5: Filtering continued
# Store the murder rate per 100,000 for each state, in murder_rate
murder_rate <- murders$total/murders$population*100000
# Compute average murder rate and store in avg using `mean`
avg <- mean(murder_rate)
# How many states have murder rates below avg ? Check using sum
sum(murder_rate < avg)## [1] 27
# EX 6: Match
# Store the 3 abbreviations in abbs in a vector (remember that they are character vectors and need quotes)
abbs <- c("AK", "MI", "IA")
# Match the abbs to the murders$abb and store in `ind`
ind <- match(abbs , murders$abb)
# Print state names from `ind`
murders$state[ind]## [1] "Alaska" "Michigan" "Iowa"
# EX 7: %in%
# Store the 5 abbreviations in `abbs`. (remember that they are character vectors)
abbs <- c("MA", "ME", "MI", "MO", "MU")
# Use the %in% command to check if the entries of abbs are abbreviations in the the murders data frame
abbs %in% murders$abb## [1] TRUE TRUE TRUE TRUE FALSE
# EX 8: Logical operator
# Store the 5 abbreviations in abbs. (remember that they are character vectors)
abbs <- c("MA", "ME", "MI", "MO", "MU")
# Use the `which` command and `!` operator to find out which abbreviation are not actually part of the dataset and store in ind
ind <- which(!abbs %in% murders$abb)
# What are the entries of abbs that are not actual abbreviations
abbs[ind]## [1] "MU"
3.2 Basic Data Wrangeling
library(dplyr)we want to add the murder ae into the table
murders <- mutate(murders, rate=total/population * 100000)
murders## state abb region population total rate
## 1 Alabama AL South 4779736 135 2.8244238
## 2 Alaska AK West 710231 19 2.6751860
## 3 Arizona AZ West 6392017 232 3.6295273
## 4 Arkansas AR South 2915918 93 3.1893901
## 5 California CA West 37253956 1257 3.3741383
## 6 Colorado CO West 5029196 65 1.2924531
## 7 Connecticut CT Northeast 3574097 97 2.7139722
## 8 Delaware DE South 897934 38 4.2319369
## 9 District of Columbia DC South 601723 99 16.4527532
## 10 Florida FL South 19687653 669 3.3980688
## 11 Georgia GA South 9920000 376 3.7903226
## 12 Hawaii HI West 1360301 7 0.5145920
## 13 Idaho ID West 1567582 12 0.7655102
## 14 Illinois IL North Central 12830632 364 2.8369608
## 15 Indiana IN North Central 6483802 142 2.1900730
## 16 Iowa IA North Central 3046355 21 0.6893484
## 17 Kansas KS North Central 2853118 63 2.2081106
## 18 Kentucky KY South 4339367 116 2.6732010
## 19 Louisiana LA South 4533372 351 7.7425810
## 20 Maine ME Northeast 1328361 11 0.8280881
## 21 Maryland MD South 5773552 293 5.0748655
## 22 Massachusetts MA Northeast 6547629 118 1.8021791
## 23 Michigan MI North Central 9883640 413 4.1786225
## 24 Minnesota MN North Central 5303925 53 0.9992600
## 25 Mississippi MS South 2967297 120 4.0440846
## 26 Missouri MO North Central 5988927 321 5.3598917
## 27 Montana MT West 989415 12 1.2128379
## 28 Nebraska NE North Central 1826341 32 1.7521372
## 29 Nevada NV West 2700551 84 3.1104763
## 30 New Hampshire NH Northeast 1316470 5 0.3798036
## 31 New Jersey NJ Northeast 8791894 246 2.7980319
## 32 New Mexico NM West 2059179 67 3.2537239
## 33 New York NY Northeast 19378102 517 2.6679599
## 34 North Carolina NC South 9535483 286 2.9993237
## 35 North Dakota ND North Central 672591 4 0.5947151
## 36 Ohio OH North Central 11536504 310 2.6871225
## 37 Oklahoma OK South 3751351 111 2.9589340
## 38 Oregon OR West 3831074 36 0.9396843
## 39 Pennsylvania PA Northeast 12702379 457 3.5977513
## 40 Rhode Island RI Northeast 1052567 16 1.5200933
## 41 South Carolina SC South 4625364 207 4.4753235
## 42 South Dakota SD North Central 814180 8 0.9825837
## 43 Tennessee TN South 6346105 219 3.4509357
## 44 Texas TX South 25145561 805 3.2013603
## 45 Utah UT West 2763885 22 0.7959810
## 46 Vermont VT Northeast 625741 2 0.3196211
## 47 Virginia VA South 8001024 250 3.1246001
## 48 Washington WA West 6724540 93 1.3829942
## 49 West Virginia WV South 1852994 27 1.4571013
## 50 Wisconsin WI North Central 5686986 97 1.7056487
## 51 Wyoming WY West 563626 5 0.8871131
filter(murders, rate <= 0.71)## state abb region population total rate
## 1 Hawaii HI West 1360301 7 0.5145920
## 2 Iowa IA North Central 3046355 21 0.6893484
## 3 New Hampshire NH Northeast 1316470 5 0.3798036
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Vermont VT Northeast 625741 2 0.3196211
select specific columns
new_table <- select(murders,state,region,rate)
new_table## state region rate
## 1 Alabama South 2.8244238
## 2 Alaska West 2.6751860
## 3 Arizona West 3.6295273
## 4 Arkansas South 3.1893901
## 5 California West 3.3741383
## 6 Colorado West 1.2924531
## 7 Connecticut Northeast 2.7139722
## 8 Delaware South 4.2319369
## 9 District of Columbia South 16.4527532
## 10 Florida South 3.3980688
## 11 Georgia South 3.7903226
## 12 Hawaii West 0.5145920
## 13 Idaho West 0.7655102
## 14 Illinois North Central 2.8369608
## 15 Indiana North Central 2.1900730
## 16 Iowa North Central 0.6893484
## 17 Kansas North Central 2.2081106
## 18 Kentucky South 2.6732010
## 19 Louisiana South 7.7425810
## 20 Maine Northeast 0.8280881
## 21 Maryland South 5.0748655
## 22 Massachusetts Northeast 1.8021791
## 23 Michigan North Central 4.1786225
## 24 Minnesota North Central 0.9992600
## 25 Mississippi South 4.0440846
## 26 Missouri North Central 5.3598917
## 27 Montana West 1.2128379
## 28 Nebraska North Central 1.7521372
## 29 Nevada West 3.1104763
## 30 New Hampshire Northeast 0.3798036
## 31 New Jersey Northeast 2.7980319
## 32 New Mexico West 3.2537239
## 33 New York Northeast 2.6679599
## 34 North Carolina South 2.9993237
## 35 North Dakota North Central 0.5947151
## 36 Ohio North Central 2.6871225
## 37 Oklahoma South 2.9589340
## 38 Oregon West 0.9396843
## 39 Pennsylvania Northeast 3.5977513
## 40 Rhode Island Northeast 1.5200933
## 41 South Carolina South 4.4753235
## 42 South Dakota North Central 0.9825837
## 43 Tennessee South 3.4509357
## 44 Texas South 3.2013603
## 45 Utah West 0.7959810
## 46 Vermont Northeast 0.3196211
## 47 Virginia South 3.1246001
## 48 Washington West 1.3829942
## 49 West Virginia South 1.4571013
## 50 Wisconsin North Central 1.7056487
## 51 Wyoming West 0.8871131
filter(new_table, rate <= 0.71)## state region rate
## 1 Hawaii West 0.5145920
## 2 Iowa North Central 0.6893484
## 3 New Hampshire Northeast 0.3798036
## 4 North Dakota North Central 0.5947151
## 5 Vermont Northeast 0.3196211
using pipe to put it all together
murders %>% select(state,region,rate) %>% filter(rate <= 0.71)## state region rate
## 1 Hawaii West 0.5145920
## 2 Iowa North Central 0.6893484
## 3 New Hampshire Northeast 0.3798036
## 4 North Dakota North Central 0.5947151
## 5 Vermont Northeast 0.3196211
creating frames
grades <- data.frame(names=c("John", "juan","Jean","Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
grades## names exam_1 exam_2
## 1 John 95 90
## 2 juan 80 85
## 3 Jean 90 85
## 4 Yao 85 90
class(grades$names)## [1] "factor"
grades <- data.frame(names=c("John", "juan","Jean","Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
class(grades$names)## [1] "character"
# EX 1: dplyr
# Loading data
library(dslabs)
data(murders)
# Loading dplyr
library(dplyr)
# Redefine murders so that it includes column named rate with the per 100,000 murder rates
murders <- mutate(murders, rate=total/population * 100000)# EX 2: mutate
# Note that if you want ranks from highest to lowest you can take the negative and then compute the ranks
x <- c(88, 100, 83, 92, 94)
rank(-x)## [1] 4 1 5 3 2
# Defining rate
rate <- murders$total/ murders$population * 100000
# Redefine murders to include a column named rank
# with the ranks of rate from highest to lowest
murders <- mutate(murders, rank(-rate))
murders## state abb region population total rate
## 1 Alabama AL South 4779736 135 2.8244238
## 2 Alaska AK West 710231 19 2.6751860
## 3 Arizona AZ West 6392017 232 3.6295273
## 4 Arkansas AR South 2915918 93 3.1893901
## 5 California CA West 37253956 1257 3.3741383
## 6 Colorado CO West 5029196 65 1.2924531
## 7 Connecticut CT Northeast 3574097 97 2.7139722
## 8 Delaware DE South 897934 38 4.2319369
## 9 District of Columbia DC South 601723 99 16.4527532
## 10 Florida FL South 19687653 669 3.3980688
## 11 Georgia GA South 9920000 376 3.7903226
## 12 Hawaii HI West 1360301 7 0.5145920
## 13 Idaho ID West 1567582 12 0.7655102
## 14 Illinois IL North Central 12830632 364 2.8369608
## 15 Indiana IN North Central 6483802 142 2.1900730
## 16 Iowa IA North Central 3046355 21 0.6893484
## 17 Kansas KS North Central 2853118 63 2.2081106
## 18 Kentucky KY South 4339367 116 2.6732010
## 19 Louisiana LA South 4533372 351 7.7425810
## 20 Maine ME Northeast 1328361 11 0.8280881
## 21 Maryland MD South 5773552 293 5.0748655
## 22 Massachusetts MA Northeast 6547629 118 1.8021791
## 23 Michigan MI North Central 9883640 413 4.1786225
## 24 Minnesota MN North Central 5303925 53 0.9992600
## 25 Mississippi MS South 2967297 120 4.0440846
## 26 Missouri MO North Central 5988927 321 5.3598917
## 27 Montana MT West 989415 12 1.2128379
## 28 Nebraska NE North Central 1826341 32 1.7521372
## 29 Nevada NV West 2700551 84 3.1104763
## 30 New Hampshire NH Northeast 1316470 5 0.3798036
## 31 New Jersey NJ Northeast 8791894 246 2.7980319
## 32 New Mexico NM West 2059179 67 3.2537239
## 33 New York NY Northeast 19378102 517 2.6679599
## 34 North Carolina NC South 9535483 286 2.9993237
## 35 North Dakota ND North Central 672591 4 0.5947151
## 36 Ohio OH North Central 11536504 310 2.6871225
## 37 Oklahoma OK South 3751351 111 2.9589340
## 38 Oregon OR West 3831074 36 0.9396843
## 39 Pennsylvania PA Northeast 12702379 457 3.5977513
## 40 Rhode Island RI Northeast 1052567 16 1.5200933
## 41 South Carolina SC South 4625364 207 4.4753235
## 42 South Dakota SD North Central 814180 8 0.9825837
## 43 Tennessee TN South 6346105 219 3.4509357
## 44 Texas TX South 25145561 805 3.2013603
## 45 Utah UT West 2763885 22 0.7959810
## 46 Vermont VT Northeast 625741 2 0.3196211
## 47 Virginia VA South 8001024 250 3.1246001
## 48 Washington WA West 6724540 93 1.3829942
## 49 West Virginia WV South 1852994 27 1.4571013
## 50 Wisconsin WI North Central 5686986 97 1.7056487
## 51 Wyoming WY West 563626 5 0.8871131
## rank(-rate)
## 1 23
## 2 27
## 3 10
## 4 17
## 5 14
## 6 38
## 7 25
## 8 6
## 9 1
## 10 13
## 11 9
## 12 49
## 13 46
## 14 22
## 15 31
## 16 47
## 17 30
## 18 28
## 19 2
## 20 44
## 21 4
## 22 32
## 23 7
## 24 40
## 25 8
## 26 3
## 27 39
## 28 33
## 29 19
## 30 50
## 31 24
## 32 15
## 33 29
## 34 20
## 35 48
## 36 26
## 37 21
## 38 42
## 39 11
## 40 35
## 41 5
## 42 41
## 43 12
## 44 16
## 45 45
## 46 51
## 47 18
## 48 37
## 49 36
## 50 34
## 51 43
# EX 3: select
# Load dplyr
library(dplyr)
# Use select to only show state names and abbreviations from murders
select(murders, state, abb)## state abb
## 1 Alabama AL
## 2 Alaska AK
## 3 Arizona AZ
## 4 Arkansas AR
## 5 California CA
## 6 Colorado CO
## 7 Connecticut CT
## 8 Delaware DE
## 9 District of Columbia DC
## 10 Florida FL
## 11 Georgia GA
## 12 Hawaii HI
## 13 Idaho ID
## 14 Illinois IL
## 15 Indiana IN
## 16 Iowa IA
## 17 Kansas KS
## 18 Kentucky KY
## 19 Louisiana LA
## 20 Maine ME
## 21 Maryland MD
## 22 Massachusetts MA
## 23 Michigan MI
## 24 Minnesota MN
## 25 Mississippi MS
## 26 Missouri MO
## 27 Montana MT
## 28 Nebraska NE
## 29 Nevada NV
## 30 New Hampshire NH
## 31 New Jersey NJ
## 32 New Mexico NM
## 33 New York NY
## 34 North Carolina NC
## 35 North Dakota ND
## 36 Ohio OH
## 37 Oklahoma OK
## 38 Oregon OR
## 39 Pennsylvania PA
## 40 Rhode Island RI
## 41 South Carolina SC
## 42 South Dakota SD
## 43 Tennessee TN
## 44 Texas TX
## 45 Utah UT
## 46 Vermont VT
## 47 Virginia VA
## 48 Washington WA
## 49 West Virginia WV
## 50 Wisconsin WI
## 51 Wyoming WY
# EX 4: filter
# Add the necessary columns
murders <- mutate(murders, rate = total/population * 100000, rank = rank(-rate))
# Filter to show the top 5 states with the highest murder rates
filter(murders, rank <= 5)## state abb region population total rate
## 1 District of Columbia DC South 601723 99 16.452753
## 2 Louisiana LA South 4533372 351 7.742581
## 3 Maryland MD South 5773552 293 5.074866
## 4 Missouri MO North Central 5988927 321 5.359892
## 5 South Carolina SC South 4625364 207 4.475323
## rank(-rate) rank
## 1 1 1
## 2 2 2
## 3 4 4
## 4 3 3
## 5 5 5
# EX 5: filter with !=
# Use filter to create a new data frame no_south
no_south <- filter(murders, region != "South")
# Use nrow() to calculate the number of rows
nrow(no_south)## [1] 34
# EX 6: filter with %in%
# Create a new data frame called murders_nw with only the states from the northeast and the west
murders_nw <- filter(murders, region %in% c("Northeast", "West"))
# Number of states (rows) in this category
nrow(murders_nw)## [1] 22
# EX 7: filtering by two conditions
# add the rate column
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))
# Create a table, call it `my_states`, that satisfies both the conditions
my_states <- filter(murders, region %in% c("Northeast", "West") & rate < 1)
# Use select to show only the state name, the murder rate and the rank
select(my_states, state, rate, rank)## state rate rank
## 1 Hawaii 0.5145920 49
## 2 Idaho 0.7655102 46
## 3 Maine 0.8280881 44
## 4 New Hampshire 0.3798036 50
## 5 Oregon 0.9396843 42
## 6 Utah 0.7959810 45
## 7 Vermont 0.3196211 51
## 8 Wyoming 0.8871131 43
# EX 8: Using the pipe %>%
## Define the rate and rank column
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))
# show the result and only include the state, rate, and rank columns, all in one line
filter(murders, region %in% c("Northeast", "West") & rate < 1) %>%
select(state, rate, rank)## state rate rank
## 1 Hawaii 0.5145920 49
## 2 Idaho 0.7655102 46
## 3 Maine 0.8280881 44
## 4 New Hampshire 0.3798036 50
## 5 Oregon 0.9396843 42
## 6 Utah 0.7959810 45
## 7 Vermont 0.3196211 51
## 8 Wyoming 0.8871131 43
# EX 9: mutate, filter and select
# Loading the libraries
library(dplyr)
data(murders)
# Create new data frame called my_states (with specifications in the instructions)
my_states <- murders %>%
mutate(rate = total / population * 100000, rank = rank(-rate)) %>%
filter(region %in% c("Northeast", "West") & rate < 1) %>%
select(state, rate, rank)
my_states## state rate rank
## 1 Hawaii 0.5145920 49
## 2 Idaho 0.7655102 46
## 3 Maine 0.8280881 44
## 4 New Hampshire 0.3798036 50
## 5 Oregon 0.9396843 42
## 6 Utah 0.7959810 45
## 7 Vermont 0.3196211 51
## 8 Wyoming 0.8871131 43
3.3 ## Basic Plots
more populus states have more murder
population_in_millions <- murders$population
total_gun_murders <- murders$total
plot(population_in_millions, total_gun_murders)To look at the distribution of the data we use histogram
class(murders$rate)## [1] "NULL"
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))
murders## state abb region population total rate rank
## 1 Alabama AL South 4779736 135 2.8244238 23
## 2 Alaska AK West 710231 19 2.6751860 27
## 3 Arizona AZ West 6392017 232 3.6295273 10
## 4 Arkansas AR South 2915918 93 3.1893901 17
## 5 California CA West 37253956 1257 3.3741383 14
## 6 Colorado CO West 5029196 65 1.2924531 38
## 7 Connecticut CT Northeast 3574097 97 2.7139722 25
## 8 Delaware DE South 897934 38 4.2319369 6
## 9 District of Columbia DC South 601723 99 16.4527532 1
## 10 Florida FL South 19687653 669 3.3980688 13
## 11 Georgia GA South 9920000 376 3.7903226 9
## 12 Hawaii HI West 1360301 7 0.5145920 49
## 13 Idaho ID West 1567582 12 0.7655102 46
## 14 Illinois IL North Central 12830632 364 2.8369608 22
## 15 Indiana IN North Central 6483802 142 2.1900730 31
## 16 Iowa IA North Central 3046355 21 0.6893484 47
## 17 Kansas KS North Central 2853118 63 2.2081106 30
## 18 Kentucky KY South 4339367 116 2.6732010 28
## 19 Louisiana LA South 4533372 351 7.7425810 2
## 20 Maine ME Northeast 1328361 11 0.8280881 44
## 21 Maryland MD South 5773552 293 5.0748655 4
## 22 Massachusetts MA Northeast 6547629 118 1.8021791 32
## 23 Michigan MI North Central 9883640 413 4.1786225 7
## 24 Minnesota MN North Central 5303925 53 0.9992600 40
## 25 Mississippi MS South 2967297 120 4.0440846 8
## 26 Missouri MO North Central 5988927 321 5.3598917 3
## 27 Montana MT West 989415 12 1.2128379 39
## 28 Nebraska NE North Central 1826341 32 1.7521372 33
## 29 Nevada NV West 2700551 84 3.1104763 19
## 30 New Hampshire NH Northeast 1316470 5 0.3798036 50
## 31 New Jersey NJ Northeast 8791894 246 2.7980319 24
## 32 New Mexico NM West 2059179 67 3.2537239 15
## 33 New York NY Northeast 19378102 517 2.6679599 29
## 34 North Carolina NC South 9535483 286 2.9993237 20
## 35 North Dakota ND North Central 672591 4 0.5947151 48
## 36 Ohio OH North Central 11536504 310 2.6871225 26
## 37 Oklahoma OK South 3751351 111 2.9589340 21
## 38 Oregon OR West 3831074 36 0.9396843 42
## 39 Pennsylvania PA Northeast 12702379 457 3.5977513 11
## 40 Rhode Island RI Northeast 1052567 16 1.5200933 35
## 41 South Carolina SC South 4625364 207 4.4753235 5
## 42 South Dakota SD North Central 814180 8 0.9825837 41
## 43 Tennessee TN South 6346105 219 3.4509357 12
## 44 Texas TX South 25145561 805 3.2013603 16
## 45 Utah UT West 2763885 22 0.7959810 45
## 46 Vermont VT Northeast 625741 2 0.3196211 51
## 47 Virginia VA South 8001024 250 3.1246001 18
## 48 Washington WA West 6724540 93 1.3829942 37
## 49 West Virginia WV South 1852994 27 1.4571013 36
## 50 Wisconsin WI North Central 5686986 97 1.7056487 34
## 51 Wyoming WY West 563626 5 0.8871131 43
hist(murders$rate)#One extreme value
murders$state[which.max(murders$rate)]## [1] "District of Columbia"
boxplots are good at comparing different groupings like regions
boxplot(rate~region, data= murders)# EX 1: Scatterplots
# Load the datasets and define some variables
library(dslabs)
data(murders)
population_in_millions <- murders$population/10^6
total_gun_murders <- murders$total
plot(population_in_millions, total_gun_murders)# Transform population using the log10 transformation and save to object log10_population
log10_population <- log10(murders$population)
# Transform total gun murders using log10 transformation and save to object log10_total_gun_murders
log10_total_gun_murders <- log10(total_gun_murders)
# Create a scatterplot with the log scale transformed population and murders
plot(log10_population, log10_total_gun_murders)# EX 2: Histograms
# Store the population in millions and save to population_in_millions
population_in_millions <- murders$population/10^6
# Create a histogram of this variable
hist(population_in_millions)# EX 3: Boxplots
# Create a boxplot of state populations by region for the murders dataset
boxplot(murders$population~murders$region)Section 4: Programming Basics
Section 4 introduces you to general programming features like ‘if-else’, and ‘for loop’ commands so that you can write your own functions to perform various operations on datasets.
In Section 4.1, you will:
- Understand some of the programming capabilities of R.
In Section 4.2, you will:
- Use basic conditional expressions to perform different operations.
- Check if any or all elements of a logical vector are TRUE.
In Section 4.3, you will:
- Define and call functions to perform various operations.
- Pass arguments to functions, and return variables/objects from functions.
In Section 4.4, you will:
- Use ‘for’ loop to perform repeated operations.
- Articulate in-built functions of R that you could try for yourself.
4.1 Introduction to Programming in R
4.2 Conditionals
library(dslabs)
data(murders)
murder_rate <- murders$total/murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5) {
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}## [1] "Vermont"
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.25) {
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}## [1] "No state has murder rate that low"
a <- c(0,1,2,-4,5)
result <- ifelse(a > 0, 1/a, NA)
result## [1] NA 1.0 0.5 NA 0.2
4.3 Functions
avg <- function(x) {
s <- sum(x)
n <- length(x)
s/n
}
x <- c(5,4,3,2)
avg(x)## [1] 3.5
4.4 For Loops
compute_s_n <- function(n){
x <- 1:n
sum(x)
}
compute_s_n(3) # 1+2+3## [1] 6
compute_s_n(100)## [1] 5050
we now want to repeat the process 25 times
m <- 25
# we create an empty vector
s_n <- vector(length = m)
for(n in 1:m) {
s_n[n] <- compute_s_n(n)
}
s_n## [1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153
## [18] 171 190 210 231 253 276 300 325
n <- 1:m
plot(n, s_n)in stead of loops we use:
- apply
- sapply
- tapply
# EX 2: Conditionals
# Assign the state abbreviation when the state name is longer than 8 characters
new_names <- ifelse(nchar(murders$state)>8, murders$abb, murders$state)
new_names## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "CA" "Colorado"
## [7] "CT" "Delaware" "DC" "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana" "Iowa" "Kansas" "Kentucky"
## [19] "LA" "Maine" "Maryland" "MA" "Michigan" "MN"
## [25] "MS" "Missouri" "Montana" "Nebraska" "Nevada" "NH"
## [31] "NJ" "NM" "New York" "NC" "ND" "Ohio"
## [37] "Oklahoma" "Oregon" "PA" "RI" "SC" "SD"
## [43] "TN" "Texas" "Utah" "Vermont" "Virginia" "WA"
## [49] "WV" "WI" "Wyoming"
# EX 4: Defining functions
# Create function called `sum_n`
sum_n <- function(n){
x <- 1:n
sum(x)
}
# Determine the sum of integers from 1 to 5000
sum_n(5000)## [1] 12502500
# EX 5: Defining functions continued...
# Create `altman_plot`
altman_plot <- function(x, y){
plot(x + y, y - x)
}
x <- c(1,2,3,4,5)
y <- c(2,4,6,8,10)
altman_plot(x,y)# Run this code
x <- 3
my_func <- function(y){
x <- 5
y+5
}
# Print value of x
x## [1] 3
# EX 7: For loops
# Here is a function that adds numbers from 1 to n
example_func <- function(n){
x <- 1:n
sum(x)
}
# Here is the sum of the first 100 numbers
example_func(100)## [1] 5050
# Write the function with argument n, with the above mentioned specifications and store it in `compute_s_n`
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Report the value of the sum when n=10
compute_s_n(10)## [1] 385
# EX 8: For loops continued...
# Define a function and store it in `compute_s_n`
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Create a vector for storing results
s_n <- vector("numeric", 25)
# Assign values to `n` and `s_n`
for(i in 1:25){
s_n[i] <- compute_s_n(i)
}# EX 9: Checking our math
# Define the function
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Define the vector of n
n <- 1:25
# Define the vector to store data
s_n <- vector("numeric", 25)
for(i in n){
s_n[i] <- compute_s_n(i)
}
# Create the plot
plot(n, s_n)# EX 10: Checking our math continued
# Define the function
compute_s_n <- function(n){
x <- 1:n
sum(x^2)
}
# Define the vector of n
n <- 1:25
# Define the vector to store data
s_n <- vector("numeric", 25)
for(i in n){
s_n[i] <- compute_s_n(i)
}
# Check that s_n is identical to the formula given in the instructions.
identical(s_n, n*(n+1)*(2*n+1)/6)## [1] TRUE