Introduction to R
Welcome to the R segment of DATA 5070. We will spend the next several weeks learning basic coding techniques, data management, and data modeling in R.
Why Learn R?
- First and foremost, R is free
- Unlike Python, R is specifically designed for conducting statistical analyses and managing data.
- The R project is open-source
- It is a great tool for creating dynamic reports (including this one).
Resources
Since R is an open-source software, there is a large community built around functions and packages. I highly encourage you to find syntax and or particular errors that may arise using these websites:
- Stack Overflow
- R Bloggers
- Google :)
1. Getting Started
Saving your work
Before we get started
In R the working directory is the place on your computer (or the cloud) that R will look to input data and output files or graphics. You should be sure to set a working directory at the beginning of each session by using the set(wd)providing R with a filepath that it will use to locate your project folder. Note that how you set the working directory will differ depending on your operating system.
# Check the current working directory
getwd()
# Mac/Unix
setwd("/Users/aayush/Desktop/TA_5070")
# Windows
setwd("C:/aayush/Desktop/TA_5070")
Loading Packages
When you first load a package, you will need to install it from CRAN using the install.packages() command.
install.packages("dplyr")
Once you have initially installed the package, you will need to call it in each additional script file where you want to use it.
library(dplyr)
library(tidyverse)
Getting Help
R provides many built-in help functions for all packages.
To receive help with a specific package, you can use:
help(packagename)
# or
?packagename
Basic Mathematical Operations
# Addition
1 + 1 + 10
## [1] 12
# Subtraction
10 - 5 - 1
## [1] 4
# Multiplication
3 * 2 * 4
## [1] 24
# Division
10 / 5
## [1] 2
# Exponent, returns the power of one variable against the other
2^4
## [1] 16
# Modulus, returns the remainder after the division
17%%4
## [1] 1
#square root
sqrt(25)
## [1] 5
# round a number
round(3.14159)
## [1] 3
# rounding to 2 digits
round(3.14159, digits = 2)
## [1] 3.14
Creating Objects / Basic Functions
What are known as objects in R are known as variables in many other programming languages. Depending on the context, object and variable can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: [https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects]
Basics
# Assign 2 to x
x <- 2
x
## [1] 2
# The arrow operator can also be used in reverse direction
3 -> z
z
## [1] 3
# Subtract z from k and save it as qv
k = 89
qv = k - z
qv
## [1] 86
# The arrow operator can be used to assign multiple variables at a time
a <- b <- 7
a
## [1] 7
b
## [1] 7
# Sometime we can also use a more complex assignment statement, assign()
assign("j", 4)
j
## [1] 4
# Remove a variable
j = 8
remove(j)
You can force R to print the value by using parentheses or by typing the object name:
weight_kg <- 55 # doesn't print anything
(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg`
## [1] 55
weight_kg # and so does typing the name of the object
## [1] 55
Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):
weight_lb <- 2.2 * weight_kg
Vectors
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of weights and assign it to a new object weight_g:
(age_g <- c(21, 34, 39, 54, 55))
## [1] 21 34 39 54 55
There are many functions that allow you to inspect the content of a vector.
length()tells you how many elements are in a particular vectorclass()function indicates the class (the type of element) of an objectstr()provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:
length(age_g)
## [1] 5
class(age_g)
## [1] "numeric"
str(age_g)
## num [1:5] 21 34 39 54 55
A vector can also contain characters:
animals <- c("mouse", "rat", "dog", "frog")
class(animals)
## [1] "character"
Lastly, we will introduce a vector with logical values (the boolean data type).
has_tail <- c(TRUE, TRUE, TRUE, FALSE)
has_tail
## [1] TRUE TRUE TRUE FALSE
2. Data Managment
Case Study: Titanic Dataset
Download the titanic data package using – install.packages("titanic")
Importing Data
library(titanic)
data <- titanic_train
attach your data – attach() function in R Language is used to access the variables present in the data framework without calling the data frame
attach(data)
Inspecting Data
If you like to “look” at your data, Rstudio can open a spreadsheet viewer,but really it’s easier to look in the console
head(data) # only prints the first 6 rows
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Using the str() funtion examines the structure of a data object rather than providing a statistical summary.
str(data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
We can also print just the column names using names(), the dimensions using dim(), or the number of rows or columns using nrow() or ncol()
names(data)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
dim(data)
## [1] 891 12
nrow(data)
## [1] 891
ncol(data)
## [1] 12
Summary statistics
summary(data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Missing Values
R handles missing values differently than some other programs, including Stata. Missing values will appear as NA (whereas in Stata these will appear as large numeric values). Note, though, that NA is not a string, it is a symbol. If you try to conduct logical tests with NA, you are likely to get errors or NULL.
You have several options for dealing with NA values.
na.omit()or na.exclude() will row-wise delete missing values in your datasetna.fail()will keep an object only if no missing values are presentis.naallows you to logically test for NA values, for example when subsetting
How many missing values do we have?
sum(is.na(data))
## [1] 177
extracting positions of each NA values
which(is.na(data), arr.ind=TRUE)
## row col
## [1,] 6 6
## [2,] 18 6
## [3,] 20 6
## [4,] 27 6
## [5,] 29 6
## [6,] 30 6
## [7,] 32 6
## [8,] 33 6
## [9,] 37 6
## [10,] 43 6
## [11,] 46 6
## [12,] 47 6
## [13,] 48 6
## [14,] 49 6
## [15,] 56 6
## [16,] 65 6
## [17,] 66 6
## [18,] 77 6
## [19,] 78 6
## [20,] 83 6
## [21,] 88 6
## [22,] 96 6
## [23,] 102 6
## [24,] 108 6
## [25,] 110 6
## [26,] 122 6
## [27,] 127 6
## [28,] 129 6
## [29,] 141 6
## [30,] 155 6
## [31,] 159 6
## [32,] 160 6
## [33,] 167 6
## [34,] 169 6
## [35,] 177 6
## [36,] 181 6
## [37,] 182 6
## [38,] 186 6
## [39,] 187 6
## [40,] 197 6
## [41,] 199 6
## [42,] 202 6
## [43,] 215 6
## [44,] 224 6
## [45,] 230 6
## [46,] 236 6
## [47,] 241 6
## [48,] 242 6
## [49,] 251 6
## [50,] 257 6
## [51,] 261 6
## [52,] 265 6
## [53,] 271 6
## [54,] 275 6
## [55,] 278 6
## [56,] 285 6
## [57,] 296 6
## [58,] 299 6
## [59,] 301 6
## [60,] 302 6
## [61,] 304 6
## [62,] 305 6
## [63,] 307 6
## [64,] 325 6
## [65,] 331 6
## [66,] 335 6
## [67,] 336 6
## [68,] 348 6
## [69,] 352 6
## [70,] 355 6
## [71,] 359 6
## [72,] 360 6
## [73,] 365 6
## [74,] 368 6
## [75,] 369 6
## [76,] 376 6
## [77,] 385 6
## [78,] 389 6
## [79,] 410 6
## [80,] 411 6
## [81,] 412 6
## [82,] 414 6
## [83,] 416 6
## [84,] 421 6
## [85,] 426 6
## [86,] 429 6
## [87,] 432 6
## [88,] 445 6
## [89,] 452 6
## [90,] 455 6
## [91,] 458 6
## [92,] 460 6
## [93,] 465 6
## [94,] 467 6
## [95,] 469 6
## [96,] 471 6
## [97,] 476 6
## [98,] 482 6
## [99,] 486 6
## [100,] 491 6
## [101,] 496 6
## [102,] 498 6
## [103,] 503 6
## [104,] 508 6
## [105,] 512 6
## [106,] 518 6
## [107,] 523 6
## [108,] 525 6
## [109,] 528 6
## [110,] 532 6
## [111,] 534 6
## [112,] 539 6
## [113,] 548 6
## [114,] 553 6
## [115,] 558 6
## [116,] 561 6
## [117,] 564 6
## [118,] 565 6
## [119,] 569 6
## [120,] 574 6
## [121,] 579 6
## [122,] 585 6
## [123,] 590 6
## [124,] 594 6
## [125,] 597 6
## [126,] 599 6
## [127,] 602 6
## [128,] 603 6
## [129,] 612 6
## [130,] 613 6
## [131,] 614 6
## [132,] 630 6
## [133,] 634 6
## [134,] 640 6
## [135,] 644 6
## [136,] 649 6
## [137,] 651 6
## [138,] 654 6
## [139,] 657 6
## [140,] 668 6
## [141,] 670 6
## [142,] 675 6
## [143,] 681 6
## [144,] 693 6
## [145,] 698 6
## [146,] 710 6
## [147,] 712 6
## [148,] 719 6
## [149,] 728 6
## [150,] 733 6
## [151,] 739 6
## [152,] 740 6
## [153,] 741 6
## [154,] 761 6
## [155,] 767 6
## [156,] 769 6
## [157,] 774 6
## [158,] 777 6
## [159,] 779 6
## [160,] 784 6
## [161,] 791 6
## [162,] 793 6
## [163,] 794 6
## [164,] 816 6
## [165,] 826 6
## [166,] 827 6
## [167,] 829 6
## [168,] 833 6
## [169,] 838 6
## [170,] 840 6
## [171,] 847 6
## [172,] 850 6
## [173,] 860 6
## [174,] 864 6
## [175,] 869 6
## [176,] 879 6
## [177,] 889 6
Here we use the na.omit() function to remove all NULL values and save it an a new data frame called data_NA
data_NA <- na.omit(data)
Now when we run the is.na() function, we see that all NA values are removed from data_NA.
sum(is.na(data_NA))
## [1] 0
Subsetting by rows
we use the filter() command to subset the dataframe by rows
#This creates a dataset with passengers that are 22 years old.
age_22 <- filter(data, Age == 22); head(age_22)
## PassengerId Survived Pclass Name Sex Age
## 1 1 0 3 Braund, Mr. Owen Harris male 22
## 2 61 0 3 Sirayanian, Mr. Orsen male 22
## 3 81 0 3 Waelens, Mr. Achille male 22
## 4 113 0 3 Barton, Mr. David John male 22
## 5 142 1 3 Nysten, Miss. Anna Sofia female 22
## 6 152 1 1 Pears, Mrs. Thomas (Edith Wearne) female 22
## SibSp Parch Ticket Fare Cabin Embarked
## 1 1 0 A/5 21171 7.2500 S
## 2 0 0 2669 7.2292 C
## 3 0 0 345767 9.0000 S
## 4 0 0 324669 8.0500 S
## 5 0 0 347081 7.7500 S
## 6 1 0 113776 66.6000 C2 S
#This creates a dataset with passengers that did not survive.
df_surv <- filter(data, Survived == 0); head(df_surv)
## PassengerId Survived Pclass Name Sex Age SibSp
## 1 1 0 3 Braund, Mr. Owen Harris male 22 1
## 2 5 0 3 Allen, Mr. William Henry male 35 0
## 3 6 0 3 Moran, Mr. James male NA 0
## 4 7 0 1 McCarthy, Mr. Timothy J male 54 0
## 5 8 0 3 Palsson, Master. Gosta Leonard male 2 3
## 6 13 0 3 Saundercock, Mr. William Henry male 20 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 373450 8.0500 S
## 3 0 330877 8.4583 Q
## 4 0 17463 51.8625 E46 S
## 5 1 349909 21.0750 S
## 6 0 A/5. 2151 8.0500 S
#This creates a dataset with passengers that are male and survived.
male_surv <- filter(data, Sex == 'male' & Survived == 1)
head(male_surv)
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1 18 1 2 Williams, Mr. Charles Eugene male NA 0 0
## 2 22 1 2 Beesley, Mr. Lawrence male 34 0 0
## 3 24 1 1 Sloper, Mr. William Thompson male 28 0 0
## 4 37 1 3 Mamee, Mr. Hanna male NA 0 0
## 5 56 1 1 Woolner, Mr. Hugh male NA 0 0
## 6 66 1 3 Moubarek, Master. Gerios male NA 1 1
## Ticket Fare Cabin Embarked
## 1 244373 13.0000 S
## 2 248698 13.0000 D56 S
## 3 113788 35.5000 A6 S
## 4 2677 7.2292 C
## 5 19947 35.5000 C52 S
## 6 2661 15.2458 C
# This creates a dataset with only passengers that are male and survived.
data_filtered <- filter(data, Sex == 'male' & Survived == 1)
head(data_filtered)
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1 18 1 2 Williams, Mr. Charles Eugene male NA 0 0
## 2 22 1 2 Beesley, Mr. Lawrence male 34 0 0
## 3 24 1 1 Sloper, Mr. William Thompson male 28 0 0
## 4 37 1 3 Mamee, Mr. Hanna male NA 0 0
## 5 56 1 1 Woolner, Mr. Hugh male NA 0 0
## 6 66 1 3 Moubarek, Master. Gerios male NA 1 1
## Ticket Fare Cabin Embarked
## 1 244373 13.0000 S
## 2 248698 13.0000 D56 S
## 3 113788 35.5000 A6 S
## 4 2677 7.2292 C
## 5 19947 35.5000 C52 S
## 6 2661 15.2458 C
subsetting by columns
To subset by columns, we use the select command.
#This creates a dataset with only the columns for Age and Fare
Sample.DF <- select(data, Age, Fare)
head(Sample.DF) # Notice there are only 2 columns now
## Age Fare
## 1 22 7.2500
## 2 38 71.2833
## 3 26 7.9250
## 4 35 53.1000
## 5 35 8.0500
## 6 NA 8.4583
Sorting by Columns
For sorting by columns we use the arrange() function
# sorting Sample.DF by Fare
sorted.df <- arrange(Sample.DF, Age)
head(sorted.df)
## Age Fare
## 1 0.42 8.5167
## 2 0.67 14.5000
## 3 0.75 19.2583
## 4 0.75 19.2583
## 5 0.83 29.0000
## 6 0.83 18.7500
Create new variables
To create new variables, we use mutate() function
The most common uses for this are:
- calculating a new number from an existing number (or numbers),
- creating categorical variables from numeric ones
# We want to create a variable (young_pasngrs) where the age of the passenger is less than 30 years.
df.young <- mutate(data, young_pasngrs = Age < 30)
head(df.young)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked young_pasngrs
## 1 A/5 21171 7.2500 S TRUE
## 2 PC 17599 71.2833 C85 C FALSE
## 3 STON/O2. 3101282 7.9250 S TRUE
## 4 113803 53.1000 C123 S FALSE
## 5 373450 8.0500 S FALSE
## 6 330877 8.4583 Q NA
# We can see that the new data frame (df.young) has the new binary variable young_pasngrs
In the case above, Age < 30 created a logical (i.e.,TRUE/FALSE) vector indicating whether each element of the column Age was less than 30.
3. Visualizing Data
One of the most powerful functions of R is it’s ability to produce a wide range of graphics to quickly and easily visualize data. Plots can be replicated, modified and even publishable with just a handful of commands.
Scatter Plot
plot(Age, Fare)
Histogram
hist(Pclass)
Line Chart
plot(Age, Fare, type = "l")