DATA 5070 | Session 2: Dataset Managment

Aayush Sethi

2021-09-19

Introduction to R

Welcome to the R segment of DATA 5070. We will spend the next several weeks learning basic coding techniques, data management, and data modeling in R.

Why Learn R?

First and foremost, R is free
Unlike Python, R is specifically designed for conducting statistical analyses and managing data.
The R project is open-source
It is a great tool for creating dynamic reports (including this one).

Resources

Since R is an open-source software, there is a large community built around functions and packages. I highly encourage you to find syntax and or particular errors that may arise using these websites:

Stack Overflow
R Bloggers
Google :)

1. Getting Started

Saving your work

Before we get started

In R the working directory is the place on your computer (or the cloud) that R will look to input data and output files or graphics. You should be sure to set a working directory at the beginning of each session by using the set(wd)providing R with a filepath that it will use to locate your project folder. Note that how you set the working directory will differ depending on your operating system.

# Check the current working directory
getwd()

# Mac/Unix
setwd("/Users/aayush/Desktop/TA_5070")

# Windows
setwd("C:/aayush/Desktop/TA_5070")

Loading Packages

When you first load a package, you will need to install it from CRAN using the install.packages() command.

install.packages("dplyr")

Once you have initially installed the package, you will need to call it in each additional script file where you want to use it.

library(dplyr)
library(tidyverse)

Getting Help

R provides many built-in help functions for all packages.

To receive help with a specific package, you can use:

help(packagename)
# or
?packagename

Basic Mathematical Operations

# Addition
1 + 1 + 10

## [1] 12

# Subtraction
10 - 5 - 1

## [1] 4

# Multiplication
3 * 2 * 4

## [1] 24

# Division
10 / 5

## [1] 2

# Exponent, returns the power of one variable against the other
2^4

## [1] 16

# Modulus, returns the remainder after the division
17%%4

## [1] 1

#square root
sqrt(25)

## [1] 5

# round a number
round(3.14159)

## [1] 3

# rounding to 2 digits 
round(3.14159, digits = 2)

## [1] 3.14

Creating Objects / Basic Functions

What are known as objects in R are known as variables in many other programming languages. Depending on the context, object and variable can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: [https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects]

Basics

# Assign 2 to x
x <- 2
x

## [1] 2

# The arrow operator can also be used in reverse direction
3 -> z
z

## [1] 3

# Subtract z from k and save it as qv
k = 89
qv = k - z
qv

## [1] 86

# The arrow operator can be used to assign multiple variables at a time
a <- b <- 7
a

## [1] 7

## [1] 7

# Sometime we can also use a more complex assignment statement, assign()
assign("j", 4)
j

## [1] 4

# Remove a variable 
j = 8 
remove(j)

You can force R to print the value by using parentheses or by typing the object name:

weight_kg <- 55    # doesn't print anything
(weight_kg <- 55)  # but putting parenthesis around the call prints the value of `weight_kg`

## [1] 55

weight_kg          # and so does typing the name of the object

## [1] 55

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):

weight_lb <- 2.2 * weight_kg

Vectors

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of weights and assign it to a new object weight_g:

(age_g <- c(21, 34, 39, 54, 55))

## [1] 21 34 39 54 55

There are many functions that allow you to inspect the content of a vector.

length() tells you how many elements are in a particular vector
class() function indicates the class (the type of element) of an object
str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:

length(age_g)

## [1] 5

class(age_g)

## [1] "numeric"

str(age_g)

##  num [1:5] 21 34 39 54 55

A vector can also contain characters:

animals <- c("mouse", "rat", "dog", "frog")
class(animals)

## [1] "character"

Lastly, we will introduce a vector with logical values (the boolean data type).

has_tail <- c(TRUE, TRUE, TRUE, FALSE)
has_tail

## [1]  TRUE  TRUE  TRUE FALSE

2. Data Managment

Case Study: Titanic Dataset

Download the titanic data package using – install.packages("titanic")

Importing Data

library(titanic)
data <- titanic_train

attach your data – attach() function in R Language is used to access the variables present in the data framework without calling the data frame

attach(data)

Inspecting Data

If you like to “look” at your data, Rstudio can open a spreadsheet viewer,but really it’s easier to look in the console

head(data)  # only prints the first 6 rows

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Using the str() funtion examines the structure of a data object rather than providing a statistical summary.

str(data)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

We can also print just the column names using names(), the dimensions using dim(), or the number of rows or columns using nrow() or ncol()

names(data)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

dim(data)

## [1] 891  12

nrow(data)

## [1] 891

ncol(data)

## [1] 12

Summary statistics

summary(data)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

Missing Values

R handles missing values differently than some other programs, including Stata. Missing values will appear as NA (whereas in Stata these will appear as large numeric values). Note, though, that NA is not a string, it is a symbol. If you try to conduct logical tests with NA, you are likely to get errors or NULL.

You have several options for dealing with NA values.

na.omit() or na.exclude() will row-wise delete missing values in your dataset
na.fail() will keep an object only if no missing values are present
is.na allows you to logically test for NA values, for example when subsetting

How many missing values do we have?

sum(is.na(data))

## [1] 177

extracting positions of each NA values

which(is.na(data), arr.ind=TRUE)

##        row col
##   [1,]   6   6
##   [2,]  18   6
##   [3,]  20   6
##   [4,]  27   6
##   [5,]  29   6
##   [6,]  30   6
##   [7,]  32   6
##   [8,]  33   6
##   [9,]  37   6
##  [10,]  43   6
##  [11,]  46   6
##  [12,]  47   6
##  [13,]  48   6
##  [14,]  49   6
##  [15,]  56   6
##  [16,]  65   6
##  [17,]  66   6
##  [18,]  77   6
##  [19,]  78   6
##  [20,]  83   6
##  [21,]  88   6
##  [22,]  96   6
##  [23,] 102   6
##  [24,] 108   6
##  [25,] 110   6
##  [26,] 122   6
##  [27,] 127   6
##  [28,] 129   6
##  [29,] 141   6
##  [30,] 155   6
##  [31,] 159   6
##  [32,] 160   6
##  [33,] 167   6
##  [34,] 169   6
##  [35,] 177   6
##  [36,] 181   6
##  [37,] 182   6
##  [38,] 186   6
##  [39,] 187   6
##  [40,] 197   6
##  [41,] 199   6
##  [42,] 202   6
##  [43,] 215   6
##  [44,] 224   6
##  [45,] 230   6
##  [46,] 236   6
##  [47,] 241   6
##  [48,] 242   6
##  [49,] 251   6
##  [50,] 257   6
##  [51,] 261   6
##  [52,] 265   6
##  [53,] 271   6
##  [54,] 275   6
##  [55,] 278   6
##  [56,] 285   6
##  [57,] 296   6
##  [58,] 299   6
##  [59,] 301   6
##  [60,] 302   6
##  [61,] 304   6
##  [62,] 305   6
##  [63,] 307   6
##  [64,] 325   6
##  [65,] 331   6
##  [66,] 335   6
##  [67,] 336   6
##  [68,] 348   6
##  [69,] 352   6
##  [70,] 355   6
##  [71,] 359   6
##  [72,] 360   6
##  [73,] 365   6
##  [74,] 368   6
##  [75,] 369   6
##  [76,] 376   6
##  [77,] 385   6
##  [78,] 389   6
##  [79,] 410   6
##  [80,] 411   6
##  [81,] 412   6
##  [82,] 414   6
##  [83,] 416   6
##  [84,] 421   6
##  [85,] 426   6
##  [86,] 429   6
##  [87,] 432   6
##  [88,] 445   6
##  [89,] 452   6
##  [90,] 455   6
##  [91,] 458   6
##  [92,] 460   6
##  [93,] 465   6
##  [94,] 467   6
##  [95,] 469   6
##  [96,] 471   6
##  [97,] 476   6
##  [98,] 482   6
##  [99,] 486   6
## [100,] 491   6
## [101,] 496   6
## [102,] 498   6
## [103,] 503   6
## [104,] 508   6
## [105,] 512   6
## [106,] 518   6
## [107,] 523   6
## [108,] 525   6
## [109,] 528   6
## [110,] 532   6
## [111,] 534   6
## [112,] 539   6
## [113,] 548   6
## [114,] 553   6
## [115,] 558   6
## [116,] 561   6
## [117,] 564   6
## [118,] 565   6
## [119,] 569   6
## [120,] 574   6
## [121,] 579   6
## [122,] 585   6
## [123,] 590   6
## [124,] 594   6
## [125,] 597   6
## [126,] 599   6
## [127,] 602   6
## [128,] 603   6
## [129,] 612   6
## [130,] 613   6
## [131,] 614   6
## [132,] 630   6
## [133,] 634   6
## [134,] 640   6
## [135,] 644   6
## [136,] 649   6
## [137,] 651   6
## [138,] 654   6
## [139,] 657   6
## [140,] 668   6
## [141,] 670   6
## [142,] 675   6
## [143,] 681   6
## [144,] 693   6
## [145,] 698   6
## [146,] 710   6
## [147,] 712   6
## [148,] 719   6
## [149,] 728   6
## [150,] 733   6
## [151,] 739   6
## [152,] 740   6
## [153,] 741   6
## [154,] 761   6
## [155,] 767   6
## [156,] 769   6
## [157,] 774   6
## [158,] 777   6
## [159,] 779   6
## [160,] 784   6
## [161,] 791   6
## [162,] 793   6
## [163,] 794   6
## [164,] 816   6
## [165,] 826   6
## [166,] 827   6
## [167,] 829   6
## [168,] 833   6
## [169,] 838   6
## [170,] 840   6
## [171,] 847   6
## [172,] 850   6
## [173,] 860   6
## [174,] 864   6
## [175,] 869   6
## [176,] 879   6
## [177,] 889   6

Here we use the na.omit() function to remove all NULL values and save it an a new data frame called data_NA

data_NA <- na.omit(data)

Now when we run the is.na() function, we see that all NA values are removed from data_NA.

sum(is.na(data_NA))

## [1] 0

Subsetting by rows

we use the filter() command to subset the dataframe by rows

#This creates a dataset with passengers that are 22 years old.
age_22 <- filter(data, Age == 22); head(age_22)

##   PassengerId Survived Pclass                              Name    Sex Age
## 1           1        0      3           Braund, Mr. Owen Harris   male  22
## 2          61        0      3             Sirayanian, Mr. Orsen   male  22
## 3          81        0      3              Waelens, Mr. Achille   male  22
## 4         113        0      3            Barton, Mr. David John   male  22
## 5         142        1      3          Nysten, Miss. Anna Sofia female  22
## 6         152        1      1 Pears, Mrs. Thomas (Edith Wearne) female  22
##   SibSp Parch    Ticket    Fare Cabin Embarked
## 1     1     0 A/5 21171  7.2500              S
## 2     0     0      2669  7.2292              C
## 3     0     0    345767  9.0000              S
## 4     0     0    324669  8.0500              S
## 5     0     0    347081  7.7500              S
## 6     1     0    113776 66.6000    C2        S

#This creates a dataset with passengers that did not survive. 
df_surv <- filter(data, Survived == 0); head(df_surv)

##   PassengerId Survived Pclass                           Name  Sex Age SibSp
## 1           1        0      3        Braund, Mr. Owen Harris male  22     1
## 2           5        0      3       Allen, Mr. William Henry male  35     0
## 3           6        0      3               Moran, Mr. James male  NA     0
## 4           7        0      1        McCarthy, Mr. Timothy J male  54     0
## 5           8        0      3 Palsson, Master. Gosta Leonard male   2     3
## 6          13        0      3 Saundercock, Mr. William Henry male  20     0
##   Parch    Ticket    Fare Cabin Embarked
## 1     0 A/5 21171  7.2500              S
## 2     0    373450  8.0500              S
## 3     0    330877  8.4583              Q
## 4     0     17463 51.8625   E46        S
## 5     1    349909 21.0750              S
## 6     0 A/5. 2151  8.0500              S

#This creates a dataset with passengers that are male and survived. 

male_surv <- filter(data, Sex == 'male' & Survived == 1)
head(male_surv)

##   PassengerId Survived Pclass                         Name  Sex Age SibSp Parch
## 1          18        1      2 Williams, Mr. Charles Eugene male  NA     0     0
## 2          22        1      2        Beesley, Mr. Lawrence male  34     0     0
## 3          24        1      1 Sloper, Mr. William Thompson male  28     0     0
## 4          37        1      3             Mamee, Mr. Hanna male  NA     0     0
## 5          56        1      1            Woolner, Mr. Hugh male  NA     0     0
## 6          66        1      3     Moubarek, Master. Gerios male  NA     1     1
##   Ticket    Fare Cabin Embarked
## 1 244373 13.0000              S
## 2 248698 13.0000   D56        S
## 3 113788 35.5000    A6        S
## 4   2677  7.2292              C
## 5  19947 35.5000   C52        S
## 6   2661 15.2458              C

# This creates a dataset with only passengers that are male and survived.
data_filtered <- filter(data, Sex == 'male' & Survived == 1)

head(data_filtered)

##   PassengerId Survived Pclass                         Name  Sex Age SibSp Parch
## 1          18        1      2 Williams, Mr. Charles Eugene male  NA     0     0
## 2          22        1      2        Beesley, Mr. Lawrence male  34     0     0
## 3          24        1      1 Sloper, Mr. William Thompson male  28     0     0
## 4          37        1      3             Mamee, Mr. Hanna male  NA     0     0
## 5          56        1      1            Woolner, Mr. Hugh male  NA     0     0
## 6          66        1      3     Moubarek, Master. Gerios male  NA     1     1
##   Ticket    Fare Cabin Embarked
## 1 244373 13.0000              S
## 2 248698 13.0000   D56        S
## 3 113788 35.5000    A6        S
## 4   2677  7.2292              C
## 5  19947 35.5000   C52        S
## 6   2661 15.2458              C

subsetting by columns

To subset by columns, we use the select command.

#This creates a dataset with only the columns for Age and Fare
Sample.DF <- select(data, Age, Fare)

head(Sample.DF)  # Notice there are only 2 columns now

##   Age    Fare
## 1  22  7.2500
## 2  38 71.2833
## 3  26  7.9250
## 4  35 53.1000
## 5  35  8.0500
## 6  NA  8.4583

Sorting by Columns

For sorting by columns we use the arrange() function

# sorting Sample.DF by Fare
sorted.df <- arrange(Sample.DF, Age)

head(sorted.df)

##    Age    Fare
## 1 0.42  8.5167
## 2 0.67 14.5000
## 3 0.75 19.2583
## 4 0.75 19.2583
## 5 0.83 29.0000
## 6 0.83 18.7500

Create new variables

To create new variables, we use mutate() function

The most common uses for this are:

calculating a new number from an existing number (or numbers),
creating categorical variables from numeric ones

# We want to create a variable (young_pasngrs) where the age of the passenger is less than 30 years. 
df.young <- mutate(data, young_pasngrs = Age < 30)

head(df.young)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked young_pasngrs
## 1        A/5 21171  7.2500              S          TRUE
## 2         PC 17599 71.2833   C85        C         FALSE
## 3 STON/O2. 3101282  7.9250              S          TRUE
## 4           113803 53.1000  C123        S         FALSE
## 5           373450  8.0500              S         FALSE
## 6           330877  8.4583              Q            NA

# We can see that the new data frame (df.young) has the new binary variable young_pasngrs

In the case above, Age < 30 created a logical (i.e.,TRUE/FALSE) vector indicating whether each element of the column Age was less than 30.

3. Visualizing Data

One of the most powerful functions of R is it’s ability to produce a wide range of graphics to quickly and easily visualize data. Plots can be replicated, modified and even publishable with just a handful of commands.

Scatter Plot

plot(Age, Fare)

Histogram

hist(Pclass)

Line Chart

plot(Age, Fare, type = "l")