NHDS Introduction to R - Part 1

Illya Mowerman

11/29/2017

Welcome

The Main objective of this three part course is to get a jumpstart into R

Today we are going to:

RStudio Panes

What is an object?

Examples of types of objects

Vectors

Vectors can be character or numeric

char_vector <- c('Hola' , 'me' , 'llamo' , 'Illya')

num_vector <- c(1, 4 , 6 , 7)

print(char_vector)
## [1] "Hola"  "me"    "llamo" "Illya"
print(num_vector)
## [1] 1 4 6 7
str(char_vector)
##  chr [1:4] "Hola" "me" "llamo" "Illya"
str(num_vector)
##  num [1:4] 1 4 6 7
summary(num_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.00    4.50    6.25    7.00

Exercises: Vectors

  1. Create a vector with the name v1 and insert into it the numnbers 1,5,7
  2. Create a vector named v2 with the number 2 in it
  3. Create a vector named v3 with the number 3 in it
  4. Sum V2 + v3

Dataframes

There are several data frames already in R. Let’s use the mtcars data. The mtcars data frame is to R what Fisher’s Iris data is to statistics.

The head function allows us to see the first six records.

You try it

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Dataframes

The summary function gives us some basic statistics

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Exercise: sum, min, max, mean of a single variable

  1. Calculate the mean of mpg using the following syntax:
mean(mtcars$mpg)
## [1] 20.09062
  1. Calculate the min, max, and sum of mpg

Graphs

We can render graphs directly into the plots pane, or sink them to a .png file.

ggplot(mtcars) +
  geom_point (aes(mpg , wt))

Graphs

We can also store graphs in an object.

my_first_plot <- ggplot(mtcars) +
  geom_smooth(aes(mpg , wt))

Graphs

The reason for storing a graph in an object is to be able to use it at a later time.

my_first_plot
## `geom_smooth()` using method = 'loess'

Exercises: Graphs

  1. Together: Create a histogram of the variable mpg found in the mtcars data
  2. Create a histogram of the variable wt
  3. Create a scatter plot using the >plot function of mpg and wt
  4. Create a boxplot of the variable disp using the function >boxplot

Functions

Functions in R work like basic mathematical functions. Below is a function for calculating the area of a circumference.

\(f(r) = \pi*r^{2}\)

circ_area <- function(diameter){
  pi*(diameter/2)^2
}

circ_area(3)
## [1] 7.068583

Exercise: Function

Now you try creating the same function and calculate the are of a circumference of diameter 5

circ_area <- function(diameter){
  pi*(diameter/2)^2
}

circ_area(3)
## [1] 7.068583

Functions

Note that although you can write an infinite amount of code within a function, functions can only return one object.

We can get around that by returning a list.

Lists

Think of lists as a box where you can put any and all objects.

Below is a version of Noah’s Singles only Arc of list

# defining my list
noahs_singles_arc <- list()

# inserting objects
noahs_singles_arc$char_vector   <- char_vector

noahs_singles_arc$mtcars        <- mtcars
  
noahs_singles_arc$my_first_plot <- my_first_plot
  
noahs_singles_arc$circ_area     <- circ_area

Lists

Let’s take a quick look at the list

summary(noahs_singles_arc)
##               Length Class      Mode     
## char_vector    4     -none-     character
## mtcars        11     data.frame list     
## my_first_plot  9     gg         list     
## circ_area      1     -none-     function

Exercise: list

  1. Create a list with v1 , v2 , and v3
  2. Retrieve v2 from your list by name
  3. Retrieve v2 from your list by position
  4. Add v2 and v3 directly from your list

Packages

Packages are functions that are created by the R community and have made them available to all

Installing packages is easy (the repos part is not necessary):

install.packages('tidyverse' , repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/8c/w4htphd93r7cq7tcyp65xr780000gn/T//RtmpchaG6y/downloaded_packages

Loading Packages is even easier

To be able to use the functions of a package, you must load it first

library(tidyverse)

Exercises: packages

  1. Install >tidiverse
  2. Load the package >tidiverse

Highly used packages for this class

Help Facility

To get help on a specific function simply put a ? in front of the function, and the help facility will display the documentation. Note that it is extremely useful to scroll down to the examples

?sum

Let’s import some data

Go to the following website to download the data:
https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

This link takes you directly to the data:
https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx

Once you’ve saved the data

  1. Install the package readxl (if already installed, no need to reinstall) using the following code: install.packages('readxl')
  2. Load the library tidyverse with the command library(readxl)
  3. Go to Import Dataset

Exercise: wrangling

  1. Check out the data with the >summary and >str function
  2. Sort the data with the >arrange function
  3. Select the first and last variables with native R
  4. Select the first and last variables with the select function
  5. Create a new variable minc_over_age (MonthlyIncome/Age) using native R
  6. Create a new variable minc_over_age (MonthlyIncome/Age) using the >mutate function

Next time…