About

R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

Basics Operations

First we will begin with a few basic operations.

Variable assignment

A variable allows you to store values or an object (e.g. a function).

x = 435
y = 78
vars = c(2,5,7,12,23,25,28,34,36,39,40,41,42) # This is a vector

vars[1] #This calls the first value in the vector vars

## [1] 2

vars[2] #This calls the second value in the vector vars

## [1] 5

vars[1:3] #This calls the first through third values in the vector vars

## [1] 2 5 7

vars #This calls the vector

##  [1]  2  5  7 12 23 25 28 34 36 39 40 41 42

Common Arithmetic Operations

Below shows some simple arithmetic operations.

34*8

## [1] 272

139/3

## [1] 46.33333

7^3

## [1] 343

Basic Data Types

R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").

#Type: Character                   
#Example:"TRUE",'23.4'

v = "issa"                       
class(v)

## [1] "character"

#Type: Numeric                
#Example: 12.3,5

v = 46.87                  
class(v)

## [1] "numeric"

#Type: Logical    
#Example: TRUE,FALSE

v = FALSE
class(v)

## [1] "logical"

#Type: Factor
#Example: m f m f m

v = as.factor(c("b", "ba", "bad"))
class(v)

## [1] "factor"

Setting the Working Directory

Before starting to work with R, we need to set the working directory.

Importing Data and Variable Assignment

il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")

Arithmetic Operations with Data

We can extract values from the dataset to perform calculations.

DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
(DuPage-Lake)*4

## [1] 1888

DuPage+Lake-Lake

## [1] 38931

(DuPage+Lake)/3

## [1] 25796.67

Basic Statistics

mean(il_income$per_capita_income)

## [1] 25164.14

median(il_income$per_capita_income)

## [1] 24808.5

quantile(il_income$per_capita_income)

##       0%      25%      50%      75%     100% 
## 14052.00 22666.00 24808.50 26899.75 38931.00

summary(il_income$population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4135   14284   26610  126078   53319 5238216

summary(il_income$per_capita_income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14052   22666   24809   25164   26900   38931

summary(il_income)

##       rank              county   per_capita_income   population     
##  Min.   :  1.00   Adams    : 1   Min.   :14052     Min.   :   4135  
##  1st Qu.: 26.25   Alexander: 1   1st Qu.:22666     1st Qu.:  14284  
##  Median : 51.50   Bond     : 1   Median :24809     Median :  26610  
##  Mean   : 51.50   Boone    : 1   Mean   :25164     Mean   : 126078  
##  3rd Qu.: 76.75   Brown    : 1   3rd Qu.:26900     3rd Qu.:  53319  
##  Max.   :102.00   Bureau   : 1   Max.   :38931     Max.   :5238216  
##                   (Other)  :96                                      
##      region     
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :3.735  
##  3rd Qu.:5.000  
##  Max.   :5.000  
##

Vectors

Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

# vector of numeric values
c(2, 4,5,7,20,23,26,45,48,59,70)

##  [1]  2  4  5  7 20 23 26 45 48 59 70

# vector of logical values.
c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE)

## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

# vector of character strings.
c("2017", "2018", "2018", "2017", "2019","2018","2019")

## [1] "2017" "2018" "2018" "2017" "2019" "2018" "2019"

Lists

Defining a List

Lists, as opposed to vectors, can hold components of different types.

field_goals = c(5,2,12,13,0,9)  # vector of numeric values                   
fg_percentage = c("56%", "30%", "64%","72%","0%","57%")  # vector of character strings.          

practice = c(TRUE, FALSE, FALSE,TRUE,FALSE,FALSE) # vector of logical values.
player = list(field_goals,fg_percentage,practice) # list of vectors
player

## [[1]]
## [1]  5  2 12 13  0  9
## 
## [[2]]
## [1] "56%" "30%" "64%" "72%" "0%"  "57%"
## 
## [[3]]
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE

List Slicing

We can retrieve components of the list with the single square bracket [] operator.

player[1]

## [[1]]
## [1]  5  2 12 13  0  9

player[2]

## [[1]]
## [1] "56%" "30%" "64%" "72%" "0%"  "57%"

player[3]

## [[1]]
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE

# first two components of the list
player[1:2]

## [[1]]
## [1]  5  2 12 13  0  9
## 
## [[2]]
## [1] "56%" "30%" "64%" "72%" "0%"  "57%"

Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly.

player[[2]] # Components of the Scores Vector

## [1] "56%" "30%" "64%" "72%" "0%"  "57%"

First element of the Scores vector

player[[2]][3]

## [1] "64%"

First three elements of the Scores vector

player[[3]][1:4]

## [1]  TRUE FALSE FALSE  TRUE

Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

player = list(field_goals = c(5,2,12,13,0,9), fg_percentage = c("56%", "30%", "64%","72%","0%","57%"), practice = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE)) 

player

## $field_goals
## [1]  5  2 12 13  0  9
## 
## $fg_percentage
## [1] "56%" "30%" "64%" "72%" "0%"  "57%"
## 
## $practice
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE

player$field_goals

## [1]  5  2 12 13  0  9

player$fg_percentage

## [1] "56%" "30%" "64%" "72%" "0%"  "57%"

player$practice

## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE

Data Frames

Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(top_il_income)

## 'data.frame':    10 obs. of  5 variables:
##  $ rank             : int  2 3 32 44 67 16 4 8 5 90
##  $ county           : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
##  $ per_capita_income: int  38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
##  $ population       : int  933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
##  $ region           : int  2 2 4 5 4 2 2 5 2 4

Creating a Data Frame

Snapshot of NBA Players.

name = c("LeBron James", "Russell Westbrook", "Kevin Durant")
team = c("Cleveland Cavaliers","Oklahoma City Thunder", "Golden State Warriors")
team_changes = c(2, 0, 1)
friends_betrayed = c(0, 0, 1)
ring = c(TRUE, FALSE, TRUE)

Now, by combining the vectors of equal size, we can create a data frame object.

nba_df = data.frame(name,team,team_changes,friends_betrayed,ring)
nba_df

##                name                  team team_changes friends_betrayed
## 1      LeBron James   Cleveland Cavaliers            2                0
## 2 Russell Westbrook Oklahoma City Thunder            0                0
## 3      Kevin Durant Golden State Warriors            1                1
##    ring
## 1  TRUE
## 2 FALSE
## 3  TRUE

Exercises & Resources

Exercises

Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts:

Data Sources

“SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09.

Introduction to R

CME Group Foundation Business Analytics Lab

Rachel Hlavacek

July 12, 2017