About

R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

Basics Operations

First we will begin with a few basic operations.

Variable assignment

A variable allows you to store values or an object (e.g. a function).

x = 128
y = 16
vars = c(2,4,8,16,32) # This is a vector

vars[1] #This calls the first value in the vector vars

## [1] 2

vars[2] #This calls the second value in the vector vars

## [1] 4

vars[1:3] #This calls the first through third values in the vector vars

## [1] 2 4 8

vars #This calls the vector

## [1]  2  4  8 16 32

Common Arithmetic Operations

Below shows some simple arithmetic operations.

12*6

## [1] 72

128/16

## [1] 8

9^2

## [1] 81

Basic Data Types

R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").

#Type: Character                   
#Example:"TRUE",'23.4'

v = "TRUE"                       
class(v)

## [1] "character"

#Type: Numeric                
#Example: 12.3,5

v = 23.5                  
class(v)

## [1] "numeric"

#Type: Logical    
#Example: TRUE,FALSE

v = TRUE
class(v)

## [1] "logical"

#Type: Factor
#Example: m f m f m

v = as.factor(c("m", "f", "m"))
class(v)

## [1] "factor"

Setting the Working Directory

Before starting to work with R, we need to set the working directory.

Importing Data and Variable Assignment

il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")

Arithmetic Operations with Data

We can extract values from the dataset to perform calculations.

DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
McHenry = top_il_income$per_capita_income[3]
Monroe = top_il_income$per_capita_income[4]
Piatt = top_il_income$per_capita_income[5]
Kendall = top_il_income$per_capita_income[6]
Will = top_il_income$per_capita_income[7]
McLean = top_il_income$per_capita_income[8]
Kane = top_il_income$per_capita_income[9]
Sangamor = top_il_income$per_capita_income[10]

DuPage+Lake+McHenry+Monroe+Piatt+Kendall+Will+McLean+Kane+Sangamor

## [1] 329185

(DuPage+Lake+McHenry+Monroe+Piatt+Kendall+Will+McLean+Kane+Sangamor)/10

## [1] 32918.5

(DuPage+Lake+McHenry)/3

## [1] 36836

(Sangamor+Kane+McLean)/3

## [1] 30655.67

Basic Statistics

mean(top_il_income$population)

## [1] 334866.3

median(top_il_income$population)

## [1] 194782

quantile(top_il_income$population)

##       0%      25%      50%      75%     100% 
##   7032.0  36920.5 194782.0 648159.0 933736.0

summary(top_il_income)

##       rank           county  per_capita_income   population    
##  Min.   : 2.00   DuPage :1   Min.   :30594     Min.   :  7032  
##  1st Qu.: 4.25   Kane   :1   1st Qu.:30744     1st Qu.: 36921  
##  Median :12.00   Kendall:1   Median :31430     Median :194782  
##  Mean   :27.10   Lake   :1   Mean   :32919     Mean   :334866  
##  3rd Qu.:41.00   McHenry:1   3rd Qu.:33103     3rd Qu.:648159  
##  Max.   :90.00   McLean :1   Max.   :38931     Max.   :933736  
##                  (Other):4                                     
##      region   
##  Min.   :2.0  
##  1st Qu.:2.0  
##  Median :3.0  
##  Mean   :3.2  
##  3rd Qu.:4.0  
##  Max.   :5.0  
##

Vectors

Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

# vector of numeric values
c(4, 8, 15, 16, 23, 48)

## [1]  4  8 15 16 23 48

# vector of logical values.
c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)

## [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE

# vector of character strings.
c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "F")

##  [1] "A+" "A"  "A-" "B+" "B"  "B-" "C+" "C"  "C-" "D+" "D"  "D-" "F"

Lists

Defining a List

Lists, as opposed to vectors, can hold components of different types.

scores = c(100, 95, 92, 87, 85, 77, 66, 58)  # vector of numeric values                   
grades = c("A+", "A", "A-", "B+", "B", "C+", "D", "F")  # vector of character strings.          

office_hours = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student

## [[1]]
## [1] 100  95  92  87  85  77  66  58
## 
## [[2]]
## [1] "A+" "A"  "A-" "B+" "B"  "C+" "D"  "F" 
## 
## [[3]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

List Slicing

We can retrieve components of the list with the single square bracket [] operator.

student[1]

## [[1]]
## [1] 100  95  92  87  85  77  66  58

# first two components of the list
student[[1:2]]

## [1] 95

Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly.

student[[2]] # Components of the Scores Vector

## [1] "A+" "A"  "A-" "B+" "B"  "C+" "D"  "F"

First element of the Scores vector

student[[2]][8]

## [1] "F"

student[[2]][1]

## [1] "A+"

First three elements of the Scores vector

grades[1:3]

## [1] "A+" "A"  "A-"

Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

student$scores = c(100, 95, 92, 87, 85, 77, 66, 58) 
student$grades = c("A+", "A", "A-", "B+", "B", "C+", "D", "F")
student$office_hours = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)

student

## [[1]]
## [1] 100  95  92  87  85  77  66  58
## 
## [[2]]
## [1] "A+" "A"  "A-" "B+" "B"  "C+" "D"  "F" 
## 
## [[3]]
## [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
## 
## $scores
## [1] 100  95  92  87  85  77  66  58
## 
## $grades
## [1] "A+" "A"  "A-" "B+" "B"  "C+" "D"  "F" 
## 
## $office_hours
## [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

student$scores

## [1] 100  95  92  87  85  77  66  58

student$grades

## [1] "A+" "A"  "A-" "B+" "B"  "C+" "D"  "F"

student$office_hours

## [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

Data Frames

Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(top_il_income)

## 'data.frame':    10 obs. of  5 variables:
##  $ rank             : int  2 3 32 44 67 16 4 8 5 90
##  $ county           : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
##  $ per_capita_income: int  38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
##  $ population       : int  933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
##  $ region           : int  2 2 4 5 4 2 2 5 2 4

Creating a Data Frame

Chicago Sporst Teams

name = c("Bears", "Cubs", "White Sox", "Blackhawks", "Bulls")
sport = c("Football", "Baseball", "Baseball", "Hockey", "Basketball")
last_full_season_win_percentage = c("18.8%", "64%", "48.1%", "61%", "50%")
conference_finish = c("15", "1", "11", "1", "8")
championship = c(FALSE, TRUE, FALSE, FALSE, FALSE)

Now, by combining the vectors of equal size, we can create a data frame object.

Chicago_Sports_Team_df = data.frame(name, sport, last_full_season_win_percentage, conference_finish, championship)
Chicago_Sports_Team_df

##         name      sport last_full_season_win_percentage conference_finish
## 1      Bears   Football                           18.8%                15
## 2       Cubs   Baseball                             64%                 1
## 3  White Sox   Baseball                           48.1%                11
## 4 Blackhawks     Hockey                             61%                 1
## 5      Bulls Basketball                             50%                 8
##   championship
## 1        FALSE
## 2         TRUE
## 3        FALSE
## 4        FALSE
## 5        FALSE

Exercises & Resources

Exercises

Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts:

Data Sources

“SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09.

Introduction to R

CME Group Foundation Business Analytics Lab

Mark Gruhlke

Summer 2017