About

R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

Basics Operations

First we will begin with a few basic operations.

Variable assignment

A variable allows you to store values or an object (e.g. a function).

x = 967
y = 162
vars = c(122,4432,8634,16745,323422,1,2,3,4,5,6,323452,2234234,234) # This is a vector

vars[1] #This calls the first value in the vector vars

## [1] 122

vars[2] #This calls the second value in the vector vars

## [1] 4432

vars[1:7] #This calls the first through third values in the vector vars

## [1]    122   4432   8634  16745 323422      1      2

vars #This calls the vector

##  [1]     122    4432    8634   16745  323422       1       2       3
##  [9]       4       5       6  323452 2234234     234

Common Arithmetic Operations

Below shows some simple arithmetic operations.

1234829*6123

## [1] 7560857967

12834554/1612

## [1] 7961.882

9^8

## [1] 43046721

Basic Data Types

R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").

#Type: Character                   
#Example:"TRUE",'23.4'

v = "TRUE"                       
class(v)

## [1] "character"

#Type: Numeric                
#Example: 12.3,5

v = 105                  
class(v)

## [1] "numeric"

#Type: Logical    
#Example: TRUE,FALSE

v = FALSE
class(v)

## [1] "logical"

#Type: Factor
#Example: e v e v e

v = as.factor(c("e", "v", "e"))
class(v)

## [1] "factor"

Setting the Working Directory

Before starting to work with R, we need to set the working directory.

Importing Data and Variable Assignment

il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")

Arithmetic Operations with Data

We can extract values from the dataset to perform calculations.

DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
DuPage-Lake/(10^2)

## [1] 38546.41

DuPage+Lake*(40)

## [1] 1577291

(DuPage+Lake)/4

## [1] 19347.5

Basic Statistics

mean(il_income$per_capita_income)

## [1] 25164.14

median(il_income$per_capita_income)

## [1] 24808.5

quantile(il_income$per_capita_income)

##       0%      25%      50%      75%     100% 
## 14052.00 22666.00 24808.50 26899.75 38931.00

summary(il_income)

##       rank              county   per_capita_income   population     
##  Min.   :  1.00   Adams    : 1   Min.   :14052     Min.   :   4135  
##  1st Qu.: 26.25   Alexander: 1   1st Qu.:22666     1st Qu.:  14284  
##  Median : 51.50   Bond     : 1   Median :24809     Median :  26610  
##  Mean   : 51.50   Boone    : 1   Mean   :25164     Mean   : 126078  
##  3rd Qu.: 76.75   Brown    : 1   3rd Qu.:26900     3rd Qu.:  53319  
##  Max.   :102.00   Bureau   : 1   Max.   :38931     Max.   :5238216  
##                   (Other)  :96                                      
##      region     
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :3.735  
##  3rd Qu.:5.000  
##  Max.   :5.000  
##

Vectors

Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

# vector of numeric values
c(22, 33, 45, 58)

## [1] 22 33 45 58

# vector of logical values.
c(FALSE, TRUE, FALSE)

## [1] FALSE  TRUE FALSE

# vector of character strings.
c("A+", "B-", "D-", "C+", "A")

## [1] "A+" "B-" "D-" "C+" "A"

Lists

Defining a List

Lists, as opposed to vectors, can hold components of different types.

scores = c(801, 275, 535)  # vector of numeric values                   
annoyance = c("F", "C", "A")  # vector of character strings.          

efficient = c(FALSE, FALSE, TRUE) # vector of logical values.
student = list(scores,annoyance,efficient) # list of vectors
student

## [[1]]
## [1] 801 275 535
## 
## [[2]]
## [1] "F" "C" "A"
## 
## [[3]]
## [1] FALSE FALSE  TRUE

List Slicing

We can retrieve components of the list with the single square bracket [] operator.

student[2]

## [[1]]
## [1] "F" "C" "A"

student[3]

## [[1]]
## [1] FALSE FALSE  TRUE

student[1]

## [[1]]
## [1] 801 275 535

# first two components of the list
student[1:2]

## [[1]]
## [1] 801 275 535
## 
## [[2]]
## [1] "F" "C" "A"

Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly.

student[[1]] # Components of the Scores Vector

## [1] 801 275 535

First element of the Scores vector

student[[1]][1]

## [1] 801

First three elements of the Scores vector

efficient[[1]][1]

## [1] FALSE

Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

student = list(scores = c(99, 32, 100), grades = c("A+", "F", "A+"), office_hours = c(FALSE, TRUE, FALSE)) 

student

## $scores
## [1]  99  32 100
## 
## $grades
## [1] "A+" "F"  "A+"
## 
## $office_hours
## [1] FALSE  TRUE FALSE

student$scores

## [1]  99  32 100

student$grades

## [1] "A+" "F"  "A+"

student$office_hours

## [1] FALSE  TRUE FALSE

Data Frames

Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(il_income)

## 'data.frame':    102 obs. of  5 variables:
##  $ rank             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ county           : Factor w/ 102 levels "Adams","Alexander",..: 16 22 49 99 45 60 101 64 86 10 ...
##  $ per_capita_income: int  30468 38931 38459 30791 30645 23937 24802 30728 23279 26087 ...
##  $ population       : int  5238216 933736 703910 687263 530847 307343 287078 266209 264052 208861 ...
##  $ region           : int  1 2 2 2 2 2 2 5 5 3 ...

Creating a Data Frame

Who is hungry?.

name = c("Elio", "Michele", "Laura")
gender = c("Male","Female", "Female")
weight = c("187", "123", "119")
hometown = c("Chicago", "Glen Ellyn", "Saint Charles")
apples = c(2, 3, 4)
oranges = c(1, 3, 4)
hungry = c(FALSE, FALSE, TRUE)

Now, by combining the vectors of equal size, we can create a data frame object.

whos_hungry = data.frame(name,gender,weight,hometown,apples,oranges,hungry)
whos_hungry

##      name gender weight      hometown apples oranges hungry
## 1    Elio   Male    187       Chicago      2       1  FALSE
## 2 Michele Female    123    Glen Ellyn      3       3  FALSE
## 3   Laura Female    119 Saint Charles      4       4   TRUE

Exercises & Resources

Exercises

Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts:

Data Sources

“SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09.

Introduction to R

CME Group Foundation Business Analytics Lab

Elio Vento

Summer 2017, July 12