About

R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

Setting the Working Directory

Before starting to work with R, we need to set the working directory to source file location.

ATTENTION:

1. SET UP THE DIRECTORY AS INSTRUCTED ABOVE BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

2. THERE ARE 10 TASKS IN THIS ASSIGNMENT, EACH IS WORTH 1 POINT.

3. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

4. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

5. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE)

Basics Operations

First we will begin with a few basic operations.

Variable assignment

We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function). Note: R is case sensitive. X and x are treated as two different variable names.

x = 128   # Here x is the variable name and 128 is the value assigned for x
y = 16
z <- 5
vars = c(2,4,8,16,32) # This is a vector named vars which is created using the generic combine function 'c'

# TASK 1: Assign a value 19 to a variable named w  
w = 19

x # This calls variable x and displays its value.

## [1] 128

z # This calls variable z and displays its value.

## [1] 5

# TASK 2: Call variable w and display its value
w #

## [1] 19

vars[1] # This calls vector vars and displays its first content.

## [1] 2

vars[2] # This calls vector vars and displays its second content.

## [1] 4

vars[1:3] # This calls vector vars and displays its first through third content.

## [1] 2 4 8

vars # This calls vector vars and displays all of its content.

## [1]  2  4  8 16 32

# TASK 3: Call vector vars and display its third through fifth content 
vars[3:5]

## [1]  8 16 32

Common Arithmetic Operations

Below shows some simple arithmetic operations.

12*6 # This operation multiplies 12 and 6. Multiplication use symbol *

## [1] 72

128/16 # This is a division operation.

## [1] 8

9^2 # This is how to compute 9 to the power of 2.

## [1] 81

# TASK 4: Compute 3 to the power of 3
3^3

## [1] 27

Basic Data Types

R works with numerous data types. Some of the most basic types are: character (string-"TEXT"), numeric, logical (Boolean-TRUE/FALSE), and factor. These types are also called classes. Therefore, in R there different classes for different data types; e.g. class character holds data of type character, class numeric holds data of type numeric, etc. A variable’s class depends on the type of the data that the variable holds

#The following is a variable v which's assigned a value of type: Character.
#Example of data values with type character:"TRUE",'23.4'
#Note: character type of data can be written with double or single quote
v = "TRUE"  # This assigns value "TRUE" of type character to variable v   
class(v)    # This calls and displays the class which variable v belongs to

## [1] "character"

#The following is assigning variable v a data of type: numeric            
#Example: 12.3,5
v = 23.5                  
class(v)

## [1] "numeric"

#The following is assigning variable v a data of type: Logical    
#Example: TRUE,FALSE
v = TRUE    # Note: the different between "TRUE" and TRUE (without quotes)
class(v)

## [1] "logical"

#The following is type: Factor (nominal, categorical)
#Example: m f m f m

v = as.factor(c("m", "f", "m"))
class(v)

## [1] "factor"

Functions

R Functions are invoked or called by its name, followed by the parenthesis, and zero or more arguments.

# The following calls the function 'c' (seen earlier) to combine three numeric values into a vector 
c(1,2,3)

## [1] 1 2 3

# This calls the function mean() to calcule the mean of three values
mean(c(5,6,7))

## [1] 6

# This calls function sqrt() to calculate the qquare root  of a number
sqrt(16)

## [1] 4

# TASK 5: Combine four numeric values (2,19,11) into a vector using function c() AND call the function mean() to calculate the mean of the four values
mean(c(2,19,11))

## [1] 10.66667

Importing Data and Variable Assignment

The following shows how to read a file called top_il_income.csv which is of type csv (comma seperated values), and assign a variable called top_il_income for that file. Note: You can name the variable using other names but it’s important to give a meaningful name to a variable.

top_il_income = read.csv(file = "data/top_il_income.csv")

# TASK 6: Read a file called il_income.csv and assign a variable for that file called il_income

il_income = read.csv(file = "data/il_income.csv")

Arithmetic Operations with Data

We can extract values from the dataset to perform calculations. Below we perform subtraction, addition, and division operations on the data kept in variable top_il_income. The $per_capita_income indicates the name of the field (column) in the data file which is per_capita_income, and the index [1] indicates the first data in that column.

FirstIncome = top_il_income$per_capita_income[1] # This assigns variable FirstIncome the first value under per_capita_income field in file top_il_income
SecondIncome = top_il_income$per_capita_income[2]
FirstIncome-SecondIncome

## [1] 472

FirstIncome+SecondIncome

## [1] 77390

(FirstIncome+SecondIncome)/2

## [1] 38695

# TASK 7: Repeat the above arithmetic operations using instead the $per_capita_income field from file il_income
FirstIncome = il_income$per_capita_income[1]
SecondIncome = il_income$per_capita_income[2]
FirstIncome-SecondIncome

## [1] -8463

FirstIncome+SecondIncome

## [1] 69399

(FirstIncome+SecondIncome)/2

## [1] 34699.5

# Basic Statistics

mean(top_il_income$per_capita_income) # This computes the mean of all the data under field per_capita_income in file top_il_income

## [1] 32918.5

median(top_il_income$per_capita_income)

## [1] 31430

quantile(top_il_income$per_capita_income)

##       0%      25%      50%      75%     100% 
## 30594.00 30743.75 31430.00 33103.25 38931.00

summary(top_il_income)

##       rank           county  per_capita_income   population    
##  Min.   : 2.00   DuPage :1   Min.   :30594     Min.   :  7032  
##  1st Qu.: 4.25   Kane   :1   1st Qu.:30744     1st Qu.: 36920  
##  Median :12.00   Kendall:1   Median :31430     Median :194782  
##  Mean   :27.10   Lake   :1   Mean   :32918     Mean   :334866  
##  3rd Qu.:41.00   McHenry:1   3rd Qu.:33103     3rd Qu.:648159  
##  Max.   :90.00   McLean :1   Max.   :38931     Max.   :933736  
##                  (Other):4                                     
##      region   
##  Min.   :2.0  
##  1st Qu.:2.0  
##  Median :3.0  
##  Mean   :3.2  
##  3rd Qu.:4.0  
##  Max.   :5.0  
##

# TASK 8: Repeat the basic statistics (i.e. to compute the mean, median, and quantile) using instead the data under field per_capita_income in file il_income
mean(il_income$per_capita_income)

## [1] 25164.14

median(il_income$per_capita_income)

## [1] 24808.5

quantile(il_income$per_capita_income)

##       0%      25%      50%      75%     100% 
## 14052.00 22666.00 24808.50 26899.75 38931.00

summary(il_income)

##       rank              county   per_capita_income   population     
##  Min.   :  1.00   Adams    : 1   Min.   :14052     Min.   :   4135  
##  1st Qu.: 26.25   Alexander: 1   1st Qu.:22666     1st Qu.:  14284  
##  Median : 51.50   Bond     : 1   Median :24808     Median :  26610  
##  Mean   : 51.50   Boone    : 1   Mean   :25164     Mean   : 126078  
##  3rd Qu.: 76.75   Brown    : 1   3rd Qu.:26900     3rd Qu.:  53319  
##  Max.   :102.00   Bureau   : 1   Max.   :38931     Max.   :5238216  
##                   (Other)  :96                                      
##      region     
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :3.735  
##  3rd Qu.:5.000  
##  Max.   :5.000  
##

Vectors

Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

c(2, 3, 5, 8) # This creates a vector of numeric values.

## [1] 2 3 5 8

c(TRUE, FALSE, TRUE) # This creates a vector of logical values.

## [1]  TRUE FALSE  TRUE

c("A", "B", "B-", "C", "D") # This creates a vector of character strins values.

## [1] "A"  "B"  "B-" "C"  "D"

Lists

Defining a List

Lists, as opposed to vectors, can hold components of different types.

scores = c(80, 75, 55)  # This creates a vector variable called scores and assign numeric values to it.                   
grades = c("B", "C", "D-")  # This creates a vector variable called grades and assign character strings values to it.
office_hours = c(TRUE, FALSE, FALSE) # This creates a vector variable called office_hours and assign logical values to it.
student = list(scores,grades,office_hours) # This creates a list variable called student and make scores, grades and office_hours vectors as the list component.
student # This  calls and displays the content of the student list.

## [[1]]
## [1] 80 75 55
## 
## [[2]]
## [1] "B"  "C"  "D-"
## 
## [[3]]
## [1]  TRUE FALSE FALSE

List Slicing

We can retrieve components of the list with the single square bracket [] operator.

student[1]  # This calls and displays first component of the student list.

## [[1]]
## [1] 80 75 55

student[2]  # This calls and displays second component of the student list.

## [[1]]
## [1] "B"  "C"  "D-"

student[3]

## [[1]]
## [1]  TRUE FALSE FALSE

student[1:2] # This calls and displays first two components of the student list.

## [[1]]
## [1] 80 75 55
## 
## [[2]]
## [1] "B"  "C"  "D-"

Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.

student[[1]] # This calls and displays the components of the Scores Vector in student list.

## [1] 80 75 55

student[[1]][1] # This calls and displays first element of scores vector in student list.

## [1] 80

student[[2]][3] #This calls and displays third element of grades vector in student list.

## [1] "D-"

# TASK 9: Call and display the second element of office_hours vector in student list
office_hours [[2]][2]

## [1] NA

student[[1]][1:3] # This calls and displays first three elements of scores vector in student list.

## [1] 80 75 55

Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours) 
student

## $myscores
## [1] 80 75 55
## 
## $mygrades
## [1] "B"  "C"  "D-"
## 
## $myoffice_hours
## [1]  TRUE FALSE FALSE

student$myscores

## [1] 80 75 55

student$mygrades

## [1] "B"  "C"  "D-"

student$myoffice_hours

## [1]  TRUE FALSE FALSE

Matrices

All columns in a matrix must have the same data type and the same length. This following creates a numeric matrix called x_mat which consists of 5 rows and 4 columns made of sequential numbers 1 to 20.

x_mat = matrix(1:20, nrow=5, ncol=4)

x_mat[,4] # This retrieves and displays the 4th column of the matrix.

## [1] 16 17 18 19 20

x_mat[3,] # This retrieves and displays the 3rd row of the matrix.

## [1]  3  8 13 18

x_mat[2:4,1:3] # This retrieves and displays rows 2,3,4 of columns 1,2,3 of the matrix.

##      [,1] [,2] [,3]
## [1,]    2    7   12
## [2,]    3    8   13
## [3,]    4    9   14

# TASK 10: Retrieve and display rows 2,3,4 of columns 1,2,3,4 of the matrix
x_mat[2:4,1:4]

##      [,1] [,2] [,3] [,4]
## [1,]    2    7   12   17
## [2,]    3    8   13   18
## [3,]    4    9   14   19

Data Frames

A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.

Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(top_il_income)

## 'data.frame':    10 obs. of  5 variables:
##  $ rank             : int  2 3 32 44 67 16 4 8 5 90
##  $ county           : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
##  $ per_capita_income: int  38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
##  $ population       : int  933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
##  $ region           : int  2 2 4 5 4 2 2 5 2 4

Creating a Data Frame

Snapshot of the solar system.

name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)

Now, by combining the vectors of equal size, we can create a data frame object.

planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df

##      name        type diameter rotation rings
## 1   Earth Terrestrial    1.000     1.00 FALSE
## 2    Mars Terrestrial    0.532     1.03 FALSE
## 3 Jupiter   Gas giant   11.209     0.41  TRUE

Suggested Exercises & Resources

Exercises

Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction

Data Sources

Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site. * “SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Introduction to R (bsad_lab01)

CME Group Foundation Business Analytics Lab

Max van de Werken

January 31, 2018