About

R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

Setting the Working Directory

Before starting to work with R, we need to set the working directory to source file location.

ATTENTION:

1. SET UP THE DIRECTORY AS INSTRUCTED ABOVE BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

2. THERE ARE 10 TASKS IN THIS ASSIGNMENT, EACH IS WORTH 1 POINT.

3. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

4. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

5. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE)

Basics Operations

First we will begin with a few basic operations.

Variable assignment

We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function). Note: R is case sensitive. X and x are treated as two different variable names.

x = 1000      # Here x is the variable name and 1000 is the value assigned for variable x.
number = 3    # Here number is the variable name and 3 is the value assigned for variable number.
zz <- 20      # Here zz is the variable name and 20 is the value assigned for zz.
vec = c(500,100,10,1) # This is a vector named vec which is created using the generic combine function 'c'

TASK 1: Assign a value 99999 to a variable named myvar

myvar = 99999

x     # This calls variable x and displays its value.

## [1] 1000

zz    # This calls variable zz and displays its value.

## [1] 20

TASK 2: Call variable myvar and display its value

myvar

## [1] 99999

vec         # This calls vector vec and displays all of its content.

## [1] 500 100  10   1

vec[2]      # This calls vector vec and displays its second content.

## [1] 100

vec[1:3]    # This calls vector vec and displays its first through third content.

## [1] 500 100  10

TASK 3: Call vector vec and display its second through fourth content

vec[2:4]

## [1] 100  10   1

Common Arithmetic Operations

Below shows some simple arithmetic operations.

60+47      # This operation adds 60 and 47.

## [1] 107

10-7       # This operation deducts 7 from 10.

## [1] 3

2*3        # This operation multiplies 2 and 3. Multiplication use symbol *

## [1] 6

2000/2     # This is a division operation.

## [1] 1000

4^2        # This is how to compute 4 to the power of 2.

## [1] 16

TASK 4: Compute 2 to the power of 3

2^3

## [1] 8

Basic Data Types

R works with numerous data types. Some of the most basic types are: character (string-"TEXT"), numeric, logical (Boolean-TRUE/FALSE), and factor. These types are also called classes. Therefore, in R there different classes for different data types; e.g. class character holds data of type character, class numeric holds data of type numeric, etc. A variable’s class depends on the type of the data that the variable holds

#The following is a variable v which's assigned a value of type: Character; for example the data #values with type character:"TRUE",'23.4'
#Note: character type of data can be written with double or single quote.
v = "TRUE"  # This assigns value "TRUE" of type character to variable v. 
class(v)    # This calls and displays the class which variable v belongs to.

## [1] "character"

#The following is assigning variable v a data of type: numeric; for example: 12.3, 5
w = 23.5    # This assigns value 23.5 of type numeric to variable w.        
class(w)    # This calls and displays the class which variable w belongs to.

## [1] "numeric"

#The following is assigning variable t a data of type: Logical; for example: TRUE, FALSE.
t = TRUE    # This assigns value TRUE of type logical to variable t. 
            # Note: the different between "TRUE" and TRUE (without quotes).
class(t)    # This calls and displays the class which variable t belongs to.

## [1] "logical"

#The following is type: Factor (nominal, categorical); for example: m f m f m
u = as.factor(c("m", "f", "m"))
class(u)

## [1] "factor"

Functions

R Functions are invoked or called by its name, followed by the parenthesis, and zero or more arguments.

# The following calls the function 'c' (seen earlier) to combine three numeric values into a vector.
c(1,2,3)

## [1] 1 2 3

# This calls the function mean() to calcule the mean of three values.
mean(c(10, 50, 3))

## [1] 21

# This calls function sqrt() to calculate the square root of a number.
sqrt(9)

## [1] 3

TASK 5: Combine three numeric values (100,40,70) into a vector using function c() AND call the function mean() to calculate the mean of the three values.

mean(c(100,40,70))

## [1] 70

Importing Data and Variable Assignment

The following code shows how to read a file called top_il_income.csv which is of type csv (comma seperated values), and assign a variable called topIncome for that file. Note: You can name the variable using other names but it’s important to give a meaningful name to a variable.

topIncome = read.csv(file = "data/top_il_income.csv")

TASK 6: Read a file called il_income.csv and assign a variable for that file called illinoisIncome.

illinoisIncome = read.csv(file = "data/il_income.csv")

Arithmetic Operations with Data

We can extract values from the dataset to perform calculations. Below we perform subtraction, addition, and division operations on the data kept in variable topIncome. The $per_capita_income indicates the name of the field (column) in the data file which is per_capita_income, and the index [1] indicates the first data in that column.

firstIncome = topIncome$per_capita_income[1] 
# The above code assigns variable firstIncome the first value under per_capita_income field in file topIncome.

secondIncome = topIncome$per_capita_income[2]
# The above code assigns variable secondIncome the first value under per_capita_income field in file topIncome.

firstIncome-secondIncome

## [1] 472

# The above code subtracts the value of secondIncome from the value of firstIncome.

firstIncome+secondIncome

## [1] 77390

(firstIncome+secondIncome)/2

## [1] 38695

TASK 7: Repeat the above arithmetic operations using instead the $per_capita_income field from file illinoisIncome

firstIncome = illinoisIncome$per_capita_income[1]
secondIncome = illinoisIncome$per_capita_income[2]
firstIncome-secondIncome

## [1] -8463

firstIncome+secondIncome

## [1] 69399

(firstIncome+secondIncome)/2

## [1] 34699.5

Basic Statistics

Below shows how to do some basic statistics on the data in file topIncome.

mean(topIncome$per_capita_income)

## [1] 32918.5

# The above code computes the mean of all the data under field per_capita_income in file topIncome
median(topIncome$per_capita_income)         # Computes the median

## [1] 31430

quantile(topIncome$per_capita_income)       # Computes the quantile

##       0%      25%      50%      75%     100% 
## 30594.00 30743.75 31430.00 33103.25 38931.00

summary(topIncome)                          # Generate the summary

##       rank           county  per_capita_income   population    
##  Min.   : 2.00   DuPage :1   Min.   :30594     Min.   :  7032  
##  1st Qu.: 4.25   Kane   :1   1st Qu.:30744     1st Qu.: 36920  
##  Median :12.00   Kendall:1   Median :31430     Median :194782  
##  Mean   :27.10   Lake   :1   Mean   :32918     Mean   :334866  
##  3rd Qu.:41.00   McHenry:1   3rd Qu.:33103     3rd Qu.:648159  
##  Max.   :90.00   McLean :1   Max.   :38931     Max.   :933736  
##                  (Other):4                                     
##      region   
##  Min.   :2.0  
##  1st Qu.:2.0  
##  Median :3.0  
##  Mean   :3.2  
##  3rd Qu.:4.0  
##  Max.   :5.0  
##

TASK 8: Repeat the above basic statistics (i.e. compute the mean, median, and quantile) using instead the data under field per_capita_income in file illinoisIncome.

mean(illinoisIncome$per_capita_income)

## [1] 25164.14

median(illinoisIncome$per_capita_income)

## [1] 24808.5

quantile(illinoisIncome$per_capita_income)

##       0%      25%      50%      75%     100% 
## 14052.00 22666.00 24808.50 26899.75 38931.00

summary(illinoisIncome)

##       rank              county   per_capita_income   population     
##  Min.   :  1.00   Adams    : 1   Min.   :14052     Min.   :   4135  
##  1st Qu.: 26.25   Alexander: 1   1st Qu.:22666     1st Qu.:  14284  
##  Median : 51.50   Bond     : 1   Median :24808     Median :  26610  
##  Mean   : 51.50   Boone    : 1   Mean   :25164     Mean   : 126078  
##  3rd Qu.: 76.75   Brown    : 1   3rd Qu.:26900     3rd Qu.:  53319  
##  Max.   :102.00   Bureau   : 1   Max.   :38931     Max.   :5238216  
##                   (Other)  :96                                      
##      region     
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :3.735  
##  3rd Qu.:5.000  
##  Max.   :5.000  
##

Vectors

Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

c(2, 3, 5, 8)                  # This creates a vector of numeric values.

## [1] 2 3 5 8

c(TRUE, FALSE, TRUE)           # This creates a vector of logical values.

## [1]  TRUE FALSE  TRUE

c("A", "B", "B-", "C", "D")    # This creates a vector of character strings values.

## [1] "A"  "B"  "B-" "C"  "D"

Lists

Defining a List

Lists, as opposed to vectors, can hold components of different types.

scores = c(80, 75, 55)  
# The above code creates a vector variable called scores and assigns numeric values to it.                   
grades = c("B", "C", "D-")  
#The above code creates a vector variable called grades & assigns character strings values to it.

office_hours = c(TRUE, FALSE, FALSE) 
# The above code creates a vector variable called office_hours and assigns logical values to it.

student = list(scores, grades, office_hours) 
# The above code creates a list variable called student which contais three vectors: scores, grades and office_hours.

student # This  calls and displays the content of the student list.

## [[1]]
## [1] 80 75 55
## 
## [[2]]
## [1] "B"  "C"  "D-"
## 
## [[3]]
## [1]  TRUE FALSE FALSE

List Slicing

We can retrieve components (or the vectos) of the list with the single square bracket [] operator.

student[1]     # This calls and displays first vector of the student list which is scores.

## [[1]]
## [1] 80 75 55

student[2]     # This calls and displays second vector of the student list which is grades.

## [[1]]
## [1] "B"  "C"  "D-"

               # How to call and display the third vector of the student list?

student[2:3]   # This calls and displays the second to the third vectors of the student list which are .....

## [[1]]
## [1] "B"  "C"  "D-"
## 
## [[2]]
## [1]  TRUE FALSE FALSE

Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.

student[[1]] # This calls and displays the members of the scores vector in student list.

## [1] 80 75 55

student[[1]][3] # This calls and displays the third member of the first vector in student list which is the third member of scores.

## [1] 55

student[[2]][1] # This calls and displays the first member of the second vector in student list which is the first member of grades.

## [1] "B"

TASK 9: Call and display the second element of office_hours vector in student list.

student[[3]][2]

## [1] FALSE

student[[1]][1:3] # This calls and displays first three members of scores vector in student list.

## [1] 80 75 55

Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours) 
student

## $myscores
## [1] 80 75 55
## 
## $mygrades
## [1] "B"  "C"  "D-"
## 
## $myoffice_hours
## [1]  TRUE FALSE FALSE

student$myscores

## [1] 80 75 55

student$mygrades

## [1] "B"  "C"  "D-"

student$myoffice_hours

## [1]  TRUE FALSE FALSE

Matrices

All columns in a matrix must have the same data type and the same length. This following creates a numeric matrix called my_mat which consists of 4 rows and 5 columns made of sequential numbers 1 to 20.

my_mat = matrix(1:20, nrow=4, ncol=5)

my_mat       # This retrieves and displays all content of the the matrix.

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

my_mat[,4]       # This retrieves and displays the 4th column of the matrix.

## [1] 13 14 15 16

my_mat[3,]       # This retrieves and displays the 3rd row of the matrix.

## [1]  3  7 11 15 19

my_mat[2:4,1:3] # This retrieves and displays rows 2,3,4 of columns 1,2,3 of the matrix.

##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    3    7   11
## [3,]    4    8   12

TASK 10: Retrieve and display rows 3,4 of columns 1,2,3 of the matrix.

my_mat[3:4,1:3]

##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]    4    8   12

Data Frames

A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.

Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(topIncome)

## 'data.frame':    10 obs. of  5 variables:
##  $ rank             : int  2 3 32 44 67 16 4 8 5 90
##  $ county           : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
##  $ per_capita_income: int  38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
##  $ population       : int  933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
##  $ region           : int  2 2 4 5 4 2 2 5 2 4

Creating a Data Frame

Snapshot of the solar system.

name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)

Now, by combining the vectors of equal size, we can create a data frame object.

planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df

##      name        type diameter rotation rings
## 1   Earth Terrestrial    1.000     1.00 FALSE
## 2    Mars Terrestrial    0.532     1.03 FALSE
## 3 Jupiter   Gas giant   11.209     0.41  TRUE

Suggested Exercises & Resources

Exercises

Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction

Data Sources

Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site. * “SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

bsad_lab01

CME Group Foundation Business Analytics Lab

Chase Wright

01/24/2019