R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

## Setting the Working Directory

Before starting to work with R, we need to set the working directory to source file location.

# Basics Operations

First we will begin with a few basic operations.

## Variable assignment

We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function).

x = 128
y = 16
z <- 5
vars = c(2,4,8,16,32) # This is a vector created using the generic combine function 'c'
x # display value of variable x
## [1] 128
z # displays value of variable z
## [1] 5
vars[1] #This calls the first value in the vector vars
## [1] 2
vars[2] #This calls the second value in the vector vars
## [1] 4
vars[1:3] #This calls the first through third values in the vector vars
## [1] 2 4 8
vars #This calls the vector 
## [1]  2  4  8 16 32

## Common Arithmetic Operations

Below shows some simple arithmetic operations.

12*6
## [1] 72
128/16
## [1] 8
9^2
## [1] 81

## Basic Data Types

R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").

#Type: Character
#Example:"TRUE",'23.4'

v = "TRUE"
class(v)                           
## [1] "character"
#Type: Numeric
#Example: 12.3,5

v = 23.5
class(v)                   
## [1] "numeric"
#Type: Logical
#Example: TRUE,FALSE

v = TRUE
class(v)
## [1] "logical"
#Type: Factor (nominal, categorical)
#Example: m f m f m

v = as.factor(c("m", "f", "m"))
class(v)
## [1] "factor"

## Functions

R Functions are invoked by its name, followed by the parenthesis, and zero or more arguments.

# The following applies the function 'c' (seen earlier) to combine three numeric values into a vector
c(1,2,3)
## [1] 1 2 3
# Example of function mean() to calcule the mean of three values
mean(c(5,6,7))
## [1] 6
# Square root of a number
sqrt(99)
## [1] 9.949874

## Importing Data and Variable Assignment

# Here we are reading a file of type csv (comma seperated values) typical of many Excel files
top_il_income = read.csv(file = "data/top_il_income.csv")

## Arithmetic Operations with Data

We can extract values from the dataset to perform calculations.

DuPage = top_il_income$per_capita_income[1] Lake = top_il_income$per_capita_income[2]
DuPage-Lake
## [1] 472
DuPage+Lake
## [1] 77390
(DuPage+Lake)/2
## [1] 38695
McHenry = top_il_income$per_capita_income[3] Sangamon = top_il_income$per_capita_income[10]
McHenry - Sangamon
## [1] 2524
McHenry + Sangamon
## [1] 63712
(McHenry+Sangamon)/2
## [1] 31856
# Repeat the above arithmetic operations using instead McHenry and Sangamon counties 

## Basic Statistics

mean(il_income$per_capita_income) ## [1] 25164.14 median(il_income$per_capita_income)
## [1] 24808.5
quantile(il_income$per_capita_income) ## 0% 25% 50% 75% 100% ## 14052.00 22666.00 24808.50 26899.75 38931.00 summary(il_income) ## rank county per_capita_income population ## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135 ## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284 ## Median : 51.50 Bond : 1 Median :24809 Median : 26610 ## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078 ## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319 ## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216 ## (Other) :96 ## region ## Min. :1.000 ## 1st Qu.:3.000 ## Median :4.000 ## Mean :3.735 ## 3rd Qu.:5.000 ## Max. :5.000 ##  mean(top_il_income$per_capita_income)
## [1] 32918.5
median(top_il_income$per_capita_income) ## [1] 31430 quantile(top_il_income$per_capita_income)
##       0%      25%      50%      75%     100%
## 30594.00 30743.75 31430.00 33103.25 38931.00
summary(top_il_income)
##       rank           county  per_capita_income   population
##  Min.   : 2.00   DuPage :1   Min.   :30594     Min.   :  7032
##  1st Qu.: 4.25   Kane   :1   1st Qu.:30744     1st Qu.: 36921
##  Median :12.00   Kendall:1   Median :31430     Median :194782
##  Mean   :27.10   Lake   :1   Mean   :32919     Mean   :334866
##  3rd Qu.:41.00   McHenry:1   3rd Qu.:33103     3rd Qu.:648159
##  Max.   :90.00   McLean :1   Max.   :38931     Max.   :933736
##                  (Other):4
##      region
##  Min.   :2.0
##  1st Qu.:2.0
##  Median :3.0
##  Mean   :3.2
##  3rd Qu.:4.0
##  Max.   :5.0
## 
# Repeat the basic statistics here using instead the data from the file top_il_income

# Vectors

## Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

# vector of numeric values
c(2, 3, 5, 8)
## [1] 2 3 5 8
# vector of logical values.
c(TRUE, FALSE, TRUE)
## [1]  TRUE FALSE  TRUE
# vector of character strings.
c("A", "B", "B-", "C", "D")
## [1] "A"  "B"  "B-" "C"  "D"

# Lists

## Defining a List

Lists, as opposed to vectors, can hold components of different types.

scores = c(80, 75, 55)  # vector of numeric values
grades = c("B", "C", "D-")  # vector of character strings.

office_hours = c(TRUE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B"  "C"  "D-"
##
## [[3]]
## [1]  TRUE FALSE FALSE

## List Slicing

We can retrieve components of the list with the single square bracket [] operator.

student[1]     
## [[1]]
## [1] 80 75 55
student[2]
## [[1]]
## [1] "B"  "C"  "D-"
student[3]
## [[1]]
## [1]  TRUE FALSE FALSE
# first two components of the list
student[1:2]
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B"  "C"  "D-"

## Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.

student[[1]] # Components of the Scores Vector
## [1] 80 75 55

First element of the Scores vector

student[[1]][1]
## [1] 80

First three elements of the Scores vector

student[[1]][1:3]
## [1] 80 75 55

## Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours)

student
## $myscores ## [1] 80 75 55 ## ##$mygrades
## [1] "B"  "C"  "D-"
##
## $myoffice_hours ## [1] TRUE FALSE FALSE student$myscores
## [1] 80 75 55
student$mygrades ## [1] "B" "C" "D-" student$myoffice_hours
## [1]  TRUE FALSE FALSE

# Matrices

All columns in a matrix must have the same data type and the same length.

Create a numeric matrix of 5 rows and 4 columns made of sequential numbers 1:20

x_mat = matrix(1:20, nrow=5, ncol=4)

Retrieve the 4th column of matrix

x_mat[,4]
## [1] 16 17 18 19 20

Retrieve the 3rd row of matrix

x_mat[3,]
## [1]  3  8 13 18

Retrieve rows 2,3,4 of columns 1,2,3

x_mat[2:4,1:3]
##      [,1] [,2] [,3]
## [1,]    2    7   12
## [2,]    3    8   13
## [3,]    4    9   14

# Data Frames

A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.

## Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(il_income)
## 'data.frame':    102 obs. of  5 variables:
##  $rank : int 1 2 3 4 5 6 7 8 9 10 ... ##$ county           : Factor w/ 102 levels "Adams","Alexander",..: 16 22 49 99 45 60 101 64 86 10 ...
##  $per_capita_income: int 30468 38931 38459 30791 30645 23937 24802 30728 23279 26087 ... ##$ population       : int  5238216 933736 703910 687263 530847 307343 287078 266209 264052 208861 ...
##  \$ region           : int  1 2 2 2 2 2 2 5 5 3 ...

## Creating a Data Frame

Snapshot of the solar system.

name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)

Now, by combining the vectors of equal size, we can create a data frame object.

planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
##      name        type diameter rotation rings
## 1   Earth Terrestrial    1.000     1.00 FALSE
## 2    Mars Terrestrial    0.532     1.03 FALSE
## 3 Jupiter   Gas giant   11.209     0.41  TRUE

# Suggested Exercises & Resources

## Data Sources

Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site.