About

R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.

This notebook is a tutorial on how to use R.

Setting the Working Directory

Before starting to work with R, we need to set the working directory to source file location.

Basics Operations

First we will begin with a few basic operations.

Variable assignment

We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function).

x = 128
y = 16
z <- 5
vars = c(2,4,8,16,32) # This is a vector created using the generic combine function 'c'

x # display value of variable x

## [1] 128

z # displays value of variable z

## [1] 5

x+y

## [1] 144

vars[1] #This calls the first value in the vector vars

## [1] 2

vars[2] #This calls the second value in the vector vars

## [1] 4

vars[1:3] #This calls the first through third values in the vector vars

## [1] 2 4 8

vars #This calls the vector

## [1]  2  4  8 16 32

Common Arithmetic Operations

Below shows some simple arithmetic operations.

12*6

## [1] 72

128/16

## [1] 8

9^2

## [1] 81

Basic Data Types

R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").

#Type: Character                   
#Example:"TRUE",'23.4'

v = "TRUE"                       
class(v)

## [1] "character"

#Type: Numeric                
#Example: 12.3,5

v = 23.5                  
class(v)

## [1] "numeric"

#Type: Logical    
#Example: TRUE,FALSE

v = TRUE
class(v)

## [1] "logical"

#Type: Factor (nominal, categorical)
#Example: m f m f m

v = as.factor(c("m", "f", "m"))
class(v)

## [1] "factor"

Functions

R Functions are invoked by its name, followed by the parenthesis, and zero or more arguments.

# The following applies the function 'c' (seen earlier) to combine three numeric values into a vector 
c(1,2,3)

## [1] 1 2 3

# Example of function mean() to calcule the mean of three values
mean(c(5,6,7))

## [1] 6

# Square root of a number
sqrt(99)

## [1] 9.949874

Importing Data and Variable Assignment

# Here we are reading a file of type csv (comma seperated values) typical of many Excel files
il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")
il_income

##     rank      county per_capita_income population region
## 1      1        Cook             30468    5238216      1
## 2      2      DuPage             38931     933736      2
## 3      3        Lake             38459     703910      2
## 4      4        Will             30791     687263      2
## 5      5        Kane             30645     530847      2
## 6      6       Mason             23937     307343      2
## 7      7   Winnebago             24802     287078      2
## 8      8      McLean             30728     266209      5
## 9      9      Shelby             23279     264052      5
## 10    10   Champaign             26087     208861      3
## 11    11      Saline             21295     198712      4
## 12    12      Peoria             28414     186221      3
## 13    13      Massac             23190     173166      3
## 14    14 Rock Island             26257     146133      3
## 15    15    Tazewell             28953     134800      3
## 16    16     Kendall             31110     123355      2
## 17    17     LaSalle             25668     111333      3
## 18    18    Kankakee             24117     110879      2
## 19    19   McDonough             20592     107303      4
## 20    20     De Witt             27575     104352      4
## 21    21   Vermilion             21924      79282      3
## 22    22  Williamson             24096      67466      5
## 23    23       Adams             24247      67013      4
## 24    24     Jackson             20729      59362      5
## 25    25   Whiteside             24815      57079      2
## 26    26       Boone             25950      53585      2
## 27    27       Coles             22464      52521      4
## 28    28        Ogle             27337      51659      2
## 29    29        Knox             22273      51441      3
## 30    30      Grundy             29439      50541      2
## 31    31       Henry             26845      49489      3
## 32    32     McHenry             33118      46045      4
## 33    33  Stephenson             23686      45749      2
## 34    34    Franklin             20591      39485      5
## 35    35    Woodford             30300      39227      3
## 36    36   Jefferson             22849      38353      5
## 37    37       Macon             26259      38339      5
## 38    38     Clinton             28255      37786      5
## 39    39  Livingston             25831      36671      3
## 40    40      Fulton             22478      35699      3
## 41    41      Morgan             24822      34828      4
## 42    42         Lee             24943      34584      2
## 43    43   Effingham             26774      34371      4
## 44    44      Monroe             33059      33879      5
## 45    45   Christian             24016      33642      4
## 46    46      Bureau             26587      33587      3
## 47    47    Randolph             22771      32852      5
## 48    48    Marshall             26399      31333      3
## 49    49       Logan             21986      29494      4
## 50    50  Montgomery             20067      28898      4
## 51    51    Iroquois             25234      28672      3
## 52    52   St. Clair             26459      24548      5
## 53    53      Jersey             26154      22372      4
## 54    54  Jo Daviess             29477      22086      2
## 55    55     Fayette             21845      22043      5
## 56    56       Scott             24395      21775      4
## 57    57       Perry             19999      21543      5
## 58    58     Douglas             24330      19823      4
## 59    59    Crawford             25613      19414      5
## 60    60     Hancock             24418      18543      4
## 61    61       Edgar             25018      17664      4
## 62    62      Warren             22923      17527      3
## 63    63       Union             22430      17408      5
## 64    64        Bond             23232      16950      5
## 65    65    Lawrence             14208      16491      5
## 66    66       Wayne             23897      16423      5
## 67    67       Piatt             31750      16387      4
## 68    68      DeKalb             23903      16247      2
## 69    69    Richland             23996      16029      5
## 70    70        Pike             20925      15989      4
## 71    71       Clark             25061      15979      4
## 72    72      Mercer             26739      15858      3
## 73    73    Moultrie             23801      14931      4
## 74    74      Marion             22398      14766      5
## 75    75     Carroll             26918      14616      2
## 76    76       White             26388      14327      5
## 77    77  Washington             27996      14270      5
## 78    78        Ford             25495      13736      3
## 79    79     Madison             28093      13701      3
## 80    80        Clay             22160      13428      5
## 81    81      Greene             22483      13241      4
## 82    82        Cass             23423      12847      4
## 83    83     Johnson             19684      12762      5
## 84    84      Menard             29391      12444      4
## 85    85    Macoupin             25402      11982      3
## 86    86      Wabash             24493      11542      5
## 87    87  Cumberland             22631      10898      4
## 88    88      Jasper             25063       9607      5
## 89    89    Hamilton             23160       8200      5
## 90    90    Sangamon             30594       7032      4
## 91    91   Henderson             27132       6995      3
## 92    92       Brown             20518       6829      4
## 93    93   Alexander             14052       6780      5
## 94    94     Edwards             21896       6534      5
## 95    95       Stark             27104       5788      3
## 96    96     Pulaski             19575       5678      5
## 97    97      Putnam             28158       5644      3
## 98    98    Gallatin             22890       5265      5
## 99    99    Schuyler             23852       5092      4
## 100  100     Calhoun             26446       4899      4
## 101  101        Pope             21431       4226      5
## 102  102      Hardin             21901       4135      5

top_il_income

##    rank   county per_capita_income population region
## 1     2   DuPage             38931     933736      2
## 2     3     Lake             38459     703910      2
## 3    32  McHenry             33118      46045      4
## 4    44   Monroe             33059      33879      5
## 5    67    Piatt             31750      16387      4
## 6    16  Kendall             31110     123355      2
## 7     4     Will             30791     687263      2
## 8     8   McLean             30728     266209      5
## 9     5     Kane             30645     530847      2
## 10   90 Sangamon             30594       7032      4

Arithmetic Operations with Data

We can extract values from the dataset to perform calculations.

DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
DuPage-Lake

## [1] 472

DuPage+Lake

## [1] 77390

(DuPage+Lake)/2

## [1] 38695

McHenry = top_il_income$per_capita_income[1]
Sangamon = top_il_income$per_capita_income[2]
McHenry-Sangamon

## [1] 472

McHenry+Sangamon

## [1] 77390

(McHenry+Sangamon)/2

## [1] 38695

# Repeat the above arithmetic operations using instead McHenry and Sangamon counties

Basic Statistics

mean(il_income$per_capita_income)

## [1] 25164.14

median(il_income$per_capita_income)

## [1] 24808.5

quantile(il_income$per_capita_income)

##       0%      25%      50%      75%     100% 
## 14052.00 22666.00 24808.50 26899.75 38931.00

summary(il_income)

##       rank              county   per_capita_income   population     
##  Min.   :  1.00   Adams    : 1   Min.   :14052     Min.   :   4135  
##  1st Qu.: 26.25   Alexander: 1   1st Qu.:22666     1st Qu.:  14284  
##  Median : 51.50   Bond     : 1   Median :24808     Median :  26610  
##  Mean   : 51.50   Boone    : 1   Mean   :25164     Mean   : 126078  
##  3rd Qu.: 76.75   Brown    : 1   3rd Qu.:26900     3rd Qu.:  53319  
##  Max.   :102.00   Bureau   : 1   Max.   :38931     Max.   :5238216  
##                   (Other)  :96                                      
##      region     
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :4.000  
##  Mean   :3.735  
##  3rd Qu.:5.000  
##  Max.   :5.000  
##

(top_il_income)

##    rank   county per_capita_income population region
## 1     2   DuPage             38931     933736      2
## 2     3     Lake             38459     703910      2
## 3    32  McHenry             33118      46045      4
## 4    44   Monroe             33059      33879      5
## 5    67    Piatt             31750      16387      4
## 6    16  Kendall             31110     123355      2
## 7     4     Will             30791     687263      2
## 8     8   McLean             30728     266209      5
## 9     5     Kane             30645     530847      2
## 10   90 Sangamon             30594       7032      4

# Repeat the basic statistics here using instead the data from the file top_il_income

mean(top_il_income$per_capita_income)

## [1] 32918.5

median(top_il_income$per_capita_income)

## [1] 31430

quantile(top_il_income$per_capita_income)

##       0%      25%      50%      75%     100% 
## 30594.00 30743.75 31430.00 33103.25 38931.00

summary(top_il_income)

##       rank           county  per_capita_income   population    
##  Min.   : 2.00   DuPage :1   Min.   :30594     Min.   :  7032  
##  1st Qu.: 4.25   Kane   :1   1st Qu.:30744     1st Qu.: 36920  
##  Median :12.00   Kendall:1   Median :31430     Median :194782  
##  Mean   :27.10   Lake   :1   Mean   :32918     Mean   :334866  
##  3rd Qu.:41.00   McHenry:1   3rd Qu.:33103     3rd Qu.:648159  
##  Max.   :90.00   McLean :1   Max.   :38931     Max.   :933736  
##                  (Other):4                                     
##      region   
##  Min.   :2.0  
##  1st Qu.:2.0  
##  Median :3.0  
##  Mean   :3.2  
##  3rd Qu.:4.0  
##  Max.   :5.0  
##

Vectors

Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.

# vector of numeric values
c(2, 3, 5, 8)

## [1] 2 3 5 8

# vector of logical values.
c(TRUE, FALSE, TRUE)

## [1]  TRUE FALSE  TRUE

# vector of character strings.
c("A", "B", "B-", "C", "D")

## [1] "A"  "B"  "B-" "C"  "D"

Lists

Defining a List

Lists, as opposed to vectors, can hold components of different types.

scores = c(80, 75, 55)  # vector of numeric values                   
grades = c("B", "C", "D-")  # vector of character strings.          

office_hours = c(TRUE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student

## [[1]]
## [1] 80 75 55
## 
## [[2]]
## [1] "B"  "C"  "D-"
## 
## [[3]]
## [1]  TRUE FALSE FALSE

List Slicing

We can retrieve components of the list with the single square bracket [] operator.

student[1]

## [[1]]
## [1] 80 75 55

student[2]

## [[1]]
## [1] "B"  "C"  "D-"

student[3]

## [[1]]
## [1]  TRUE FALSE FALSE

# first two components of the list
student[1:2]

## [[1]]
## [1] 80 75 55
## 
## [[2]]
## [1] "B"  "C"  "D-"

Member Reference

Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.

student[[1]] # Components of the Scores Vector

## [1] 80 75 55

First element of the Scores vector

student[[1]][1]

## [1] 80

First three elements of the Scores vector

grades[[1]][1:3]

## [1] "B" NA  NA

Named List Members

It’s possible to assign names to list members and reference them by names instead of by numeric indexes.

student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours) 

student

## $myscores
## [1] 80 75 55
## 
## $mygrades
## [1] "B"  "C"  "D-"
## 
## $myoffice_hours
## [1]  TRUE FALSE FALSE

student$myscores

## [1] 80 75 55

student$mygrades

## [1] "B"  "C"  "D-"

student$myoffice_hours

## [1]  TRUE FALSE FALSE

Matrices

All columns in a matrix must have the same data type and the same length.

Create a numeric matrix of 5 rows and 4 columns made of sequential numbers 1:20

x_mat = matrix(1:20, nrow=5, ncol=4)

Retrieve the 4th column of matrix

x_mat[,4]

## [1] 16 17 18 19 20

Retrieve the 3rd row of matrix

x_mat[3,]

## [1]  3  8 13 18

Retrieve rows 2,3,4 of columns 1,2,3

x_mat[2:4,1:3]

##      [,1] [,2] [,3]
## [1,]    2    7   12
## [2,]    3    8   13
## [3,]    4    9   14

Data Frames

A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.

Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.

The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.

str(il_income)

## 'data.frame':    102 obs. of  5 variables:
##  $ rank             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ county           : Factor w/ 102 levels "Adams","Alexander",..: 16 22 49 99 45 60 101 64 86 10 ...
##  $ per_capita_income: int  30468 38931 38459 30791 30645 23937 24802 30728 23279 26087 ...
##  $ population       : int  5238216 933736 703910 687263 530847 307343 287078 266209 264052 208861 ...
##  $ region           : int  1 2 2 2 2 2 2 5 5 3 ...

Creating a Data Frame

Snapshot of the solar system.

name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)

Now, by combining the vectors of equal size, we can create a data frame object.

planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df

##      name        type diameter rotation rings
## 1   Earth Terrestrial    1.000     1.00 FALSE
## 2    Mars Terrestrial    0.532     1.03 FALSE
## 3 Jupiter   Gas giant   11.209     0.41  TRUE

Suggested Exercises & Resources

Exercises

Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction

Data Sources

Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site.

“SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Introduction to R (bsad_lab01)

CME Group Foundation Business Analytics Lab

Tsai Ling(Ruby) Chiang

BSAD343 Spring 2019