R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
Before starting to work with R, we need to set the working directory to source file location.
First we will begin with a few basic operations.
We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function).
x = 128
y = 16
z <- 5
vars = c(2,4,8,16,32) # This is a vector created using the generic combine function 'c'
x # display value of variable x
## [1] 128
z # displays value of variable z
## [1] 5
x+y
## [1] 144
vars[1] #This calls the first value in the vector vars
## [1] 2
vars[2] #This calls the second value in the vector vars
## [1] 4
vars[1:3] #This calls the first through third values in the vector vars
## [1] 2 4 8
vars #This calls the vector
## [1] 2 4 8 16 32
Below shows some simple arithmetic operations.
12*6
## [1] 72
128/16
## [1] 8
9^2
## [1] 81
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").
#Type: Character
#Example:"TRUE",'23.4'
v = "TRUE"
class(v)
## [1] "character"
#Type: Numeric
#Example: 12.3,5
v = 23.5
class(v)
## [1] "numeric"
#Type: Logical
#Example: TRUE,FALSE
v = TRUE
class(v)
## [1] "logical"
#Type: Factor (nominal, categorical)
#Example: m f m f m
v = as.factor(c("m", "f", "m"))
class(v)
## [1] "factor"
R Functions are invoked by its name, followed by the parenthesis, and zero or more arguments.
# The following applies the function 'c' (seen earlier) to combine three numeric values into a vector
c(1,2,3)
## [1] 1 2 3
# Example of function mean() to calcule the mean of three values
mean(c(5,6,7))
## [1] 6
# Square root of a number
sqrt(99)
## [1] 9.949874
# Here we are reading a file of type csv (comma seperated values) typical of many Excel files
il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")
il_income
## rank county per_capita_income population region
## 1 1 Cook 30468 5238216 1
## 2 2 DuPage 38931 933736 2
## 3 3 Lake 38459 703910 2
## 4 4 Will 30791 687263 2
## 5 5 Kane 30645 530847 2
## 6 6 Mason 23937 307343 2
## 7 7 Winnebago 24802 287078 2
## 8 8 McLean 30728 266209 5
## 9 9 Shelby 23279 264052 5
## 10 10 Champaign 26087 208861 3
## 11 11 Saline 21295 198712 4
## 12 12 Peoria 28414 186221 3
## 13 13 Massac 23190 173166 3
## 14 14 Rock Island 26257 146133 3
## 15 15 Tazewell 28953 134800 3
## 16 16 Kendall 31110 123355 2
## 17 17 LaSalle 25668 111333 3
## 18 18 Kankakee 24117 110879 2
## 19 19 McDonough 20592 107303 4
## 20 20 De Witt 27575 104352 4
## 21 21 Vermilion 21924 79282 3
## 22 22 Williamson 24096 67466 5
## 23 23 Adams 24247 67013 4
## 24 24 Jackson 20729 59362 5
## 25 25 Whiteside 24815 57079 2
## 26 26 Boone 25950 53585 2
## 27 27 Coles 22464 52521 4
## 28 28 Ogle 27337 51659 2
## 29 29 Knox 22273 51441 3
## 30 30 Grundy 29439 50541 2
## 31 31 Henry 26845 49489 3
## 32 32 McHenry 33118 46045 4
## 33 33 Stephenson 23686 45749 2
## 34 34 Franklin 20591 39485 5
## 35 35 Woodford 30300 39227 3
## 36 36 Jefferson 22849 38353 5
## 37 37 Macon 26259 38339 5
## 38 38 Clinton 28255 37786 5
## 39 39 Livingston 25831 36671 3
## 40 40 Fulton 22478 35699 3
## 41 41 Morgan 24822 34828 4
## 42 42 Lee 24943 34584 2
## 43 43 Effingham 26774 34371 4
## 44 44 Monroe 33059 33879 5
## 45 45 Christian 24016 33642 4
## 46 46 Bureau 26587 33587 3
## 47 47 Randolph 22771 32852 5
## 48 48 Marshall 26399 31333 3
## 49 49 Logan 21986 29494 4
## 50 50 Montgomery 20067 28898 4
## 51 51 Iroquois 25234 28672 3
## 52 52 St. Clair 26459 24548 5
## 53 53 Jersey 26154 22372 4
## 54 54 Jo Daviess 29477 22086 2
## 55 55 Fayette 21845 22043 5
## 56 56 Scott 24395 21775 4
## 57 57 Perry 19999 21543 5
## 58 58 Douglas 24330 19823 4
## 59 59 Crawford 25613 19414 5
## 60 60 Hancock 24418 18543 4
## 61 61 Edgar 25018 17664 4
## 62 62 Warren 22923 17527 3
## 63 63 Union 22430 17408 5
## 64 64 Bond 23232 16950 5
## 65 65 Lawrence 14208 16491 5
## 66 66 Wayne 23897 16423 5
## 67 67 Piatt 31750 16387 4
## 68 68 DeKalb 23903 16247 2
## 69 69 Richland 23996 16029 5
## 70 70 Pike 20925 15989 4
## 71 71 Clark 25061 15979 4
## 72 72 Mercer 26739 15858 3
## 73 73 Moultrie 23801 14931 4
## 74 74 Marion 22398 14766 5
## 75 75 Carroll 26918 14616 2
## 76 76 White 26388 14327 5
## 77 77 Washington 27996 14270 5
## 78 78 Ford 25495 13736 3
## 79 79 Madison 28093 13701 3
## 80 80 Clay 22160 13428 5
## 81 81 Greene 22483 13241 4
## 82 82 Cass 23423 12847 4
## 83 83 Johnson 19684 12762 5
## 84 84 Menard 29391 12444 4
## 85 85 Macoupin 25402 11982 3
## 86 86 Wabash 24493 11542 5
## 87 87 Cumberland 22631 10898 4
## 88 88 Jasper 25063 9607 5
## 89 89 Hamilton 23160 8200 5
## 90 90 Sangamon 30594 7032 4
## 91 91 Henderson 27132 6995 3
## 92 92 Brown 20518 6829 4
## 93 93 Alexander 14052 6780 5
## 94 94 Edwards 21896 6534 5
## 95 95 Stark 27104 5788 3
## 96 96 Pulaski 19575 5678 5
## 97 97 Putnam 28158 5644 3
## 98 98 Gallatin 22890 5265 5
## 99 99 Schuyler 23852 5092 4
## 100 100 Calhoun 26446 4899 4
## 101 101 Pope 21431 4226 5
## 102 102 Hardin 21901 4135 5
top_il_income
## rank county per_capita_income population region
## 1 2 DuPage 38931 933736 2
## 2 3 Lake 38459 703910 2
## 3 32 McHenry 33118 46045 4
## 4 44 Monroe 33059 33879 5
## 5 67 Piatt 31750 16387 4
## 6 16 Kendall 31110 123355 2
## 7 4 Will 30791 687263 2
## 8 8 McLean 30728 266209 5
## 9 5 Kane 30645 530847 2
## 10 90 Sangamon 30594 7032 4
We can extract values from the dataset to perform calculations.
DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
DuPage-Lake
## [1] 472
DuPage+Lake
## [1] 77390
(DuPage+Lake)/2
## [1] 38695
McHenry = top_il_income$per_capita_income[1]
Sangamon = top_il_income$per_capita_income[2]
McHenry-Sangamon
## [1] 472
McHenry+Sangamon
## [1] 77390
(McHenry+Sangamon)/2
## [1] 38695
# Repeat the above arithmetic operations using instead McHenry and Sangamon counties
mean(il_income$per_capita_income)
## [1] 25164.14
median(il_income$per_capita_income)
## [1] 24808.5
quantile(il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 14052.00 22666.00 24808.50 26899.75 38931.00
summary(il_income)
## rank county per_capita_income population
## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135
## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284
## Median : 51.50 Bond : 1 Median :24808 Median : 26610
## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078
## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319
## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216
## (Other) :96
## region
## Min. :1.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.735
## 3rd Qu.:5.000
## Max. :5.000
##
(top_il_income)
## rank county per_capita_income population region
## 1 2 DuPage 38931 933736 2
## 2 3 Lake 38459 703910 2
## 3 32 McHenry 33118 46045 4
## 4 44 Monroe 33059 33879 5
## 5 67 Piatt 31750 16387 4
## 6 16 Kendall 31110 123355 2
## 7 4 Will 30791 687263 2
## 8 8 McLean 30728 266209 5
## 9 5 Kane 30645 530847 2
## 10 90 Sangamon 30594 7032 4
# Repeat the basic statistics here using instead the data from the file top_il_income
mean(top_il_income$per_capita_income)
## [1] 32918.5
median(top_il_income$per_capita_income)
## [1] 31430
quantile(top_il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 30594.00 30743.75 31430.00 33103.25 38931.00
summary(top_il_income)
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36920
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32918 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(2, 3, 5, 8)
## [1] 2 3 5 8
# vector of logical values.
c(TRUE, FALSE, TRUE)
## [1] TRUE FALSE TRUE
# vector of character strings.
c("A", "B", "B-", "C", "D")
## [1] "A" "B" "B-" "C" "D"
Lists, as opposed to vectors, can hold components of different types.
scores = c(80, 75, 55) # vector of numeric values
grades = c("B", "C", "D-") # vector of character strings.
office_hours = c(TRUE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
##
## [[3]]
## [1] TRUE FALSE FALSE
We can retrieve components of the list with the single square bracket [] operator.
student[1]
## [[1]]
## [1] 80 75 55
student[2]
## [[1]]
## [1] "B" "C" "D-"
student[3]
## [[1]]
## [1] TRUE FALSE FALSE
# first two components of the list
student[1:2]
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.
student[[1]] # Components of the Scores Vector
## [1] 80 75 55
First element of the Scores vector
student[[1]][1]
## [1] 80
First three elements of the Scores vector
grades[[1]][1:3]
## [1] "B" NA NA
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours)
student
## $myscores
## [1] 80 75 55
##
## $mygrades
## [1] "B" "C" "D-"
##
## $myoffice_hours
## [1] TRUE FALSE FALSE
student$myscores
## [1] 80 75 55
student$mygrades
## [1] "B" "C" "D-"
student$myoffice_hours
## [1] TRUE FALSE FALSE
All columns in a matrix must have the same data type and the same length.
Create a numeric matrix of 5 rows and 4 columns made of sequential numbers 1:20
x_mat = matrix(1:20, nrow=5, ncol=4)
Retrieve the 4th column of matrix
x_mat[,4]
## [1] 16 17 18 19 20
Retrieve the 3rd row of matrix
x_mat[3,]
## [1] 3 8 13 18
Retrieve rows 2,3,4 of columns 1,2,3
x_mat[2:4,1:3]
## [,1] [,2] [,3]
## [1,] 2 7 12
## [2,] 3 8 13
## [3,] 4 9 14
A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(il_income)
## 'data.frame': 102 obs. of 5 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ county : Factor w/ 102 levels "Adams","Alexander",..: 16 22 49 99 45 60 101 64 86 10 ...
## $ per_capita_income: int 30468 38931 38459 30791 30645 23937 24802 30728 23279 26087 ...
## $ population : int 5238216 933736 703910 687263 530847 307343 287078 266209 264052 208861 ...
## $ region : int 1 2 2 2 2 2 2 5 5 3 ...
Snapshot of the solar system.
name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
## name type diameter rotation rings
## 1 Earth Terrestrial 1.000 1.00 FALSE
## 2 Mars Terrestrial 0.532 1.03 FALSE
## 3 Jupiter Gas giant 11.209 0.41 TRUE
Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction
Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site.