R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
First we will begin with a few basic operations.
A variable allows you to store values or an object (e.g. a function).
x = 128
y = 16
vars = c(2,4,8,16,32) # This is a vector
vars[1] #This calls the first value in the vector vars
## [1] 2
vars[2] #This calls the second value in the vector vars
## [1] 4
vars[1:3] #This calls the first through third values in the vector vars
## [1] 2 4 8
vars #This calls the vector
## [1] 2 4 8 16 32
Below shows some simple arithmetic operations.
12*6
## [1] 72
128/16
## [1] 8
9^2
## [1] 81
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").
#Type: Character
#Example:"TRUE",'23.4'
v = "TRUE"
class(v)
## [1] "character"
#Type: Numeric
#Example: 12.3,5
v = 23.5
class(v)
## [1] "numeric"
#Type: Logical
#Example: TRUE,FALSE
v = TRUE
class(v)
## [1] "logical"
#Type: Factor
#Example: m f m f m
v = as.factor(c("m", "f", "m"))
class(v)
## [1] "factor"
Before starting to work with R, we need to set the working directory.
il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")
We can extract values from the dataset to perform calculations.
DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
McHenry = top_il_income$per_capita_income[3]
Monroe = top_il_income$per_capita_income[4]
Piatt = top_il_income$per_capita_income[5]
Kendall = top_il_income$per_capita_income[6]
Will = top_il_income$per_capita_income[7]
McLean = top_il_income$per_capita_income[8]
Kane = top_il_income$per_capita_income[9]
Sangamor = top_il_income$per_capita_income[10]
DuPage+Lake+McHenry+Monroe+Piatt+Kendall+Will+McLean+Kane+Sangamor
## [1] 329185
(DuPage+Lake+McHenry+Monroe+Piatt+Kendall+Will+McLean+Kane+Sangamor)/10
## [1] 32918.5
(DuPage+Lake+McHenry)/3
## [1] 36836
(Sangamor+Kane+McLean)/3
## [1] 30655.67
mean(top_il_income$population)
## [1] 334866.3
median(top_il_income$population)
## [1] 194782
quantile(top_il_income$population)
## 0% 25% 50% 75% 100%
## 7032.0 36920.5 194782.0 648159.0 933736.0
summary(top_il_income)
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36921
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32919 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(4, 8, 15, 16, 23, 48)
## [1] 4 8 15 16 23 48
# vector of logical values.
c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)
## [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
# vector of character strings.
c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "F")
## [1] "A+" "A" "A-" "B+" "B" "B-" "C+" "C" "C-" "D+" "D" "D-" "F"
Lists, as opposed to vectors, can hold components of different types.
scores = c(100, 95, 92, 87, 85, 77, 66, 58) # vector of numeric values
grades = c("A+", "A", "A-", "B+", "B", "C+", "D", "F") # vector of character strings.
office_hours = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
## [[1]]
## [1] 100 95 92 87 85 77 66 58
##
## [[2]]
## [1] "A+" "A" "A-" "B+" "B" "C+" "D" "F"
##
## [[3]]
## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
We can retrieve components of the list with the single square bracket [] operator.
student[1]
## [[1]]
## [1] 100 95 92 87 85 77 66 58
# first two components of the list
student[[1:2]]
## [1] 95
Using the double square bracket [[]] operator we can reference a member of the list directly.
student[[2]] # Components of the Scores Vector
## [1] "A+" "A" "A-" "B+" "B" "C+" "D" "F"
First element of the Scores vector
student[[2]][8]
## [1] "F"
student[[2]][1]
## [1] "A+"
First three elements of the Scores vector
grades[1:3]
## [1] "A+" "A" "A-"
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student$scores = c(100, 95, 92, 87, 85, 77, 66, 58)
student$grades = c("A+", "A", "A-", "B+", "B", "C+", "D", "F")
student$office_hours = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)
student
## [[1]]
## [1] 100 95 92 87 85 77 66 58
##
## [[2]]
## [1] "A+" "A" "A-" "B+" "B" "C+" "D" "F"
##
## [[3]]
## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
##
## $scores
## [1] 100 95 92 87 85 77 66 58
##
## $grades
## [1] "A+" "A" "A-" "B+" "B" "C+" "D" "F"
##
## $office_hours
## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
student$scores
## [1] 100 95 92 87 85 77 66 58
student$grades
## [1] "A+" "A" "A-" "B+" "B" "C+" "D" "F"
student$office_hours
## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(top_il_income)
## 'data.frame': 10 obs. of 5 variables:
## $ rank : int 2 3 32 44 67 16 4 8 5 90
## $ county : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
## $ per_capita_income: int 38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
## $ population : int 933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
## $ region : int 2 2 4 5 4 2 2 5 2 4
Chicago Sporst Teams
name = c("Bears", "Cubs", "White Sox", "Blackhawks", "Bulls")
sport = c("Football", "Baseball", "Baseball", "Hockey", "Basketball")
last_full_season_win_percentage = c("18.8%", "64%", "48.1%", "61%", "50%")
conference_finish = c("15", "1", "11", "1", "8")
championship = c(FALSE, TRUE, FALSE, FALSE, FALSE)
Now, by combining the vectors of equal size, we can create a data frame object.
Chicago_Sports_Team_df = data.frame(name, sport, last_full_season_win_percentage, conference_finish, championship)
Chicago_Sports_Team_df
## name sport last_full_season_win_percentage conference_finish
## 1 Bears Football 18.8% 15
## 2 Cubs Baseball 64% 1
## 3 White Sox Baseball 48.1% 11
## 4 Blackhawks Hockey 61% 1
## 5 Bulls Basketball 50% 8
## championship
## 1 FALSE
## 2 TRUE
## 3 FALSE
## 4 FALSE
## 5 FALSE
Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts: