R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
First we will begin with a few basic operations.
A variable allows you to store values or an object (e.g. a function).
x = 435
y = 78
vars = c(2,5,7,12,23,25,28,34,36,39,40,41,42) # This is a vector
vars[1] #This calls the first value in the vector vars
## [1] 2
vars[2] #This calls the second value in the vector vars
## [1] 5
vars[1:3] #This calls the first through third values in the vector vars
## [1] 2 5 7
vars #This calls the vector
## [1] 2 5 7 12 23 25 28 34 36 39 40 41 42
Below shows some simple arithmetic operations.
34*8
## [1] 272
139/3
## [1] 46.33333
7^3
## [1] 343
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").
#Type: Character
#Example:"TRUE",'23.4'
v = "issa"
class(v)
## [1] "character"
#Type: Numeric
#Example: 12.3,5
v = 46.87
class(v)
## [1] "numeric"
#Type: Logical
#Example: TRUE,FALSE
v = FALSE
class(v)
## [1] "logical"
#Type: Factor
#Example: m f m f m
v = as.factor(c("b", "ba", "bad"))
class(v)
## [1] "factor"
Before starting to work with R, we need to set the working directory.
il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")
We can extract values from the dataset to perform calculations.
DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
(DuPage-Lake)*4
## [1] 1888
DuPage+Lake-Lake
## [1] 38931
(DuPage+Lake)/3
## [1] 25796.67
mean(il_income$per_capita_income)
## [1] 25164.14
median(il_income$per_capita_income)
## [1] 24808.5
quantile(il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 14052.00 22666.00 24808.50 26899.75 38931.00
summary(il_income$population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4135 14284 26610 126078 53319 5238216
summary(il_income$per_capita_income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14052 22666 24809 25164 26900 38931
summary(il_income)
## rank county per_capita_income population
## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135
## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284
## Median : 51.50 Bond : 1 Median :24809 Median : 26610
## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078
## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319
## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216
## (Other) :96
## region
## Min. :1.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.735
## 3rd Qu.:5.000
## Max. :5.000
##
A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(2, 4,5,7,20,23,26,45,48,59,70)
## [1] 2 4 5 7 20 23 26 45 48 59 70
# vector of logical values.
c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE)
## [1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE
# vector of character strings.
c("2017", "2018", "2018", "2017", "2019","2018","2019")
## [1] "2017" "2018" "2018" "2017" "2019" "2018" "2019"
Lists, as opposed to vectors, can hold components of different types.
field_goals = c(5,2,12,13,0,9) # vector of numeric values
fg_percentage = c("56%", "30%", "64%","72%","0%","57%") # vector of character strings.
practice = c(TRUE, FALSE, FALSE,TRUE,FALSE,FALSE) # vector of logical values.
player = list(field_goals,fg_percentage,practice) # list of vectors
player
## [[1]]
## [1] 5 2 12 13 0 9
##
## [[2]]
## [1] "56%" "30%" "64%" "72%" "0%" "57%"
##
## [[3]]
## [1] TRUE FALSE FALSE TRUE FALSE FALSE
We can retrieve components of the list with the single square bracket [] operator.
player[1]
## [[1]]
## [1] 5 2 12 13 0 9
player[2]
## [[1]]
## [1] "56%" "30%" "64%" "72%" "0%" "57%"
player[3]
## [[1]]
## [1] TRUE FALSE FALSE TRUE FALSE FALSE
# first two components of the list
player[1:2]
## [[1]]
## [1] 5 2 12 13 0 9
##
## [[2]]
## [1] "56%" "30%" "64%" "72%" "0%" "57%"
Using the double square bracket [[]] operator we can reference a member of the list directly.
player[[2]] # Components of the Scores Vector
## [1] "56%" "30%" "64%" "72%" "0%" "57%"
First element of the Scores vector
player[[2]][3]
## [1] "64%"
First three elements of the Scores vector
player[[3]][1:4]
## [1] TRUE FALSE FALSE TRUE
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
player = list(field_goals = c(5,2,12,13,0,9), fg_percentage = c("56%", "30%", "64%","72%","0%","57%"), practice = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE))
player
## $field_goals
## [1] 5 2 12 13 0 9
##
## $fg_percentage
## [1] "56%" "30%" "64%" "72%" "0%" "57%"
##
## $practice
## [1] TRUE FALSE FALSE TRUE FALSE FALSE
player$field_goals
## [1] 5 2 12 13 0 9
player$fg_percentage
## [1] "56%" "30%" "64%" "72%" "0%" "57%"
player$practice
## [1] TRUE FALSE FALSE TRUE FALSE FALSE
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(top_il_income)
## 'data.frame': 10 obs. of 5 variables:
## $ rank : int 2 3 32 44 67 16 4 8 5 90
## $ county : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
## $ per_capita_income: int 38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
## $ population : int 933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
## $ region : int 2 2 4 5 4 2 2 5 2 4
Snapshot of NBA Players.
name = c("LeBron James", "Russell Westbrook", "Kevin Durant")
team = c("Cleveland Cavaliers","Oklahoma City Thunder", "Golden State Warriors")
team_changes = c(2, 0, 1)
friends_betrayed = c(0, 0, 1)
ring = c(TRUE, FALSE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
nba_df = data.frame(name,team,team_changes,friends_betrayed,ring)
nba_df
## name team team_changes friends_betrayed
## 1 LeBron James Cleveland Cavaliers 2 0
## 2 Russell Westbrook Oklahoma City Thunder 0 0
## 3 Kevin Durant Golden State Warriors 1 1
## ring
## 1 TRUE
## 2 FALSE
## 3 TRUE
Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts: