R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
First we will begin with a few basic operations.
A variable allows you to store values or an object (e.g. a function).
x = 2687
y = 190
vars = c(85,2,14,66,32,74,93,101) # This is a vector
vars[1] #This calls the first value in the vector vars
## [1] 85
vars[8] #This calls the second value in the vector vars
## [1] 101
vars[1:3] #This calls the first through third values in the vector vars
## [1] 85 2 14
vars[4
] #This calls the vector
## [1] 66
Below shows some simple arithmetic operations.
12*230
## [1] 2760
450/15
## [1] 30
22^5
## [1] 5153632
14*22/4
## [1] 77
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").
#Type: Character
#Example:"TRUE",'23.4'
v = "FALSE"
class(v)
## [1] "character"
#Type: Numeric
#Example: 12.3,5,42
v = 48.5
class(v)
## [1] "numeric"
#Type: Logical
#Example: TRUE,FALSE
v = FALSE
class(v)
## [1] "logical"
#Type: Factor
#Example: m f m f m m
v = as.factor(c("m", "f", "m"))
class(v)
## [1] "factor"
Before starting to work with R, we need to set the working directory.
il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")
We can extract values from the dataset to perform calculations.
Piatt= top_il_income$per_capita_income[2]
Piatt-Piatt*Piatt
## [1] -1479056222
Piatt+Piatt/Piatt
## [1] 38460
(Piatt+Piatt)/8
## [1] 9614.75
mean(il_income$per_capita_income)
## [1] 25164.14
median(il_income$per_capita_income)*mean(il_income$per_capita_income)
## [1] 624284499
quantile(il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 14052.00 22666.00 24808.50 26899.75 38931.00
mode(il_income$per_capita_income)
## [1] "numeric"
summary(il_income)
## rank county per_capita_income population
## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135
## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284
## Median : 51.50 Bond : 1 Median :24809 Median : 26610
## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078
## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319
## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216
## (Other) :96
## region
## Min. :1.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.735
## 3rd Qu.:5.000
## Max. :5.000
##
summary(top_il_income)
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36921
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32919 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(2, 3, 5, 8, 6, 89, 14, 77, 90, 74, 24)
## [1] 2 3 5 8 6 89 14 77 90 74 24
# vector of logical values.
c(FALSE, TRUE, FALSE, FALSE)
## [1] FALSE TRUE FALSE FALSE
# vector of character strings.
c("A", "B", "B-", "C", "D", "F", "B+", "C-", "D+", "A-", "D-")
## [1] "A" "B" "B-" "C" "D" "F" "B+" "C-" "D+" "A-" "D-"
Lists, as opposed to vectors, can hold components of different types.
scores = c(80, 75, 55, 100, 59, 23, 79, 83, 72) # vector of numeric values
grades = c("B", "C", "D-", "A", "F", "F", "C+", "B", "C-") # vector of character strings.
office_hours = c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
## [[1]]
## [1] 80 75 55 100 59 23 79 83 72
##
## [[2]]
## [1] "B" "C" "D-" "A" "F" "F" "C+" "B" "C-"
##
## [[3]]
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
We can retrieve components of the list with the single square bracket [] operator.
student[2]
## [[1]]
## [1] "B" "C" "D-" "A" "F" "F" "C+" "B" "C-"
student[3]
## [[1]]
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
student[1]
## [[1]]
## [1] 80 75 55 100 59 23 79 83 72
# first two components of the list
student[1:3]
## [[1]]
## [1] 80 75 55 100 59 23 79 83 72
##
## [[2]]
## [1] "B" "C" "D-" "A" "F" "F" "C+" "B" "C-"
##
## [[3]]
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
student[2:3]
## [[1]]
## [1] "B" "C" "D-" "A" "F" "F" "C+" "B" "C-"
##
## [[2]]
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
Using the double square bracket [[]] operator we can reference a member of the list directly.
student[[3]] # Components of the Scores Vector
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
First element of the Scores vector
student[[1]][2]
## [1] 75
First three elements of the Scores vector
grades[[3]][1:5]
## [1] "D-" NA NA NA NA
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(outof100 = c(80, 75, 55, 100, 59, 23, 79, 83, 72), lettergrades = c("B", "C", "D-", "A", "F", "F", "C+", "B"), help = c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE))
student
## $outof100
## [1] 80 75 55 100 59 23 79 83 72
##
## $lettergrades
## [1] "B" "C" "D-" "A" "F" "F" "C+" "B"
##
## $help
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
student$outof100
## [1] 80 75 55 100 59 23 79 83 72
student$lettergrades
## [1] "B" "C" "D-" "A" "F" "F" "C+" "B"
student$help
## [1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(top_il_income)
## 'data.frame': 10 obs. of 5 variables:
## $ rank : int 2 3 32 44 67 16 4 8 5 90
## $ county : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
## $ per_capita_income: int 38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
## $ population : int 933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
## $ region : int 2 2 4 5 4 2 2 5 2 4
Snapshot of the solar system.
name = c("Pluto", "Earth", "Mars", "Jupiter", "Uranus")
type = c("Planetoid", "Terrestrial","Terrestrial", "Gas giant", "Gas giant")
diameter = c(.003, 1, 0.532, 11.209, 5.989)
rotation = c(.03, 1, 1.03, 0.41, 4)
rings = c(FALSE, FALSE, FALSE, TRUE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
## name type diameter rotation rings
## 1 Pluto Planetoid 0.003 0.03 FALSE
## 2 Earth Terrestrial 1.000 1.00 FALSE
## 3 Mars Terrestrial 0.532 1.03 FALSE
## 4 Jupiter Gas giant 11.209 0.41 TRUE
## 5 Uranus Gas giant 5.989 4.00 TRUE
Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts: