R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
First we will begin with a few basic operations.
A variable allows you to store values or an object (e.g. a function).
x = 3000
y = 25
vars = c(3,5,12,16,20,55) # This is a vector
vars[3] #This calls the first value in the vector vars
## [1] 12
vars[5] #This calls the second value in the vector vars
## [1] 20
vars[2:5] #This calls the first through third values in the vector vars
## [1] 5 12 16 20
vars #This calls the vector
## [1] 3 5 12 16 20 55
Below shows some simple arithmetic operations.
126*16
## [1] 2016
3542/6
## [1] 590.3333
99^6
## [1] 941480149401
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").
#Type: numeric
#Example:"12",'23.4'
v = 12
class(v)
## [1] "numeric"
#Type: Numeric
#Example: 12.3,5
v = "false"
class(v)
## [1] "character"
#Type: Logical
#Example: TRUE,FALSE
v = FALSE
class(v)
## [1] "logical"
#Type: Factor
#Example: m f m f m
v = (c("m", "f", "m"))
class(v)
## [1] "character"
Before starting to work with R, we need to set the working directory.
il_income = read.csv("data/il_income.csv")
top_il_income = read.csv("data/top_il_income.csv")
We can extract values from the dataset to perform calculations.
Monroe = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
Monroe-Lake
## [1] 472
Monroe+Lake
## [1] 77390
(Monroe+Lake)/3
## [1] 25796.67
mean(il_income$population)
## [1] 126078.4
median(il_income$population)
## [1] 26610
quantile(il_income$population)
## 0% 25% 50% 75% 100%
## 4135.00 14284.25 26610.00 53319.00 5238216.00
summary(top_il_income)
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36920
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32918 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(3, 6 , 12, 27)
## [1] 3 6 12 27
# vector of logical values.
c("false")
## [1] "false"
# vector of character strings.
c("A+", "A" , "B", "B-", "C" )
## [1] "A+" "A" "B" "B-" "C"
Lists, as opposed to vectors, can hold components of different types.
scores = c(60, 70, 30) # vector of numeric values
grades = c("D-", "C-", "F") # vector of character strings.
office_hours = c(FALSE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
## [[1]]
## [1] 60 70 30
##
## [[2]]
## [1] "D-" "C-" "F"
##
## [[3]]
## [1] FALSE FALSE FALSE
We can retrieve components of the list with the single square bracket [] operator.
student[3]
## [[1]]
## [1] FALSE FALSE FALSE
student[1]
## [[1]]
## [1] 60 70 30
student[2]
## [[1]]
## [1] "D-" "C-" "F"
# last two components of the list
student[2:3]
## [[1]]
## [1] "D-" "C-" "F"
##
## [[2]]
## [1] FALSE FALSE FALSE
Using the double square bracket [[]] operator we can reference a member of the list directly.
scores[[3]] # Components of the Scores Vector
## [1] 30
second element of the Scores vector
student[[2]][1]
## [1] "D-"
last two elements of the Scores vector
grades[[2]][2:3]
## [1] NA NA
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(grades = c("D-", "C-", "F"), scores = c(60, 70, 30), office_hours = c(FALSE, FALSE, FALSE))
student
## $grades
## [1] "D-" "C-" "F"
##
## $scores
## [1] 60 70 30
##
## $office_hours
## [1] FALSE FALSE FALSE
student$grades
## [1] "D-" "C-" "F"
student$scores
## [1] 60 70 30
student$office_hours
## [1] FALSE FALSE FALSE
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(top_il_income)
## 'data.frame': 10 obs. of 5 variables:
## $ rank : int 2 3 32 44 67 16 4 8 5 90
## $ county : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
## $ per_capita_income: int 38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
## $ population : int 933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
## $ region : int 2 2 4 5 4 2 2 5 2 4
Snapshot of the color preference.
name = c("Purple", "Gold", "Black")
type = c("Color","Metal", "Color")
preference = c(33, 20 , 47 )
color = c(TRUE, TRUE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,type,preference,color)
planets_df
## name type type.1 preference color
## 1 Purple Color Color 33 TRUE
## 2 Gold Metal Metal 20 TRUE
## 3 Black Color Color 47 TRUE
Datacamp - Learn Data Science from your browser:
R-tutor - An R intro to stats that explains basic R concepts: