Package - collection of R functions, data and compiled code
Library -The location where the packages are stored
If there is a particular functionality that you require, you can download the package from the appropriate site, and it will be stored in your library. To use the package, use the command library() to load the package in the current R session. Then call the appropriate package functions.
install.packages(“package_name”) – Install the package from CRAN repository
library(“package_name”) – Load the package in current R session
First, R can function just like a calculator
2 + 2 # console contains: [1] 4
## [1] 4
2 - 2 # [1] 0
## [1] 0
2 * 2
## [1] 4
2 / 2
## [1] 1
3 %% 2
## [1] 1
R is an open source software whose packages are developed by several individuals around the world.
There are some coordinated efforts (e.g., tidyverse) but, in general, distributed development means that uniform conventions are often not followed concerning function names, arguments, and documentation.
This means that there are several ways to “code” in R and get to the same output.
Objects hold information - numbers, text, images… Each object has a name and we can assign content to an object using <-. You can also use = but the arrow is generally preferred
Let’s create an object storing the number 2
object1 <- 2
#Now, let's create another object storing the number 3
object2 <- 3
See what happens when we sum them
object1 + object2
## [1] 5
An type of object that stores multiple pieces of information is called vector.
c is the function that we use to combine multiple values into one object.
A function is a command that take an object and perform an operation
We can create an object called ob1 storing the following values: 1, 3, 4, 5, 5
ob1 <- c(1, 3, 4, 5, 5)
ob1 # To inspect an object, you can just type its name.
## [1] 1 3 4 5 5
I strongly encourage you to always inspect your objects, vectors, matrices… Checking on your objects helps catch mistakes early.
Good names
- healthcare
- health_tidy
- crime_data
OK names
- data.1 (dot)
- mydata (very generic)
Bad names
- myfirstdata_Oct202021 (too long)
- my_very_first_ObjectInR (too complex)
Very bad names
- blaaarrgggg_10202021 (meaningless)
We can store multiple type of values into a vector, not just numbers.
Strings (or characters) are pieces of text information which are stored in quotation marks " " - In Excel, this is the same as cells formatted as “Text”
week_days <- c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday")
week_days
## [1] "monday" "tuesday" "wednesday" "thursday" "friday" "saturday"
## [7] "sunday"
greeting <- "hello"
greeting
## [1] "hello"
When interpreting this, it’s important to recognise that the quote marks here aren’t part of the string itself. They’re just something that we use to make sure that R knows to treat the characters that they enclose as a piece of text data, known as a character string.
To know the type of value stored into an object, we can use the command class.
class(week_days)
## [1] "character"
Create two objects, called “names” and “yob”:
What is the class of each vector?
names <- c("Victor", "Vicky", "Victoria", "Vinny", "Val")
yob <- c(1994, 1987, 1989, 1985, 1993)
class(names)
## [1] "character"
class(yob)
## [1] "numeric"
We now have two vectors containing multiple values. One vector is character and the other is numeric.
Make sure to leave a space after each comma. Do NOT smash all of your code together; make it easy to read!
Two or more vectors can be combined into a matrix.
In this case, we use either rbind or cbind to bind our data by row or column, respectively.
cbind(names, yob)
## names yob
## [1,] "Victor" "1994"
## [2,] "Vicky" "1987"
## [3,] "Victoria" "1989"
## [4,] "Vinny" "1985"
## [5,] "Val" "1993"
rbind(names, yob)
## [,1] [,2] [,3] [,4] [,5]
## names "Victor" "Vicky" "Victoria" "Vinny" "Val"
## yob "1994" "1987" "1989" "1985" "1993"
What is the difference between the two commands?
Let’s use cbind to create a matrix to ‘play’ with. We call it people.
people <- cbind(names, yob)
class(people)
## [1] "matrix" "array"
Matrices are made of rows and columns. You can check how many rows or columns at any time using the following commands:
nrow(people) # number of rows
## [1] 5
ncol(people) # number of columns
## [1] 2
dim(people) # dimensions (rows and columns)
## [1] 5 2
Important: R always stores information in the row-column format
These commands should be the first one you use when opening a dataset in R - check if it is loaded correctly + data size should match your expectations (e.g., if a dataset is at the state level, it should have ~50 observations)
Like in battleship, we can identify information contained in a matrix by their row-column position
# Position: row 2, column 2
people[2, 2]
## yob
## "1987"
# Position: row 5, column 1
people[5, 1]
## names
## "Val"
What happens if you omit the column or the row number? E.g., people[2, ]?
people
## names yob
## [1,] "Victor" "1994"
## [2,] "Vicky" "1987"
## [3,] "Victoria" "1989"
## [4,] "Vinny" "1985"
## [5,] "Val" "1993"
# No column is specified
people[2, ]
## names yob
## "Vicky" "1987"
# No row is specified
people[, 2]
## [1] "1994" "1987" "1989" "1985" "1993"
We can use [] to subset an object
people[, 2]
## [1] "1994" "1987" "1989" "1985" "1993"
We can save the new subset in a new object
people_subset = people[, 2] # takes the 2nd column and puts it into people_subset
people_subset
## [1] "1994" "1987" "1989" "1985" "1993"
We will rarely work with matrices. In most cases, we will use dataframes. A dataframe is a dataset in R. We can convert any object into a dataframe
people_df <- as.data.frame(people)
class(people_df)
## [1] "data.frame"
class(people)
## [1] "matrix" "array"
people_df
## names yob
## 1 Victor 1994
## 2 Vicky 1987
## 3 Victoria 1989
## 4 Vinny 1985
## 5 Val 1993
Note that we can use the same commands as before to check the dimension of the dataframe
ncol(people_df)
## [1] 2
nrow(people_df)
## [1] 5
dim(people_df)
## [1] 5 2
Dataframes have special properties. For instance, they have column and row names
colnames(people_df)
## [1] "names" "yob"
rownames(people_df)
## [1] "1" "2" "3" "4" "5"
# Data Frame
Numbers <- c(1, 2, 3, 4, 5)
Alphabets <- c("A", "B", "C", "D", "E")
Boolean <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
Float <- c(1.1, 2.2, 3.3, 4.4, 5.5)
df <- data.frame(Numbers, Alphabets, Boolean, Float)
df
## Numbers Alphabets Boolean Float
## 1 1 A TRUE 1.1
## 2 2 B FALSE 2.2
## 3 3 C TRUE 3.3
## 4 4 D TRUE 4.4
## 5 5 E FALSE 5.5
# Analyzing a DataFrame
dim(df)
## [1] 5 4
ncol(df)
## [1] 4
nrow(df)
## [1] 5
str(df)
## 'data.frame': 5 obs. of 4 variables:
## $ Numbers : num 1 2 3 4 5
## $ Alphabets: chr "A" "B" "C" "D" ...
## $ Boolean : logi TRUE FALSE TRUE TRUE FALSE
## $ Float : num 1.1 2.2 3.3 4.4 5.5
names(df)
## [1] "Numbers" "Alphabets" "Boolean" "Float"
colnames(df)
## [1] "Numbers" "Alphabets" "Boolean" "Float"
rownames(df)
## [1] "1" "2" "3" "4" "5"
head(df,2)
## Numbers Alphabets Boolean Float
## 1 1 A TRUE 1.1
## 2 2 B FALSE 2.2
tail(df,2)
## Numbers Alphabets Boolean Float
## 4 4 D TRUE 4.4
## 5 5 E FALSE 5.5
Since columns have names, we can call each column using the symbol $
people_df$names
## [1] "Victor" "Vicky" "Victoria" "Vinny" "Val"
people_df$yob
## [1] "1994" "1987" "1989" "1985" "1993"
Important: You always need to call both the dataset and the column name
datasetName$ColumnName
We can perform any operation on columns. Let’s try with checking the class of those columns.
class(people_df$names)
## [1] "character"
class(people_df$yob)
## [1] "character"
Special class of vectors for categorical variables. Factors are composed by levels (a.k.a., categories). R uses factors to represent categorical variables that have a known set of possible values.
You can always convert any vector (or column or row) from one class to the other
Note: When you recode a variable, it’s good practice to save it as a new one. That way if you make a mistake, the original data still exists.
It allows you to check your work
You might need to go back to the original variable
If you make a mistake, you don’t have to upload your dataset again
You can always clean your dataframe at the end (e.g., keep only relevant columns)
By using as.numeric, the new vector stores the # of the level but not their content.
| Class | Description |
|---|---|
| character | It stores text information. |
| numeric | It stores numbers (continuous variables) |
| factors | It stores categorical variables |
| levels | It stores each category of a factor |
You can manipulate columns in the same way you would with vectors (mostly).
For instance, we can create a new column called age where we calculate the age for each individuals in the current year.
people_df$age <- 2021 - as.numeric(people_df$yob)
people_df$age
## [1] 27 34 32 36 28
You can also decide to calculate the age in terms of months instead of years
people_df_agemonths <- people_df$age * 12
people_df_agemonths
## [1] 324 408 384 432 336
In sum, you can easily perform operations with your columns.
When learning about a new function, you generally want to retrieve three pieces of information:
Description what the function does
Usage how you are expected to write the function
Arguments what each part of the function does.
All help pages also contain an “examples” section where you can see how the function is used in practice.
Even when you discover new functions from other sources, you should check out the help page to understand all possible options provided by the arguments.
Let’s use some descriptive statistics functions to check out the variable age.
table(people_df$age) # Frequencies
mean(people_df$age) # Mean value
min(people_df$age) # Minimum value
max(people_df$age) # Maximum value
sd(people_df$age) # Standard deviation
median(people_df$age) # Median value
Now calculate the mean age for those called “Val”. Note that equal is represented by the symbol == when used for logical indexing.
mean(people_df$age[people_df$names == "Val"])
## [1] 28
Terminology used in this class provides you the basics to talk about R concepts and elements (vectors, objects, functions…)
‘dollar sign’ syntax is so called because of the use of $ to connect a dataframe name with a column name.
Dataframes are a very common way to work with data in R. Some functions do not work with tibbles (tidyverse database format) so you’ll likely go back to this at one point (e.g., regression analysis classes)
Tidyverse is better for data wrangling and visualization.
# Factor
#install.packages(dataset)
library(datasets)
data("mtcars")
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
str(mtcars) # see the structure of the data
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
tab <- table(mtcars$cyl)
tab
##
## 4 6 8
## 11 7 14
prop.table(tab)
##
## 4 6 8
## 0.34375 0.21875 0.43750
summary(mtcars[,c("cyl" , "vs" , "am" , "gear" , "carb")])
## cyl vs am gear
## Min. :4.000 Min. :0.0000 Min. :0.0000 Min. :3.000
## 1st Qu.:4.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
## Median :6.000 Median :0.0000 Median :0.0000 Median :4.000
## Mean :6.188 Mean :0.4375 Mean :0.4062 Mean :3.688
## 3rd Qu.:8.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :8.000 Max. :1.0000 Max. :1.0000 Max. :5.000
## carb
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.812
## 3rd Qu.:4.000
## Max. :8.000
# Numeric
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
min(mtcars$mpg)
## [1] 10.4
max(mtcars$mpg)
## [1] 33.9
range(mtcars$mpg)
## [1] 10.4 33.9
quantile(mtcars$mpg)
## 0% 25% 50% 75% 100%
## 10.400 15.425 19.200 22.800 33.900
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
# Two Numeric
data(mtcars)
View(mtcars)
cor(mtcars$mpg , mtcars$hp)
## [1] -0.7761684
cor(mtcars$disp , mtcars$hp)
## [1] 0.7909486
# One Factor & One Numeric
mtcars %>% group_by(cyl) %>% summarise(avg=mean(mpg),
median=median(mpg),
std=sd(mpg))
## # A tibble: 3 x 4
## cyl avg median std
## <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 26 4.51
## 2 6 19.7 19.7 1.45
## 3 8 15.1 15.2 2.56
In R, functions accept objects as inputs, manipulate the inputs in some way, and return some output. For example, the function mean(object) would return the mean of an object (assuming the object was a list of numbers). The function c() is called the Combine Function and will combine a list of numbers (or words) into a new object.
## 'rain' contains actual rainfall data for Boulder, CO (2000-2011)
rain <- c(16, 18, 14, 22, 27, 17, 19, 17, 17, 22, 20, 22)
The object “rain” contains data, we can calculate some descriptive statistics:
mean(rain) #returns the average rainfall from 2000-2011 in Boulder, CO
## [1] 19.25
sum(rain) #returns the total amount of rainfall during the study period
## [1] 231
length(rain) #returns the length of the list, i.e. the number of years of data
## [1] 12
We can also calculate deviations from the mean for each year:
rain - mean(rain) #Deviations from the mean; negative values indicate below average rainfall.
## [1] -3.25 -1.25 -5.25 2.75 7.75 -2.25 -0.25 -2.25 -2.25 2.75 0.75 2.75
We can use the assignment operator to save these deviations from the mean as a new object:
rainDeviations <- rain - mean(rain)
rainDeviations^2 #Squared deviations from the mean
## [1] 10.5625 1.5625 27.5625 7.5625 60.0625 5.0625 0.0625 5.0625 5.0625
## [10] 7.5625 0.5625 7.5625
sqrt(rain) #Square root of rainfall values
## [1] 4.000000 4.242641 3.741657 4.690416 5.196152 4.123106 4.358899 4.123106
## [9] 4.123106 4.690416 4.472136 4.690416
Conceptually, the standard deviation is like the average deviation from the mean. However, the average deviation from the mean is always zero. Thus, we calculate the standard deviation as:
\[s = \sqrt{s^{2}} = \sqrt{\frac{SS}{N - 1}} = \sqrt{\frac{\sum (x_{i} - \bar{x})^{2}}{N - 1}}\] where \(s^2\) is the svariance; SS is the sum of squared errors; N is the number of observations; \(x_i\) is the \(i^{th}\) score in a group; and \(\bar{x}\) is the mean of the group.
The standard deviation is the Root Mean Square (RMS) of the deviations from the mean. The above formula can be broken down into a series of simple steps:
1. Calculate the deviations from the mean (see above R code).
2. Square the deviations from the mean, save the squared deviations as a new R object (use the “<-” assignment operator).
3. Take the mean of these squared deviations. Again, save the results as an object.
4. Finally, take the square root of the result from the prior step.