Package - collection of R functions, data and compiled code

Library -The location where the packages are stored

If there is a particular functionality that you require, you can download the package from the appropriate site, and it will be stored in your library. To use the package, use the command library() to load the package in the current R session. Then call the appropriate package functions.

install.packages(“package_name”) – Install the package from CRAN repository

library(“package_name”) – Load the package in current R session

Intro to R - Objects, Vectors, etc.

First, R can function just like a calculator

2 + 2      # console contains: [1] 4
## [1] 4
2 - 2      # [1] 0
## [1] 0
2 * 2
## [1] 4
2 / 2
## [1] 1
3 %% 2
## [1] 1

R is an open source software whose packages are developed by several individuals around the world.

There are some coordinated efforts (e.g., tidyverse) but, in general, distributed development means that uniform conventions are often not followed concerning function names, arguments, and documentation.

This means that there are several ways to “code” in R and get to the same output.

Objects

Objects hold information - numbers, text, images… Each object has a name and we can assign content to an object using <-. You can also use = but the arrow is generally preferred

Let’s create an object storing the number 2

object1 <- 2

#Now, let's create another object storing the number 3

object2 <- 3

See what happens when we sum them

object1 + object2
## [1] 5

Vectors

An type of object that stores multiple pieces of information is called vector.

c is the function that we use to combine multiple values into one object.

A function is a command that take an object and perform an operation

We can create an object called ob1 storing the following values: 1, 3, 4, 5, 5

ob1 <- c(1, 3, 4, 5, 5)
ob1        # To inspect an object, you can just type its name.
## [1] 1 3 4 5 5

I strongly encourage you to always inspect your objects, vectors, matrices… Checking on your objects helps catch mistakes early.

Object names

  • Object can have any name but you cannot use spaces in an object name (e.g., “Object A” -> “ObjectA”)
  • You can use a dot or an underscore to separate words (underscore is preferred)
  • R is case sensitive
  • Always give your objects a “good name”
    • Intuitive / meaningful
    • Concise / short
    • Easy to remember
    • Unique (e.g., do not name an object as a variable or another object)

Good names
- healthcare
- health_tidy
- crime_data

OK names
- data.1 (dot)
- mydata (very generic)

Bad names
- myfirstdata_Oct202021 (too long)
- my_very_first_ObjectInR (too complex)

Very bad names
- blaaarrgggg_10202021 (meaningless)

Strings

We can store multiple type of values into a vector, not just numbers.

Strings (or characters) are pieces of text information which are stored in quotation marks " " - In Excel, this is the same as cells formatted as “Text”

week_days <- c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday")

week_days
## [1] "monday"    "tuesday"   "wednesday" "thursday"  "friday"    "saturday" 
## [7] "sunday"
greeting <- "hello"

greeting
## [1] "hello"

When interpreting this, it’s important to recognise that the quote marks here aren’t part of the string itself. They’re just something that we use to make sure that R knows to treat the characters that they enclose as a piece of text data, known as a character string.

Class

To know the type of value stored into an object, we can use the command class.

class(week_days)
## [1] "character"

Example to identify class of vectors

Create two objects, called “names” and “yob”:

  • names should be a string containing the following names: Victor, Vicky, Victoria, Vinny, and Val
  • yob should be a numeric vector containing the following years of birth: 1994, 1987, 1989, 1985, 1993

What is the class of each vector?

Solution

names <- c("Victor", "Vicky", "Victoria", "Vinny", "Val")

yob <- c(1994, 1987, 1989, 1985, 1993)


class(names)
## [1] "character"
class(yob) 
## [1] "numeric"

We now have two vectors containing multiple values. One vector is character and the other is numeric.

Make sure to leave a space after each comma. Do NOT smash all of your code together; make it easy to read!

Matrices

Two or more vectors can be combined into a matrix.

In this case, we use either rbind or cbind to bind our data by row or column, respectively.

cbind(names, yob)
##      names      yob   
## [1,] "Victor"   "1994"
## [2,] "Vicky"    "1987"
## [3,] "Victoria" "1989"
## [4,] "Vinny"    "1985"
## [5,] "Val"      "1993"
rbind(names, yob)
##       [,1]     [,2]    [,3]       [,4]    [,5]  
## names "Victor" "Vicky" "Victoria" "Vinny" "Val" 
## yob   "1994"   "1987"  "1989"     "1985"  "1993"

What is the difference between the two commands?

Let’s use cbind to create a matrix to ‘play’ with. We call it people.

people <- cbind(names, yob)

class(people)
## [1] "matrix" "array"

Matrix dimensions

Matrices are made of rows and columns. You can check how many rows or columns at any time using the following commands:

nrow(people) # number of rows
## [1] 5
ncol(people) # number of columns
## [1] 2
dim(people) # dimensions (rows and columns)
## [1] 5 2

Important: R always stores information in the row-column format

These commands should be the first one you use when opening a dataset in R - check if it is loaded correctly + data size should match your expectations (e.g., if a dataset is at the state level, it should have ~50 observations)

Matrix positions

Like in battleship, we can identify information contained in a matrix by their row-column position

# Position: row 2, column 2
people[2, 2]
##    yob 
## "1987"
# Position: row 5, column 1
people[5, 1]
## names 
## "Val"

Numerical indexing

What happens if you omit the column or the row number? E.g., people[2, ]?

people
##      names      yob   
## [1,] "Victor"   "1994"
## [2,] "Vicky"    "1987"
## [3,] "Victoria" "1989"
## [4,] "Vinny"    "1985"
## [5,] "Val"      "1993"
# No column is specified
people[2, ]
##   names     yob 
## "Vicky"  "1987"
# No row is specified
people[, 2]
## [1] "1994" "1987" "1989" "1985" "1993"

We can use [] to subset an object

people[, 2]
## [1] "1994" "1987" "1989" "1985" "1993"

We can save the new subset in a new object

people_subset = people[, 2]  # takes the 2nd column and puts it into people_subset
 
people_subset
## [1] "1994" "1987" "1989" "1985" "1993"

Dataframes

We will rarely work with matrices. In most cases, we will use dataframes. A dataframe is a dataset in R. We can convert any object into a dataframe

people_df <- as.data.frame(people)

class(people_df) 
## [1] "data.frame"
class(people)
## [1] "matrix" "array"
people_df
##      names  yob
## 1   Victor 1994
## 2    Vicky 1987
## 3 Victoria 1989
## 4    Vinny 1985
## 5      Val 1993

Note that we can use the same commands as before to check the dimension of the dataframe

ncol(people_df)
## [1] 2
nrow(people_df)
## [1] 5
dim(people_df)
## [1] 5 2

Dataframes have special properties. For instance, they have column and row names

colnames(people_df)
## [1] "names" "yob"
rownames(people_df)
## [1] "1" "2" "3" "4" "5"
# Data Frame
Numbers <- c(1, 2, 3, 4, 5) 
Alphabets <- c("A", "B", "C", "D", "E") 
Boolean <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
Float <- c(1.1, 2.2, 3.3, 4.4, 5.5)
df <- data.frame(Numbers, Alphabets, Boolean, Float)
df
##   Numbers Alphabets Boolean Float
## 1       1         A    TRUE   1.1
## 2       2         B   FALSE   2.2
## 3       3         C    TRUE   3.3
## 4       4         D    TRUE   4.4
## 5       5         E   FALSE   5.5
# Analyzing a DataFrame

dim(df)
## [1] 5 4
ncol(df)
## [1] 4
nrow(df)
## [1] 5
str(df)
## 'data.frame':    5 obs. of  4 variables:
##  $ Numbers  : num  1 2 3 4 5
##  $ Alphabets: chr  "A" "B" "C" "D" ...
##  $ Boolean  : logi  TRUE FALSE TRUE TRUE FALSE
##  $ Float    : num  1.1 2.2 3.3 4.4 5.5
names(df)
## [1] "Numbers"   "Alphabets" "Boolean"   "Float"
colnames(df)
## [1] "Numbers"   "Alphabets" "Boolean"   "Float"
rownames(df)
## [1] "1" "2" "3" "4" "5"
head(df,2)
##   Numbers Alphabets Boolean Float
## 1       1         A    TRUE   1.1
## 2       2         B   FALSE   2.2
tail(df,2)
##   Numbers Alphabets Boolean Float
## 4       4         D    TRUE   4.4
## 5       5         E   FALSE   5.5

Columns

Since columns have names, we can call each column using the symbol $

people_df$names
## [1] "Victor"   "Vicky"    "Victoria" "Vinny"    "Val"
people_df$yob
## [1] "1994" "1987" "1989" "1985" "1993"

Important: You always need to call both the dataset and the column name

datasetName$ColumnName

We can perform any operation on columns. Let’s try with checking the class of those columns.

class(people_df$names)
## [1] "character"
class(people_df$yob)
## [1] "character"

What is a factor?

Special class of vectors for categorical variables. Factors are composed by levels (a.k.a., categories). R uses factors to represent categorical variables that have a known set of possible values.

  • factor: dog, cat, cat, dog, bird, dog
  • levels: dog, cat, bird

Converting class

You can always convert any vector (or column or row) from one class to the other

  • as.numeric
  • as.character
  • as.factor
  • as.matrix
  • as.data.frame

Note: When you recode a variable, it’s good practice to save it as a new one. That way if you make a mistake, the original data still exists.

  • It allows you to check your work

  • You might need to go back to the original variable

  • If you make a mistake, you don’t have to upload your dataset again

  • You can always clean your dataframe at the end (e.g., keep only relevant columns)

From factor to numeric

By using as.numeric, the new vector stores the # of the level but not their content.

  • factor: dog, cat, cat, dog, bird, dog
  • levels: dog, cat, bird
  • level numbers: 1, 2, 3

Summary of Class types

Class Description
character It stores text information.
numeric It stores numbers (continuous variables)
factors It stores categorical variables
levels It stores each category of a factor

Operations with columns

You can manipulate columns in the same way you would with vectors (mostly).

For instance, we can create a new column called age where we calculate the age for each individuals in the current year.

people_df$age <- 2021 - as.numeric(people_df$yob)

people_df$age
## [1] 27 34 32 36 28

You can also decide to calculate the age in terms of months instead of years

people_df_agemonths <- people_df$age * 12
people_df_agemonths
## [1] 324 408 384 432 336

In sum, you can easily perform operations with your columns.

Functions

When learning about a new function, you generally want to retrieve three pieces of information:

  • Description what the function does

  • Usage how you are expected to write the function

  • Arguments what each part of the function does.

All help pages also contain an “examples” section where you can see how the function is used in practice.

Even when you discover new functions from other sources, you should check out the help page to understand all possible options provided by the arguments.

Let’s use some descriptive statistics functions to check out the variable age.

table(people_df$age) # Frequencies

mean(people_df$age) # Mean value

min(people_df$age) # Minimum value

max(people_df$age) # Maximum value

sd(people_df$age) # Standard deviation

median(people_df$age) # Median value

Now calculate the mean age for those called “Val”. Note that equal is represented by the symbol == when used for logical indexing.

mean(people_df$age[people_df$names == "Val"])
## [1] 28

Recap

Terminology used in this class provides you the basics to talk about R concepts and elements (vectors, objects, functions…)

‘dollar sign’ syntax is so called because of the use of $ to connect a dataframe name with a column name.

Dataframes are a very common way to work with data in R. Some functions do not work with tibbles (tidyverse database format) so you’ll likely go back to this at one point (e.g., regression analysis classes)

Tidyverse is better for data wrangling and visualization.

Univariate Analysis

# Factor
#install.packages(dataset)
library(datasets)
data("mtcars")
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
str(mtcars) # see the structure of the data
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
tab <- table(mtcars$cyl)
tab
## 
##  4  6  8 
## 11  7 14
prop.table(tab)
## 
##       4       6       8 
## 0.34375 0.21875 0.43750
summary(mtcars[,c("cyl" , "vs" , "am" , "gear" , "carb")])
##       cyl              vs               am              gear      
##  Min.   :4.000   Min.   :0.0000   Min.   :0.0000   Min.   :3.000  
##  1st Qu.:4.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :6.000   Median :0.0000   Median :0.0000   Median :4.000  
##  Mean   :6.188   Mean   :0.4375   Mean   :0.4062   Mean   :3.688  
##  3rd Qu.:8.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000  
##  Max.   :8.000   Max.   :1.0000   Max.   :1.0000   Max.   :5.000  
##       carb      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.812  
##  3rd Qu.:4.000  
##  Max.   :8.000
# Numeric

mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
min(mtcars$mpg)
## [1] 10.4
max(mtcars$mpg)
## [1] 33.9
range(mtcars$mpg)
## [1] 10.4 33.9
quantile(mtcars$mpg)
##     0%    25%    50%    75%   100% 
## 10.400 15.425 19.200 22.800 33.900
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Bivariate Analysis

# Two Numeric
data(mtcars)
View(mtcars)

cor(mtcars$mpg , mtcars$hp)
## [1] -0.7761684
cor(mtcars$disp , mtcars$hp)
## [1] 0.7909486
# One Factor & One Numeric
mtcars %>% group_by(cyl) %>% summarise(avg=mean(mpg), 
                                       median=median(mpg), 
                                       std=sd(mpg))
## # A tibble: 3 x 4
##     cyl   avg median   std
##   <dbl> <dbl>  <dbl> <dbl>
## 1     4  26.7   26    4.51
## 2     6  19.7   19.7  1.45
## 3     8  15.1   15.2  2.56

Rain Example

In R, functions accept objects as inputs, manipulate the inputs in some way, and return some output. For example, the function mean(object) would return the mean of an object (assuming the object was a list of numbers). The function c() is called the Combine Function and will combine a list of numbers (or words) into a new object.

## 'rain' contains actual rainfall data for Boulder, CO (2000-2011)
rain <- c(16, 18, 14, 22, 27, 17, 19, 17, 17, 22, 20, 22)

The object “rain” contains data, we can calculate some descriptive statistics:

mean(rain) #returns the average rainfall from 2000-2011 in Boulder, CO
## [1] 19.25
sum(rain) #returns the total amount of rainfall during the study period
## [1] 231
length(rain) #returns the length of the list, i.e. the number of years of data
## [1] 12

We can also calculate deviations from the mean for each year:

rain - mean(rain)  #Deviations from the mean; negative values indicate below average rainfall.
##  [1] -3.25 -1.25 -5.25  2.75  7.75 -2.25 -0.25 -2.25 -2.25  2.75  0.75  2.75

We can use the assignment operator to save these deviations from the mean as a new object:

rainDeviations <- rain - mean(rain)
rainDeviations^2  #Squared deviations from the mean
##  [1] 10.5625  1.5625 27.5625  7.5625 60.0625  5.0625  0.0625  5.0625  5.0625
## [10]  7.5625  0.5625  7.5625
sqrt(rain)  #Square root of rainfall values
##  [1] 4.000000 4.242641 3.741657 4.690416 5.196152 4.123106 4.358899 4.123106
##  [9] 4.123106 4.690416 4.472136 4.690416

Conceptually, the standard deviation is like the average deviation from the mean. However, the average deviation from the mean is always zero. Thus, we calculate the standard deviation as:

\[s = \sqrt{s^{2}} = \sqrt{\frac{SS}{N - 1}} = \sqrt{\frac{\sum (x_{i} - \bar{x})^{2}}{N - 1}}\] where \(s^2\) is the svariance; SS is the sum of squared errors; N is the number of observations; \(x_i\) is the \(i^{th}\) score in a group; and \(\bar{x}\) is the mean of the group.

The standard deviation is the Root Mean Square (RMS) of the deviations from the mean. The above formula can be broken down into a series of simple steps:
1. Calculate the deviations from the mean (see above R code).
2. Square the deviations from the mean, save the squared deviations as a new R object (use the “<-” assignment operator).
3. Take the mean of these squared deviations. Again, save the results as an object.
4. Finally, take the square root of the result from the prior step.