Mini-Course: Introduction to R for Data Analysis

Learning Objectives

The goal of this course is to lay a functional foundation in the use of R, requiring no prior background in R, and illustrated mainly using topics encountered in an elementary statistics course. R’s computational platform is geared toward working with multivariate data. As such, it extends easily to many tasks encountered in data analysis, statistical learning, and computational linear algebra. Some simple applications in these areas will also be presented.

PART I: Basics of R and Base R

1. Software

A. R

What is R?

R is a language and environment for statistical computing and graphics.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.

Source: https://www.r-project.org/about.html

Download R from CRAN: https://ftp.osuosl.org/pub/cran/

B. R Studio

What is R Studio?

Integrated Development Environment (IDE): console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management.
R Studio runs on top of R.

Download R Studio from Posit: https://posit.co/download/rstudio-desktop/

2. Creating a Script

An R script is a text file containing the commands that you wish to execute in the command line of R.

3. Commenting Code

A comment is non-executable code. Leaving comments in your code to describe what the code does or your thought process is a good practice.

Think of this as your past self leaving your future self notes!

### HELLO SELF!  THIS IS A COMMENT!

4. Variable Assignment

In R you can use either the equals sign = or arrow assignment when you want to store a variable (or object) for later use.

### VARIABLE ASSIGNMENT
a = 13
b <- 2

### We can use R as a calculator!
a+b # addition

## [1] 15

a-b # subtraction

## [1] 11

a/b # division

## [1] 6.5

a^b # exponentiation

## [1] 169

a%%b # modulo

## [1] 1

a>=b # inequalities (logical evaluation)

## [1] TRUE

5. Generating Random Data

Statistics is the study of randomness! Many statistics examples are based on simple games of chance.

A. Flip a coin

## BERNOULLI TRIAL
## One coin flip
sample(x=c("Heads", "Tails"), size=1)

## [1] "Tails"

Flip a coin multiple 5 times.

## FIVE INDEPENDENT COIN FLIPS
sample(x=c("Heads", "Tails"), size=5, replace=TRUE)

## [1] "Tails" "Heads" "Heads" "Heads" "Heads"

B. Roll dice

## D-6
## ONE ROLL
sample(1:6, size=1)

## [1] 4

Sum of two rolls (Example: Catan)

## TWO ROLLS
roll2<-sample(1:6, size=2, replace=TRUE)
roll2

## [1] 2 3

## SUM OF TWO ROLLS
sum(roll2)

## [1] 5

C. Pick a card

First we will build a deck:

## REP FUNCTION 
ranks<-rep(c("Ace", 2:10, "Jack", "Queen", "King"), 4)

## CONCATENATE
suits<-c(rep("Hearts", 13), 
         rep("Diamonds", 13), 
         rep("Clubs", 13), 
         rep("Spades", 13))

## CREATE A NEW DATA FRAME FOR THE DECK
deck<-data.frame(ranks, suits)
dim(deck)

## [1] 52  2

Pick a five card hand

## SET A SEED TO HAVE REPRODUCIBLE CODE
set.seed(314)

## RANDOMLY CHOOSE CARDS
deal<-sample(1:52, 5)

## SELECT ONLY THE CARDS FROM THE SAMPLE
## INDEXED BY ROWS AND THEN COLUMNS [R, C]
deck[deal,]

##    ranks    suits
## 14   Ace Diamonds
## 3      3   Hearts
## 20     7 Diamonds
## 35     9    Clubs
## 39  King    Clubs

D. Random Variables

There are several known distributions built into R:

norm: Normal distribution
t: Student t’s distribution
chisq: Chi-Squared distribution
binom: Binomial distribution
pois: Poisson distribution

If we proceed the shortened distribution name with an r we can simulate data from that distribution.

### RANDOM NORMAL DATA
x<-rnorm(n=1000)

### HISTOGRAM (in base)
hist(x)

6. Installing Packages and Calling Libraries

One of the strengths of R is its community of users that build open-source packages.

Useful packages/resources for data and examples:

MASS: Modern Applied Statistics in S
openintro: Package for “OpenIntro” (free) online textbooks for modern statistical methods
ISLR: Package for “Introduction to Statistical Learning with R”

You only need to install a package once. After that you can comment it out.

Note: You cannot leave an install.packages() code in an RMarkdown. It will fail to compile.

## INSTALL PACKAGE
#install.packages("ISLR")

## CALL THE LIBRARY TO ACCESS FUCTIONS
library(ISLR)

7. Help Files

When working with new packages or function you might want to read about the arguments and defaults.

## HELP FILES
?rnorm

## OR
help(rnorm)

8. Data in R

A. Data Built Into Base R

## WHAT DATA ARE BUILT INTO R?
library(help = "datasets")

Call the data

## Anscombe's Quartet
data("anscombe")

B. Data in Packages

library(ISLR)

## COLLEGE DATASET
data("College")

Look at the data:

## HEAD: first six rows
head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

## TAIL: last six rows
tail(College)

##                                 Private  Apps Accept Enroll Top10perc Top25perc
## Worcester Polytechnic Institute     Yes  2768   2314    682        49        86
## Worcester State College              No  2197   1515    543         4        26
## Xavier University                   Yes  1959   1805    695        24        47
## Xavier University of Louisiana      Yes  2097   1915    695        34        61
## Yale University                     Yes 10705   2453   1317        95        99
## York College of Pennsylvania        Yes  2989   1855    691        28        63
##                                 F.Undergrad P.Undergrad Outstate Room.Board
## Worcester Polytechnic Institute        2802          86    15884       5370
## Worcester State College                3089        2029     6797       3900
## Xavier University                      2849        1107    11520       4960
## Xavier University of Louisiana         2793         166     6900       4200
## Yale University                        5217          83    19840       6510
## York College of Pennsylvania           2988        1726     4990       3560
##                                 Books Personal PhD Terminal S.F.Ratio
## Worcester Polytechnic Institute   530      730  92       94      15.2
## Worcester State College           500     1200  60       60      21.0
## Xavier University                 600     1250  73       75      13.3
## Xavier University of Louisiana    617      781  67       75      14.4
## Yale University                   630     2115  96       96       5.8
## York College of Pennsylvania      500     1250  75       75      18.1
##                                 perc.alumni Expend Grad.Rate
## Worcester Polytechnic Institute          34  10774        82
## Worcester State College                  14   4469        40
## Xavier University                        31   9189        83
## Xavier University of Louisiana           20   8323        49
## Yale University                          49  40386        99
## York College of Pennsylvania             28   4509        99

## STR: structure
str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

C. Importing Data

## IMPORTING IN BASE
## DATA FOR NATIONAL PARK TRAILS ON ALL TRAILS APP
allTrails <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/AllTrails%20data%20-%20nationalpark.csv", 
                  header=TRUE)

9. Basic Statistics Functions

Let’s separate into four groups using the anscombe data.

Group 1: Use x1 and y1
Group 2: Use x2 and y2
Group 3: Use x3 and y3
Group 4: Use x4 and y4

A. Summary

Five number summary (plus the mean)

## SUMMARY
summary(anscombe$x1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0     6.5     9.0     9.0    11.5    14.0

B. Center

## SAMPLE MEAN
mean(anscombe$x1)

## [1] 9

mean(anscombe$y1)

## [1] 7.500909

## MEDIAN
median(anscombe$x1)

## [1] 9

quantile(anscombe$x1, p=.5)

## 50% 
##   9

median(anscombe$y1)

## [1] 7.58

C. Spread

## RANGE
min(anscombe$x1)

## [1] 4

max(anscombe$x1)

## [1] 14

range(anscombe$x1)

## [1]  4 14

## SAMPLE STANDARD DEVIATION
sd(anscombe$x1)

## [1] 3.316625

sd(anscombe$y1)

## [1] 2.031568

## SAMPLE VARIANCE
var(anscombe$x1)

## [1] 11

var(anscombe$y1)

## [1] 4.127269

D. Correlation

## Correlation is a symmetric function
cor(anscombe$x1, anscombe$y1)

## [1] 0.8164205

cor(anscombe$y1, anscombe$x1)

## [1] 0.8164205

10. Base graphics

plot(anscombe$x1, anscombe$y1)

The moral of the story is, you should always plot your data.

Here is another fun example called the Dinosaur Dozen

https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html

11. Writing Functions

For simple linear regression the slope estimate is given by

\[\hat{\beta_1}=r \times \frac{s_y}{s_x}\]

## CREATE A FUNCTION FOR SLOPE
slope<-function(x, y){
  sy<-sd(y)
  sx<-sd(x)
  r<-cor(x, y)
  
  thisSlope<-r*(sy/sx)
  
  return(thisSlope)
}

## CALL THE NEW FUNCTION
slope(anscombe$x1, anscombe$y1)

## [1] 0.5000909

12. Loops

Loops are very useful to run simulations and illustrate concepts such as convergence.

For instance, the Law of Large Numbers:

nsim <- 500
pHat<-rep(0, nsim)
for(i in 1:nsim){
  flips<-sample(c(0,1), size=i, replace=TRUE)
  pHat[i]<-mean(flips)
}

plot(1:nsim, pHat, pch=16)

PART II: Data Science Workflow and Tidyverse

1. Data Science Workflow

The data science workflow in R takes the following iterative steps:

Import
Tidy
Transform
Visualize
Model
Communicate

Source: R for Data Science https://r4ds.had.co.nz/introduction.html

2. Where to find data for projects

A. Kaggle

Kaggle is a website that hosts data analysis competitions and freely available datasets.

Source: https://www.kaggle.com/datasets

B. UCI Machine Learning Repository

The University of California Irvine (UCI) Machine Learning Repository maintains over 600 datasets for the machine learning community for projects and scholarly work.

Source: https://archive.ics.uci.edu/ml/index.php

C. Storytelling with Data List

Storytelling with Data by Cole Nussbaumer Knaflic is an innovative text and online community for data visualization.

She has also put together a list of public/freely available datasets: https://docs.google.com/document/d/1Ads4XsCjXmDrdGRgfmm_OgRdpFcl6Qhs6SOllNGyq7Y/edit

Source: https://www.storytellingwithdata.com/

3. R Markdown (with `knitr`)

Integrate text and reproducible code.

A. Headings

Headings can be used to create an outline structure to organize and navigate an R Markdown document.

B. Code Chunks

Code chunks within R Markdown are used to execute R code and print the respective output.

C. RPubs

You can publish your R Markdown documents like coding websites for free on RPubs.

4. Tidyverse

Learning Through Example

(AllTrails App)

AllTrails is a fitness and travel mobile app used in outdoor recreational activities. AllTrails is commonly used for outdoor activities such as hiking, mountain biking, climbing and snow sports. The service allows users to access a database of trail maps, which includes crowdsourced reviews and images.

Citations:

Data Source: These data come from Kaggle

https://www.kaggle.com/datasets/planejane/national-park-trails

Look at the data

What variables are available to work with?

str(allTrails)

## 'data.frame':    3313 obs. of  18 variables:
##  $ trail_id         : int  10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
##  $ name             : chr  "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr  "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr  "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr  "United States" "United States" "United States" "United States" ...
##  $ X_geoloc         : chr  "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num  24.9 18 17.8 16.3 12.6 ...
##  $ length           : num  15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num  1162 508 82 120 1125 ...
##  $ difficulty_rating: int  5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr  "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : int  3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num  5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : int  423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr  "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr  "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr  "i" "i" "i" "i" ...

For this activity you will use the following variables:

area_name: Name of National Park
state_name: Name of State where trail is located
length: Length of hike in meters
elevation_gain: Elevation gain in meters
route_type: Out and back, Loop, or Point to Point
avg_rating: Star rating (out of 5)

A. `dplyr`

i. Piping operator `%>%`

Takes the data frame from the left and passes into the function on the right.
Helps to make code readable, should be read aloud as “and then”

ii. Verbs

MUTATE: Calculate the Grade

Average grade can be calculated by taking the total elevation gain of the trail, dividing by the total distance, multiplied by 100 to equal a percent grade. The elevation and distance must be in the same units.

Citation:

https://venturacountytrails.org/TrailMaps/TrailSymbols.htm#:~:text=Average%20grade%20can%20be%20calculated,as%20the%20’rise%20over%20run.

TASK: Add a column to the data frame calculate the average grade for each trail.

## MUTATE
allTrails<-allTrails%>%
  mutate(grade=(elevation_gain/length)*100)

str(allTrails)

## 'data.frame':    3313 obs. of  19 variables:
##  $ trail_id         : int  10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
##  $ name             : chr  "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr  "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr  "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr  "United States" "United States" "United States" "United States" ...
##  $ X_geoloc         : chr  "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num  24.9 18 17.8 16.3 12.6 ...
##  $ length           : num  15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num  1162 508 82 120 1125 ...
##  $ difficulty_rating: int  5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr  "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : int  3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num  5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : int  423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr  "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr  "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr  "i" "i" "i" "i" ...
##  $ grade            : num  7.44 7.34 2.83 3.54 3.78 ...

MUTATE: Convert Units

The AllTrails website reports trail length and elevation gain in miles and feet, respectively.

See example here: https://www.alltrails.com/trail/us/washington/sentinell-peak-via-grey-wolf-deer-loop

1 meter = 0.000621371 miles
1 meter = 3.28084 feet

TASK: Convert the values for length and elevation to miles and feet, respectively.

allTrails<-allTrails%>%
  mutate(lengthMiles=length*0.000621371, 
         elevationFt=elevation_gain*3.28084)

str(allTrails)

## 'data.frame':    3313 obs. of  21 variables:
##  $ trail_id         : int  10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
##  $ name             : chr  "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr  "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr  "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr  "United States" "United States" "United States" "United States" ...
##  $ X_geoloc         : chr  "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num  24.9 18 17.8 16.3 12.6 ...
##  $ length           : num  15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num  1162 508 82 120 1125 ...
##  $ difficulty_rating: int  5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr  "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : int  3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num  5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : int  423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr  "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr  "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr  "i" "i" "i" "i" ...
##  $ grade            : num  7.44 7.34 2.83 3.54 3.78 ...
##  $ lengthMiles      : num  9.7 4.3 1.8 2.1 18.5 ...
##  $ elevationFt      : num  3812 1666 269 393 3690 ...

FILTER: Finding Family Friendly Hikes

Our family likes to hike together! We have three children (ages 2, 7, and 11) so we have some limitations.

TASK: Find the best hike that fits ALL of the following conditions:

Is in Oregon
Is is a loop
Is less than 3 miles
Is “easy” (has a grade less than 5)

Note: Best is defined by the highest average rating

allTrails%>%
  filter(state_name=="Oregon", 
         route_type=="loop", 
         lengthMiles<3, 
         grade<5)%>%
  arrange(desc(avg_rating))

##   trail_id                                 name                 area_name
## 1 10016688                      Sun Notch Trail Crater Lake National Park
## 2 10013161                   Godfrey Glen Trail Crater Lake National Park
## 3 10015976             Annie Creek Canyon Trail Crater Lake National Park
## 4 10012733 Castle Crest Wildflower Garden Trail Crater Lake National Park
## 5 10236071              Lady of the Woods Trail Crater Lake National Park
##     city_name state_name  country_name                             X_geoloc
## 1 Crater Lake     Oregon United States     {'lat': 42.9, 'lng': -122.09549}
## 2   Chiloquin     Oregon United States {'lat': 42.86676, 'lng': -122.14558}
## 3   Chiloquin     Oregon United States {'lat': 42.86542, 'lng': -122.16247}
## 4   Chiloquin     Oregon United States {'lat': 42.89664, 'lng': -122.13375}
## 5 Crater Lake     Oregon United States {'lat': 42.89698, 'lng': -122.13413}
##   popularity   length elevation_gain difficulty_rating route_type visitor_usage
## 1    13.9272 1287.472        38.7096                 1       loop             1
## 2     8.8741 1770.274        19.8120                 1       loop            NA
## 3     7.5299 3379.614        92.9640                 3       loop            NA
## 4     7.0665 1931.208        36.8808                 1       loop             1
## 5     4.5346 1126.538        33.8328                 1       loop             1
##   avg_rating num_reviews
## 1        4.5          98
## 2        4.0          23
## 3        4.0          40
## 4        4.0          24
## 5        3.0          19
##                                                             features
## 1     ['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers']
## 2             ['dogs-no', 'forest', 'kids', 'views', 'wild-flowers']
## 3                               ['dogs-no', 'views', 'wild-flowers']
## 4 ['dogs-no', 'forest', 'kids', 'views', 'wild-flowers', 'wildlife']
## 5 ['dogs-leash', 'forest', 'kids', 'river', 'views', 'wild-flowers']
##                                                         activities units
## 1 ['birding', 'hiking', 'snowshoeing', 'trail-running', 'walking']     i
## 2                                            ['hiking', 'walking']     i
## 3                                                       ['hiking']     i
## 4                                            ['hiking', 'walking']     i
## 5                                            ['hiking', 'walking']     i
##      grade lengthMiles elevationFt
## 1 3.006636   0.7999978         127
## 2 1.119149   1.0999969          65
## 3 2.750728   2.0999941         305
## 4 1.909727   1.1999966         121
## 5 3.003254   0.6999980         111

GROUP_BY and SUMMARISE: Trails within National Parks (within States)

TASK: Find the number of trails and the average star rating within each National Park.

Which national park has the most trails?
Which national park has the highest rated trails?

# Which national park has the most trails?
nTrails<-allTrails%>%
  group_by(state_name, area_name)%>%
  count()%>%
  arrange(desc(n))

head(nTrails)

## # A tibble: 6 × 3
## # Groups:   state_name, area_name [6]
##   state_name area_name                               n
##   <chr>      <chr>                               <int>
## 1 California Yosemite National Park                242
## 2 Wyoming    Yellowstone National Park             209
## 3 Colorado   Rocky Mountain National Park          207
## 4 Virginia   Shenandoah National Park              187
## 5 Maine      Acadia National Park                  179
## 6 Tennessee  Great Smoky Mountains National Park   175

# Which national park has the highest rated trails?

ratedTrails<-allTrails%>%
  group_by(area_name)%>%
  summarise(avgRate=mean(avg_rating, na.rm=TRUE))%>%
  arrange(desc(avgRate))

head(ratedTrails)

## # A tibble: 6 × 2
##   area_name                                       avgRate
##   <chr>                                             <dbl>
## 1 Kenai Fjords National Park                         4.75
## 2 Haleakala National Park                            4.57
## 3 Dry Tortugas National Park                         4.5 
## 4 Fort Pickens National Park                         4.5 
## 5 Wolf Trap National Park for the Performing Arts    4.5 
## 6 Mount Rainier National Park                        4.43

JOIN: Join Regions

States are grouped by regions (included in R) with the following data frame:

stateRegion<-data.frame(state_name=state.name, 
                        region=state.region)

head(stateRegion)

##   state_name region
## 1    Alabama  South
## 2     Alaska   West
## 3    Arizona   West
## 4   Arkansas  South
## 5 California   West
## 6   Colorado   West

TASK: Add a column for region to the data frame

allTrailsReg<-allTrails%>%
  left_join(stateRegion)

## Joining with `by = join_by(state_name)`

str(allTrailsReg)

## 'data.frame':    3313 obs. of  22 variables:
##  $ trail_id         : int  10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
##  $ name             : chr  "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr  "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr  "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr  "United States" "United States" "United States" "United States" ...
##  $ X_geoloc         : chr  "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num  24.9 18 17.8 16.3 12.6 ...
##  $ length           : num  15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num  1162 508 82 120 1125 ...
##  $ difficulty_rating: int  5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr  "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : int  3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num  5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : int  423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr  "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr  "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr  "i" "i" "i" "i" ...
##  $ grade            : num  7.44 7.34 2.83 3.54 3.78 ...
##  $ lengthMiles      : num  9.7 4.3 1.8 2.1 18.5 ...
##  $ elevationFt      : num  3812 1666 269 393 3690 ...
##  $ region           : Factor w/ 4 levels "Northeast","South",..: 4 4 4 4 4 4 4 4 4 4 ...

B. `ggplot2`

BARS

ggplot(allTrailsReg, aes(x=region))+
  geom_bar()

## STACKED BAR AS DEFAULT
ggplot(allTrailsReg, aes(x=region, fill=route_type))+
  geom_bar()

## DODGE
ggplot(allTrailsReg, aes(x=region, fill=route_type))+
  geom_bar(position="dodge")

## FILL
ggplot(allTrailsReg, aes(x=region, fill=route_type))+
  geom_bar(position="fill")

HISTOGRAM

ggplot(allTrailsReg, aes(x=avg_rating ))+
  geom_histogram(bins=10)

DENSITY

ggplot(allTrailsReg, aes(x=grade ))+
  geom_density()

## Warning: Removed 2 rows containing non-finite values (`stat_density()`).

BOXPLOT

ggplot(allTrailsReg, aes(x=grade ))+
  geom_boxplot()

## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Comparing across groups

## COLOR AS A GROUPING VARIABLE
ggplot(allTrailsReg, aes(y=grade , fill=region))+
  geom_boxplot()

## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

## FACET
ggplot(allTrailsReg, aes(y=grade , fill=region))+
  geom_boxplot()+
  facet_grid(.~region)

## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

BONUS: MAPS

Summarise state data

stateNP<-allTrailsReg%>%
  group_by(state_name)%>%
  summarise(stateTrails=n(), 
            avgPop=mean(popularity, na.rm=TRUE), 
            avgElev=mean(elevation_gain, na.rm=TRUE))

Map data in R:

#install.packages("usmap")
library(usmap)

states <- usmap::us_map()

head(states)

##         x        y order  hole piece group fips abbr    full
## 1 1091779 -1380695     1 FALSE     1  01.1   01   AL Alabama
## 2 1091268 -1376372     2 FALSE     1  01.1   01   AL Alabama
## 3 1091140 -1362998     3 FALSE     1  01.1   01   AL Alabama
## 4 1090940 -1343517     4 FALSE     1  01.1   01   AL Alabama
## 5 1090913 -1341006     5 FALSE     1  01.1   01   AL Alabama
## 6 1090796 -1334480     6 FALSE     1  01.1   01   AL Alabama

Oregon map:

## POINTS
oregon<-states%>%
  filter(full=="Oregon")

ggplot(oregon, aes(x, y))+
  geom_point()

## LINES (connect the dots)
ggplot(oregon, aes(x, y))+
  geom_line()

## PATH
ggplot(oregon, aes(x, y, group=group))+
  geom_path()

## POLYGON
ggplot(oregon, aes(x, y, group=group))+
  geom_polygon(fill="forestgreen")

Join map data

stateNP_Map<-states%>%
  mutate(state_name=full)%>%
  left_join(stateNP)

## Joining with `by = join_by(state_name)`

Make a map

stateNP_Map%>%
  ggplot(aes(x, y, group = group)) +
  geom_polygon(aes(fill = stateTrails),color="black")+
  theme_bw()+
  coord_equal()

Change the color palette:

#install.packages("viridis")
library(viridis)

## Loading required package: viridisLite

stateNP_Map%>%
  ggplot(aes(x, y, group = group)) +
  geom_polygon(aes(fill = stateTrails),color="black")+
  theme_bw()+
  coord_equal()+
  ggtitle("California has the MOST trails, but...")+
  scale_fill_viridis(option="viridis", direction = 1)

stateNP_Map%>%
  ggplot(aes(x, y, group = group)) +
  geom_polygon(aes(fill = avgPop),color="black")+
  theme_bw()+
  coord_equal()+
  ggtitle("..Oregon trails are the MOST popular")+
  scale_fill_viridis(option="viridis", direction = 1)

PART III: Statistical Learning and Matrices in R

Let’s shift gears and work with the College data set in the ISRL package.

str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

We want to predict the number of students who Enroll:

ggplot(College, aes(Apps, Enroll))+
  geom_point()

It looks like we might have an outlier

collegeWO<-College%>%
  filter(Apps<30000)

ggplot(collegeWO, aes(Apps, Enroll))+
  geom_point()

What would the linear model be?

## SIMPLE LINEAR MOD
mod<-lm(Enroll~Apps, data=College)
summary(mod)

## 
## Call:
## lm(formula = Enroll ~ Apps, data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5427.7  -148.2   -81.6    45.0  3279.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.697e+02  2.246e+01   7.557 1.16e-13 ***
## Apps        2.033e-01  4.587e-03  44.323  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 494.5 on 775 degrees of freedom
## Multiple R-squared:  0.7171, Adjusted R-squared:  0.7167 
## F-statistic:  1965 on 1 and 775 DF,  p-value: < 2.2e-16

1. Linear Algebra in R

Lets try to verify the model summary by learning about matrix algebra in R.

A. Response Vector

First, create your response vector, \(Y\), and look at the dimensions. We will also want to keep track of the sample size.

## RESPONSE VECTOR
Y<-as.matrix(College$Enroll)
head(Y)

##      [,1]
## [1,]  721
## [2,]  512
## [3,]  336
## [4,]  137
## [5,]   55
## [6,]  158

## dimensions
dim(Y)

## [1] 777   1

n<-dim(Y)[1]

B. Design Matrix

X<-matrix(c(rep(1, n),
            College$Apps), 
          ncol=2, 
          byrow=FALSE)
head(X)

##      [,1] [,2]
## [1,]    1 1660
## [2,]    1 2186
## [3,]    1 1428
## [4,]    1  417
## [5,]    1  193
## [6,]    1  587

C. Beta Estimates

## Least squares estimates
## solve() = inverse
## t() = transpose
## %*% = matrix multiple

betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat

##            [,1]
## [1,] 169.712809
## [2,]   0.203309

Diagnostic Plots

plot(mod)