The goal of this course is to lay a functional foundation in the use of R, requiring no prior background in R, and illustrated mainly using topics encountered in an elementary statistics course. R’s computational platform is geared toward working with multivariate data. As such, it extends easily to many tasks encountered in data analysis, statistical learning, and computational linear algebra. Some simple applications in these areas will also be presented.
What is R?
Source: https://www.r-project.org/about.html
Download R from CRAN: https://ftp.osuosl.org/pub/cran/
What is R Studio?
Download R Studio from Posit: https://posit.co/download/rstudio-desktop/
An R script is a text file containing the commands that you wish to execute in the command line of R.
A comment is non-executable code. Leaving comments in your code to describe what the code does or your thought process is a good practice.
Think of this as your past self leaving your future self notes!
### HELLO SELF! THIS IS A COMMENT!
In R you can use either the equals sign =
or arrow
assignment when you want to store a variable (or object) for later
use.
### VARIABLE ASSIGNMENT
a = 13
b <- 2
### We can use R as a calculator!
a+b # addition
## [1] 15
a-b # subtraction
## [1] 11
a/b # division
## [1] 6.5
a^b # exponentiation
## [1] 169
a%%b # modulo
## [1] 1
a>=b # inequalities (logical evaluation)
## [1] TRUE
Statistics is the study of randomness! Many statistics examples are based on simple games of chance.
## BERNOULLI TRIAL
## One coin flip
sample(x=c("Heads", "Tails"), size=1)
## [1] "Tails"
Flip a coin multiple 5 times.
## FIVE INDEPENDENT COIN FLIPS
sample(x=c("Heads", "Tails"), size=5, replace=TRUE)
## [1] "Tails" "Heads" "Heads" "Heads" "Heads"
## D-6
## ONE ROLL
sample(1:6, size=1)
## [1] 4
Sum of two rolls (Example: Catan)
## TWO ROLLS
roll2<-sample(1:6, size=2, replace=TRUE)
roll2
## [1] 2 3
## SUM OF TWO ROLLS
sum(roll2)
## [1] 5
First we will build a deck:
## REP FUNCTION
ranks<-rep(c("Ace", 2:10, "Jack", "Queen", "King"), 4)
## CONCATENATE
suits<-c(rep("Hearts", 13),
rep("Diamonds", 13),
rep("Clubs", 13),
rep("Spades", 13))
## CREATE A NEW DATA FRAME FOR THE DECK
deck<-data.frame(ranks, suits)
dim(deck)
## [1] 52 2
Pick a five card hand
## SET A SEED TO HAVE REPRODUCIBLE CODE
set.seed(314)
## RANDOMLY CHOOSE CARDS
deal<-sample(1:52, 5)
## SELECT ONLY THE CARDS FROM THE SAMPLE
## INDEXED BY ROWS AND THEN COLUMNS [R, C]
deck[deal,]
## ranks suits
## 14 Ace Diamonds
## 3 3 Hearts
## 20 7 Diamonds
## 35 9 Clubs
## 39 King Clubs
There are several known distributions built into R:
norm
: Normal distributiont
: Student t’s distributionchisq
: Chi-Squared distributionbinom
: Binomial distributionpois
: Poisson distributionIf we proceed the shortened distribution name with an r
we can simulate data from that distribution.
### RANDOM NORMAL DATA
x<-rnorm(n=1000)
### HISTOGRAM (in base)
hist(x)
One of the strengths of R is its community of users that build open-source packages.
Useful packages/resources for data and examples:
MASS
: Modern Applied Statistics in Sopenintro
: Package for “OpenIntro” (free) online
textbooks for modern statistical methodsISLR
: Package for “Introduction to Statistical Learning
with R”You only need to install a package once. After that you can comment it out.
Note: You cannot leave an install.packages()
code in
an RMarkdown. It will fail to compile.
## INSTALL PACKAGE
#install.packages("ISLR")
## CALL THE LIBRARY TO ACCESS FUCTIONS
library(ISLR)
When working with new packages or function you might want to read about the arguments and defaults.
## HELP FILES
?rnorm
## OR
help(rnorm)
## WHAT DATA ARE BUILT INTO R?
library(help = "datasets")
Call the data
## Anscombe's Quartet
data("anscombe")
library(ISLR)
## COLLEGE DATASET
data("College")
Look at the data:
## HEAD: first six rows
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
## TAIL: last six rows
tail(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Worcester Polytechnic Institute Yes 2768 2314 682 49 86
## Worcester State College No 2197 1515 543 4 26
## Xavier University Yes 1959 1805 695 24 47
## Xavier University of Louisiana Yes 2097 1915 695 34 61
## Yale University Yes 10705 2453 1317 95 99
## York College of Pennsylvania Yes 2989 1855 691 28 63
## F.Undergrad P.Undergrad Outstate Room.Board
## Worcester Polytechnic Institute 2802 86 15884 5370
## Worcester State College 3089 2029 6797 3900
## Xavier University 2849 1107 11520 4960
## Xavier University of Louisiana 2793 166 6900 4200
## Yale University 5217 83 19840 6510
## York College of Pennsylvania 2988 1726 4990 3560
## Books Personal PhD Terminal S.F.Ratio
## Worcester Polytechnic Institute 530 730 92 94 15.2
## Worcester State College 500 1200 60 60 21.0
## Xavier University 600 1250 73 75 13.3
## Xavier University of Louisiana 617 781 67 75 14.4
## Yale University 630 2115 96 96 5.8
## York College of Pennsylvania 500 1250 75 75 18.1
## perc.alumni Expend Grad.Rate
## Worcester Polytechnic Institute 34 10774 82
## Worcester State College 14 4469 40
## Xavier University 31 9189 83
## Xavier University of Louisiana 20 8323 49
## Yale University 49 40386 99
## York College of Pennsylvania 28 4509 99
## STR: structure
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
## IMPORTING IN BASE
## DATA FOR NATIONAL PARK TRAILS ON ALL TRAILS APP
allTrails <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/AllTrails%20data%20-%20nationalpark.csv",
header=TRUE)
Let’s separate into four groups using the anscombe
data.
x1
and y1
x2
and y2
x3
and y3
x4
and y4
Five number summary (plus the mean)
## SUMMARY
summary(anscombe$x1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## SAMPLE MEAN
mean(anscombe$x1)
## [1] 9
mean(anscombe$y1)
## [1] 7.500909
## MEDIAN
median(anscombe$x1)
## [1] 9
quantile(anscombe$x1, p=.5)
## 50%
## 9
median(anscombe$y1)
## [1] 7.58
## RANGE
min(anscombe$x1)
## [1] 4
max(anscombe$x1)
## [1] 14
range(anscombe$x1)
## [1] 4 14
## SAMPLE STANDARD DEVIATION
sd(anscombe$x1)
## [1] 3.316625
sd(anscombe$y1)
## [1] 2.031568
## SAMPLE VARIANCE
var(anscombe$x1)
## [1] 11
var(anscombe$y1)
## [1] 4.127269
## Correlation is a symmetric function
cor(anscombe$x1, anscombe$y1)
## [1] 0.8164205
cor(anscombe$y1, anscombe$x1)
## [1] 0.8164205
plot(anscombe$x1, anscombe$y1)
The moral of the story is, you should always plot your data.
Here is another fun example called the Dinosaur Dozen
https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html
For simple linear regression the slope estimate is given by
\[\hat{\beta_1}=r \times \frac{s_y}{s_x}\]
## CREATE A FUNCTION FOR SLOPE
slope<-function(x, y){
sy<-sd(y)
sx<-sd(x)
r<-cor(x, y)
thisSlope<-r*(sy/sx)
return(thisSlope)
}
## CALL THE NEW FUNCTION
slope(anscombe$x1, anscombe$y1)
## [1] 0.5000909
Loops are very useful to run simulations and illustrate concepts such as convergence.
For instance, the Law of Large Numbers:
nsim <- 500
pHat<-rep(0, nsim)
for(i in 1:nsim){
flips<-sample(c(0,1), size=i, replace=TRUE)
pHat[i]<-mean(flips)
}
plot(1:nsim, pHat, pch=16)
The data science workflow in R takes the following iterative steps:
Source: R for Data Science https://r4ds.had.co.nz/introduction.html
Kaggle is a website that hosts data analysis competitions and freely available datasets.
Source: https://www.kaggle.com/datasets
The University of California Irvine (UCI) Machine Learning Repository maintains over 600 datasets for the machine learning community for projects and scholarly work.
Storytelling with Data by Cole Nussbaumer Knaflic is an innovative text and online community for data visualization.
She has also put together a list of public/freely available datasets: https://docs.google.com/document/d/1Ads4XsCjXmDrdGRgfmm_OgRdpFcl6Qhs6SOllNGyq7Y/edit
knitr
)Integrate text and reproducible code.
Headings can be used to create an outline structure to organize and navigate an R Markdown document.
Code chunks within R Markdown are used to execute R code and print the respective output.
You can publish your R Markdown documents like coding websites for free on RPubs.
(AllTrails App)
AllTrails is a fitness and travel mobile app used in outdoor recreational activities. AllTrails is commonly used for outdoor activities such as hiking, mountain biking, climbing and snow sports. The service allows users to access a database of trail maps, which includes crowdsourced reviews and images.
Citations:
Data Source: These data come from Kaggle
What variables are available to work with?
str(allTrails)
## 'data.frame': 3313 obs. of 18 variables:
## $ trail_id : int 10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
## $ name : chr "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr "United States" "United States" "United States" "United States" ...
## $ X_geoloc : chr "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num 24.9 18 17.8 16.3 12.6 ...
## $ length : num 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num 1162 508 82 120 1125 ...
## $ difficulty_rating: int 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : int 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : int 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr "i" "i" "i" "i" ...
For this activity you will use the following variables:
area_name
: Name of National Parkstate_name
: Name of State where trail is locatedlength
: Length of hike in meterselevation_gain
: Elevation gain in metersroute_type
: Out and back, Loop, or Point to Pointavg_rating
: Star rating (out of 5)dplyr
%>%
Average grade can be calculated by taking the total elevation gain of the trail, dividing by the total distance, multiplied by 100 to equal a percent grade. The elevation and distance must be in the same units.
Citation:
TASK: Add a column to the data frame calculate the average grade for each trail.
## MUTATE
allTrails<-allTrails%>%
mutate(grade=(elevation_gain/length)*100)
str(allTrails)
## 'data.frame': 3313 obs. of 19 variables:
## $ trail_id : int 10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
## $ name : chr "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr "United States" "United States" "United States" "United States" ...
## $ X_geoloc : chr "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num 24.9 18 17.8 16.3 12.6 ...
## $ length : num 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num 1162 508 82 120 1125 ...
## $ difficulty_rating: int 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : int 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : int 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr "i" "i" "i" "i" ...
## $ grade : num 7.44 7.34 2.83 3.54 3.78 ...
The AllTrails website reports trail length and elevation gain in miles and feet, respectively.
See example here: https://www.alltrails.com/trail/us/washington/sentinell-peak-via-grey-wolf-deer-loop
TASK: Convert the values for length and elevation to miles and feet, respectively.
allTrails<-allTrails%>%
mutate(lengthMiles=length*0.000621371,
elevationFt=elevation_gain*3.28084)
str(allTrails)
## 'data.frame': 3313 obs. of 21 variables:
## $ trail_id : int 10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
## $ name : chr "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr "United States" "United States" "United States" "United States" ...
## $ X_geoloc : chr "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num 24.9 18 17.8 16.3 12.6 ...
## $ length : num 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num 1162 508 82 120 1125 ...
## $ difficulty_rating: int 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : int 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : int 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr "i" "i" "i" "i" ...
## $ grade : num 7.44 7.34 2.83 3.54 3.78 ...
## $ lengthMiles : num 9.7 4.3 1.8 2.1 18.5 ...
## $ elevationFt : num 3812 1666 269 393 3690 ...
Our family likes to hike together! We have three children (ages 2, 7, and 11) so we have some limitations.
TASK: Find the best hike that fits ALL of the following conditions:
Note: Best is defined by the highest average rating
allTrails%>%
filter(state_name=="Oregon",
route_type=="loop",
lengthMiles<3,
grade<5)%>%
arrange(desc(avg_rating))
## trail_id name area_name
## 1 10016688 Sun Notch Trail Crater Lake National Park
## 2 10013161 Godfrey Glen Trail Crater Lake National Park
## 3 10015976 Annie Creek Canyon Trail Crater Lake National Park
## 4 10012733 Castle Crest Wildflower Garden Trail Crater Lake National Park
## 5 10236071 Lady of the Woods Trail Crater Lake National Park
## city_name state_name country_name X_geoloc
## 1 Crater Lake Oregon United States {'lat': 42.9, 'lng': -122.09549}
## 2 Chiloquin Oregon United States {'lat': 42.86676, 'lng': -122.14558}
## 3 Chiloquin Oregon United States {'lat': 42.86542, 'lng': -122.16247}
## 4 Chiloquin Oregon United States {'lat': 42.89664, 'lng': -122.13375}
## 5 Crater Lake Oregon United States {'lat': 42.89698, 'lng': -122.13413}
## popularity length elevation_gain difficulty_rating route_type visitor_usage
## 1 13.9272 1287.472 38.7096 1 loop 1
## 2 8.8741 1770.274 19.8120 1 loop NA
## 3 7.5299 3379.614 92.9640 3 loop NA
## 4 7.0665 1931.208 36.8808 1 loop 1
## 5 4.5346 1126.538 33.8328 1 loop 1
## avg_rating num_reviews
## 1 4.5 98
## 2 4.0 23
## 3 4.0 40
## 4 4.0 24
## 5 3.0 19
## features
## 1 ['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers']
## 2 ['dogs-no', 'forest', 'kids', 'views', 'wild-flowers']
## 3 ['dogs-no', 'views', 'wild-flowers']
## 4 ['dogs-no', 'forest', 'kids', 'views', 'wild-flowers', 'wildlife']
## 5 ['dogs-leash', 'forest', 'kids', 'river', 'views', 'wild-flowers']
## activities units
## 1 ['birding', 'hiking', 'snowshoeing', 'trail-running', 'walking'] i
## 2 ['hiking', 'walking'] i
## 3 ['hiking'] i
## 4 ['hiking', 'walking'] i
## 5 ['hiking', 'walking'] i
## grade lengthMiles elevationFt
## 1 3.006636 0.7999978 127
## 2 1.119149 1.0999969 65
## 3 2.750728 2.0999941 305
## 4 1.909727 1.1999966 121
## 5 3.003254 0.6999980 111
TASK: Find the number of trails and the average star rating within each National Park.
# Which national park has the most trails?
nTrails<-allTrails%>%
group_by(state_name, area_name)%>%
count()%>%
arrange(desc(n))
head(nTrails)
## # A tibble: 6 × 3
## # Groups: state_name, area_name [6]
## state_name area_name n
## <chr> <chr> <int>
## 1 California Yosemite National Park 242
## 2 Wyoming Yellowstone National Park 209
## 3 Colorado Rocky Mountain National Park 207
## 4 Virginia Shenandoah National Park 187
## 5 Maine Acadia National Park 179
## 6 Tennessee Great Smoky Mountains National Park 175
# Which national park has the highest rated trails?
ratedTrails<-allTrails%>%
group_by(area_name)%>%
summarise(avgRate=mean(avg_rating, na.rm=TRUE))%>%
arrange(desc(avgRate))
head(ratedTrails)
## # A tibble: 6 × 2
## area_name avgRate
## <chr> <dbl>
## 1 Kenai Fjords National Park 4.75
## 2 Haleakala National Park 4.57
## 3 Dry Tortugas National Park 4.5
## 4 Fort Pickens National Park 4.5
## 5 Wolf Trap National Park for the Performing Arts 4.5
## 6 Mount Rainier National Park 4.43
States are grouped by regions (included in R) with the following data frame:
stateRegion<-data.frame(state_name=state.name,
region=state.region)
head(stateRegion)
## state_name region
## 1 Alabama South
## 2 Alaska West
## 3 Arizona West
## 4 Arkansas South
## 5 California West
## 6 Colorado West
TASK: Add a column for region to the data frame
allTrailsReg<-allTrails%>%
left_join(stateRegion)
## Joining with `by = join_by(state_name)`
str(allTrailsReg)
## 'data.frame': 3313 obs. of 22 variables:
## $ trail_id : int 10020048 10236086 10267857 10236076 10236082 10241620 10236080 10236075 10236084 10327620 ...
## $ name : chr "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr "United States" "United States" "United States" "United States" ...
## $ X_geoloc : chr "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num 24.9 18 17.8 16.3 12.6 ...
## $ length : num 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num 1162 508 82 120 1125 ...
## $ difficulty_rating: int 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : int 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : int 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr "i" "i" "i" "i" ...
## $ grade : num 7.44 7.34 2.83 3.54 3.78 ...
## $ lengthMiles : num 9.7 4.3 1.8 2.1 18.5 ...
## $ elevationFt : num 3812 1666 269 393 3690 ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 4 4 4 4 4 4 4 4 4 4 ...
ggplot2
ggplot(allTrailsReg, aes(x=region))+
geom_bar()
## STACKED BAR AS DEFAULT
ggplot(allTrailsReg, aes(x=region, fill=route_type))+
geom_bar()
## DODGE
ggplot(allTrailsReg, aes(x=region, fill=route_type))+
geom_bar(position="dodge")
## FILL
ggplot(allTrailsReg, aes(x=region, fill=route_type))+
geom_bar(position="fill")
ggplot(allTrailsReg, aes(x=avg_rating ))+
geom_histogram(bins=10)
ggplot(allTrailsReg, aes(x=grade ))+
geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).
ggplot(allTrailsReg, aes(x=grade ))+
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Comparing across groups
## COLOR AS A GROUPING VARIABLE
ggplot(allTrailsReg, aes(y=grade , fill=region))+
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
## FACET
ggplot(allTrailsReg, aes(y=grade , fill=region))+
geom_boxplot()+
facet_grid(.~region)
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Summarise state data
stateNP<-allTrailsReg%>%
group_by(state_name)%>%
summarise(stateTrails=n(),
avgPop=mean(popularity, na.rm=TRUE),
avgElev=mean(elevation_gain, na.rm=TRUE))
Map data in R:
#install.packages("usmap")
library(usmap)
states <- usmap::us_map()
head(states)
## x y order hole piece group fips abbr full
## 1 1091779 -1380695 1 FALSE 1 01.1 01 AL Alabama
## 2 1091268 -1376372 2 FALSE 1 01.1 01 AL Alabama
## 3 1091140 -1362998 3 FALSE 1 01.1 01 AL Alabama
## 4 1090940 -1343517 4 FALSE 1 01.1 01 AL Alabama
## 5 1090913 -1341006 5 FALSE 1 01.1 01 AL Alabama
## 6 1090796 -1334480 6 FALSE 1 01.1 01 AL Alabama
Oregon map:
## POINTS
oregon<-states%>%
filter(full=="Oregon")
ggplot(oregon, aes(x, y))+
geom_point()
## LINES (connect the dots)
ggplot(oregon, aes(x, y))+
geom_line()
## PATH
ggplot(oregon, aes(x, y, group=group))+
geom_path()
## POLYGON
ggplot(oregon, aes(x, y, group=group))+
geom_polygon(fill="forestgreen")
Join map data
stateNP_Map<-states%>%
mutate(state_name=full)%>%
left_join(stateNP)
## Joining with `by = join_by(state_name)`
Make a map
stateNP_Map%>%
ggplot(aes(x, y, group = group)) +
geom_polygon(aes(fill = stateTrails),color="black")+
theme_bw()+
coord_equal()
Change the color palette:
#install.packages("viridis")
library(viridis)
## Loading required package: viridisLite
stateNP_Map%>%
ggplot(aes(x, y, group = group)) +
geom_polygon(aes(fill = stateTrails),color="black")+
theme_bw()+
coord_equal()+
ggtitle("California has the MOST trails, but...")+
scale_fill_viridis(option="viridis", direction = 1)
stateNP_Map%>%
ggplot(aes(x, y, group = group)) +
geom_polygon(aes(fill = avgPop),color="black")+
theme_bw()+
coord_equal()+
ggtitle("..Oregon trails are the MOST popular")+
scale_fill_viridis(option="viridis", direction = 1)
Let’s shift gears and work with the College
data set in
the ISRL
package.
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
We want to predict the number of students who
Enroll
:
ggplot(College, aes(Apps, Enroll))+
geom_point()
It looks like we might have an outlier
collegeWO<-College%>%
filter(Apps<30000)
ggplot(collegeWO, aes(Apps, Enroll))+
geom_point()
What would the linear model be?
## SIMPLE LINEAR MOD
mod<-lm(Enroll~Apps, data=College)
summary(mod)
##
## Call:
## lm(formula = Enroll ~ Apps, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5427.7 -148.2 -81.6 45.0 3279.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.697e+02 2.246e+01 7.557 1.16e-13 ***
## Apps 2.033e-01 4.587e-03 44.323 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 494.5 on 775 degrees of freedom
## Multiple R-squared: 0.7171, Adjusted R-squared: 0.7167
## F-statistic: 1965 on 1 and 775 DF, p-value: < 2.2e-16
Lets try to verify the model summary by learning about matrix algebra in R.
First, create your response vector, \(Y\), and look at the dimensions. We will also want to keep track of the sample size.
## RESPONSE VECTOR
Y<-as.matrix(College$Enroll)
head(Y)
## [,1]
## [1,] 721
## [2,] 512
## [3,] 336
## [4,] 137
## [5,] 55
## [6,] 158
## dimensions
dim(Y)
## [1] 777 1
n<-dim(Y)[1]
X<-matrix(c(rep(1, n),
College$Apps),
ncol=2,
byrow=FALSE)
head(X)
## [,1] [,2]
## [1,] 1 1660
## [2,] 1 2186
## [3,] 1 1428
## [4,] 1 417
## [5,] 1 193
## [6,] 1 587
## Least squares estimates
## solve() = inverse
## t() = transpose
## %*% = matrix multiple
betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat
## [,1]
## [1,] 169.712809
## [2,] 0.203309
plot(mod)