Students will demonstrate their ability to translate questions into
code using tidyverse
packages and verbs.
dplyr
: filter()
, mutate()
,
group_by()
, summarise()
, count()
,
ectggplot2
: ggplot()
, aes()
,
geom_bar
, geom_col
,
geom_histogram
, geom_boxplot
library(tidyverse)
(AllTrails App)
AllTrails is a fitness and travel mobile app used in outdoor recreational activities. AllTrails is commonly used for outdoor activities such as hiking, mountain biking, climbing and snow sports. The service allows users to access a database of trail maps, which includes crowdsourced reviews and images.
Citations:
Data Source: These data come from Kaggle
allTrails <- read_csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/AllTrails%20data%20-%20nationalpark.csv")
What variables are available to work with?
str(allTrails)
## spec_tbl_df [3,313 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ trail_id : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
## $ name : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr [1:3313] "United States" "United States" "United States" "United States" ...
## $ _geoloc : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
## $ length : num [1:3313] 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num [1:3313] 1162 508 82 120 1125 ...
## $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr [1:3313] "i" "i" "i" "i" ...
## - attr(*, "spec")=
## .. cols(
## .. trail_id = col_double(),
## .. name = col_character(),
## .. area_name = col_character(),
## .. city_name = col_character(),
## .. state_name = col_character(),
## .. country_name = col_character(),
## .. `_geoloc` = col_character(),
## .. popularity = col_double(),
## .. length = col_double(),
## .. elevation_gain = col_double(),
## .. difficulty_rating = col_double(),
## .. route_type = col_character(),
## .. visitor_usage = col_double(),
## .. avg_rating = col_double(),
## .. num_reviews = col_double(),
## .. features = col_character(),
## .. activities = col_character(),
## .. units = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
For this assessment you will use the following variables:
area_name
: Name of National Parkstate_name
: Name of State where trail is locatedlength
: Length of hike in meterselevation_gain
: Elevation gain in metersroute_type
: Out and back, Loop, or Point to Pointavg_rating
: Star rating (out of 5)(5 points)
Average grade can be calculated by taking the total elevation gain of the trail, dividing by the total distance, multiplied by 100 to equal a percent grade. The elevation and distance must be in the same units.
Citation:
TASK: Add a column to the data frame calculate the average grade for each trail.
Please do NOT create a new data frame, you can overwrite the data frame name and add on a new column. You will be using this in subsequent parts.
## MUTATE
allTrails<-allTrails%>%
mutate(grade=(elevation_gain/length)*100)
str(allTrails)
## tibble [3,313 × 19] (S3: tbl_df/tbl/data.frame)
## $ trail_id : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
## $ name : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr [1:3313] "United States" "United States" "United States" "United States" ...
## $ _geoloc : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
## $ length : num [1:3313] 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num [1:3313] 1162 508 82 120 1125 ...
## $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr [1:3313] "i" "i" "i" "i" ...
## $ grade : num [1:3313] 7.44 7.34 2.83 3.54 3.78 ...
(3 points)
TASK: Make a histogram showing the distribution trail grades. Don’t forget to title your graph.
# HISTOGRAM
ggplot(allTrails, aes(x=grade))+
geom_histogram(bins=10)+
ggtitle("Skewed Distribution of Trail Grades")
## Warning: Removed 2 rows containing non-finite values (stat_bin).
# DENSITY
ggplot(allTrails, aes(x=grade))+
geom_density()+
ggtitle("Skewed Distribution of Trail Grades")
## Warning: Removed 2 rows containing non-finite values (stat_density).
# BOXPLOT
ggplot(allTrails, aes(x=grade))+
geom_boxplot()+
ggtitle("Skewed Distribution of Trail Grades")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
(5 points)
TASK: Comment on the shape of this histogram
## INSERT HERE ##
# SHAPE
# MODALITY
# CENTER
# SPREAD
# OUTLIERS?
(5 points)
The AllTrails website reports trail length and elevation gain in miles and feet, respectively.
See example here: https://www.alltrails.com/trail/us/washington/sentinell-peak-via-grey-wolf-deer-loop
TASK: Convert the values for length and elevation to miles and feet, respectively.
Please do NOT create a new data frame, you can overwrite the data frame name and add on a new column. You will be using this in subsequent parts.
allTrails<-allTrails%>%
mutate(lengthMiles=length*0.000621371,
elevationFt=elevation_gain*3.28084)
str(allTrails)
## tibble [3,313 × 21] (S3: tbl_df/tbl/data.frame)
## $ trail_id : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
## $ name : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr [1:3313] "United States" "United States" "United States" "United States" ...
## $ _geoloc : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
## $ length : num [1:3313] 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num [1:3313] 1162 508 82 120 1125 ...
## $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr [1:3313] "i" "i" "i" "i" ...
## $ grade : num [1:3313] 7.44 7.34 2.83 3.54 3.78 ...
## $ lengthMiles : num [1:3313] 9.7 4.3 1.8 2.1 18.5 ...
## $ elevationFt : num [1:3313] 3812 1666 269 393 3690 ...
(4 points)
Create a side-by-side box plot to compare the distributions of elevation gain in feet across the three route types (loop, out and back, point to point). Please fill the boxes with color for each route type.
## FACET
ggplot(allTrails, aes(y=elevationFt , fill=route_type))+
geom_boxplot()
(10 points)
Our family likes to hike together! We have three children (ages 2, 7, and 11) so we have some limitations.
TASK: Find the best hike that fits ALL of the following conditions:
Note: Best is defined by the highest average rating
allTrails%>%
filter(state_name=="Oregon",
route_type=="loop",
lengthMiles<3,
grade<5)%>%
arrange(desc(avg_rating))
## # A tibble: 5 × 21
## trail_id name area_…¹ city_…² state…³ count…⁴ _geol…⁵ popul…⁶ length eleva…⁷
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 10016688 Sun N… Crater… Crater… Oregon United… {'lat'… 13.9 1287. 38.7
## 2 10013161 Godfr… Crater… Chiloq… Oregon United… {'lat'… 8.87 1770. 19.8
## 3 10015976 Annie… Crater… Chiloq… Oregon United… {'lat'… 7.53 3380. 93.0
## 4 10012733 Castl… Crater… Chiloq… Oregon United… {'lat'… 7.07 1931. 36.9
## 5 10236071 Lady … Crater… Crater… Oregon United… {'lat'… 4.53 1127. 33.8
## # … with 11 more variables: difficulty_rating <dbl>, route_type <chr>,
## # visitor_usage <dbl>, avg_rating <dbl>, num_reviews <dbl>, features <chr>,
## # activities <chr>, units <chr>, grade <dbl>, lengthMiles <dbl>,
## # elevationFt <dbl>, and abbreviated variable names ¹area_name, ²city_name,
## # ³state_name, ⁴country_name, ⁵`_geoloc`, ⁶popularity, ⁷elevation_gain
Respond in a full sentence, with which hike is the best family friendly hike in an Oregon National Park:
## INSERT ANSWER HERE ##
(5 points)
TASK: Find the number of trails and the average star rating within each National Park.
Create a new data frame to accomplish this so that we can use this in the following step.
Hint: You can group by two variables. Please do this to keep the State in which each National Park is located.
# Which national park has the most trails?
nTrails<-allTrails%>%
group_by(area_name)%>%
count()%>%
arrange(desc(n))
head(nTrails)
## # A tibble: 6 × 2
## # Groups: area_name [6]
## area_name n
## <chr> <int>
## 1 Great Smoky Mountains National Park 293
## 2 Yosemite National Park 242
## 3 Yellowstone National Park 228
## 4 Rocky Mountain National Park 207
## 5 Shenandoah National Park 187
## 6 Acadia National Park 179
# Which national park has the highest rated trails?
ratedTrails<-allTrails%>%
group_by(area_name)%>%
summarise(avgRate=mean(avg_rating, na.rm=TRUE))%>%
arrange(desc(avgRate))
head(ratedTrails)
## # A tibble: 6 × 2
## area_name avgRate
## <chr> <dbl>
## 1 Kenai Fjords National Park 4.75
## 2 Haleakala National Park 4.57
## 3 Dry Tortugas National Park 4.5
## 4 Fort Pickens National Park 4.5
## 5 Wolf Trap National Park for the Performing Arts 4.5
## 6 Mount Rainier National Park 4.43
(5 points)
TASK: Find the State with the most National Parks.
Hint: You should use the data frame you created in the previous step.
nStateArea<-allTrails%>%
group_by(state_name, area_name)%>%
count()%>%
arrange(desc(n))
#View(nStateArea)
head(nStateArea)
## # A tibble: 6 × 3
## # Groups: state_name, area_name [6]
## state_name area_name n
## <chr> <chr> <int>
## 1 California Yosemite National Park 242
## 2 Wyoming Yellowstone National Park 209
## 3 Colorado Rocky Mountain National Park 207
## 4 Virginia Shenandoah National Park 187
## 5 Maine Acadia National Park 179
## 6 Tennessee Great Smoky Mountains National Park 175
nState<-nStateArea%>%
group_by(state_name)%>%
count()%>%
arrange(desc(n))
head(nState)
## # A tibble: 6 × 2
## # Groups: state_name [6]
## state_name n
## <chr> <int>
## 1 California 9
## 2 Utah 5
## 3 Alaska 4
## 4 Colorado 4
## 5 Florida 4
## 6 Arizona 3
(8 points)
States are grouped by regions with the following data frame:
stateRegion<-data.frame(state_name=state.name,
region=state.region)
head(stateRegion)
## state_name region
## 1 Alabama South
## 2 Alaska West
## 3 Arizona West
## 4 Arkansas South
## 5 California West
## 6 Colorado West
(2 points)
TASK: Add a column for region to the data frame
Hint: Use left_join
allTrails<-allTrails%>%
left_join(stateRegion)
## Joining, by = "state_name"
str(allTrails)
## tibble [3,313 × 22] (S3: tbl_df/tbl/data.frame)
## $ trail_id : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
## $ name : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
## $ area_name : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
## $ city_name : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
## $ state_name : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ country_name : chr [1:3313] "United States" "United States" "United States" "United States" ...
## $ _geoloc : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
## $ popularity : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
## $ length : num [1:3313] 15611 6920 2897 3380 29773 ...
## $ elevation_gain : num [1:3313] 1162 508 82 120 1125 ...
## $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
## $ route_type : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
## $ visitor_usage : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
## $ avg_rating : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
## $ num_reviews : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
## $ features : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
## $ activities : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
## $ units : chr [1:3313] "i" "i" "i" "i" ...
## $ grade : num [1:3313] 7.44 7.34 2.83 3.54 3.78 ...
## $ lengthMiles : num [1:3313] 9.7 4.3 1.8 2.1 18.5 ...
## $ elevationFt : num [1:3313] 3812 1666 269 393 3690 ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 4 4 4 4 4 4 4 4 4 4 ...
(3 points)
Find the number of National Parks in each region. Create a new data frame to accomplish this and use it in the next step.
nRegionArea<-allTrails%>%
group_by(region,area_name)%>%
count()%>%
arrange(desc(n))
#View(nStateArea)
head(nRegionArea)
## # A tibble: 6 × 3
## # Groups: region, area_name [6]
## region area_name n
## <fct> <chr> <int>
## 1 South Great Smoky Mountains National Park 293
## 2 West Yosemite National Park 242
## 3 West Yellowstone National Park 228
## 4 West Rocky Mountain National Park 207
## 5 South Shenandoah National Park 187
## 6 Northeast Acadia National Park 179
nReg<-nRegionArea%>%
group_by(region)%>%
count()%>%
arrange(desc(n))
head(nReg)
## # A tibble: 5 × 2
## # Groups: region [5]
## region n
## <fct> <int>
## 1 West 35
## 2 South 15
## 3 North Central 8
## 4 Northeast 1
## 5 <NA> 1
(3 points)
Create a bar graph for the number of National Parks per region.
# BAR GRAPH
ggplot(data = nReg, aes(x=region, y=n))+
geom_bar(stat="identity")
# BAR GRAPH (ORDERED)
ggplot(data = nReg, aes(x=reorder(region, -n), y=n))+
geom_bar(stat="identity")+
xlab("Region")