Learning Objectives:

Students will demonstrate their ability to translate questions into code using tidyverse packages and verbs.

Step 0: Library tidyverse

library(tidyverse)

Step 1: Load the Data

(AllTrails App)

AllTrails is a fitness and travel mobile app used in outdoor recreational activities. AllTrails is commonly used for outdoor activities such as hiking, mountain biking, climbing and snow sports. The service allows users to access a database of trail maps, which includes crowdsourced reviews and images.

Citations:

Data Source: These data come from Kaggle

allTrails <- read_csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/AllTrails%20data%20-%20nationalpark.csv")

Step 2: Look at the data

What variables are available to work with?

str(allTrails)
## spec_tbl_df [3,313 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ trail_id         : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
##  $ name             : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr [1:3313] "United States" "United States" "United States" "United States" ...
##  $ _geoloc          : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
##  $ length           : num [1:3313] 15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num [1:3313] 1162 508 82 120 1125 ...
##  $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr [1:3313] "i" "i" "i" "i" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trail_id = col_double(),
##   ..   name = col_character(),
##   ..   area_name = col_character(),
##   ..   city_name = col_character(),
##   ..   state_name = col_character(),
##   ..   country_name = col_character(),
##   ..   `_geoloc` = col_character(),
##   ..   popularity = col_double(),
##   ..   length = col_double(),
##   ..   elevation_gain = col_double(),
##   ..   difficulty_rating = col_double(),
##   ..   route_type = col_character(),
##   ..   visitor_usage = col_double(),
##   ..   avg_rating = col_double(),
##   ..   num_reviews = col_double(),
##   ..   features = col_character(),
##   ..   activities = col_character(),
##   ..   units = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

For this assessment you will use the following variables:

  • area_name: Name of National Park
  • state_name: Name of State where trail is located
  • length: Length of hike in meters
  • elevation_gain: Elevation gain in meters
  • route_type: Out and back, Loop, or Point to Point
  • avg_rating: Star rating (out of 5)

Step 3: Calculate the Grade

(5 points)

Average grade can be calculated by taking the total elevation gain of the trail, dividing by the total distance, multiplied by 100 to equal a percent grade. The elevation and distance must be in the same units.

Citation:

TASK: Add a column to the data frame calculate the average grade for each trail.

Please do NOT create a new data frame, you can overwrite the data frame name and add on a new column. You will be using this in subsequent parts.

## MUTATE
allTrails<-allTrails%>%
  mutate(grade=(elevation_gain/length)*100)

str(allTrails)
## tibble [3,313 × 19] (S3: tbl_df/tbl/data.frame)
##  $ trail_id         : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
##  $ name             : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr [1:3313] "United States" "United States" "United States" "United States" ...
##  $ _geoloc          : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
##  $ length           : num [1:3313] 15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num [1:3313] 1162 508 82 120 1125 ...
##  $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr [1:3313] "i" "i" "i" "i" ...
##  $ grade            : num [1:3313] 7.44 7.34 2.83 3.54 3.78 ...

Step 4: Distribution of Grade (8 points)

A. Graphic - Histogram

(3 points)

TASK: Make a histogram showing the distribution trail grades. Don’t forget to title your graph.

# HISTOGRAM
ggplot(allTrails, aes(x=grade))+
  geom_histogram(bins=10)+
  ggtitle("Skewed Distribution of Trail Grades")
## Warning: Removed 2 rows containing non-finite values (stat_bin).

# DENSITY
ggplot(allTrails, aes(x=grade))+
  geom_density()+
  ggtitle("Skewed Distribution of Trail Grades")
## Warning: Removed 2 rows containing non-finite values (stat_density).

# BOXPLOT
ggplot(allTrails, aes(x=grade))+
  geom_boxplot()+
  ggtitle("Skewed Distribution of Trail Grades")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

B. Insight

(5 points)

TASK: Comment on the shape of this histogram

## INSERT HERE ##
# SHAPE
# MODALITY
# CENTER
# SPREAD
# OUTLIERS?

Step 5: Convert Units

(5 points)

The AllTrails website reports trail length and elevation gain in miles and feet, respectively.

See example here: https://www.alltrails.com/trail/us/washington/sentinell-peak-via-grey-wolf-deer-loop

  • 1 meter = 0.000621371 miles
  • 1 meter = 3.28084 feet

TASK: Convert the values for length and elevation to miles and feet, respectively.

Please do NOT create a new data frame, you can overwrite the data frame name and add on a new column. You will be using this in subsequent parts.

allTrails<-allTrails%>%
  mutate(lengthMiles=length*0.000621371, 
         elevationFt=elevation_gain*3.28084)

str(allTrails)
## tibble [3,313 × 21] (S3: tbl_df/tbl/data.frame)
##  $ trail_id         : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
##  $ name             : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr [1:3313] "United States" "United States" "United States" "United States" ...
##  $ _geoloc          : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
##  $ length           : num [1:3313] 15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num [1:3313] 1162 508 82 120 1125 ...
##  $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr [1:3313] "i" "i" "i" "i" ...
##  $ grade            : num [1:3313] 7.44 7.34 2.83 3.54 3.78 ...
##  $ lengthMiles      : num [1:3313] 9.7 4.3 1.8 2.1 18.5 ...
##  $ elevationFt      : num [1:3313] 3812 1666 269 393 3690 ...

Step 6: Compare Route Types

(4 points)

Create a side-by-side box plot to compare the distributions of elevation gain in feet across the three route types (loop, out and back, point to point). Please fill the boxes with color for each route type.

## FACET
ggplot(allTrails, aes(y=elevationFt , fill=route_type))+
  geom_boxplot()

Step 7: Finding Family Friendly Hikes

(10 points)

Our family likes to hike together! We have three children (ages 2, 7, and 11) so we have some limitations.

TASK: Find the best hike that fits ALL of the following conditions:

  • Is in Oregon
  • Is is a loop
  • Is less than 3 miles
  • Is “easy” (has a grade less than 5)

Note: Best is defined by the highest average rating

allTrails%>%
  filter(state_name=="Oregon", 
         route_type=="loop", 
         lengthMiles<3, 
         grade<5)%>%
  arrange(desc(avg_rating))
## # A tibble: 5 × 21
##   trail_id name   area_…¹ city_…² state…³ count…⁴ _geol…⁵ popul…⁶ length eleva…⁷
##      <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>  <dbl>   <dbl>
## 1 10016688 Sun N… Crater… Crater… Oregon  United… {'lat'…   13.9   1287.    38.7
## 2 10013161 Godfr… Crater… Chiloq… Oregon  United… {'lat'…    8.87  1770.    19.8
## 3 10015976 Annie… Crater… Chiloq… Oregon  United… {'lat'…    7.53  3380.    93.0
## 4 10012733 Castl… Crater… Chiloq… Oregon  United… {'lat'…    7.07  1931.    36.9
## 5 10236071 Lady … Crater… Crater… Oregon  United… {'lat'…    4.53  1127.    33.8
## # … with 11 more variables: difficulty_rating <dbl>, route_type <chr>,
## #   visitor_usage <dbl>, avg_rating <dbl>, num_reviews <dbl>, features <chr>,
## #   activities <chr>, units <chr>, grade <dbl>, lengthMiles <dbl>,
## #   elevationFt <dbl>, and abbreviated variable names ¹​area_name, ²​city_name,
## #   ³​state_name, ⁴​country_name, ⁵​`_geoloc`, ⁶​popularity, ⁷​elevation_gain

Respond in a full sentence, with which hike is the best family friendly hike in an Oregon National Park:

## INSERT ANSWER HERE ##

Step 8: Trails within National Parks (within States)

(5 points)

TASK: Find the number of trails and the average star rating within each National Park.

  • Which national park has the most trails?
  • Which national park has the highest rated trails?

Create a new data frame to accomplish this so that we can use this in the following step.

Hint: You can group by two variables. Please do this to keep the State in which each National Park is located.

# Which national park has the most trails?
nTrails<-allTrails%>%
  group_by(area_name)%>%
  count()%>%
  arrange(desc(n))

head(nTrails)
## # A tibble: 6 × 2
## # Groups:   area_name [6]
##   area_name                               n
##   <chr>                               <int>
## 1 Great Smoky Mountains National Park   293
## 2 Yosemite National Park                242
## 3 Yellowstone National Park             228
## 4 Rocky Mountain National Park          207
## 5 Shenandoah National Park              187
## 6 Acadia National Park                  179
# Which national park has the highest rated trails?

ratedTrails<-allTrails%>%
  group_by(area_name)%>%
  summarise(avgRate=mean(avg_rating, na.rm=TRUE))%>%
  arrange(desc(avgRate))

head(ratedTrails)
## # A tibble: 6 × 2
##   area_name                                       avgRate
##   <chr>                                             <dbl>
## 1 Kenai Fjords National Park                         4.75
## 2 Haleakala National Park                            4.57
## 3 Dry Tortugas National Park                         4.5 
## 4 Fort Pickens National Park                         4.5 
## 5 Wolf Trap National Park for the Performing Arts    4.5 
## 6 Mount Rainier National Park                        4.43

Step 9: Most National Parks

(5 points)

TASK: Find the State with the most National Parks.

Hint: You should use the data frame you created in the previous step.

nStateArea<-allTrails%>%
  group_by(state_name, area_name)%>%
  count()%>%
  arrange(desc(n))

#View(nStateArea)
head(nStateArea)
## # A tibble: 6 × 3
## # Groups:   state_name, area_name [6]
##   state_name area_name                               n
##   <chr>      <chr>                               <int>
## 1 California Yosemite National Park                242
## 2 Wyoming    Yellowstone National Park             209
## 3 Colorado   Rocky Mountain National Park          207
## 4 Virginia   Shenandoah National Park              187
## 5 Maine      Acadia National Park                  179
## 6 Tennessee  Great Smoky Mountains National Park   175
nState<-nStateArea%>%
  group_by(state_name)%>%
  count()%>%
  arrange(desc(n))

head(nState)
## # A tibble: 6 × 2
## # Groups:   state_name [6]
##   state_name     n
##   <chr>      <int>
## 1 California     9
## 2 Utah           5
## 3 Alaska         4
## 4 Colorado       4
## 5 Florida        4
## 6 Arizona        3

Step 10: Join Regions

(8 points)

States are grouped by regions with the following data frame:

stateRegion<-data.frame(state_name=state.name, 
                        region=state.region)

head(stateRegion)
##   state_name region
## 1    Alabama  South
## 2     Alaska   West
## 3    Arizona   West
## 4   Arkansas  South
## 5 California   West
## 6   Colorado   West
A: Join

(2 points)

TASK: Add a column for region to the data frame

  • You can do this with the data frame you generated in the previous step (9)
  • OR you can use the full complete data frame

Hint: Use left_join

allTrails<-allTrails%>%
  left_join(stateRegion)
## Joining, by = "state_name"
str(allTrails)
## tibble [3,313 × 22] (S3: tbl_df/tbl/data.frame)
##  $ trail_id         : num [1:3313] 10020048 10236086 10267857 10236076 10236082 ...
##  $ name             : chr [1:3313] "Harding Ice Field Trail" "Mount Healy Overlook Trail" "Exit Glacier Trail" "Horseshoe Lake Trail" ...
##  $ area_name        : chr [1:3313] "Kenai Fjords National Park" "Denali National Park" "Kenai Fjords National Park" "Denali National Park" ...
##  $ city_name        : chr [1:3313] "Seward" "Denali National Park" "Seward" "Denali National Park" ...
##  $ state_name       : chr [1:3313] "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ country_name     : chr [1:3313] "United States" "United States" "United States" "United States" ...
##  $ _geoloc          : chr [1:3313] "{'lat': 60.18852, 'lng': -149.63156}" "{'lat': 63.73049, 'lng': -148.91968}" "{'lat': 60.18879, 'lng': -149.631}" "{'lat': 63.73661, 'lng': -148.915}" ...
##  $ popularity       : num [1:3313] 24.9 18 17.8 16.3 12.6 ...
##  $ length           : num [1:3313] 15611 6920 2897 3380 29773 ...
##  $ elevation_gain   : num [1:3313] 1162 508 82 120 1125 ...
##  $ difficulty_rating: num [1:3313] 5 3 1 1 5 5 3 3 1 5 ...
##  $ route_type       : chr [1:3313] "out and back" "out and back" "out and back" "loop" ...
##  $ visitor_usage    : num [1:3313] 3 1 3 2 1 1 1 1 1 1 ...
##  $ avg_rating       : num [1:3313] 5 4.5 4.5 4.5 4.5 4.5 4 4 4.5 4.5 ...
##  $ num_reviews      : num [1:3313] 423 260 224 237 110 43 39 27 21 5 ...
##  $ features         : chr [1:3313] "['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']" "['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']" "['dogs-no', 'partially-paved', 'views', 'wildlife']" "['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']" ...
##  $ activities       : chr [1:3313] "['birding', 'camping', 'hiking', 'nature-trips', 'trail-running']" "['birding', 'camping', 'hiking', 'nature-trips', 'walking']" "['hiking', 'walking']" "['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']" ...
##  $ units            : chr [1:3313] "i" "i" "i" "i" ...
##  $ grade            : num [1:3313] 7.44 7.34 2.83 3.54 3.78 ...
##  $ lengthMiles      : num [1:3313] 9.7 4.3 1.8 2.1 18.5 ...
##  $ elevationFt      : num [1:3313] 3812 1666 269 393 3690 ...
##  $ region           : Factor w/ 4 levels "Northeast","South",..: 4 4 4 4 4 4 4 4 4 4 ...
B: National Parks by Region

(3 points)

Find the number of National Parks in each region. Create a new data frame to accomplish this and use it in the next step.

nRegionArea<-allTrails%>%
  group_by(region,area_name)%>%
  count()%>%
  arrange(desc(n))

#View(nStateArea)
head(nRegionArea)
## # A tibble: 6 × 3
## # Groups:   region, area_name [6]
##   region    area_name                               n
##   <fct>     <chr>                               <int>
## 1 South     Great Smoky Mountains National Park   293
## 2 West      Yosemite National Park                242
## 3 West      Yellowstone National Park             228
## 4 West      Rocky Mountain National Park          207
## 5 South     Shenandoah National Park              187
## 6 Northeast Acadia National Park                  179
nReg<-nRegionArea%>%
  group_by(region)%>%
  count()%>%
  arrange(desc(n))

head(nReg)
## # A tibble: 5 × 2
## # Groups:   region [5]
##   region            n
##   <fct>         <int>
## 1 West             35
## 2 South            15
## 3 North Central     8
## 4 Northeast         1
## 5 <NA>              1
C: Bar Graph

(3 points)

Create a bar graph for the number of National Parks per region.

# BAR GRAPH
ggplot(data = nReg, aes(x=region, y=n))+
  geom_bar(stat="identity")

# BAR GRAPH (ORDERED)
ggplot(data = nReg, aes(x=reorder(region, -n), y=n))+
  geom_bar(stat="identity")+
  xlab("Region")