knitr::opts_chunk$set(echo = TRUE,
                      #warning = FALSE,
                      #message = FALSE,
                      fig.align = "center")
All homework assignments should be submitted a pdf file. Easiest way is to knit as an html file and convert to a pdf. There is a video in module 1 about how you can do that!
If a question asks for a graph, table, or calculation (like an average), make sure that it appears in your knitted document.
Your homework should be your own work. While you can use the internet for help, any major deviations to methods seen in this course will be marked incorrect, even if it gives the correct answer.
The code should be readable and commented. If I’m unsure what your code did, I can’t award partial credit!
Load the tidyverse and caret packages. Install the packages if needed.
# Load the packages below:
pacman::p_load(tidyverse, caret)
# Or 
library(tidyverse)
library(caret)
# either method is fine to use
If successful, the code chunk below should run
# Making sure the randomness is the same
RNGversion('4.1.0'); set.seed(2870)
iris |> 
  # Shuffling the Species column
  mutate(
    species2 = sample(Species)
  ) |> 
  dplyr::select(species2, Species) |> 
  # Creating a table of species2 and Species
  table() |> 
  # Creating a confusion matrix (which we'll see later)
  confusionMatrix()
## Confusion Matrix and Statistics
## 
##             Species
## species2     setosa versicolor virginica
##   setosa         15         18        17
##   versicolor     16         16        18
##   virginica      19         16        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3067          
##                  95% CI : (0.2341, 0.3871)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 0.7810          
##                                           
##                   Kappa : -0.04           
##                                           
##  Mcnemar's Test P-Value : 0.9511          
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.3000            0.3200           0.3000
## Specificity                 0.6500            0.6600           0.6500
## Pos Pred Value              0.3000            0.3200           0.3000
## Neg Pred Value              0.6500            0.6600           0.6500
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.1000            0.1067           0.1000
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           0.4750            0.4900           0.4750
The bike criterium has 200 total participants. The ages of
six participants are 25, 34, 33, 42, 22, 29. Save these ages (in the
order specified) in a vector named age, then calculate the
average age.
# age vector
age <- c(25, 34, 33, 42, 22, 29)
# Average (mean) age
mean(age)
## [1] 30.83333
The brand of the bikes used by the same six cyclists from
question 1b are: Bianchi, Dare, Bianchi, Bianchi, Cannondale, Dare. Save
the brands in a vector called bike_brand and display the
results using table()
bike_brand <- c('Bianchi', 'Dare', 'Bianchi', 'Bianchi', 'Cannondale', 'Dare')
table(bike_brand)
## bike_brand
##    Bianchi Cannondale       Dare 
##          3          1          2
The same six participants’ responses to the number of previous criterium attended is 2, 5, 7, left blank, 0, 3. Save the values in a vector named cyclist_exp, then calculate the median of the responses. It should return a number!
cyclist_exp <- c(2, 5, 7, NA, 0, 3)
# Need to use na.rm = T to remove the missing values
median(cyclist_exp,
       na.rm = T)
## [1] 3
The six participants are assigned ID numbers of 103, 203,
303, 403, 503, 603. Create an object named bib_num using a
shortcut (not c(103, 203, ...)). Have bib_num
appear in the knitted document.
bib_num <- (1:6)*100 + 3
bib_num
## [1] 103 203 303 403 503 603
Create a data set named cyclists using the 4
vectors created previously in the order of bib_num,
bike_brand, age, cyclist_exp with
column names of ID, bike, age,
experience respectively.
Display the data frame in the knitted document
cyclists <- 
  data.frame(
    ID = bib_num,
    brand = bike_brand,
    age,
    experience = cyclist_exp
  )
cyclists
##    ID      brand age experience
## 1 103    Bianchi  25          2
## 2 203       Dare  34          5
## 3 303    Bianchi  33          7
## 4 403    Bianchi  42         NA
## 5 503 Cannondale  22          0
## 6 603       Dare  29          3
Repeat question 1f), but create the cyclists
data set without creating global objects for the individual columns
first (no individual vectors). When completed, the only object in your
global environment should be cyclists
# Keep the line below at the top of this code chunk:
rm(list = ls())
# Now create a data.frame named trees2 as describe by this question:
cyclists <- 
  data.frame(
    ID = (1:6)*100 + 3,
    brand = c('Bianchi', 'Dare', 'Bianchi', 'Bianchi', 'Cannondale', 'Dare'),
    age = c(25, 34, 33, 42, 22, 29),
    experience = c(2, 5, 7, NA, 0, 3)
  )
cyclists
##    ID      brand age experience
## 1 103    Bianchi  25          2
## 2 203       Dare  34          5
## 3 303    Bianchi  33          7
## 4 403    Bianchi  42         NA
## 5 503 Cannondale  22          0
## 6 603       Dare  29          3
# Displaying the objects in the global environment
ls()
## [1] "cyclists"
Read in the strava_data.csv file, saved as strava.
skim() the data once it has been read in without loading
the skimr package.
strava <- read.csv("strava_data.csv")
skimr::skim(strava)
| Name | strava | 
| Number of rows | 122 | 
| Number of columns | 4 | 
| _______________________ | |
| Column type frequency: | |
| character | 1 | 
| numeric | 3 | 
| ________________________ | |
| Group variables | None | 
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace | 
|---|---|---|---|---|---|---|---|
| date | 0 | 1 | 19 | 19 | 0 | 122 | 0 | 
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | 
|---|---|---|---|---|---|---|---|---|---|---|
| time | 0 | 1 | 68.34 | 33.10 | 23.25 | 40.46 | 63.71 | 83.72 | 163.70 | ▇▇▃▂▁ | 
| distance | 0 | 1 | 11.93 | 6.04 | 5.07 | 6.88 | 11.30 | 15.48 | 28.79 | ▇▅▃▁▂ | 
| elevation_gain | 0 | 1 | 2.09 | 1.18 | 0.45 | 1.39 | 1.65 | 2.46 | 6.53 | ▇▃▂▁▁ | 
Start by changing the date column in
strava to a date-type column instead of a character using
ymd_hms() from the lubridate package (which
should be loaded when you imported the tidyverse)
Note: You’ll need to change the date column
before you can create the month and day
column!
After changing date to a date-type column,
create columns named month and day by using
the month() and day() functions, respectively,
on the newly changed date column. Include
label = TRUE inside month() to display the
three letter abbreviation for each month.
Then add a column named dist_km that is the
distance measured in kilometers instead of miles:
1 mile \(\approx\) 1.6 kilometers
Display the first 10 rows of the resulting data using
tibble(strava).
# changing the date column
strava$date <- ymd_hms(strava$date)
# Adding month and day
strava$month <- month(strava$date, label = TRUE)
strava$day <- day(strava$date)
# Adding the dist_km column
strava$dist_km <- strava$distance * 1.6
# Displaying the results
tibble(strava)
## # A tibble: 122 × 7
##    date                 time distance elevation_gain month   day dist_km
##    <dttm>              <dbl>    <dbl>          <dbl> <ord> <int>   <dbl>
##  1 2023-06-28 22:18:00  60.0     9.28          0.935 Jun      28    14.8
##  2 2023-06-30 17:53:13  54.7     8.33          1.50  Jun      30    13.3
##  3 2023-07-04 16:12:12  67.0    11.2           1.37  Jul       4    17.8
##  4 2023-07-14 18:29:18  77.9    15.5           5.16  Jul      14    24.8
##  5 2023-07-15 17:35:51  45.4     8.91          0.920 Jul      15    14.2
##  6 2023-07-17 16:51:24  59.4    11.5           1.64  Jul      17    18.4
##  7 2023-07-21 18:27:02 140.     25.5           4.94  Jul      21    40.7
##  8 2023-07-23 21:34:35  65.8    12.8           1.67  Jul      23    20.5
##  9 2023-07-24 22:07:51  82.1    10.7           1.45  Jul      24    17.0
## 10 2023-07-25 14:45:21 105.     18.8           2.08  Jul      25    30.0
## # ℹ 112 more rows
distance and
timeCalculate the correlation between distance and
time using cor().
Then calculate the correlation between dist_km
and time.
**Round both to 3 decimal places
Note: You don’t need to save the values, just display the output in the knitted document
# correlation between distance and time
round(cor(x = strava$distance, y = strava$time), 3)
## [1] 0.913
# correlation between dist_km and time
round(cor(x = strava$dist_km, y = strava$time), 3)
## [1] 0.913
What do you notice about the two correlations?
Answer Here
Create a set of side-by-side boxplots for distance with a
different box for each month. You can use the base function
boxplot() or you can use
ggplot().
# Using base R
boxplot(
  distance ~ month,
  data = strava
)
# Using ggplot()
ggplot(
  data = strava,
  mapping = aes(
    y = distance,
    x = month
  )
) + 
  geom_boxplot(fill = 'steelblue')
Which months have no activity?
Answer here