knitr::opts_chunk$set(echo = TRUE,
#warning = FALSE,
#message = FALSE,
fig.align = "center")
All homework assignments should be submitted a pdf file. Easiest way is to knit as an html file and convert to a pdf. There is a video in module 1 about how you can do that!
If a question asks for a graph, table, or calculation (like an average), make sure that it appears in your knitted document.
Your homework should be your own work. While you can use the internet for help, any major deviations to methods seen in this course will be marked incorrect, even if it gives the correct answer.
The code should be readable and commented. If I’m unsure what your code did, I can’t award partial credit!
Load the tidyverse and caret packages. Install the packages if needed.
# Load the packages below:
pacman::p_load(tidyverse, caret)
# Or
library(tidyverse)
library(caret)
# either method is fine to use
If successful, the code chunk below should run
# Making sure the randomness is the same
RNGversion('4.1.0'); set.seed(2870)
iris |>
# Shuffling the Species column
mutate(
species2 = sample(Species)
) |>
dplyr::select(species2, Species) |>
# Creating a table of species2 and Species
table() |>
# Creating a confusion matrix (which we'll see later)
confusionMatrix()
## Confusion Matrix and Statistics
##
## Species
## species2 setosa versicolor virginica
## setosa 15 18 17
## versicolor 16 16 18
## virginica 19 16 15
##
## Overall Statistics
##
## Accuracy : 0.3067
## 95% CI : (0.2341, 0.3871)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 0.7810
##
## Kappa : -0.04
##
## Mcnemar's Test P-Value : 0.9511
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 0.3000 0.3200 0.3000
## Specificity 0.6500 0.6600 0.6500
## Pos Pred Value 0.3000 0.3200 0.3000
## Neg Pred Value 0.6500 0.6600 0.6500
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.1000 0.1067 0.1000
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 0.4750 0.4900 0.4750
The bike criterium has 200 total participants. The ages of
six participants are 25, 34, 33, 42, 22, 29. Save these ages (in the
order specified) in a vector named age
, then calculate the
average age.
# age vector
age <- c(25, 34, 33, 42, 22, 29)
# Average (mean) age
mean(age)
## [1] 30.83333
The brand of the bikes used by the same six cyclists from
question 1b are: Bianchi, Dare, Bianchi, Bianchi, Cannondale, Dare. Save
the brands in a vector called bike_brand
and display the
results using table()
bike_brand <- c('Bianchi', 'Dare', 'Bianchi', 'Bianchi', 'Cannondale', 'Dare')
table(bike_brand)
## bike_brand
## Bianchi Cannondale Dare
## 3 1 2
The same six participants’ responses to the number of previous criterium attended is 2, 5, 7, left blank, 0, 3. Save the values in a vector named cyclist_exp, then calculate the median of the responses. It should return a number!
cyclist_exp <- c(2, 5, 7, NA, 0, 3)
# Need to use na.rm = T to remove the missing values
median(cyclist_exp,
na.rm = T)
## [1] 3
The six participants are assigned ID numbers of 103, 203,
303, 403, 503, 603. Create an object named bib_num
using a
shortcut (not c(103, 203, ...)
). Have bib_num
appear in the knitted document.
bib_num <- (1:6)*100 + 3
bib_num
## [1] 103 203 303 403 503 603
Create a data set named cyclists
using the 4
vectors created previously in the order of bib_num
,
bike_brand
, age
, cyclist_exp
with
column names of ID
, bike
, age
,
experience
respectively.
Display the data frame in the knitted document
cyclists <-
data.frame(
ID = bib_num,
brand = bike_brand,
age,
experience = cyclist_exp
)
cyclists
## ID brand age experience
## 1 103 Bianchi 25 2
## 2 203 Dare 34 5
## 3 303 Bianchi 33 7
## 4 403 Bianchi 42 NA
## 5 503 Cannondale 22 0
## 6 603 Dare 29 3
Repeat question 1f), but create the cyclists
data set without creating global objects for the individual columns
first (no individual vectors). When completed, the only object in your
global environment should be cyclists
# Keep the line below at the top of this code chunk:
rm(list = ls())
# Now create a data.frame named trees2 as describe by this question:
cyclists <-
data.frame(
ID = (1:6)*100 + 3,
brand = c('Bianchi', 'Dare', 'Bianchi', 'Bianchi', 'Cannondale', 'Dare'),
age = c(25, 34, 33, 42, 22, 29),
experience = c(2, 5, 7, NA, 0, 3)
)
cyclists
## ID brand age experience
## 1 103 Bianchi 25 2
## 2 203 Dare 34 5
## 3 303 Bianchi 33 7
## 4 403 Bianchi 42 NA
## 5 503 Cannondale 22 0
## 6 603 Dare 29 3
# Displaying the objects in the global environment
ls()
## [1] "cyclists"
Read in the strava_data.csv file, saved as strava.
skim()
the data once it has been read in without loading
the skimr
package.
strava <- read.csv("strava_data.csv")
skimr::skim(strava)
Name | strava |
Number of rows | 122 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
date | 0 | 1 | 19 | 19 | 0 | 122 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
time | 0 | 1 | 68.34 | 33.10 | 23.25 | 40.46 | 63.71 | 83.72 | 163.70 | ▇▇▃▂▁ |
distance | 0 | 1 | 11.93 | 6.04 | 5.07 | 6.88 | 11.30 | 15.48 | 28.79 | ▇▅▃▁▂ |
elevation_gain | 0 | 1 | 2.09 | 1.18 | 0.45 | 1.39 | 1.65 | 2.46 | 6.53 | ▇▃▂▁▁ |
Start by changing the date
column in
strava
to a date-type column instead of a character using
ymd_hms()
from the lubridate
package (which
should be loaded when you imported the tidyverse)
Note: You’ll need to change the date
column
before you can create the month
and day
column!
After changing date
to a date-type column,
create columns named month
and day
by using
the month()
and day()
functions, respectively,
on the newly changed date
column. Include
label = TRUE
inside month()
to display the
three letter abbreviation for each month.
Then add a column named dist_km
that is the
distance measured in kilometers instead of miles:
1 mile \(\approx\) 1.6 kilometers
Display the first 10 rows of the resulting data using
tibble(strava)
.
# changing the date column
strava$date <- ymd_hms(strava$date)
# Adding month and day
strava$month <- month(strava$date, label = TRUE)
strava$day <- day(strava$date)
# Adding the dist_km column
strava$dist_km <- strava$distance * 1.6
# Displaying the results
tibble(strava)
## # A tibble: 122 × 7
## date time distance elevation_gain month day dist_km
## <dttm> <dbl> <dbl> <dbl> <ord> <int> <dbl>
## 1 2023-06-28 22:18:00 60.0 9.28 0.935 Jun 28 14.8
## 2 2023-06-30 17:53:13 54.7 8.33 1.50 Jun 30 13.3
## 3 2023-07-04 16:12:12 67.0 11.2 1.37 Jul 4 17.8
## 4 2023-07-14 18:29:18 77.9 15.5 5.16 Jul 14 24.8
## 5 2023-07-15 17:35:51 45.4 8.91 0.920 Jul 15 14.2
## 6 2023-07-17 16:51:24 59.4 11.5 1.64 Jul 17 18.4
## 7 2023-07-21 18:27:02 140. 25.5 4.94 Jul 21 40.7
## 8 2023-07-23 21:34:35 65.8 12.8 1.67 Jul 23 20.5
## 9 2023-07-24 22:07:51 82.1 10.7 1.45 Jul 24 17.0
## 10 2023-07-25 14:45:21 105. 18.8 2.08 Jul 25 30.0
## # ℹ 112 more rows
distance
and
time
Calculate the correlation between distance
and
time
using cor()
.
Then calculate the correlation between dist_km
and time
.
**Round both to 3 decimal places
Note: You don’t need to save the values, just display the output in the knitted document
# correlation between distance and time
round(cor(x = strava$distance, y = strava$time), 3)
## [1] 0.913
# correlation between dist_km and time
round(cor(x = strava$dist_km, y = strava$time), 3)
## [1] 0.913
What do you notice about the two correlations?
Answer Here
Create a set of side-by-side boxplots for distance with a
different box for each month. You can use the base function
boxplot()
or you can use
ggplot()
.
# Using base R
boxplot(
distance ~ month,
data = strava
)
# Using ggplot()
ggplot(
data = strava,
mapping = aes(
y = distance,
x = month
)
) +
geom_boxplot(fill = 'steelblue')
Which months have no activity?
Answer here