DS 2870: Homework 1

knitr::opts_chunk$set(echo = TRUE,
                      #warning = FALSE,
                      #message = FALSE,
                      fig.align = "center")

Homework Instructions

All homework assignments should be submitted a pdf file. Easiest way is to knit as an html file and convert to a pdf. There is a video in module 1 about how you can do that!
If a question asks for a graph, table, or calculation (like an average), make sure that it appears in your knitted document.
Your homework should be your own work. While you can use the internet for help, any major deviations to methods seen in this course will be marked incorrect, even if it gives the correct answer.
The code should be readable and commented. If I’m unsure what your code did, I can’t award partial credit!

Question 1: Cyclists

1a) Load Packages

Load the tidyverse and caret packages. Install the packages if needed.

# Load the packages below:
pacman::p_load(tidyverse, caret)

# Or 
library(tidyverse)
library(caret)

# either method is fine to use

If successful, the code chunk below should run

# Making sure the randomness is the same
RNGversion('4.1.0'); set.seed(2870)
iris |> 
  # Shuffling the Species column
  mutate(
    species2 = sample(Species)
  ) |> 
  dplyr::select(species2, Species) |> 
  # Creating a table of species2 and Species
  table() |> 
  # Creating a confusion matrix (which we'll see later)
  confusionMatrix()

## Confusion Matrix and Statistics
## 
##             Species
## species2     setosa versicolor virginica
##   setosa         15         18        17
##   versicolor     16         16        18
##   virginica      19         16        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3067          
##                  95% CI : (0.2341, 0.3871)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 0.7810          
##                                           
##                   Kappa : -0.04           
##                                           
##  Mcnemar's Test P-Value : 0.9511          
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.3000            0.3200           0.3000
## Specificity                 0.6500            0.6600           0.6500
## Pos Pred Value              0.3000            0.3200           0.3000
## Neg Pred Value              0.6500            0.6600           0.6500
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.1000            0.1067           0.1000
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           0.4750            0.4900           0.4750

1b) Cyclist Ages

The bike criterium has 200 total participants. The ages of six participants are 25, 34, 33, 42, 22, 29. Save these ages (in the order specified) in a vector named age, then calculate the average age.

# age vector
age <- c(25, 34, 33, 42, 22, 29)

# Average (mean) age
mean(age)

## [1] 30.83333

1c) Bike Brands

The brand of the bikes used by the same six cyclists from question 1b are: Bianchi, Dare, Bianchi, Bianchi, Cannondale, Dare. Save the brands in a vector called bike_brand and display the results using table()

bike_brand <- c('Bianchi', 'Dare', 'Bianchi', 'Bianchi', 'Cannondale', 'Dare')

table(bike_brand)

## bike_brand
##    Bianchi Cannondale       Dare 
##          3          1          2

1d) Cyclist Experience

The same six participants’ responses to the number of previous criterium attended is 2, 5, 7, left blank, 0, 3. Save the values in a vector named cyclist_exp, then calculate the median of the responses. It should return a number!

cyclist_exp <- c(2, 5, 7, NA, 0, 3)

# Need to use na.rm = T to remove the missing values
median(cyclist_exp,
       na.rm = T)

## [1] 3

1e) Cyclist ID

The six participants are assigned ID numbers of 103, 203, 303, 403, 503, 603. Create an object named bib_num using a shortcut (not c(103, 203, ...)). Have bib_num appear in the knitted document.

bib_num <- (1:6)*100 + 3

bib_num

## [1] 103 203 303 403 503 603

1f) Cyclist Data Set

Create a data set named cyclists using the 4 vectors created previously in the order of bib_num, bike_brand, age, cyclist_exp with column names of ID, bike, age, experience respectively.

Display the data frame in the knitted document

cyclists <- 
  data.frame(
    ID = bib_num,
    brand = bike_brand,
    age,
    experience = cyclist_exp
  )

cyclists

##    ID      brand age experience
## 1 103    Bianchi  25          2
## 2 203       Dare  34          5
## 3 303    Bianchi  33          7
## 4 403    Bianchi  42         NA
## 5 503 Cannondale  22          0
## 6 603       Dare  29          3

1g)

Repeat question 1f), but create the cyclists data set without creating global objects for the individual columns first (no individual vectors). When completed, the only object in your global environment should be cyclists

# Keep the line below at the top of this code chunk:
rm(list = ls())

# Now create a data.frame named trees2 as describe by this question:
cyclists <- 
  data.frame(
    ID = (1:6)*100 + 3,
    brand = c('Bianchi', 'Dare', 'Bianchi', 'Bianchi', 'Cannondale', 'Dare'),
    age = c(25, 34, 33, 42, 22, 29),
    experience = c(2, 5, 7, NA, 0, 3)
  )

cyclists

##    ID      brand age experience
## 1 103    Bianchi  25          2
## 2 203       Dare  34          5
## 3 303    Bianchi  33          7
## 4 403    Bianchi  42         NA
## 5 503 Cannondale  22          0
## 6 603       Dare  29          3

# Displaying the objects in the global environment
ls()

## [1] "cyclists"

Question 2: Strava Data Set

2a) Read in the Data

Read in the strava_data.csv file, saved as strava. skim() the data once it has been read in without loading the skimr package.

strava <- read.csv("strava_data.csv")

skimr::skim(strava)

Data summary
Name	strava
Number of rows	122
Number of columns	4
_______________________
Column type frequency:
character	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
date	0	1	19	19	0	122	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
time	1	68.34	33.10	23.25	40.46	63.71	83.72	163.70	▇▇▃▂▁
distance	1	11.93	6.04	5.07	6.88	11.30	15.48	28.79	▇▅▃▁▂
elevation_gain	1	2.09	1.18	0.45	1.39	1.65	2.46	6.53	▇▃▂▁▁

2B) Changing date and adding month, day, dist_km

Start by changing the date column in strava to a date-type column instead of a character using ymd_hms() from the lubridate package (which should be loaded when you imported the tidyverse)

Note: You’ll need to change the date column before you can create the month and day column!

After changing date to a date-type column, create columns named month and day by using the month() and day() functions, respectively, on the newly changed date column. Include label = TRUE inside month() to display the three letter abbreviation for each month.

Then add a column named dist_km that is the distance measured in kilometers instead of miles:

1 mile \(\approx\) 1.6 kilometers

Display the first 10 rows of the resulting data using tibble(strava).

# changing the date column
strava$date <- ymd_hms(strava$date)

# Adding month and day
strava$month <- month(strava$date, label = TRUE)
strava$day <- day(strava$date)

# Adding the dist_km column
strava$dist_km <- strava$distance * 1.6

# Displaying the results
tibble(strava)

## # A tibble: 122 × 7
##    date                 time distance elevation_gain month   day dist_km
##    <dttm>              <dbl>    <dbl>          <dbl> <ord> <int>   <dbl>
##  1 2023-06-28 22:18:00  60.0     9.28          0.935 Jun      28    14.8
##  2 2023-06-30 17:53:13  54.7     8.33          1.50  Jun      30    13.3
##  3 2023-07-04 16:12:12  67.0    11.2           1.37  Jul       4    17.8
##  4 2023-07-14 18:29:18  77.9    15.5           5.16  Jul      14    24.8
##  5 2023-07-15 17:35:51  45.4     8.91          0.920 Jul      15    14.2
##  6 2023-07-17 16:51:24  59.4    11.5           1.64  Jul      17    18.4
##  7 2023-07-21 18:27:02 140.     25.5           4.94  Jul      21    40.7
##  8 2023-07-23 21:34:35  65.8    12.8           1.67  Jul      23    20.5
##  9 2023-07-24 22:07:51  82.1    10.7           1.45  Jul      24    17.0
## 10 2023-07-25 14:45:21 105.     18.8           2.08  Jul      25    30.0
## # ℹ 112 more rows

2C) Correlation between `distance` and `time`

Calculate the correlation between distance and time using cor().

Then calculate the correlation between dist_km and time.

**Round both to 3 decimal places

Note: You don’t need to save the values, just display the output in the knitted document

# correlation between distance and time
round(cor(x = strava$distance, y = strava$time), 3)

## [1] 0.913

# correlation between dist_km and time
round(cor(x = strava$dist_km, y = strava$time), 3)

## [1] 0.913

What do you notice about the two correlations?

Answer Here

2D) Box plot of distance by month

Create a set of side-by-side boxplots for distance with a different box for each month. You can use the base function boxplot() or you can use ggplot().

# Using base R
boxplot(
  distance ~ month,
  data = strava
)

# Using ggplot()
ggplot(
  data = strava,
  mapping = aes(
    y = distance,
    x = month
  )
) + 
  geom_boxplot(fill = 'steelblue')

Which months have no activity?

Answer here

DS 2870: Homework 1

Your Name

Solutions

Homework Instructions

Question 1: Cyclists

1a) Load Packages

1b) Cyclist Ages

1c) Bike Brands

1d) Cyclist Experience

1e) Cyclist ID

1f) Cyclist Data Set

1g)

Question 2: Strava Data Set

2a) Read in the Data

2B) Changing date and adding month, day, dist_km

2C) Correlation between `distance` and `time`

2D) Box plot of distance by month

DS 2870: Homework 1

Your Name

Solutions

Homework Instructions

Question 1: Cyclists

1a) Load Packages

1b) Cyclist Ages

1c) Bike Brands

1d) Cyclist Experience

1e) Cyclist ID

1f) Cyclist Data Set

1g)

Question 2: Strava Data Set

2a) Read in the Data

2B) Changing date and adding month, day, dist_km

2C) Correlation between distance and time

2D) Box plot of distance by month

2C) Correlation between `distance` and `time`