Part 1: Data Visualization

  1. [COMPLETED] (CW) Open an R Markdown file to use for today’s classwork [COMPLETED]

  2. (CW) Load the bike sharing data from last class

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
bike_sharing <- read_csv("~/Downloads/bikesharing.csv")
## Rows: 731 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): season, month, weekday, weather
## dbl  (7): year, temperature_F, casual, registered, count, humidity, windspeed
## lgl  (2): holiday, workingday
## date (2): date, date_noyear
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. (CW) Create a box plot of season vs count. Sort the boxes by count.
bike_sharing %>%
  ggplot(aes(reorder(season, count), count)) +
  geom_boxplot()

  1. OPTIONAL: Create a box plot of windspeed by season. Choose another categorical variable to fill.
bike_sharing %>%
  ggplot(aes(reorder(season, windspeed), windspeed, fill = month)) +
  geom_boxplot()

Part 2: T Tests

  1. (CW) Read the following csv into a data frame called ais in R.
# Example: t.test(count ~ workingday, data = bike_sharing)
ais <- read_csv("~/Downloads/ais.csv")
## Rows: 202 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, sport
## dbl (7): rcc, wcc, hc, hg, ferr, ht, wt
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. (CW) Perform a t-test using the ais data, with sex as the independent variable and ht as the dependent variable.

Answer: I detect that there is a difference in mean ht by sex since the p-value is less than 0.05, which rejects the null hypothesis.

t.test(ht ~ sex, data = ais)
## 
##  Welch Two Sample t-test
## 
## data:  ht by sex
## t = -9.6009, df = 199.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -13.153090  -8.670675
## sample estimates:
## mean in group f mean in group m 
##        174.5940        185.5059
  1. (CW) Perform a t-test using sex as the independent variable and wcc as the dependent variable.

Answer: I detect that there is not a difference in mean wwx by sex since the p-value is greater than 0.05, which fails to reject the null hypothesis.

t.test(wcc ~ sex, data = ais)
## 
##  Welch Two Sample t-test
## 
## data:  wcc by sex
## t = -0.8988, df = 198.28, p-value = 0.3699
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -0.7268643  0.2717271
## sample estimates:
## mean in group f mean in group m 
##        6.994000        7.221569

Part 3: Correlation

  1. (CW) Find the pair-wise correlation for all quantitative variables using cor() (you will need to use select() to remove sex and sport).
cor(select(ais, rcc, wcc, hc, hg, ferr, ht, wt))
##            rcc        wcc        hc        hg      ferr         ht        wt
## rcc  1.0000000 0.14706422 0.9249639 0.8887998 0.2508655 0.35885396 0.4037635
## wcc  0.1470642 1.00000000 0.1533326 0.1347199 0.1320729 0.07681056 0.1556625
## hc   0.9249639 0.15333265 1.0000000 0.9507567 0.2582395 0.37119150 0.4237113
## hg   0.8887998 0.13471992 0.9507567 1.0000000 0.3083911 0.35232222 0.4552628
## ferr 0.2508655 0.13207288 0.2582395 0.3083911 1.0000000 0.12325468 0.2737023
## ht   0.3588540 0.07681056 0.3711915 0.3523222 0.1232547 1.00000000 0.7809321
## wt   0.4037635 0.15566247 0.4237113 0.4552628 0.2737023 0.78093207 1.0000000
  1. (CW) Plot the pair-wise scatterplots for all quantitative variables using pairs() (you will need to use select() to remove sex and sport).
pairs(select(ais, rcc, wcc, hc, hg, ferr, ht, wt))

  1. (CW) Based on the correlation matrix, which two variables have the highest correlation? Use cor.test() to find more details about the correlation of these two variables (what is the p value? What is the confidence interval?)

Answer: From the correlation matrix, the two variables that seem to have the highest correlation are hg and hc. The p-value from these variables is 2.2e-16, which is smaller than our alpha of 0.05. The confidence interval for the correlation coefficient is between 0.9354917 and 0.9624795 with 95% confidence. From this, we can see that there is a strong to near perfect positive linear correlation between hg and hc.

cor.test(ais$hg, ais$hc)
## 
##  Pearson's product-moment correlation
## 
## data:  ais$hg and ais$hc
## t = 43.382, df = 200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9354917 0.9624795
## sample estimates:
##       cor 
## 0.9507567
lm(count ~ temperature_F, data = bike_sharing)
## 
## Call:
## lm(formula = count ~ temperature_F, data = bike_sharing)
## 
## Coefficients:
##   (Intercept)  temperature_F  
##      -1663.15          89.96
summary(lm(count ~ temperature_F, data = bike_sharing))
## 
## Call:
## lm(formula = count ~ temperature_F, data = bike_sharing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4616  -1135   -105   1046   3741 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1663.154    288.972  -5.755 1.27e-08 ***
## temperature_F    89.957      4.135  21.753  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1510 on 729 degrees of freedom
## Multiple R-squared:  0.3936, Adjusted R-squared:  0.3928 
## F-statistic: 473.2 on 1 and 729 DF,  p-value: < 2.2e-16