Load Packages & Dataset

Accessing data - Download the Bike-Sharing-Dataset.zip file above.

The folder contains two data files hour.csv and day.csv. You will use day.csv for this assignment. The readme file has a description of the data, which you are encouraged to read so that you can successfully interpret the analytic results.

There are 16 columns in the dataset. You will need to use columns: dteday, temp, and cnt. “cnt” is the outcome variable, or dependent variable.

Presenting data – Create an .rmd file in RStudio. Use a code chunk to report a summary of the data. 5 points

Preparing data - Extract the month names from the dteday column using lubridate package and save them in a new column month_name, which has a chr data type. 15 points

Hints: -Convert the column dteday into a date type using the lubridate package. -Use an appropriate lubridate function to extract the month from the dteday column and save it as month_name. -Remember the default function will extract the month numbers and not the month name. You will find a function argument to extract month names/ labels. -Also, make sure to convert the column month_name to a character data type. If you want, you may convert it back to a factor. Do not keep it as an ordered factor data type.

Summarize Data

# Display the summary of selected columns
data %>% select(dteday, temp, cnt) %>% 
  summary()

##     dteday               temp              cnt      
##  Length:731         Min.   :0.05913   Min.   :  22  
##  Class :character   1st Qu.:0.33708   1st Qu.:3152  
##  Mode  :character   Median :0.49833   Median :4548  
##                     Mean   :0.49538   Mean   :4504  
##                     3rd Qu.:0.65542   3rd Qu.:5956  
##                     Max.   :0.86167   Max.   :8714

Extract Month Names

# Convert dteday to date format
data$dteday <- ymd(data$dteday)

# Extract month names
data$month_name <- month(data$dteday, label = TRUE, abbr = TRUE)

# Convert to character type
data$month_name <- as.character(data$month_name)

# Display updated data structure
str(data)

## 'data.frame':    731 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date, format: "2011-01-01" "2011-01-02" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
##  $ month_name: chr  "Jan" "Jan" "Jan" "Jan" ...

Model 1: Simple LR

# Model 1: cnt as dependent variable and month_name as independent variable
Model1 <- lm(cnt ~ month_name, data = data)

# Display the summary of Model 1
summary(Model1)

## 
## Call:
## lm(formula = cnt ~ month_name, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5177.2 -1095.2  -249.3  1290.0  4669.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4484.9      196.7  22.799  < 2e-16 ***
## month_nameAug   1179.5      275.9   4.275 2.17e-05 ***
## month_nameDec  -1081.1      275.9  -3.918 9.79e-05 ***
## month_nameFeb  -1829.6      281.8  -6.492 1.58e-10 ***
## month_nameJan  -2308.6      275.9  -8.366 3.09e-16 ***
## month_nameJul   1078.8      275.9   3.909 0.000101 ***
## month_nameJun   1287.5      278.2   4.628 4.38e-06 ***
## month_nameMar   -792.6      275.9  -2.873 0.004192 ** 
## month_nameMay    864.9      275.9   3.134 0.001793 ** 
## month_nameNov   -237.7      278.2  -0.854 0.393113    
## month_nameOct    714.3      275.9   2.589 0.009829 ** 
## month_nameSep   1281.6      278.2   4.607 4.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1524 on 719 degrees of freedom
## Multiple R-squared:  0.3906, Adjusted R-squared:  0.3813 
## F-statistic:  41.9 on 11 and 719 DF,  p-value: < 2.2e-16

The R-squared value represents the proportion of variance in cnt that is explained by month_name. The adjusted R-squared value is a modified version of R-squared that has been adjusted for the number of predictors in the model. In this case, it is ~ .38, indicating that about 38% of the variance in cnt is explained by month_name.

The reference month has to be April. The coefficient for the month of April is the intercept. The coefficients for the other months are the differences in cnt from April. This is obvious from the summary output, as Apr is the only month not named.From the source listed below: R defaults to using factors in alphabetical order. The first factor is the reference level, and the coefficients for the other levels are the differences from that reference level. April is the first month in alphabetical order, so it is the reference level.

The predicted value for cnt in April (reference month) is simply equal to the intercept estimate from the output of model1, or 4484.9.

https://stackoverflow.com/questions/31638771/r-reorder-levels-of-a-factor-alphabetically-but-one

Predict Jan and Jun

pred_january <- 4484.9 - 2308.6
pred_june <- 4484.9 + 1287.5

cat("Predicted Jan:", pred_january, "\n")

## Predicted Jan: 2176.3

cat("Predicted Jun:", pred_june)

## Predicted Jun: 5772.4

Model 2: MLR

# Model 2: cnt as dependent variable and temp and month_name as independent variables
Model2 <- lm(cnt ~ temp + month_name, data = data)

# Display the summary of Model 2
summary(Model2)

## 
## Call:
## lm(formula = cnt ~ temp + month_name, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4896.6 -1080.0  -228.4  1245.2  3372.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1554.39     390.76   3.978 7.66e-05 ***
## temp           6235.14     729.40   8.548  < 2e-16 ***
## month_nameAug  -308.08     315.42  -0.977   0.3290    
## month_nameDec  -170.96     283.80  -0.602   0.5471    
## month_nameFeb  -764.81     296.15  -2.582   0.0100 *  
## month_nameJan  -852.31     313.41  -2.719   0.0067 ** 
## month_nameJul  -701.18     335.50  -2.090   0.0370 *  
## month_nameJun   -47.47     307.78  -0.154   0.8775    
## month_nameMar  -297.20     269.38  -1.103   0.2703    
## month_nameMay    86.73     278.37   0.312   0.7555    
## month_nameNov   390.66     275.22   1.419   0.1562    
## month_nameOct   620.72     263.30   2.357   0.0187 *  
## month_nameSep   368.25     285.93   1.288   0.1982    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1453 on 718 degrees of freedom
## Multiple R-squared:  0.4469, Adjusted R-squared:  0.4377 
## F-statistic: 48.35 on 12 and 718 DF,  p-value: < 2.2e-16

The R-squared value for Model 2 is ~ .44, indicating that about 44% of the variance in cnt is explained by temp and month_name. This is higher than the R-squared value for Model 1, which was ~ .38. This suggests that adding temp as an independent variable improves the model’s ability to explain the variance in cnt. The adjusted R-squared value is also higher in Model 2 compared to Model 1, indicating that the addition of temp as a predictor variable has improved the model fit. This seems intuitive, as one can assume fewer people are likely to bike in very cold temperatures. There may be lurking variables (such as weather) that are not included in the model.

Compare the coefficient estimates for the month_nameJan variable in Model1 and Model2. With regular text explain why the coefficient estimates are different.

The coffecients for month_nameJan in Model1 and Model2 are different because in Model1, the coefficient for month_nameJan represents the difference in cnt from the reference month (April) without considering the effect of temperature. In Model2, the coefficient for month_nameJan represents the difference in cnt from April while controlling for the effect of temperature.

# Using coefficients from Model2 to calculate predicted cnt for January when temp = 0.25
pred_january_model2 <- coef(Model2)["(Intercept)"] + coef(Model2)["month_nameJan"] + coef(Model2)["temp"] * 0.25

cat("Predicted cnt Jan when temp = 0.25:", pred_january_model2, "\n")

## Predicted cnt Jan when temp = 0.25: 2260.863

Module 1 Assignment

Peter Thompson

2025-04-04