Accessing data - Download the Bike-Sharing-Dataset.zip file above.
The folder contains two data files hour.csv and day.csv. You will use day.csv for this assignment. The readme file has a description of the data, which you are encouraged to read so that you can successfully interpret the analytic results.
There are 16 columns in the dataset. You will need to use columns: dteday, temp, and cnt. “cnt” is the outcome variable, or dependent variable.
Presenting data – Create an .rmd file in RStudio. Use a code chunk to report a summary of the data. 5 points
Preparing data - Extract the month names from the dteday column using lubridate package and save them in a new column month_name, which has a chr data type. 15 points
Hints: -Convert the column dteday into a date type using the lubridate package. -Use an appropriate lubridate function to extract the month from the dteday column and save it as month_name. -Remember the default function will extract the month numbers and not the month name. You will find a function argument to extract month names/ labels. -Also, make sure to convert the column month_name to a character data type. If you want, you may convert it back to a factor. Do not keep it as an ordered factor data type.
# Display the summary of selected columns
data %>% select(dteday, temp, cnt) %>%
summary()
## dteday temp cnt
## Length:731 Min. :0.05913 Min. : 22
## Class :character 1st Qu.:0.33708 1st Qu.:3152
## Mode :character Median :0.49833 Median :4548
## Mean :0.49538 Mean :4504
## 3rd Qu.:0.65542 3rd Qu.:5956
## Max. :0.86167 Max. :8714
# Convert dteday to date format
data$dteday <- ymd(data$dteday)
# Extract month names
data$month_name <- month(data$dteday, label = TRUE, abbr = TRUE)
# Convert to character type
data$month_name <- as.character(data$month_name)
# Display updated data structure
str(data)
## 'data.frame': 731 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date, format: "2011-01-01" "2011-01-02" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
## $ month_name: chr "Jan" "Jan" "Jan" "Jan" ...
# Model 1: cnt as dependent variable and month_name as independent variable
Model1 <- lm(cnt ~ month_name, data = data)
# Display the summary of Model 1
summary(Model1)
##
## Call:
## lm(formula = cnt ~ month_name, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5177.2 -1095.2 -249.3 1290.0 4669.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4484.9 196.7 22.799 < 2e-16 ***
## month_nameAug 1179.5 275.9 4.275 2.17e-05 ***
## month_nameDec -1081.1 275.9 -3.918 9.79e-05 ***
## month_nameFeb -1829.6 281.8 -6.492 1.58e-10 ***
## month_nameJan -2308.6 275.9 -8.366 3.09e-16 ***
## month_nameJul 1078.8 275.9 3.909 0.000101 ***
## month_nameJun 1287.5 278.2 4.628 4.38e-06 ***
## month_nameMar -792.6 275.9 -2.873 0.004192 **
## month_nameMay 864.9 275.9 3.134 0.001793 **
## month_nameNov -237.7 278.2 -0.854 0.393113
## month_nameOct 714.3 275.9 2.589 0.009829 **
## month_nameSep 1281.6 278.2 4.607 4.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1524 on 719 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3813
## F-statistic: 41.9 on 11 and 719 DF, p-value: < 2.2e-16
The R-squared value represents the proportion of variance in cnt that is explained by month_name. The adjusted R-squared value is a modified version of R-squared that has been adjusted for the number of predictors in the model. In this case, it is ~ .38, indicating that about 38% of the variance in cnt is explained by month_name.
The reference month has to be April. The coefficient for the month of April is the intercept. The coefficients for the other months are the differences in cnt from April. This is obvious from the summary output, as Apr is the only month not named.From the source listed below: R defaults to using factors in alphabetical order. The first factor is the reference level, and the coefficients for the other levels are the differences from that reference level. April is the first month in alphabetical order, so it is the reference level.
The predicted value for cnt in April (reference month) is simply equal to the intercept estimate from the output of model1, or 4484.9.
https://stackoverflow.com/questions/31638771/r-reorder-levels-of-a-factor-alphabetically-but-one
pred_january <- 4484.9 - 2308.6
pred_june <- 4484.9 + 1287.5
cat("Predicted Jan:", pred_january, "\n")
## Predicted Jan: 2176.3
cat("Predicted Jun:", pred_june)
## Predicted Jun: 5772.4
# Model 2: cnt as dependent variable and temp and month_name as independent variables
Model2 <- lm(cnt ~ temp + month_name, data = data)
# Display the summary of Model 2
summary(Model2)
##
## Call:
## lm(formula = cnt ~ temp + month_name, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4896.6 -1080.0 -228.4 1245.2 3372.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1554.39 390.76 3.978 7.66e-05 ***
## temp 6235.14 729.40 8.548 < 2e-16 ***
## month_nameAug -308.08 315.42 -0.977 0.3290
## month_nameDec -170.96 283.80 -0.602 0.5471
## month_nameFeb -764.81 296.15 -2.582 0.0100 *
## month_nameJan -852.31 313.41 -2.719 0.0067 **
## month_nameJul -701.18 335.50 -2.090 0.0370 *
## month_nameJun -47.47 307.78 -0.154 0.8775
## month_nameMar -297.20 269.38 -1.103 0.2703
## month_nameMay 86.73 278.37 0.312 0.7555
## month_nameNov 390.66 275.22 1.419 0.1562
## month_nameOct 620.72 263.30 2.357 0.0187 *
## month_nameSep 368.25 285.93 1.288 0.1982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1453 on 718 degrees of freedom
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4377
## F-statistic: 48.35 on 12 and 718 DF, p-value: < 2.2e-16
The R-squared value for Model 2 is ~ .44, indicating that about 44% of the variance in cnt is explained by temp and month_name. This is higher than the R-squared value for Model 1, which was ~ .38. This suggests that adding temp as an independent variable improves the model’s ability to explain the variance in cnt. The adjusted R-squared value is also higher in Model 2 compared to Model 1, indicating that the addition of temp as a predictor variable has improved the model fit. This seems intuitive, as one can assume fewer people are likely to bike in very cold temperatures. There may be lurking variables (such as weather) that are not included in the model.
The coffecients for month_nameJan in Model1 and Model2 are different because in Model1, the coefficient for month_nameJan represents the difference in cnt from the reference month (April) without considering the effect of temperature. In Model2, the coefficient for month_nameJan represents the difference in cnt from April while controlling for the effect of temperature.
# Using coefficients from Model2 to calculate predicted cnt for January when temp = 0.25
pred_january_model2 <- coef(Model2)["(Intercept)"] + coef(Model2)["month_nameJan"] + coef(Model2)["temp"] * 0.25
cat("Predicted cnt Jan when temp = 0.25:", pred_january_model2, "\n")
## Predicted cnt Jan when temp = 0.25: 2260.863