COURSE 10 - MODUL 1
REGRESSION ALGORITHM FOR TESTING AND PREDICTING DATA
Data
library(readr)## Warning: package 'readr' was built under R version 4.2.3
df <- read_delim("C:/Users/hp/Downloads/day.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)## Rows: 731 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (16): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert 'dteday' column to date format
df$dteday <- as.Date(df$dteday, format="%Y-%m-%d")
# Extract month name as string
df$name_month <- format(df$dteday, "%B")
# Select only the required columns
df_selected <- df[, c("name_month", "temp", "cnt")]
# Save the resulting dataset to a new CSV file.
write.csv(df_selected, "day_processed.csv", row.names = FALSE)
# Show multiple rows of results
head(df_selected)print(df)## # A tibble: 731 × 18
## instant dteday season yr mnth holiday weekday working…¹ weath…² temp
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 1 0 1 0 6 0 2 0.344
## 2 2 2011-01-02 1 0 1 0 0 0 2 0.363
## 3 3 2011-01-03 1 0 1 0 1 1 1 0.196
## 4 4 2011-01-04 1 0 1 0 2 1 1 0.2
## 5 5 2011-01-05 1 0 1 0 3 1 1 0.227
## 6 6 2011-01-06 1 0 1 0 4 1 1 0.204
## 7 7 2011-01-07 1 0 1 0 5 1 2 0.197
## 8 8 2011-01-08 1 0 1 0 6 0 2 0.165
## 9 9 2011-01-09 1 0 1 0 0 0 1 0.138
## 10 10 2011-01-10 1 0 1 0 1 1 1 0.151
## # … with 721 more rows, 8 more variables: atemp <dbl>, hum <dbl>,
## # windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>,
## # nama_bulan <dbl>, name_month <chr>, and abbreviated variable names
## # ¹workingday, ²weathersit
Model 1
A. Simple Linear Regression Model
Model1 <- lm(cnt ~ name_month, data = df)
summary(Model1)##
## Call:
## lm(formula = cnt ~ name_month, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5177.2 -1095.2 -249.3 1290.0 4669.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4484.9 196.7 22.799 < 2e-16 ***
## name_monthAugust 1179.5 275.9 4.275 2.17e-05 ***
## name_monthDecember -1081.1 275.9 -3.918 9.79e-05 ***
## name_monthFebruary -1829.6 281.8 -6.492 1.58e-10 ***
## name_monthJanuary -2308.6 275.9 -8.366 3.09e-16 ***
## name_monthJuly 1078.8 275.9 3.909 0.000101 ***
## name_monthJune 1287.5 278.2 4.628 4.38e-06 ***
## name_monthMarch -792.6 275.9 -2.873 0.004192 **
## name_monthMay 864.9 275.9 3.134 0.001793 **
## name_monthNovember -237.7 278.2 -0.854 0.393113
## name_monthOctober 714.3 275.9 2.589 0.009829 **
## name_monthSeptember 1281.6 278.2 4.607 4.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1524 on 719 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3813
## F-statistic: 41.9 on 11 and 719 DF, p-value: < 2.2e-16
B. Interpretasi R-Square
Based on the regression output, the R-squared (R²) value of 0.3906 indicates that this model is able to explain about 39.06% of the variation in the dependent variable, while the remaining 60.94% is influenced by other factors not included in the model. Meanwhile, the Adjusted R-squared of 0.3813 indicates that after adjusting the number of variables in the model, the model’s predictive ability decreased slightly, indicating that some variables may not contribute significantly. Overall, this model has moderate predictive ability, but can still be improved by considering additional variables or more appropriate modeling methods to improve prediction accuracy.
C. Moon Reference
Since April does not appear in the summary of the regression results, it is used as the reference month in the model. This means that the coefficients for the other months show different values compared to April as the reference month.
intercept <- coef(Model1)["(Intercept)"]
cat("Cnt prediction for April:", intercept,"\n")## Cnt prediction for April: 4484.9
Based on the results above, it was found that the predicted total number of rental bicycles including regular bicycles and registered ones for April was 4484,9 units.
D. January and June Month Predictions
Model1 <- lm(cnt ~ name_month, data = df)
coef(Model1) ## (Intercept) name_monthAugust name_monthDecember name_monthFebruary
## 4484.9000 1179.5194 -1081.0935 -1829.6018
## name_monthJanuary name_monthJuly name_monthJune name_monthMarch
## -2308.5613 1078.7774 1287.4667 -792.6419
## name_monthMay name_monthNovember name_monthOctober name_monthSeptember
## 864.8742 -237.7167 714.3258 1281.6167
Based on the regression results displayed, the regression model can be written in the form of an equation as follows:
\[Y = 4484.9000 + 1179.5194 X_{\text{August}} - 1081.0935 X_{\text{December}} - 1829.6018 X_{\text{February}} - 2308.5613 X_{\text{January}} + 1078.7774 X_{\text{July}} + 1287.4667 X_{\text{June}} - 792.6419 X_{\text{March}} + 864.8742 X_{\text{May}} - 237.7167 X_{\text{November}} + 714.3258 X_{\text{October}} + 1281.6167 X_{\text{September}}\]
Where:
\(Y\) is the predicted dependent variable.
\(X_{\text{month}}\) is a dummy variable for each month (1 if it is the month, 0 otherwise).
April is used as the reference month, so it does not appear in the equation. The interpretation of each coefficient is the change \(Y\) relative to April.
Based on the regression results, it can be concluded that there is a seasonal pattern in the number of bicycle rentals. January, February, March, November, and December show a decrease in the number of rentals compared to the reference month, which is likely due to weather factors or other conditions that are less supportive of cycling activities. In contrast, August, July, June, May, October, and September tend to experience an increase in the number of rentals, which may be influenced by more favorable weather conditions or increased outdoor activities. The largest decrease occurred in January with a difference of -2308.56 units, while the largest increase was recorded in August (+1179.52) and September (+1281.62). Thus, this model can be used to predict the number of bicycle rentals by month by considering seasonal differences, thus helping in the planning and management of bicycle rental services.
Model 2
A. Simple Linear Regression Model
Model2 <- lm(cnt ~ temp + name_month, data = df)
summary(Model2)##
## Call:
## lm(formula = cnt ~ temp + name_month, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4896.6 -1080.0 -228.4 1245.2 3372.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1554.39 390.76 3.978 7.66e-05 ***
## temp 6235.14 729.40 8.548 < 2e-16 ***
## name_monthAugust -308.08 315.42 -0.977 0.3290
## name_monthDecember -170.96 283.80 -0.602 0.5471
## name_monthFebruary -764.81 296.15 -2.582 0.0100 *
## name_monthJanuary -852.31 313.41 -2.719 0.0067 **
## name_monthJuly -701.18 335.50 -2.090 0.0370 *
## name_monthJune -47.47 307.78 -0.154 0.8775
## name_monthMarch -297.20 269.38 -1.103 0.2703
## name_monthMay 86.73 278.37 0.312 0.7555
## name_monthNovember 390.66 275.22 1.419 0.1562
## name_monthOctober 620.72 263.30 2.357 0.0187 *
## name_monthSeptember 368.25 285.93 1.288 0.1982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1453 on 718 degrees of freedom
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4377
## F-statistic: 48.35 on 12 and 718 DF, p-value: < 2.2e-16
B. R-Square Difference
Based on the comparison of R-squared and Adjusted R-squared values, Model 2 has a better ability to explain data variations compared to Model 1. The R-squared value in Model 2 increased from 39.06% to 44.69%, and the Adjusted R-squared value also increased from 38.13% to 43.77%. This increase indicates that the addition of the temperature variable in Model 2 contributes to increasing the accuracy of the model. Therefore, Model 2 is more recommended because it provides better results in explaining the relationship between variables in the data.
C. Comparison of Coefficient Estimates
coef(Model1)["name_monthJanuary"]## name_monthJanuary
## -2308.561
coef(Model2)["name_monthJanuary"]## name_monthJanuary
## -852.3127
The difference in the coefficient estimates of name_of_monthJanuary between Model 1 and Model 2 shows that the addition of the temperature variable affects the regression results. In Model 1, the influence of January appears larger because there is no additional variable explaining the variation in the data. However, when temperature is entered into Model 2, some of the variation previously associated with the month is now explained by temperature, so that the coefficient value of January becomes smaller. This shows that temperature has a contribution in explaining changes in the data, so that Model 2 is more able to capture the factors that influence the response of the variable more accurately.
D. January Month Prediction When Temperature is 0.25
Prediction_januari_temp_025 <- 1554.39 + (-852.31) + (6235.14*0.25)
cat("Prediction cnt for January with temp 0.25 = ", Prediction_januari_temp_025, "\n")## Prediction cnt for January with temp 0.25 = 2260.865
So the total number of rental bicycles including regular and registered bicycles is 2260,865 units.