bike <- read.csv('D:/FALL 2023/STATISTICS/datasets/bike.csv')
library(pwr)
library(ggplot2)
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
###RESPONSE VARIABLE : RENTED.BIKE.COUNT ###EXPLANATORY VARIABLE: SEASONS
###NULL HYPOTHESIS: The mean number of rented bikes is the same across all seasons. In other words, there is no significant difference in bike rentals between different seasons. ANOVA test to compare the means of the response variable (Rented.Bike.Count) among different categories of the explanatory variable (Seasons). ###ANOVA
# Set response and explanatory
response <- bike$Rented.Bike.Count
explanatory <- bike$Seasons
# ANOVA test
fit <- aov(response ~ explanatory)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## explanatory 3 7.657e+08 255236335 776.5 <2e-16 ***
## Residuals 8756 2.878e+09 328715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Interpreting ANOVA Results:
The ANOVA result provides two key values: the F-statistic and a p-value. F-Statistic:
The F-statistic is a numerical value that helps us determine whether there are significant differences among the group means. P-Value:The p-value is a probability value. It tells us how likely it is to observe the data if the null hypothesis were true. Drawing Conclusions:
If the p-value is less than a chosen significance level (e.g., 0.05), we can reject the null hypothesis. likewise ,In our model, p value is less than significance value.This suggests that there is a significant difference in bike rentals among at least two seasons. If the p-value is greater than the chosen significance level, we fail to reject the null hypothesis. This means there is not enough evidence to conclude a significant difference in bike rentals among seasons.
###SUMMARY “There is enough evidence to conclude that there is a significant difference in bike rentals among seasons (p-value > 0.05). Therefore, it would be safe to assume that the number of rented bikes does not remain relatively consistent across different seasons. This suggests that seasonal variations do have a statistically significant impact on bike rentals. However, further exploration or specific pairwise comparisons may provide more insights into any subtle variations in bike rental patterns between particular seasons.”
##BUILDING LINEAR REGRESSION MODELS
# Build the linear regression model
model <- lm(Rented.Bike.Count ~ Temperature, data = bike)
###MODEL EVALUATION Using diagnostic plots and hypothesis tests to evaluate the model.
##DIAGNOSTIC PLOT
# Residuals vs. Fitted Plot
par(mfrow = c(2, 2))
plot(model)
This plot is used to check for linearity and homoscedasticity. In this
plot, the x-axis represents the predicted or fitted values from the
regression model, and the y-axis represents the residuals (the
differences between the observed values and the predicted values).
If the points in the plot are randomly scattered around a horizontal line at 0, it suggests that the assumption of linearity is not violated. If the spread of the points remains relatively constant across all values of the fitted values, it indicates homoscedasticity (constant variance of residuals). If the spread widens or narrows as you move along the x-axis, it suggests heteroscedasticity, which can be problematic for the model.
# Normal Q-Q Plot
qqnorm(residuals(model))
qqline(residuals(model))
This plot is used to assess the normality of the residuals. A normal Q-Q
plot compares the quantiles of the residuals to the quantiles of a
normal distribution.
If the points in the Q-Q plot closely follow a straight line, it suggests that the residuals are approximately normally distributed. Deviations from the line indicate departures from normality.
# Scale-Location Plot
sqrt_abs_residuals <- sqrt(abs(residuals(model)))
plot(fitted(model), sqrt_abs_residuals)
The scale-location plot helps assess homoscedasticity and identify
influential points.
If the points in the plot form a horizontal line with approximately equal spread, it supports the assumption of homoscedasticity. If there is a clear pattern or trend in the plot, it suggests heteroscedasticity. If points are far from the horizontal line, it indicates influential data points.
# Residuals vs. Leverage Plot
library(car)
## Loading required package: carData
avPlots(model)
This plot is used to identify influential data points and check for
outliers. It combines information on the residuals and the leverage of
data points. Points that are far from the horizontal line have high
leverage and can strongly influence the regression model. Points that
are both far from the line and have high leverage are potential
outliers.
Interpreting these diagnostic plots is crucial for ensuring that the linear regression model assumptions are met. Deviations from these assumptions can affect the model’s validity and the reliability of its predictions. Depending on the issues identified in these plots, you may need to consider data transformations or address outliers to improve the model’s performance.
###hypothesis tests
# Summarize the regression model
summary(model)
##
## Call:
## lm(formula = Rented.Bike.Count ~ Temperature, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1100.60 -336.57 -49.69 233.81 2525.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 329.9525 8.5411 38.63 <2e-16 ***
## Temperature 29.0811 0.4862 59.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 543.5 on 8758 degrees of freedom
## Multiple R-squared: 0.29, Adjusted R-squared: 0.29
## F-statistic: 3578 on 1 and 8758 DF, p-value: < 2.2e-16
###Interpretation:
If the p-value for the “Temperature” coefficient is less than your chosen significance level (e.g., 0.05), you can conclude that there is a significant linear relationship between temperature and bike rentals.
The coefficient for “Temperature” represents the change in the number of rented bikes for a one-unit change in temperature. For example, if the coefficient is 5, it means that for every one-degree increase in temperature, the number of rented bikes is expected to increase by 5.
###BUILDING MODEL WITH INTERACTIONS TERMS
To improve the linear regression model, WE can include another variable in addition to “Temperature.” Let’s assume we include “Humidity” as an additional explanatory variable, as it may also influence bike rentals. Including interaction terms can help capture complex relationships between variables. In this case, we will include an interaction between “Temperature” and “Humidity” to explore if the effect of temperature on bike rentals depends on humidity levels.
model1 <- lm(Rented.Bike.Count ~ Temperature * Humidity, data = bike)
Including the interaction term (Temperature * Humidity) allows us to explore whether the effect of temperature on bike rentals depends on humidity levels. It captures the combined effect of both variables, which may be necessary when dealing with complex relationships between predictors.
###Evaluating the model ###hypothesis tests
summary(model1)
##
## Call:
## lm(formula = Rented.Bike.Count ~ Temperature * Humidity, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1448.44 -292.94 -72.59 190.05 2488.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 649.54669 22.92577 28.33 <2e-16 ***
## Temperature 47.66897 1.41217 33.76 <2e-16 ***
## Humidity -5.47394 0.41556 -13.17 <2e-16 ***
## Temperature:Humidity -0.30465 0.02533 -12.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 506.3 on 8756 degrees of freedom
## Multiple R-squared: 0.3839, Adjusted R-squared: 0.3837
## F-statistic: 1819 on 3 and 8756 DF, p-value: < 2.2e-16
###interpretation
Interpret the coefficients of the model, including the main effects of “Temperature” and “Humidity,” and the interaction term. The interaction term’s coefficient will indicate whether the relationship between temperature and bike rentals depends on humidity levels.
For example, if the interaction term is significant and positive, it suggests that the effect of temperature on bike rentals increases as humidity levels rise. Including additional variables and interaction terms can help capture more complex relationships and potentially improve the model’s predictive accuracy. However, it’s important to carefully evaluate the model’s assumptions and results to ensure its validity.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.