Please submit your answers by 5:59 pm on Feb 11, 2019. Remember to show your work. In other words, always use echo=TRUE for the R code chunks that you provide. NOTE - All plots must show proper title, axis lables, and any legends used. Points will be deducted otherwise.
We are going to work with the dataset bike_data.csv (provided in Files->Assignments->Assignment_3). This dataset has been dowloaded from Kaggle, which is an online prediction contest website (see https://www.kaggle.com/c/bike-sharing-demand/data). The data is essentially the log of hourly bike rentals in a city over two years. The following is the codebook:
. datetime - hourly date + timestamp
. season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
. holiday - whether the day is considered a holiday
. workingday - whether the day is neither a weekend nor holiday
. weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy , 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist , 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds , 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
. temp - temperature in Celsius
. atemp - “feels like” temperature in Celsius
. humidity - relative humidity
. windspeed - wind speed . . casual - number of non-registered user rentals initiated
. registered - number of registered user rentals initiated
. count - number of total rentals
First, we need to do some preprocessing. Specifically, we need to process the year variable and remove all observations with weather == 4 (these are outliers and need to be be removed).
## Insert code below
bike_data_lm <- lm(count ~ temp, data=bike_data)
#summary(bike_data_lm)
coef(bike_data_lm)
## (Intercept) temp
## 6.008118 9.172048
confint(bike_data_lm)
## 2.5 % 97.5 %
## (Intercept) -2.695529 14.711765
## temp 8.770590 9.573505
Ans. Count is the total number of bike rentals and is a whole number. If the temperature is 0 degrees celsius, there are 6 bike rentals (coefficient of intercept), and when the temperature increases by 1 degree celsius, 9 more bikes are rented (temp). The direction of the association between temp and count is positive (increasing). The magnitude of the association between temp and count is 9.17 (coefficient of temp).
Ans. If the temperature is 0 degrees celsius, -157 bikes were rented (coefficient of intercept). When the temperature increases by 1 degree celsius, 5 more bikes were rented. The direction of the association between count and temp_f is positive (increasing). The magnitude of association between count and temp_f is 5.09 (coefficient of temp_f).
## Insert code below
convert_celsuis_to_fahr <- function(temperature) {
fahr <- ((temperature * 9/5) + 32)
return(fahr)
}
bike_data <- mutate (bike_data, temp_f = convert_celsuis_to_fahr(bike_data$temp))
bike_data_lm_tempf <- lm(count ~ temp_f, data=bike_data)
coef(bike_data_lm_tempf)
## (Intercept) temp_f
## -157.050506 5.095582
confint(bike_data_lm_tempf)
## 2.5 % 97.5 %
## (Intercept) -172.62703 -141.473980
## temp_f 4.87255 5.318614
On the same datasetas Q1, perform the following multiple linear regression: count ~ temp + season + humidity + weather + year. Keep season and weather as categorical variables. Interpret your results through the following means :
## Insert code below
bike_data <- bike_data %>% mutate(season = as.factor(season))
bike_data <- bike_data %>% mutate(weather = as.factor(weather))
bike_data_mr_lm <- lm(count ~ temp + season + humidity + weather + year, data = bike_data)
summary(bike_data_mr_lm)
##
## Call:
## lm(formula = count ~ temp + season + humidity + weather + year,
## data = bike_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -332.13 -98.50 -24.59 71.56 669.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.49594 7.33550 13.427 < 2e-16 ***
## temp 10.43201 0.31043 33.606 < 2e-16 ***
## season2 4.71765 5.24017 0.900 0.36799
## season3 -29.10223 6.64854 -4.377 1.21e-05 ***
## season4 66.98554 4.39278 15.249 < 2e-16 ***
## humidity -2.73308 0.08594 -31.800 < 2e-16 ***
## weather2 11.34183 3.48729 3.252 0.00115 **
## weather3 -7.37635 5.78648 -1.275 0.20242
## year2012 75.87791 2.88950 26.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 149.6 on 10876 degrees of freedom
## Multiple R-squared: 0.3189, Adjusted R-squared: 0.3184
## F-statistic: 636.7 on 8 and 10876 DF, p-value: < 2.2e-16
coef(bike_data_mr_lm)
## (Intercept) temp season2 season3 season4 humidity
## 98.495944 10.432008 4.717650 -29.102227 66.985542 -2.733076
## weather2 weather3 year2012
## 11.341830 -7.376353 75.877913
confint(bike_data_mr_lm)
## 2.5 % 97.5 %
## (Intercept) 84.117030 112.874857
## temp 9.823518 11.040497
## season2 -5.554035 14.989334
## season3 -42.134585 -16.069869
## season4 58.374886 75.596197
## humidity -2.901543 -2.564609
## weather2 4.506115 18.177544
## weather3 -18.718903 3.966197
## year2012 70.213974 81.541852
how does each variable contribute to count in terms of increase or decrease?
Ans. We are checking the value of each variable’s coefficient and observing the positive (increase) or negative (decrease) sign.
Temp - If temperature increases by 1 degree celsius, the number of bike rentals increases by 10 (10.43)
Season2 Summer - In the summer, the number of bike rentals increases by 5 (4.7).
Season3 Fall - In the fall, the number of bike rentals decreases by 29 (-29.1).
Season4 Winter - In the winter, the number of bike rentals increases by 67 (66.9).
Humidity - When there is relative humidity, the number of bike rentals decreases by 3
(-2.7).
Weather2 - When the weather is mist + cloudy, mist + broken clouds, mist + few clouds, mist the number of bike rentals increases by 11 (11.3).
Weather3 - When the weather is light snow, light rain + thunderstroms + scattered clouds,
light + scattered clouds, the number of bike rentals decreases by 7 (7.3)
Year2012 - In the year 2012, compared to 2011, the number of bike rentals increased by 76 (75.8).
what can you say about the results and the quality of fit? Use pvalue threshold of < 0.001 to reject any null hypothesis.
Ans. Quality of fit: The RSE for the data is very high (it should be low), 149.6 on 10876 degrees of freedom. The R^2 statistic is very low at 0.3 (it should be high). The data points are far and scattered, indicating a poor quality of fit.
Temp - The 95% CI (9.8,11.0) and the p-value < 0 (2e-16). The p-value is significant and there is an association based on the 95% CI so we reject the null hypothesis.
Season2 (Summer) - The 95% CI is (-5.55,14.98) and the p-value > 0 (0.367). The p-value is not significant and there is no association based on the 95% CI so we do not reject the null hypothesis.
Season3 (Fall) - The 95% CI is (16.06, 42.13) (switched positions since both were negative) and the p-value < 0 (1.21e-05). The p-value is significant and there is an association
based on the 95% CI so we reject the null hypothesis.
Season4 (Winter) - The 95% CI is (58.37, 75.5) and the p-value < 0 (2e-16). The p-value is significant and there is an association based on the 95% CI so we reject the null
hypothesis.
Humidity (Relative) - The 95% CI is (2.56, 2.9) (switched positions since both were negative) and the p-value < 0 (2e-16). The p-value is significant and there is an association based on the 95% CI so we reject the null hypothesis.
Weather2 (Mist+cloudy, mist+broken clouds, mist+few clouds, mist) - The 95% CI is (11.34,3.48) and the p-value < 0 (0.00115). The p-value is significant (depending on the data) and there is an association based on the 95% CI so we reject the null hypothesis.
Weather3 (Light snow, light rain+thunderstorm+scattered clouds, light rain+scattered
clouds) - The 95% CI is (-7.37,5.78) and the p-value > 0 (0.202). The p-value is not
significant and there is not an association based on the 95% CI so we do not reject the null hypothesis.
Year2012 - The 95% CI is (70.21,81.54) and the p-value < 0 (2e-16). The p-value is
significant and there is an association based on the 95% CI so we reject the null
hypothesis.
This question deals within application of linear regression. Download the dataset titled “sales_advertising.csv” from Files -> Assignments -> Assignment_3. The dataset measure sales of a product as a function of advertising budgets for TV, radio, and newspaper media. The following is the data dictionary.
# ## Insert code below
# Note: To make the question more clear: we are comparing sales (increase or decrease) of A product (not TV, radio or newspaper) by the TV, radio or newspaper. We are looking to see which of these (TV, radio or newspaper) should be allocated the most money (budget) to sell THE product.
#Reading in the csv file
sales_advertising_data <- read.csv("~/Desktop/sales_advertising.csv", header = TRUE)
#Plot between sales and TV
plot(sales_advertising_data$TV, sales_advertising_data$Sales, ylab="Sales", xlab="TV", main ="Sales of product with TV")
#Plot between sales and newspaper
plot(sales_advertising_data$Newspaper, sales_advertising_data$Sales, ylab="Sales", xlab="Newspaper", main = "Sales of product with Newspaper")
#Plot between sales and radio
plot(sales_advertising_data$Radio, sales_advertising_data$Sales, ylab="Sales", xlab="Radio", main = "Sales of product with Radio")
What is the observed association between sales and TV? With $0 in advertising budget, 7032 units of the product are sold via TV. For every $1000 spent, number of units increase by 48.
What is the null hypothesis for this particular model? Increase or decrease in budget will not affect the sales of product via TV.
What can we say about the null hypothesis? (We are looking at the p-value and 95% confidence interval of the TV) Since the P-value is low (therefore significant) and the 95% CI is ($42.2, $52.8), we reject the null hypothesis.
# Insert code
# code for (b) sales ~ TV
sales_tv_lm <- lm(Sales ~ TV, data=sales_advertising_data)
summary(sales_tv_lm)
##
## Call:
## lm(formula = Sales ~ TV, data = sales_advertising_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
coef(sales_tv_lm)
## (Intercept) TV
## 7.03259355 0.04753664
confint(sales_tv_lm)
## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256
What is the observed association between sales and Newspaper? With $0 in advertising budget, 12350 units of the product are sold via newspaper. For every $1000 spent, number of units of product sold increases by 55.
What is the null hypothesis for this particular model? Increase or decrease in budget will not affect the sales of product via newspaper.
What can we say about the null hypothesis? (We are looking at the p-value and 95% confidence interval of the newspaper) Since the P-value is low (therefore significant) and the 95% CI is ($22, $87.3), we reject the null hypothesis.
# Insert code
# code for (c) sales ~ newspaper
sales_news_lm <- lm(Sales ~ Newspaper, data=sales_advertising_data)
summary(sales_news_lm)
##
## Call:
## lm(formula = Sales ~ Newspaper, data = sales_advertising_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2272 -3.3873 -0.8392 3.5059 12.7751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
## Newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
coef(sales_news_lm)
## (Intercept) Newspaper
## 12.3514071 0.0546931
confint(sales_news_lm)
## 2.5 % 97.5 %
## (Intercept) 11.12595560 13.57685854
## Newspaper 0.02200549 0.08738071
# Insert code
# code for (d) sales ~ TV + Radio + Newspaper
sales_data_lm_mr <- lm(Sales ~ TV + Radio + Newspaper, data = sales_advertising_data)
summary(sales_data_lm_mr)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = sales_advertising_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
coef(sales_data_lm_mr)
## (Intercept) TV Radio Newspaper
## 2.938889369 0.045764645 0.188530017 -0.001037493
confint(sales_data_lm_mr)
## 2.5 % 97.5 %
## (Intercept) 2.32376228 3.55401646
## TV 0.04301371 0.04851558
## Radio 0.17154745 0.20551259
## Newspaper -0.01261595 0.01054097
Association between sales and tv: For every $1000 spent, number of units increases by 48. Due to the low P-value (2e-16) and 95% CI ($43, $48.5), this association is significant.
Association between sales and radio: For every $1000 spent, number of units increases by 189. Due to the low P-value (2e-16) and 95% CI ($171.5, $205.5), this association is significant.
Association between sales and newspaper: Due to the high P-value (0.86) and 95% CI ($-1, $5.8), this association is not significant.