Assignment III

Please submit your answers by 5:59 pm on Feb 11, 2019. Remember to show your work. In other words, always use echo=TRUE for the R code chunks that you provide. NOTE - All plots must show proper title, axis lables, and any legends used. Points will be deducted otherwise.

Question 1: Simple Linear Regression

We are going to work with the dataset bike_data.csv (provided in Files->Assignments->Assignment_3). This dataset has been dowloaded from Kaggle, which is an online prediction contest website (see https://www.kaggle.com/c/bike-sharing-demand/data). The data is essentially the log of hourly bike rentals in a city over two years. The following is the codebook:

. datetime - hourly date + timestamp
. season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
. holiday - whether the day is considered a holiday
. workingday - whether the day is neither a weekend nor holiday
. weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy , 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist , 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds , 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
. temp - temperature in Celsius
. atemp - “feels like” temperature in Celsius
. humidity - relative humidity
. windspeed - wind speed . . casual - number of non-registered user rentals initiated
. registered - number of registered user rentals initiated
. count - number of total rentals

First, we need to do some preprocessing. Specifically, we need to process the year variable and remove all observations with weather == 4 (these are outliers and need to be be removed).

Perform the following simple linear regression: count ~ temperature. What are the coefficients and their 95% confidence intervals?
Ans. Note: Intercept i.e., count is considered meaningless Coefficient of intercept i.e., count = 6.008 95% CI of intercept i.e., count = (-2.69,14.7). Coefficient of temp = 9.17 95% CI of temp = (8.7,9.5)

## Insert code below
bike_data_lm <- lm(count ~ temp, data=bike_data)
#summary(bike_data_lm)

coef(bike_data_lm)

## (Intercept)        temp 
##    6.008118    9.172048

confint(bike_data_lm)

##                 2.5 %    97.5 %
## (Intercept) -2.695529 14.711765
## temp         8.770590  9.573505

Interpret your results in terms of increase or decrease in temperature. Mention specifically what is the meaning of the intercept and the direction and magnitude of the association between temperature and count.

Ans. Count is the total number of bike rentals and is a whole number. If the temperature is 0 degrees celsius, there are 6 bike rentals (coefficient of intercept), and when the temperature increases by 1 degree celsius, 9 more bikes are rented (temp). The direction of the association between temp and count is positive (increasing). The magnitude of the association between temp and count is 9.17 (coefficient of temp).

Using mutate, create a new variable temp_f which is Farenheit representation of values in the temp variable. Perform the regression count ~ temp_f and interpret the results. What conclusions can you draw from this experiment?

Ans. If the temperature is 0 degrees celsius, -157 bikes were rented (coefficient of intercept). When the temperature increases by 1 degree celsius, 5 more bikes were rented. The direction of the association between count and temp_f is positive (increasing). The magnitude of association between count and temp_f is 5.09 (coefficient of temp_f).

## Insert code below
convert_celsuis_to_fahr <- function(temperature) {
  fahr <- ((temperature * 9/5) + 32)
  return(fahr)
}

bike_data <- mutate (bike_data, temp_f = convert_celsuis_to_fahr(bike_data$temp))

bike_data_lm_tempf <- lm(count ~ temp_f, data=bike_data)
coef(bike_data_lm_tempf)

## (Intercept)      temp_f 
## -157.050506    5.095582

confint(bike_data_lm_tempf)

##                  2.5 %      97.5 %
## (Intercept) -172.62703 -141.473980
## temp_f         4.87255    5.318614

Question 2: Multiple Linear Regression - I

On the same datasetas Q1, perform the following multiple linear regression: count ~ temp + season + humidity + weather + year. Keep season and weather as categorical variables. Interpret your results through the following means :

what is the intercept and what does it mean? Ans. The intercept is 98. When the temperature is 0 degrees celsius, the season=1 (spring), humidity is relative, weather=1 (clear, few clouds, partly cloudy) and the year = 2011, the total number of bike rentals is 98.

## Insert code below
bike_data <- bike_data %>% mutate(season = as.factor(season))
bike_data <- bike_data %>% mutate(weather = as.factor(weather))
bike_data_mr_lm <- lm(count ~ temp + season + humidity + weather + year, data = bike_data)

summary(bike_data_mr_lm)

## 
## Call:
## lm(formula = count ~ temp + season + humidity + weather + year, 
##     data = bike_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -332.13  -98.50  -24.59   71.56  669.69 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  98.49594    7.33550  13.427  < 2e-16 ***
## temp         10.43201    0.31043  33.606  < 2e-16 ***
## season2       4.71765    5.24017   0.900  0.36799    
## season3     -29.10223    6.64854  -4.377 1.21e-05 ***
## season4      66.98554    4.39278  15.249  < 2e-16 ***
## humidity     -2.73308    0.08594 -31.800  < 2e-16 ***
## weather2     11.34183    3.48729   3.252  0.00115 ** 
## weather3     -7.37635    5.78648  -1.275  0.20242    
## year2012     75.87791    2.88950  26.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 149.6 on 10876 degrees of freedom
## Multiple R-squared:  0.3189, Adjusted R-squared:  0.3184 
## F-statistic: 636.7 on 8 and 10876 DF,  p-value: < 2.2e-16

coef(bike_data_mr_lm)

## (Intercept)        temp     season2     season3     season4    humidity 
##   98.495944   10.432008    4.717650  -29.102227   66.985542   -2.733076 
##    weather2    weather3    year2012 
##   11.341830   -7.376353   75.877913

confint(bike_data_mr_lm)

##                  2.5 %     97.5 %
## (Intercept)  84.117030 112.874857
## temp          9.823518  11.040497
## season2      -5.554035  14.989334
## season3     -42.134585 -16.069869
## season4      58.374886  75.596197
## humidity     -2.901543  -2.564609
## weather2      4.506115  18.177544
## weather3    -18.718903   3.966197
## year2012     70.213974  81.541852

how does each variable contribute to count in terms of increase or decrease?
Ans. We are checking the value of each variable’s coefficient and observing the positive (increase) or negative (decrease) sign.

Temp - If temperature increases by 1 degree celsius, the number of bike rentals increases by 10 (10.43)

Season2 Summer - In the summer, the number of bike rentals increases by 5 (4.7).

Season3 Fall - In the fall, the number of bike rentals decreases by 29 (-29.1).

Season4 Winter - In the winter, the number of bike rentals increases by 67 (66.9).

Humidity - When there is relative humidity, the number of bike rentals decreases by 3
(-2.7).

Weather2 - When the weather is mist + cloudy, mist + broken clouds, mist + few clouds, mist the number of bike rentals increases by 11 (11.3).

Weather3 - When the weather is light snow, light rain + thunderstroms + scattered clouds,
light + scattered clouds, the number of bike rentals decreases by 7 (7.3)

Year2012 - In the year 2012, compared to 2011, the number of bike rentals increased by 76 (75.8).
what can you say about the results and the quality of fit? Use pvalue threshold of < 0.001 to reject any null hypothesis.
Ans. Quality of fit: The RSE for the data is very high (it should be low), 149.6 on 10876 degrees of freedom. The R^2 statistic is very low at 0.3 (it should be high). The data points are far and scattered, indicating a poor quality of fit.

Temp - The 95% CI (9.8,11.0) and the p-value < 0 (2e-16). The p-value is significant and there is an association based on the 95% CI so we reject the null hypothesis.

Season2 (Summer) - The 95% CI is (-5.55,14.98) and the p-value > 0 (0.367). The p-value is not significant and there is no association based on the 95% CI so we do not reject the null hypothesis.

Season3 (Fall) - The 95% CI is (16.06, 42.13) (switched positions since both were negative) and the p-value < 0 (1.21e-05). The p-value is significant and there is an association
based on the 95% CI so we reject the null hypothesis.

Season4 (Winter) - The 95% CI is (58.37, 75.5) and the p-value < 0 (2e-16). The p-value is significant and there is an association based on the 95% CI so we reject the null
hypothesis.

Humidity (Relative) - The 95% CI is (2.56, 2.9) (switched positions since both were negative) and the p-value < 0 (2e-16). The p-value is significant and there is an association based on the 95% CI so we reject the null hypothesis.

Weather2 (Mist+cloudy, mist+broken clouds, mist+few clouds, mist) - The 95% CI is (11.34,3.48) and the p-value < 0 (0.00115). The p-value is significant (depending on the data) and there is an association based on the 95% CI so we reject the null hypothesis.

Weather3 (Light snow, light rain+thunderstorm+scattered clouds, light rain+scattered
clouds) - The 95% CI is (-7.37,5.78) and the p-value > 0 (0.202). The p-value is not
significant and there is not an association based on the 95% CI so we do not reject the null hypothesis.

Year2012 - The 95% CI is (70.21,81.54) and the p-value < 0 (2e-16). The p-value is
significant and there is an association based on the 95% CI so we reject the null
hypothesis.

Question 3: Multiple Linear Regression - II

This question deals within application of linear regression. Download the dataset titled “sales_advertising.csv” from Files -> Assignments -> Assignment_3. The dataset measure sales of a product as a function of advertising budgets for TV, radio, and newspaper media. The following is the data dictionary.

TV: advertising budget for TV (in thousands of dollars)
radio: advertising budget for radio (in thousands of dollars)
newspaper: advertising budget for newspaper (in thousands of dollars)
sales: sales of product (in thousands of units)

Plot the response (sales) against all three predictors in three separate plots. Write your code below. Do any of the plots show a linear trend?
Ans. The plot between sales and TV shows a strongly positive linear trend. The plot between sales and newspaper does not show a linear trend. The plot between sales and radio shows a weakly positive linear trend.

# ## Insert code below
 
# Note: To make the question more clear: we are comparing sales (increase or decrease) of A product (not TV, radio or newspaper) by the TV, radio or newspaper. We are looking to see which of these (TV, radio or newspaper) should be allocated the most money (budget) to sell THE product. 
 
 #Reading in the csv file
 sales_advertising_data <- read.csv("~/Desktop/sales_advertising.csv", header = TRUE)

#Plot between sales and TV
plot(sales_advertising_data$TV, sales_advertising_data$Sales, ylab="Sales", xlab="TV", main ="Sales of product with TV")

#Plot between sales and newspaper
plot(sales_advertising_data$Newspaper, sales_advertising_data$Sales, ylab="Sales", xlab="Newspaper", main = "Sales of product with Newspaper")

#Plot between sales and radio
plot(sales_advertising_data$Radio, sales_advertising_data$Sales, ylab="Sales", xlab="Radio", main = "Sales of product with Radio")

Perform a simple regression to model sales ~ TV. Write your code below. What is the observed association between sales and TV? What is the null hypothesis for this particular model? From the regression, what can we say about the null hypothesis? Use a p-value threshold of <0.05 to indicate significance.
Ans.

What is the observed association between sales and TV? With $0 in advertising budget, 7032 units of the product are sold via TV. For every $1000 spent, number of units increase by 48.

What is the null hypothesis for this particular model? Increase or decrease in budget will not affect the sales of product via TV.

What can we say about the null hypothesis? (We are looking at the p-value and 95% confidence interval of the TV) Since the P-value is low (therefore significant) and the 95% CI is ($42.2, $52.8), we reject the null hypothesis.

# Insert code
# code for (b) sales ~ TV
sales_tv_lm <- lm(Sales ~ TV, data=sales_advertising_data)
summary(sales_tv_lm)

## 
## Call:
## lm(formula = Sales ~ TV, data = sales_advertising_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

coef(sales_tv_lm)

## (Intercept)          TV 
##  7.03259355  0.04753664

confint(sales_tv_lm)

##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256

Perform a simple regression to model sales ~ newspaper. Write your code below. What is the observed association between sales and newspaper? What is the null hypothesis for this particular model? From the regression, what can we say about the null hypothesis? Use a p-value threshold of <0.05 to indicate significance.
Ans.

What is the observed association between sales and Newspaper? With $0 in advertising budget, 12350 units of the product are sold via newspaper. For every $1000 spent, number of units of product sold increases by 55.

What is the null hypothesis for this particular model? Increase or decrease in budget will not affect the sales of product via newspaper.

What can we say about the null hypothesis? (We are looking at the p-value and 95% confidence interval of the newspaper) Since the P-value is low (therefore significant) and the 95% CI is ($22, $87.3), we reject the null hypothesis.

# Insert code
# code for (c) sales ~ newspaper
sales_news_lm <- lm(Sales ~ Newspaper, data=sales_advertising_data)
summary(sales_news_lm)

## 
## Call:
## lm(formula = Sales ~ Newspaper, data = sales_advertising_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## Newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

coef(sales_news_lm)

## (Intercept)   Newspaper 
##  12.3514071   0.0546931

confint(sales_news_lm)

##                   2.5 %      97.5 %
## (Intercept) 11.12595560 13.57685854
## Newspaper    0.02200549  0.08738071

Perform a multiple linear regression to model sales ~ TV + radio + newspaper.
Ans.When all budgets (for TV, radio, news) is 0, $2938 units of product are sold (via tv, radio, news).

# Insert code
# code for (d) sales ~ TV + Radio + Newspaper

sales_data_lm_mr <- lm(Sales ~ TV + Radio + Newspaper, data = sales_advertising_data)

summary(sales_data_lm_mr)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = sales_advertising_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

coef(sales_data_lm_mr)

##  (Intercept)           TV        Radio    Newspaper 
##  2.938889369  0.045764645  0.188530017 -0.001037493

confint(sales_data_lm_mr)

##                   2.5 %     97.5 %
## (Intercept)  2.32376228 3.55401646
## TV           0.04301371 0.04851558
## Radio        0.17154745 0.20551259
## Newspaper   -0.01261595 0.01054097

What are the observed associations between sales and each of the media budgets? Mention which associations are significant. Use a p-value threshold of <0.05 to indicate significance.
Ans. When all the advertising budgets (TV, radio and newspaper) are 0, 2938 units of the product are sold via TV, radio and newspaper.

Association between sales and tv: For every $1000 spent, number of units increases by 48. Due to the low P-value (2e-16) and 95% CI ($43, $48.5), this association is significant.

Association between sales and radio: For every $1000 spent, number of units increases by 189. Due to the low P-value (2e-16) and 95% CI ($171.5, $205.5), this association is significant.

Association between sales and newspaper: Due to the high P-value (0.86) and 95% CI ($-1, $5.8), this association is not significant.

Do you observe any difference in the associations in the multiple regression model vs. the simple regression model? Explain why or why not.
Ans. Yes, there is a difference. The sales ~ newspapers association was significant in simple regression (low p-value and 95% CI ($22, $87.5)) but insignificant in multiple regression (high p-value at 0.86 and 95% CI ($-1,$5.83)). The difference between simple and multiple regression is the number of variables involved. The significance in simple regression could have depended heavily on the one variable. In multiple regression, a different variable could have been more (or less) important, changing the significance. Significance is depended on the data, the variables and the importance of each variable.

Assignment III

February 5, 2019

Question 1: Simple Linear Regression

Question 2: Multiple Linear Regression - I

Question 3: Multiple Linear Regression - II