homework 1

1. . Download the homework1.Rmd file from the course website.
2. Open homework1.Rmd in RStudio.
Changing the author field and file name.
3. Simple (Bivariate) Linear Regression with the Boston Housing data.
4. Multiple Linear Regression with the Boston Housing data
5. Non-Linear transformations of the predictors
6. ggplot visualizations
PART WITH MOST CODE BUT TO RELATE THEORY TO RESULTS.
Data pre-processing
Problem 1: Qualitative predictors
Problem 2: Multiple linear regression
Problem 3: Dealing with collinearity

1. . Download the `homework1.Rmd` file from the course website.

2. Open `homework1.Rmd` in RStudio.

Changing the author field and file name.

(a) Change the `author:` field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “hw1_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

Just so you know here are some useful RStudio hotkeys.

Keystroke	Description
`<tab>`	Autocompletes commands and filenames, and lists arguments for functions.
`<up>`	Cycles through previous commands in the console prompt
`<ctrl-up>`	Lists history of previous commands matching an unfinished one
`<ctrl-enter>`	Runs current line from source window to Console. Good for trying things out ideas from a source file.
`<ESC>`	Aborts an unfinished command and get out of the + prompt

Note: Shown above are the Windows/Linux keys. For Mac OS X, the <ctrl> key should be substituted with the <command> (⌘) key.

Instead of sending code line-by-line with <ctrl-enter>, you can send entire code chunks, and even run all of the code chunks in your .Rmd file. Look under the menu of the Source panel.
Run your code in the Console and Knit HTML frequently to check for errors.
You may find it easier to solve a problem by interacting only with the Console at first, or by creating a separate .R source file that contains only R code and no Markdown. ### 2. Installing and loading packages

Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.

To install a package, use the install.packages() command. Of these packages, ISLR corresponds to the book available to download for free at https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6009dd9fa7bc363aa822d2c7/1611259312432/ISLR+Seventh+Printing.pdf, parts of this homework are tied to the book itself. You’ll want to run the following commands to get the necessary packages for today’s lab:

install.packages("ggplot2")
install.packages("MASS")
install.packages("ISLR")
install.packages("knitr")

You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library() command. For this homework, you’ll want to run the following code

library(ggplot2) # graphics library
library(MASS)    # contains data sets, including Boston
library(ISLR)    # contains code and data from the textbook
library(knitr)   # contains kable() function

options(scipen = 4)  # Suppresses scientific notation

3. Simple (Bivariate) Linear Regression with the Boston Housing data.

This portion of the lab gets you to carry out the Lab in §3.6 of ISLR (Pages 109 - 118). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.

Please run all of the code indicated in §3.6 of ISLR, even if I don’t explicitly ask you to do so in this document.

Note: You may want to use the View(Boston) command instead of fix(Boston).

(a) Use the `dim()` command to figure out the number of rows and columns in the Boston housing data

library(MASS)                # contains the Boston dataset
data("Boston")               # load it into your session

# quick sanity checks
dim(Boston)       # rows & cols

## [1] 506  14

names(Boston)     # variable names

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

head(Boston, 3)   # first few rows

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
##   medv
## 1 24.0
## 2 21.6
## 3 34.7

(b) Use the `nrow()` and `ncol()` commands to figure out the number of rows and columns in the Boston housing data.

nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 14

(c) Use the `names()` command to see which variables exist in the data. Which of these variables is our response variable? What does this response variable refer to? How many input variables do we have?

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

(d) Use the `lm()` function to a fit linear regression of `medv` on `lstat`. Save the output of your linear regression in a variable called `lm.fit`.

lm.fit <- lm(medv ~ lstat, data = Boston)

(e) Use the `summary()` command on your `lm.fit` object to get a print-out of your regression results

summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

(f) Uncomment the line below to get a ‘nice’ printout of the coefficients table

library(knitr)
kable(summary(lm.fit)$coefficients, digits = c(4, 5, 2, 4))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	34.5538	0.56263	61.42	0
lstat	-0.9500	0.03873	-24.53	0

(g) Call `names()` on `lm.fit` to explore what values this linear model object contains.

names(lm.fit)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

(h) Use the `coef()` function to get the estimated coefficients. What is the estimated Intercept? What is the coefficient of `lstat` in the model? Interpret this coefficient.

coef(lm.fit)

## (Intercept)       lstat 
##  34.5538409  -0.9500494

(i) Next as in Lab 2 use a ggplot command that overlays a linear regression line on a scatterplot of `mdev` vs. `lstat`. Make sure to produce more meaningful axis labels. Does the linear model appear to fit the data well? Explain.

library(ggplot2)

ggplot(Boston, aes(x = lstat, y = medv)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Linear Regression of medv on lstat (Boston Housing Data)",
    x = "Percent Lower-Status Population (lstat)",
    y = "Median Home Value (medv, in $1000s)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

(i) Follow the ISLR examples for getting confidence intervals and prediction intervals for the regression data. Remind me to talk about prediction intervals in class.

# 95% confidence interval for the slope and intercept
confint(lm.fit)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

# Example: predict median value for specific lstat values
new.data <- data.frame(lstat = c(5, 10, 15))
predict(lm.fit, newdata = new.data, interval = "confidence")

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

predict(lm.fit, newdata = new.data, interval = "prediction")

##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

4. Multiple Linear Regression with the Boston Housing data

(a) Use the command `?Boston` to figure out what the `age` variable means. What does `age` mean in the Boston Housing data?

?Boston # opens help page (or use help(“Boston”, package = “MASS”)) # Answer in words (put this in your Rmd text, not code): # age = proportion of owner-occupied units built prior to 1940 (in percent).

(b) Following the example in part 3(i) above, use the `qplot()` command to construct a scatterplot of `medv` veruses `age`. Make sure to specify meaningful x and y axis names. Overlay a linear regression line. Does a linear relationship appear to hold between the two variables?

library(ggplot2)

qplot(
  x = age, y = medv, data = Boston,
  geom = c("point", "smooth"),
  method = "lm", se = TRUE,
  xlab = "Proportion of owner-occupied units built before 1940 (%)",
  ylab = "Median home value (medv, $1000s)",
  main = "Boston Housing: medv vs age with linear fit"
)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning in geom_point(method = "lm", se = TRUE): Ignoring unknown parameters:
## `method` and `se`

## `geom_smooth()` using formula = 'y ~ x'

# The relationship is weakly positive and not perfectly linear.

(c) Use the `lm()` command to a fit a linear regression of `medv` on `lstat` and `age`. Save your regression model in a variable called `lm.fit`.

lm.fit2 <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

(d) What is the coefficient of `age` in your model? Interpret this coefficient.

coef(lm.fit2)["age"]

##        age 
## 0.03454434

summary(lm.fit2)$coefficients["age", ]

##    Estimate  Std. Error     t value    Pr(>|t|) 
## 0.034544339 0.012225467 2.825604893 0.004906776

confint(lm.fit2)["age", ]

##      2.5 %     97.5 % 
## 0.01052507 0.05856361

(e) Use `medv ~ .` syntax to fit a model regressing `medv` on all the other variables. Use the `summary()` and `kable()` functions to produce a coefficients table in nice formatting.

lm.fit.full <- lm(medv ~ ., data = Boston)

summary(lm.fit.full)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)  36.4594884   5.1034588   7.144 3.28e-12 ***
## crim         -0.1080114   0.0328650  -3.287 0.001087 ** 
## zn            0.0464205   0.0137275   3.382 0.000778 ***
## indus         0.0205586   0.0614957   0.334 0.738288    
## chas          2.6867338   0.8615798   3.118 0.001925 ** 
## nox         -17.7666112   3.8197437  -4.651 4.25e-06 ***
## rm            3.8098652   0.4179253   9.116  < 2e-16 ***
## age           0.0006922   0.0132098   0.052 0.958229    
## dis          -1.4755668   0.1994547  -7.398 6.01e-13 ***
## rad           0.3060495   0.0663464   4.613 5.07e-06 ***
## tax          -0.0123346   0.0037605  -3.280 0.001112 ** 
## ptratio      -0.9527472   0.1308268  -7.283 1.31e-12 ***
## black         0.0093117   0.0026860   3.467 0.000573 ***
## lstat        -0.5247584   0.0507153 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

library(knitr)
kable(summary(lm.fit.full)$coefficients, digits = c(4, 5, 2, 4))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	36.4595	5.10346	7.14	0.0000
crim	-0.1080	0.03286	-3.29	0.0011
zn	0.0464	0.01373	3.38	0.0008
indus	0.0206	0.06150	0.33	0.7383
chas	2.6867	0.86158	3.12	0.0019
nox	-17.7666	3.81974	-4.65	0.0000
rm	3.8099	0.41793	9.12	0.0000
age	0.0007	0.01321	0.05	0.9582
dis	-1.4756	0.19945	-7.40	0.0000
rad	0.3060	0.06635	4.61	0.0000
tax	-0.0123	0.00376	-3.28	0.0011
ptratio	-0.9527	0.13083	-7.28	0.0000
black	0.0093	0.00269	3.47	0.0006
lstat	-0.5248	0.05072	-10.35	0.0000

(f) Think about what the variables in the data set mean. Do the signs of all of the coefficient estimates make sense? Are there any that do not? For the ones that do not, are the coefficients statistically significant (do they have p-value < 0.05)?

summary(lm.fit.full)$coefficients[, 4]

##  (Intercept)         crim           zn        indus         chas          nox 
## 3.283438e-12 1.086810e-03 7.781097e-04 7.382881e-01 1.925030e-03 4.245644e-06 
##           rm          age          dis          rad          tax      ptratio 
## 1.979441e-18 9.582293e-01 6.013491e-13 5.070529e-06 1.111637e-03 1.308835e-12 
##        black        lstat 
## 5.728592e-04 7.776912e-23

sig.vars <- summary(lm.fit.full)$coefficients[
  summary(lm.fit.full)$coefficients[, 4] < 0.05, ]
kable(sig.vars, digits = 4)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	36.4595	5.1035	7.1441	0.0000
crim	-0.1080	0.0329	-3.2865	0.0011
zn	0.0464	0.0137	3.3816	0.0008
chas	2.6867	0.8616	3.1184	0.0019
nox	-17.7666	3.8197	-4.6513	0.0000
rm	3.8099	0.4179	9.1161	0.0000
dis	-1.4756	0.1995	-7.3980	0.0000
rad	0.3060	0.0663	4.6129	0.0000
tax	-0.0123	0.0038	-3.2800	0.0011
ptratio	-0.9527	0.1308	-7.2825	0.0000
black	0.0093	0.0027	3.4668	0.0006
lstat	-0.5248	0.0507	-10.3471	0.0000

5. Non-Linear transformations of the predictors

(a) Perform a regression of `medv` onto a quadratic polynomial of `lstat` by using the formula `medv ~ lstat + I(lstat^2)`. Use the `summary()` function to display the estimated coefficients. Is the coefficient of the squared term statistically significant?

lm.fit.poly <- lm(medv ~ lstat + I(lstat^2), data = Boston)

summary(lm.fit.poly)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

(b) Try using the formula `medv ~ lstat + lstat^2` instead. What happens?

lm.fit.poly2 <- lm(medv ~ lstat + lstat^2, data = Boston)
summary(lm.fit.poly2)

## 
## Call:
## lm(formula = medv ~ lstat + lstat^2, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

# Without the I(), R interprets the ^ symbol differently—it expands interactions and polynomial terms automatically (creating additional combinations).

(c) Use the formula `medv ~ poly(lstat, 2)`. Compare your results to part (a).

lm.fit.poly3 <- lm(medv ~ poly(lstat, 2), data = Boston)
summary(lm.fit.poly3)

## 
## Call:
## lm(formula = medv ~ poly(lstat, 2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2456   91.76   <2e-16 ***
## poly(lstat, 2)1 -152.4595     5.5237  -27.60   <2e-16 ***
## poly(lstat, 2)2   64.2272     5.5237   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

# Both models confirm strong non-linearity between lstat and medv.
# The orthogonal polynomial (poly) approach is numerically cleaner, but the raw form (I(lstat²)) is easier to interpret.

6. ggplot visualizations

ggplot’s stat_smooth command allows us to visualize simple regression models in a really easy way. This set of problems helps you get accustomed to specifying polynomial and step function formulas for the purpose of visualization.

For this problem, please refer to the code posted dealing with the oj.R where I ass polynomials and step functions.

(a) Use `ggplot` graphics to construct a scatterplot of `medv` vs `lstat`, overlaying a 2nd degree polynomial. Does this appear to be a good model of the data? Construct plots with higher degree polynomial fits. Do any of them appear to describe the data particularly well?

library(ggplot2)

ggplot(Boston, aes(x = lstat, y = medv)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, color = "red", linewidth = 1) +
  labs(
    title = "Boston Housing: medv vs lstat with 2nd-Degree Polynomial Fit",
    x = "Percent Lower-Status Population (lstat)",
    y = "Median Home Value (medv, $1000s)"
  ) +
  theme_minimal()

# The 2nd-degree polynomial captures the smooth, non-linear decline in home values as lstat rises. It fits the curvature well without overfitting—showing that housing values drop quickly at low lstat and level off at higher values. Higher-degree polynomials add noise rather than improvement.

(b) Repeat part (a), but this time using step functions instead of polynomials. Try picking cuts to best match the trends in the data. Which functional form appears to do a better job of describing the data: polynomials, or step functions? Explain.

ggplot(Boston, aes(x = lstat, y = medv)) +
  geom_point(color = "gray40", alpha = 0.6) +
  stat_smooth(
    method = "lm",
    formula = y ~ cut(x, breaks = c(0, 10, 20, 30)),
    se = FALSE,
    color = "darkred",
    linewidth = 1
  ) +
  labs(
    title = "Step Function Fit: medv vs lstat",
    x = "Percent Lower-Status Population (lstat)",
    y = "Median Home Value (medv, $1000s)"
  ) +
  theme_minimal()

## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_smooth()`).

# The step function captures broad downward shifts but oversimplifies the trend, missing the smooth curvature. While easy to interpret by groups, it’s less accurate than the polynomial model. The polynomial provides a smoother and more realistic fit overall.

(c) Repeat part (a), this time using `ptratio` as the x-axis variable, and `medv` still as the y-axis variable.

ggplot(Boston, aes(x = ptratio, y = medv)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, color = "red", size = 1) +
  labs(
    title = "Boston Housing: medv vs ptratio with 2nd-Degree Polynomial Fit",
    x = "Pupil-Teacher Ratio (ptratio)",
    y = "Median Home Value (medv, $1000s)"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

(d) Repeat part (b), this time with `ptratio` instead of `lstat`.

ggplot(Boston, aes(x = ptratio, y = medv)) +
  geom_point(color = "gray40", alpha = 0.6) +
  stat_smooth(
    method = "lm",
    formula = y ~ cut(x, breaks = c(10, 15, 20, 25)),
    se = FALSE,
    color = "darkred",
    size = 1
  ) +
  labs(
    title = "Step Function Fit: medv vs ptratio",
    x = "Pupil-Teacher Ratio (ptratio)",
    y = "Median Home Value (medv, $1000s)"
  ) +
  theme_minimal()

PART WITH MOST CODE BUT TO RELATE THEORY TO RESULTS.

library(ggplot2)
library(plyr)
library(ISLR)
library(MASS)
library(knitr)

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

options(scipen = 4)

For this problem we’ll be working with two years of bikeshare data from the Capital Bikeshare system in Washington DC. This is in the same folder as this assignment on the course website. The dataset contains daily bikeshare counts, along with daily measurements on environmental and seasonal information that may affect the bikesharing.

Data pre-processing

Let’s start by loading the data.

bikes <- read.csv("~/Downloads/bikes (1).csv", header = TRUE)
str(bikes)

## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "1/1/2011" "1/2/2011" "1/3/2011" "1/4/2011" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

head(bikes)

##   instant   dteday season yr mnth holiday weekday workingday weathersit
## 1       1 1/1/2011      1  0    1       0       6          0          2
## 2       2 1/2/2011      1  0    1       0       0          0          2
## 3       3 1/3/2011      1  0    1       0       1          1          1
## 4       4 1/4/2011      1  0    1       0       2          1          1
## 5       5 1/5/2011      1  0    1       0       3          1          1
## 6       6 1/6/2011      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

bikes <- transform(bikes,
  temp = 47 * temp - 8,         # Convert normalized temp back to Celsius
  atemp = 66 * atemp - 16,      # Convert normalized feeling temp back to Celsius
  hum = 100 * hum,              # Convert humidity to percentage
  windspeed = 67 * windspeed    # Convert normalized windspeed to actual scale
)

library(plyr)
bikes <- transform(bikes,
  season = mapvalues(season, c(1, 2, 3, 4),
                     c("Winter", "Spring", "Summer", "Fall"))
)

str(bikes)

## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "1/1/2011" "1/2/2011" "1/3/2011" "1/4/2011" ...
##  $ season    : chr  "Winter" "Winter" "Winter" "Winter" ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  8.18 9.08 1.23 1.4 2.67 ...
##  $ atemp     : num  7.999 7.347 -3.499 -2 -0.868 ...
##  $ hum       : num  80.6 69.6 43.7 59 43.7 ...
##  $ windspeed : num  10.7 16.7 16.6 10.7 12.5 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

summary(bikes)

##     instant         dteday             season                yr        
##  Min.   :  1.0   Length:731         Length:731         Min.   :0.0000  
##  1st Qu.:183.5   Class :character   Class :character   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Mode  :character   Median :1.0000  
##  Mean   :366.0                                         Mean   :0.5007  
##  3rd Qu.:548.5                                         3rd Qu.:1.0000  
##  Max.   :731.0                                         Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp            atemp              hum       
##  Min.   :1.000   Min.   :-5.221   Min.   :-10.781   Min.   : 0.00  
##  1st Qu.:1.000   1st Qu.: 7.843   1st Qu.:  6.298   1st Qu.:52.00  
##  Median :1.000   Median :15.422   Median : 16.124   Median :62.67  
##  Mean   :1.395   Mean   :15.283   Mean   : 15.307   Mean   :62.79  
##  3rd Qu.:2.000   3rd Qu.:22.805   3rd Qu.: 24.168   3rd Qu.:73.02  
##  Max.   :3.000   Max.   :32.498   Max.   : 39.499   Max.   :97.25  
##    windspeed          casual         registered        cnt      
##  Min.   : 1.500   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.: 9.042   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :12.125   Median : 713.0   Median :3662   Median :4548  
##  Mean   :12.763   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:15.625   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :34.000   Max.   :3410.0   Max.   :6946   Max.   :8714

Let’s look at some boxplots of how bikeshare ride count varies with season.

qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")

There’s something funny going on here. Instead of showing up in seasonal order, the seasons in the plot are showing up in alphabetical order. The following command reorders the seasons appropriately.

bikes <- transform(bikes, season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall")))

Now let’s try that plot again.

qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")

Here’s information on what the variables mean.

instant: record index
- dteday : date
- season : season (1:Winter, 2:Spring, 3:Summer, 4:Fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
  - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Temperature in Celsius.
- atemp: Feeling temperature in Celsius.
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Problem 1: Qualitative predictors

The Season variable is an example of what’s called a qualitative or categorical predictor. In R, such variables are called factors. This problem gets to fit a model with a qualitative predictor and to interpret the findings. (This is like how we handled race in class)

(a) Fit a linear regression model with `cnt` as the response and `season` as the input. Use the `summary()` and `kable()` commands to produce a nice looking coefficients table.

lm.season <- lm(cnt ~ season, data = bikes)

library(knitr)
kable(summary(lm.season)$coefficients, digits = 3)

	Estimate	Std. Error	t value
(Intercept)	2604.133	116.598	22.334
seasonSpring	2388.199	164.221	14.543
seasonSummer	3040.171	163.352	18.611
seasonFall	2124.030	165.588	12.827

(b) How many total coefficients are there in the model?

There are 4 total coefficients in the model:

1 intercept (the baseline category: Winter)

3 dummy coefficients for Spring, Summer, and Fall

This is because season has 4 levels, and R automatically omits one (Winter) as the reference group.

(c) How many coefficients are estimated for the `season` variable?

here are 3 estimated coefficients for season:

seasonSpring

seasonSummer

seasonFall

Each represents the difference in the mean number of total rides (cnt) compared to Winter.

(d) Interpret the coefficients of `season` in the model.

The intercept represents the average number of total rides during Winter, which is approximately 2604.
The coefficient for Spring (2388.19) means that, on average, there are about 2,388 more rides per day in Spring compared to Winter.
The coefficient for Summer (3040.17) indicates that, on average, there are about 3,040 more rides per day in Summer compared to Winter — this is the largest seasonal increase.
The coefficient for Fall (2124.03) shows that, on average, there are about 2,124 more rides per day in Fall compared to Winter.
All coefficients are statistically significant (p < 0.001), suggesting that bike rentals increase substantially across all other seasons relative to Winter, with Summer having the highest average ridership.

Hint: If you have not previously studied how to interpret qualitative variables in regressions, begin by reading through the relevant sections of the Suggested readings for the Week 1 lectures

Problem 2: Multiple linear regression

In this problem we’ll practice fitting and interpreting the results of a multiple linear regression.

(a) Fit a regression model with `cnt` as the response and the following variables as inputs: `temp`, `atemp`, `mnth`, `hum`, `windspeed`. Use the `summary()` and `kable()` commands to produce a nice looking coefficients table.

lm.multi <- lm(cnt ~ temp + atemp + mnth + hum + windspeed, data = bikes)

summary(lm.multi)

## 
## Call:
## lm(formula = cnt ~ temp + atemp + mnth + hum + windspeed, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4829.6  -992.7  -188.5  1089.4  3615.1 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept) 5057.781    342.624  14.762      < 2e-16 ***
## temp          45.387     47.414   0.957       0.3388    
## atemp         72.014     38.138   1.888       0.0594 .  
## mnth          95.040     15.742   6.037 0.0000000025 ***
## hum          -35.262      3.801  -9.277      < 2e-16 ***
## windspeed    -59.159     10.601  -5.580 0.0000000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1389 on 725 degrees of freedom
## Multiple R-squared:  0.4895, Adjusted R-squared:  0.486 
## F-statistic:   139 on 5 and 725 DF,  p-value: < 2.2e-16

library(knitr)
kable(summary(lm.multi)$coefficients, digits = 3)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	5057.781	342.624	14.762	0.000
temp	45.387	47.414	0.957	0.339
atemp	72.014	38.138	1.888	0.059
mnth	95.040	15.742	6.037	0.000
hum	-35.262	3.801	-9.277	0.000
windspeed	-59.159	10.601	-5.580	0.000

(b) Interpret the coefficients of `mnth`, `windspeed` and `atemp` in the model.

The coefficient for ‘mnth’ (95.04) indicates that, on average, bike ridership increases by about 95 rides per day with each additional month in the year, holding other variables constant.
The coefficient for ‘windspeed’ (-59.16) means that for every unit increase in normalized windspeed, ridership decreases by about 59 rides per day, holding all other variables constant.
The coefficient for ‘atemp’ (72.01) suggests that for each one-unit increase in the apparent temperature (feels-like temperature), ridership increases by roughly 72 rides per day, controlling for other predictors.

(c) Which predictors are associated with increased ridership? Which predictors are associated with decreased ridership?

Predictors with positive coefficients (associated with increased ridership): ‘temp’, ‘atemp’, and ‘mnth’
Predictors with negative coefficients (associated with decreased ridership): ‘hum’ and ‘windspeed’

(d) Which predictors are statistically significant at the 0.05 level?

From the p-values in the output, the predictors ‘mnth’, ‘hum’, and ‘windspeed’ are statistically significant at the 0.05 level (p < 0.001).
Both ‘temp’ and ‘atemp’ are not significant (p > 0.05), indicating they do not have a statistically meaningful impact on ridership in this model.

Problem 3: Dealing with collinearity

As one of your recent econometrics class covered, collinear or highly correlated predictors can make interpreting regression coefficients problematic. In this problem you will try to diagnose and address collinearity issues in the data.

(a) A way to check if two varibles are correlatated is to use the `pairs()` function on the set of variables used in the multiple regression model estimated just above, This command will give scatterplots above the diagonal, and correlations below the diagonals that you can quickly check if any of the predictor variables are highly correlated with one another.

bikes.vars <- bikes[, c("temp", "atemp", "mnth", "hum", "windspeed")]

pairs(bikes.vars, main = "Scatterplot Matrix of Predictor Variables")

round(cor(bikes.vars), 3)

##             temp  atemp   mnth    hum windspeed
## temp       1.000  0.992  0.220  0.127    -0.158
## atemp      0.992  1.000  0.227  0.140    -0.184
## mnth       0.220  0.227  1.000  0.222    -0.208
## hum        0.127  0.140  0.222  1.000    -0.248
## windspeed -0.158 -0.184 -0.208 -0.248     1.000

Hint: A complete example of how to use the pairs() command to construct such plots for the pickup data would be to first type in a separate window pickup-> read pickup.csv, after that you will create a matrix of the variable you will want as truck.var.names <- c(“year”, “miles”, “price”) The above command says which variable you will want in the graphs, then you type the following command to see the 2*2 scatterpolts pairs(pickup[,truck.var.names]) to get a table of all the correlations you could then type round(cor(pickup[,truck.var.names]), 3) So now that this has been explained, you can do it with the bikeshare data. ##### (b) Are any of the predictors highly correlated? Are you surprised that these predictors are highly correlated, or can you think of a reason for why it makes sense that they should be correlated?

The variables temp and atemp are very highly correlated (correlation ≈ 0.99).
This indicates a strong linear relationship — as temperature increases, the “feels-like” temperature increases almost identically.
This is expected, not surprising, because ‘atemp’ (apparent temperature) is derived directly from the actual ‘temp’ variable and includes similar scaling adjustments (like humidity and wind).

(c) Refit your regression model, but this time omit the `temp` variable. Display the coefficients table for this model.

lm.multi.noTemp <- lm(cnt ~ atemp + mnth + hum + windspeed, data = bikes)

summary(lm.multi.noTemp)

## 
## Call:
## lm(formula = cnt ~ atemp + mnth + hum + windspeed, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4864.6  -993.4  -171.3  1102.3  4114.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5186.624    315.061  16.462  < 2e-16 ***
## atemp        108.210      4.970  21.774  < 2e-16 ***
## mnth          95.016     15.741   6.036 2.51e-09 ***
## hum          -35.446      3.796  -9.338  < 2e-16 ***
## windspeed    -57.397     10.440  -5.498 5.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1389 on 726 degrees of freedom
## Multiple R-squared:  0.4889, Adjusted R-squared:  0.486 
## F-statistic: 173.6 on 4 and 726 DF,  p-value: < 2.2e-16

library(knitr)
kable(summary(lm.multi.noTemp)$coefficients, digits = 3)

	Estimate	Std. Error	t value
(Intercept)	5186.624	315.061	16.462
atemp	108.210	4.970	21.774
mnth	95.016	15.741	6.036
hum	-35.446	3.796	-9.338
windspeed	-57.397	10.440	-5.498

(d) What is the coefficient of `atemp` in this new model? Is it very different from the `atemp` coefficient estimated in part (b)? Is it statistically significant? Explain your findings.

The coefficient of ‘atemp’ in this new model is 108.210 and is statistically significant (p < 0.001).
Compared to the earlier model (where ‘atemp’ = 72.014, p = 0.259), the coefficient increased and became significant after removing ‘temp’.
This indicates that the strong collinearity between ‘temp’ and ‘atemp’ in the previous model had masked the true effect of ‘atemp’.
Without ‘temp’, ‘atemp’ now clearly captures the positive relationship between warmer apparent temperatures and increased bike ridership.

homework 1

Leo Caporicci

Week 1

1. . Download the homework1.Rmd file from the course website.

2. Open homework1.Rmd in RStudio.

Changing the author field and file name.

(a) Change the author: field on the Rmd document from Your Name Here to your own name.

(b) Rename this file to “hw1_YourHameHere.Rmd”, where YourNameHere is changed to your own name.

3. Simple (Bivariate) Linear Regression with the Boston Housing data.

(a) Use the dim() command to figure out the number of rows and columns in the Boston housing data

(b) Use the nrow() and ncol() commands to figure out the number of rows and columns in the Boston housing data.

(c) Use the names() command to see which variables exist in the data. Which of these variables is our response variable? What does this response variable refer to? How many input variables do we have?

(d) Use the lm() function to a fit linear regression of medv on lstat. Save the output of your linear regression in a variable called lm.fit.

(e) Use the summary() command on your lm.fit object to get a print-out of your regression results

(f) Uncomment the line below to get a ‘nice’ printout of the coefficients table

(g) Call names() on lm.fit to explore what values this linear model object contains.

(h) Use the coef() function to get the estimated coefficients. What is the estimated Intercept? What is the coefficient of lstat in the model? Interpret this coefficient.

(i) Next as in Lab 2 use a ggplot command that overlays a linear regression line on a scatterplot of mdev vs. lstat. Make sure to produce more meaningful axis labels. Does the linear model appear to fit the data well? Explain.