Exercise 8 in section 2.4

This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are
Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10 % of high school class
Top25perc : New students from top 25 % of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate
Before reading the data into R, it can be viewed in Excel or a text editor.

(a). Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data

urlpage = "https://www.statlearning.com/s/College.csv"
college = read.csv(url(urlpage))

(b). Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

# Remove the "#" below
rownames(college) <- college[, 1] # use university name as index
# need to install XQuartz in MAC and restart the computer
fix(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

# Remove the "#" below
college = college[,-1]  # keep all rows but delete the 1st column 
fix(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

(c).i. Use the summary() function to produce a numerical summary of the variables in the data set

college$Private = as.factor(college$Private)
summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(c).ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

# pairs() function is used to draw matrix scatter plot
pairs(college[,1:10])

(c).iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

attach(college)
plot(Private,Outstate, 
     col = "blue",
     varwidth = T, 
     xlab = "Private", 
     ylab = "Outstate", 
     main = "Side-By-Side Boxplots of Private VS Outstate")

(c).iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50%.

# Remove the # below
Elite = rep("No", nrow(college))
Elite[college$Top10perc>50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)

Use the summary() function to see how many elite universities there are.

summary(college$Elite)
##  No Yes 
## 699  78

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

attach(college)
plot(Elite, Outstate,
     col = 'yellow',
     varwidth = T,
     xlab = 'Elite',
     ylab = 'Outstate',
     main = "Side-By-Side Boxplots of Elite VS Outstate")

(c).v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

Divide the print window into four regions

# par() could make one page of multiple images, par(mfrow=(2,2)) means that it displays 2 rows and 2 columns
par(mfrow=c(2,2))

Build histograms

par(mfrow=c(2,2))
attach(college)

# Comparison chart1: New students from top 25 % of high school class
hist(Top25perc, col = "#BB8FCE" , breaks =30,
     ylab = "Number",
     main = "Top25perc Histogram with 30 bins")
hist(Top25perc, col = "#BB8FCE", breaks =60,
     ylab = "Number",
     main = "Top25perc Histogram with 60 bins")

# Comparison chart2: Student/faculty ratio
hist(S.F.Ratio, col = "#009E73",breaks = 10,
     ylab = "Count",
     main = "S.F.Ratio Histogram with 10 bins")
hist(S.F.Ratio, col = "#009E73", breaks =40,
     ylab = "Count",
     main = "S.F.Ratio Histogram with 40 bins")

(c).vi. Continue exploring the data, and provide a brief summary of what you discover.

# step1: build a multiple linear regression model
model = lm(Grad.Rate~.,data=college)
summary(model)
## 
## Call:
## lm(formula = Grad.Rate ~ ., data = college)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.991  -7.100  -0.300   7.174  54.034 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.8925541  4.8522723   6.985 6.24e-12 ***
## PrivateYes   3.4262050  1.7151733   1.998 0.046119 *  
## Apps         0.0012936  0.0004428   2.921 0.003588 ** 
## Accept      -0.0006909  0.0008638  -0.800 0.424030    
## Enroll       0.0021440  0.0023111   0.928 0.353840    
## Top10perc    0.0465274  0.0851268   0.547 0.584838    
## Top25perc    0.1374511  0.0564462   2.435 0.015118 *  
## F.Undergrad -0.0004648  0.0004026  -1.155 0.248635    
## P.Undergrad -0.0014809  0.0003907  -3.790 0.000162 ***
## Outstate     0.0010197  0.0002339   4.360 1.48e-05 ***
## Room.Board   0.0019067  0.0005926   3.217 0.001348 ** 
## Books       -0.0022140  0.0029189  -0.758 0.448388    
## Personal    -0.0016620  0.0007703  -2.158 0.031270 *  
## PhD          0.0882924  0.0571134   1.546 0.122543    
## Terminal    -0.0751566  0.0624063  -1.204 0.228845    
## S.F.Ratio    0.0746163  0.1595478   0.468 0.640153    
## perc.alumni  0.2796432  0.0492353   5.680 1.92e-08 ***
## Expend      -0.0004596  0.0001552  -2.961 0.003158 ** 
## EliteYes     0.4618984  2.5235781   0.183 0.854821    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.75 on 758 degrees of freedom
## Multiple R-squared:  0.4616, Adjusted R-squared:  0.4488 
## F-statistic:  36.1 on 18 and 758 DF,  p-value: < 2.2e-16
library(car)
vif(model)
##     Private        Apps      Accept      Enroll   Top10perc   Top25perc 
##    2.788191   14.012117   21.385793   21.999953   10.758529    5.962313 
## F.Undergrad P.Undergrad    Outstate  Room.Board       Books    Personal 
##   18.194541    1.688082    4.224016    2.015314    1.108072    1.297754 
##         PhD    Terminal   S.F.Ratio perc.alumni      Expend       Elite 
##    4.149118    4.027321    1.902894    1.775935    3.132748    2.747435
model1 = lm(Grad.Rate~. -Apps-Accept-Enroll-Top10perc-F.Undergrad,data=college)
summary(model1)
## 
## Call:
## lm(formula = Grad.Rate ~ . - Apps - Accept - Enroll - Top10perc - 
##     F.Undergrad, data = college)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.766  -7.884  -0.582   7.902  56.916 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.2135523  4.7938873   6.303 4.95e-10 ***
## PrivateYes   0.8074562  1.5941170   0.507 0.612636    
## Top25perc    0.2030182  0.0348216   5.830 8.17e-09 ***
## P.Undergrad -0.0011525  0.0003625  -3.179 0.001537 ** 
## Outstate     0.0010874  0.0002307   4.713 2.90e-06 ***
## Room.Board   0.0022963  0.0005903   3.890 0.000109 ***
## Books       -0.0016098  0.0029666  -0.543 0.587532    
## Personal    -0.0015472  0.0007761  -1.994 0.046548 *  
## PhD          0.1011227  0.0572586   1.766 0.077784 .  
## Terminal    -0.0908948  0.0629162  -1.445 0.148954    
## S.F.Ratio    0.1625722  0.1608193   1.011 0.312385    
## perc.alumni  0.2604421  0.0493332   5.279 1.69e-07 ***
## Expend      -0.0002720  0.0001496  -1.818 0.069409 .  
## EliteYes     2.2919269  2.1006727   1.091 0.275598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.98 on 763 degrees of freedom
## Multiple R-squared:  0.4384, Adjusted R-squared:  0.4289 
## F-statistic: 45.82 on 13 and 763 DF,  p-value: < 2.2e-16
vif(model1)
##     Private   Top25perc P.Undergrad    Outstate  Room.Board       Books 
##    2.324528    2.189928    1.402653    3.967357    1.929870    1.104673 
##    Personal         PhD    Terminal   S.F.Ratio perc.alumni      Expend 
##    1.271330    4.024846    3.950690    1.865938    1.720837    2.808543 
##       Elite 
##    1.837379

Summary: I want to know which factors affects graduation rate in colleges, so I build a linear regression model. Then I use VIF function to check whether there are some multicollinearit and drop some problematic variables, and the R-square drops from 0.462 to 0.438.

  1. From the result, I found that “Top25perc”, “Outstate”, “Room.board”, “Personal” and “perc.alumni” are significantly affect the graduation rate.
  2. When a college adds a new student who is from top 25 % of high school class, the Grad.Rate can be increased by 20.3%. It is reasonable, most students from top 25% of high school classes are willing to study hard, they can graduate from colleges soomthly.
  3. The second factor I want to interpret is “perc.alumni”, It seems that a 1% increase in the alumni who have donated will increase the school graduation rate by 26%. Most of the donations will be used to improve the teaching environment, which confirms the importance of a good learning environment to students.
  4. The final and the most interesting predictor is the “estimated personal spending”, when the average cost of each student increases by 100 dollar, the graduation rate will decrease 15.5%. Spending money for dinner, parties or various entertainment activities will lead to this result. But if this 100 dollar is spent on right place such as “Out-of-state tuition” and “Room and board costs”, the graduate rate will increase by 10.9% and 23% respectively. Well, it is pretty interesting. This tells us a truth, no pain, no gain!

Exercise 9 in section 2.4

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data

Read the data “Auto” from your computer

# Method 1:
Auto = read.csv("/Users/zhangling/Desktop/7035 AI/Assignment/ass1/Auto.csv",header = T, na.strings = "?")
fix(Auto)
# Method 2:
Auto1 = read.table("/Users/zhangling/Desktop/7035 AI/Assignment/ass1/Auto.data", header = T, na.strings = "?")
fix(Auto1)

Using na.omit() function to remove NAs.

Auto = na.omit(Auto)

(a). Which of the predictors are quantitative, and which are qualitative?

# Method 1:
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
# Method 2:
# library(dplyr)
# glimpse(Auto)

# show the variable definitions
?ISLR::Auto

Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year Qualitative predictors: name,origin(1.American,2.European,3.Japanese)

(b). What is the range of each quantitative predictor? You can answer this using the range() function.

rg = sapply(Auto[,1:7], range)
rownames(rg) = c("Min","Max")
rg
##      mpg cylinders displacement horsepower weight acceleration year
## Min  9.0         3           68         46   1613          8.0   70
## Max 46.6         8          455        230   5140         24.8   82

(c). What is the mean and standard deviation of each quantitative predictor?

sapply(Auto[,1:7],mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592
sapply(Auto[,1:7], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

(d). Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

sub_Auto = Auto[-c(10:85),]
# Range
rg = sapply(sub_Auto[,1:7], range)
rownames(rg) = c("Min","Max")
rg
##      mpg cylinders displacement horsepower weight acceleration year
## Min 11.0         3           68         46   1649          8.5   70
## Max 46.6         8          455        230   4997         24.8   82
# Mean
sapply(sub_Auto[,1:7], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570
# Standard Deviation
sapply(sub_Auto[,1:7], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

(e). Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

# Step1
attach(Auto)
Auto$name = as.factor(Auto$name)
Auto$origin = as.factor(Auto$origin)
pairs(Auto[,1:9])

plot(as.factor(origin),weight,
     xlab = "origin", 
     ylab = "weight",
     varwidth = T,
     names = c("American","European","Japanese"),
     main = "weight vs origin" )  # an example

# Step2
library(ggplot2)

ggplot(Auto, aes(x = weight, y = acceleration, col = origin)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~ origin) + 
  theme(legend.position = "none") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(x = "Weight", 
       y = "Acceleration", 
       title = "Correlation between weight and acceleration, by origin")

findings: (1) From the step1, I find that all variables(except the name) might differ in the different categories of origin. Furthermore, many of these variable pairs show a positive/negative relationship. (2) From the step2, with the vehicle weight increase, the time to accelerate from 0 to 60 mph will decrease in America and Japan, but things will be opposite in Europe.

(f). Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

# Step1: mpg vs origin
attach(Auto)
plot(as.factor(origin),mpg,
     xlab = "origin", 
     ylab = "mpg",
     varwidth = T,
     col = c('#5f9cfc', '#fa766d','#00ba44'),
     names = c("American","European","Japanese"),
     main = "Miles Per Gallon VS Origin" )  # an example

# Step2: Boosting Decision Tree
library(gbm)
set.seed(1)
attach(Auto)
boost.auto = gbm(mpg~cylinders+displacement+horsepower+weight+acceleration+year, data = Auto, distribution = "gaussian", n.trees = 5000, interaction.depth = 3)
summary(boost.auto)

##                       var   rel.inf
## weight             weight 29.272808
## displacement displacement 22.754345
## year                 year 17.336012
## horsepower     horsepower 16.095825
## acceleration acceleration  9.715508
## cylinders       cylinders  4.825502
par(mfrow=c(2,2))
plot(boost.auto,i="weight")

plot(boost.auto,i="displacement")

plot(boost.auto,i="year")

plot(boost.auto,i="horsepower")

  1. From the step1, miles per gallon of auto perform obviously different in these 3 countries, so “origin” is useful in predicting “mpg”.
  2. From the step2, I use a boosting decision tree, then I see that “weight”, “displacement”, “year” and “horsepower” are by far the most important variables. These 4 variables are useful in predicting “mpg”. Then I produce partial dependence plots for these 4 variables. In this case, miles per gallon is increasing with “year” and decreasing with “horsepower” and “weight”. When engine displacement less than 250 inches, miles per gallon is increasing with engine displacement, but mpg is almost stable when engine displacement larger than 300 inches.

Exercise 13 in section 3.7

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

(a). Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0,1) distribution. This represents a feature, X.

set.seed(1)
x = rnorm(100, mean = 0, sd = 1)

(b). Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.

eps = rnorm(100, mean = 0, sd = 0.5) # var = sd^2

(c). Using x and eps, generate a vector y according to the model \[ Y= -1+0.5X+\epsilon\] What is the length of the vector y? What are the values of \(\beta_{0}\) and \(\beta_{1}\) in this linear model?

y = -1 + 0.5*x + eps
length(y)
## [1] 100

The length of vector y is: 100
\(\beta_{0}\): -1 \(\beta_{1}\): 0.5

(d). Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

plot(x,y)

finding: (1) The plot is almost a simple linear regression, and the x and y have a positive relationship. (2) Intercept ≈ -1, slope ≈ 0.5, as the function we built before.

(e). Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) compare to \(\beta_{0}\) and \(\beta_{1}\)?

lm.fit = lm(y~x)
lm.fit
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -1.0188       0.4995

\(\beta_{0}\): -1.0188 \(\beta_{1}\): 0.4995 These two values are almost the same as the previous values.

(f).Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

plot(x,y, pch = 20, main = "x vs y")
abline(a = -1, b = 0.5, col = 'red', lwd = 1)
abline(lm.fit, lwd = 1, col = 'blue',lty = 2 )
legend("bottom",
       legend=c("least squares","population"),
       col= c('blue','red'),
       lwd = 2,
       lty = c(2,1))

(g). Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

lm.fit2 = lm(y ~ x + I(x^2))
summary(lm.fit2)
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

The p-value associated with the quadratic term is equal to 0.164, so it could not reject the null hypothesis(the coefficient for x^2 is 0). The conclusion is that there is no statistical evidence that the quadratic term improves the model fit.

(h). Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model in part(c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.

# less noise(sd of epsilon from 0.5 to 0.05)
set.seed(1)
x1 = rnorm(100, mean = 0, sd = 1)

eps1 = rnorm(100, sd = 0.05) 
y1 = -1 + 0.5*x1 + eps1
plot(x1,y1)

lm.fit3 = lm(y1~x1)
lm.fit3
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Coefficients:
## (Intercept)           x1  
##     -1.0019       0.4999
plot(x1,y1, pch = 20, main = "x1 vs y1")
abline(a = -1, b = 0.5, col = 'red', lwd = 3)  # population
abline(lm.fit3, lwd = 3, col = '#9cd9b9',lty = 2 )  # least squares 
legend("bottom",
       legend=c("least squares","population"),
       col= c('#9cd9b9','red'),
       lwd = 2,
       lty = c(2,1))

# Comparision
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15
summary(lm.fit3)
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.093842 -0.030688 -0.006975  0.026970  0.117309 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.001885   0.004849 -206.60   <2e-16 ***
## x1           0.499947   0.005386   92.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04814 on 98 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9886 
## F-statistic:  8615 on 1 and 98 DF,  p-value: < 2.2e-16

As I’d expect, the residual standard error decrease, as we set sd of eps from 0.25 to 0.05. After reducing noise, the distribution of points is more concentrated, the regression line can explain more data, so the multiple r-squared increase.

(i). Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model in part (c) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.

# more noise(sd of epsilon from 0.5 to 1)
set.seed(1)
x2 = rnorm(100, mean = 0, sd = 1)
eps2 = rnorm(100, sd = 1) 
y2 = -1 + 0.5*x2 + eps2

plot(x2,y2)

lm.fit4 = lm(y2~x2)
lm.fit4
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Coefficients:
## (Intercept)           x2  
##     -1.0377       0.4989
plot(x2,y2, pch = 20, main = "x2 vs y2")
abline(a = -1, b = 0.5, col = 'green', lwd = 3)  # population
abline(lm.fit4, lwd = 3, col = 'red',lty = 2 )  # least squares 
legend("bottom",
       legend=c("least squares","population"),
       col= c('red','green'),
       lwd = 2,
       lty = c(2,1))

# Comparision
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15
summary(lm.fit4)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03769    0.09699 -10.699  < 2e-16 ***
## x2           0.49894    0.10773   4.632 1.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.1796, Adjusted R-squared:  0.1712 
## F-statistic: 21.45 on 1 and 98 DF,  p-value: 1.117e-05

As I’d expect, the residual standard error increase, as we set sd of eps from 0.5 to 1. After increasing noise, the distribution of points is more separated, the regression line can explain less data, so the multiple r-squared decrease.

(j). What are the confidence intervals for \(\beta_{0}\) and \(\beta_{1}\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

# original data set(sd = 0.5)
confint(lm.fit)
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602
# the noisier data set(sd = 1)
confint(lm.fit4)
##                  2.5 %     97.5 %
## (Intercept) -1.2301607 -0.8452245
## x2           0.2851588  0.7127204
# the less noisy data set(sd = 0.05)
confint(lm.fit3)
##                  2.5 %     97.5 %
## (Intercept) -1.0115080 -0.9922612
## x1           0.4892579  0.5106360
  1. None of the confidence intervals for \(\beta_{1}\) contain zero.
  2. As the relationship becomes noisier (variance increases), the 95% confidence interval for each parameter becomes wider. In contrast, a stronger relationship with less noise means that the confidence interval for each parameter becomes narrower.

Exercise 15 in section 3.7

This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

library(MASS)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

(a). For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

Convert the chas variable into factor variable

Boston$chas = as.factor(Boston$chas)

# Loop + linear regression
target = Boston$crim
R2 = c()
p_value = c()
intercept = c()
beta1 = c()
varname = c()
for (i in 2:ncol(Boston)){
  data = Boston[ ,i]
  m = lm(target~data, data = Boston)
  R2[i] = summary(m)$r.squared
  p_value[i] = summary(m)$coefficients[2,4]
  intercept[i] = summary(m)$coefficients[1,1]
  beta1[i] = summary(m)$coefficients[2,1]
  varname[i] = names(Boston)[i]
}

library(magrittr) # in order to use %>% function
library(dplyr)
crime_preds = data.frame(varname = varname,
                         R2 = round(R2,5),
                         p_value = round(p_value,10),
                         intercept = round(intercept,5),
                         slope = round(beta1,5)) %>%
  arrange(desc(R2))
crime_preds
##    varname      R2      p_value intercept    slope
## 1      rad 0.39126 0.0000000000  -2.28716  0.61791
## 2      tax 0.33961 0.0000000000  -8.52837  0.02974
## 3    lstat 0.20759 0.0000000000  -3.33054  0.54880
## 4      nox 0.17722 0.0000000000 -13.71988 31.24853
## 5    indus 0.16531 0.0000000000  -2.06374  0.50978
## 6     medv 0.15078 0.0000000000  11.79654 -0.36316
## 7    black 0.14827 0.0000000000  16.55353 -0.03628
## 8      dis 0.14415 0.0000000000   9.49926 -1.55090
## 9      age 0.12442 0.0000000000  -3.77791  0.10779
## 10 ptratio 0.08407 0.0000000000 -17.64693  1.15198
## 11      rm 0.04807 0.0000006347  20.48180 -2.68405
## 12      zn 0.04019 0.0000055065   4.45369 -0.07393
## 13    chas 0.00312 0.2094345015   3.74445 -1.89278
## 14    <NA>      NA           NA        NA       NA

To fit a simple linear regression model to predict the response for each predictor and summary the results.

zn

lm.zn = lm(crim~zn,data = Boston)
summary(lm.zn)
## 
## Call:
## lm(formula = crim ~ zn, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## zn          -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06

indus

lm.indus = lm(crim~indus,data = Boston)
summary(lm.indus)
## 
## Call:
## lm(formula = crim ~ indus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.972  -2.698  -0.736   0.712  81.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
## indus        0.50978    0.05102   9.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1637 
## F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16

chas

lm.chas = lm(crim~chas,data = Boston)
summary(lm.chas)
## 
## Call:
## lm(formula = crim ~ chas, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## chas1        -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

nox

lm.nox = lm(crim~nox,data = Boston)
summary(lm.nox)
## 
## Call:
## lm(formula = crim ~ nox, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.371  -2.738  -0.974   0.559  81.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
## nox           31.249      2.999  10.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1756 
## F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16

rm

lm.rm = lm(crim~rm,data = Boston)
summary(lm.rm)
## 
## Call:
## lm(formula = crim ~ rm, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -3.952 -2.654  0.989 87.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.482      3.365   6.088 2.27e-09 ***
## rm            -2.684      0.532  -5.045 6.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.04618 
## F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07

age

lm.age = lm(crim~age,data = Boston)
summary(lm.age)
## 
## Call:
## lm(formula = crim ~ age, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.789 -4.257 -1.230  1.527 82.849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
## age          0.10779    0.01274   8.463 2.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1227 
## F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16

dis

lm.dis = lm(crim~dis,data = Boston)
summary(lm.dis)
## 
## Call:
## lm(formula = crim ~ dis, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.708 -4.134 -1.527  1.516 81.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4993     0.7304  13.006   <2e-16 ***
## dis          -1.5509     0.1683  -9.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.1425 
## F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16

rad

lm.rad = lm(crim~rad,data = Boston)
summary(lm.rad)
## 
## Call:
## lm(formula = crim ~ rad, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## rad          0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

tax

lm.tax = lm(crim~tax,data = Boston)
summary(lm.tax)
## 
## Call:
## lm(formula = crim ~ tax, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.513  -2.738  -0.194   1.065  77.696 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
## tax          0.029742   0.001847   16.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared:  0.3396, Adjusted R-squared:  0.3383 
## F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16

ptratio

lm.ptratio = lm(crim~ptratio,data = Boston)
summary(lm.ptratio)
## 
## Call:
## lm(formula = crim ~ ptratio, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.654 -3.985 -1.912  1.825 83.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
## ptratio       1.1520     0.1694   6.801 2.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared:  0.08407,    Adjusted R-squared:  0.08225 
## F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11

black

lm.black = lm(crim~black,data = Boston)
summary(lm.black)
## 
## Call:
## lm(formula = crim ~ black, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.553529   1.425903  11.609   <2e-16 ***
## black       -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16

lstat

lm.lstat = lm(crim~lstat,data = Boston)
summary(lm.lstat)
## 
## Call:
## lm(formula = crim ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## lstat        0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16

medv

lm.medv = lm(crim~medv,data = Boston)
summary(lm.medv)
## 
## Call:
## lm(formula = crim ~ medv, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.071 -4.022 -2.343  1.298 80.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.79654    0.93419   12.63   <2e-16 ***
## medv        -0.36316    0.03839   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

(b). Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_0\) : \(\beta_{j}\) = 0?

lm = lm(crim ~., data = Boston)
summary(lm)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas1        -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

we can reject the null hypothesis \(H_0\) : \(\beta_{j}\) = 0 (at 5% level) for the predictors: zn, dis, rad, black and medv. For all other predictors, there was insufficient evidence to reject the null hypothesis that their coefficient was zero.

(c). How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.

Get the regression coefficients from (a)

x_a = crime_preds[1:13,4]
x_a
##  [1]  -2.28716  -8.52837  -3.33054 -13.71988  -2.06374  11.79654  16.55353
##  [8]   9.49926  -3.77791 -17.64693  20.48180   4.45369   3.74445

Get the regression coefficients from (b)

y_b = summary(lm)$coefficients[-1,1]
y_b
##            zn         indus         chas1           nox            rm 
##   0.044855215  -0.063854824  -0.749133611 -10.313534912   0.430130506 
##           age           dis           rad           tax       ptratio 
##   0.001451643  -0.987175726   0.588208591  -0.003780016  -0.271080558 
##         black         lstat          medv 
##  -0.007537505   0.126211376  -0.198886821

Create a plot displaying the univariate regression coefficients from (a) on the x-axis and the multiple regression coefficients from (b) on the y-axis

plot(x_a,y_b,
     main = "Simple VS Multiple Rgression Coefficients",
     sub = "predicting crim",
     xlab = "Simple Coefficient",
     ylab = "Multiple Coefficient",
     col = factor(y_b),
     pch = 16)

(d). Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form \[ Y= \beta_{0}+\beta_{1}X+\beta_{2}X^2+\beta_{3}^3X+\epsilon\]

Fit a linear regression model with a quadratic form for each predictor and summary the results.

zn

lm.zn1 = lm(crim ~zn+I(zn^2)+I(zn^3), data = Boston)
summary(lm.zn1)
## 
## Call:
## lm(formula = crim ~ zn + I(zn^2) + I(zn^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.821 -4.614 -1.294  0.473 84.130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.846e+00  4.330e-01  11.192  < 2e-16 ***
## zn          -3.322e-01  1.098e-01  -3.025  0.00261 ** 
## I(zn^2)      6.483e-03  3.861e-03   1.679  0.09375 .  
## I(zn^3)     -3.776e-05  3.139e-05  -1.203  0.22954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared:  0.05824,    Adjusted R-squared:  0.05261 
## F-statistic: 10.35 on 3 and 502 DF,  p-value: 1.281e-06

indus

lm.indus1 = lm(crim ~indus+I(indus^2)+I(indus^3), data = Boston)
summary(lm.indus1)
## 
## Call:
## lm(formula = crim ~ indus + I(indus^2) + I(indus^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.278 -2.514  0.054  0.764 79.713 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.6625683  1.5739833   2.327   0.0204 *  
## indus       -1.9652129  0.4819901  -4.077 5.30e-05 ***
## I(indus^2)   0.2519373  0.0393221   6.407 3.42e-10 ***
## I(indus^3)  -0.0069760  0.0009567  -7.292 1.20e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.2552 
## F-statistic: 58.69 on 3 and 502 DF,  p-value: < 2.2e-16

chas

# it will appear error, because the data type of chas is factor, it cannot use square or cube.

nox

lm.nox1 = lm(crim ~ nox + I(nox^2) + I(nox^3), data = Boston)
summary(lm.nox1)
## 
## Call:
## lm(formula = crim ~ nox + I(nox^2) + I(nox^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.110 -2.068 -0.255  0.739 78.302 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   233.09      33.64   6.928 1.31e-11 ***
## nox         -1279.37     170.40  -7.508 2.76e-13 ***
## I(nox^2)     2248.54     279.90   8.033 6.81e-15 ***
## I(nox^3)    -1245.70     149.28  -8.345 6.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared:  0.297,  Adjusted R-squared:  0.2928 
## F-statistic: 70.69 on 3 and 502 DF,  p-value: < 2.2e-16

rm

lm.rm1 = lm(crim ~ rm + I(rm^2) + I(rm^3), data = Boston)
summary(lm.rm1)
## 
## Call:
## lm(formula = crim ~ rm + I(rm^2) + I(rm^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.485  -3.468  -2.221  -0.015  87.219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 112.6246    64.5172   1.746   0.0815 .
## rm          -39.1501    31.3115  -1.250   0.2118  
## I(rm^2)       4.5509     5.0099   0.908   0.3641  
## I(rm^3)      -0.1745     0.2637  -0.662   0.5086  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared:  0.06779,    Adjusted R-squared:  0.06222 
## F-statistic: 12.17 on 3 and 502 DF,  p-value: 1.067e-07

age

lm.age1 = lm(crim ~ age + I(age^2) + I(age^3), data = Boston)
summary(lm.age1)
## 
## Call:
## lm(formula = crim ~ age + I(age^2) + I(age^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.762 -2.673 -0.516  0.019 82.842 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -2.549e+00  2.769e+00  -0.920  0.35780   
## age          2.737e-01  1.864e-01   1.468  0.14266   
## I(age^2)    -7.230e-03  3.637e-03  -1.988  0.04738 * 
## I(age^3)     5.745e-05  2.109e-05   2.724  0.00668 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared:  0.1742, Adjusted R-squared:  0.1693 
## F-statistic: 35.31 on 3 and 502 DF,  p-value: < 2.2e-16

dis

lm.dis1 = lm(crim ~ dis + I(dis^2) + I(dis^3), data = Boston)
summary(lm.dis1)
## 
## Call:
## lm(formula = crim ~ dis + I(dis^2) + I(dis^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.757  -2.588   0.031   1.267  76.378 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.0476     2.4459  12.285  < 2e-16 ***
## dis         -15.5543     1.7360  -8.960  < 2e-16 ***
## I(dis^2)      2.4521     0.3464   7.078 4.94e-12 ***
## I(dis^3)     -0.1186     0.0204  -5.814 1.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared:  0.2778, Adjusted R-squared:  0.2735 
## F-statistic: 64.37 on 3 and 502 DF,  p-value: < 2.2e-16

rad

lm.rad1 = lm(crim ~ rad + I(rad^2) + I(rad^3), data = Boston)
summary(lm.rad1)
## 
## Call:
## lm(formula = crim ~ rad + I(rad^2) + I(rad^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.381  -0.412  -0.269   0.179  76.217 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.605545   2.050108  -0.295    0.768
## rad          0.512736   1.043597   0.491    0.623
## I(rad^2)    -0.075177   0.148543  -0.506    0.613
## I(rad^3)     0.003209   0.004564   0.703    0.482
## 
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared:    0.4,  Adjusted R-squared:  0.3965 
## F-statistic: 111.6 on 3 and 502 DF,  p-value: < 2.2e-16

tax

lm.tax1 = lm(crim ~ tax + I(tax^2) + I(tax^3), data = Boston)
summary(lm.tax1)
## 
## Call:
## lm(formula = crim ~ tax + I(tax^2) + I(tax^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.273  -1.389   0.046   0.536  76.950 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.918e+01  1.180e+01   1.626    0.105
## tax         -1.533e-01  9.568e-02  -1.602    0.110
## I(tax^2)     3.608e-04  2.425e-04   1.488    0.137
## I(tax^3)    -2.204e-07  1.889e-07  -1.167    0.244
## 
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3651 
## F-statistic:  97.8 on 3 and 502 DF,  p-value: < 2.2e-16

ptratio

lm.ptratio1 = lm(crim ~ ptratio + I(ptratio^2) + I(ptratio^3), data = Boston)
summary(lm.ptratio1)
## 
## Call:
## lm(formula = crim ~ ptratio + I(ptratio^2) + I(ptratio^3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.833 -4.146 -1.655  1.408 82.697 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  477.18405  156.79498   3.043  0.00246 **
## ptratio      -82.36054   27.64394  -2.979  0.00303 **
## I(ptratio^2)   4.63535    1.60832   2.882  0.00412 **
## I(ptratio^3)  -0.08476    0.03090  -2.743  0.00630 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared:  0.1138, Adjusted R-squared:  0.1085 
## F-statistic: 21.48 on 3 and 502 DF,  p-value: 4.171e-13

black

lm.black1 = lm(crim ~ black + I(black^2) + I(black^3), data = Boston)
summary(lm.black1)
## 
## Call:
## lm(formula = crim ~ black + I(black^2) + I(black^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.096  -2.343  -2.128  -1.439  86.790 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.826e+01  2.305e+00   7.924  1.5e-14 ***
## black       -8.356e-02  5.633e-02  -1.483    0.139    
## I(black^2)   2.137e-04  2.984e-04   0.716    0.474    
## I(black^3)  -2.652e-07  4.364e-07  -0.608    0.544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared:  0.1498, Adjusted R-squared:  0.1448 
## F-statistic: 29.49 on 3 and 502 DF,  p-value: < 2.2e-16

lstat

lm.lstat1 = lm(crim ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
summary(lm.lstat1)
## 
## Call:
## lm(formula = crim ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.2009656  2.0286452   0.592   0.5541  
## lstat       -0.4490656  0.4648911  -0.966   0.3345  
## I(lstat^2)   0.0557794  0.0301156   1.852   0.0646 .
## I(lstat^3)  -0.0008574  0.0005652  -1.517   0.1299  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

medv

lm.medv1 = lm(crim ~ medv + I(medv^2) + I(medv^3), data = Boston)
summary(lm.medv1)
## 
## Call:
## lm(formula = crim ~ medv + I(medv^2) + I(medv^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.427  -1.976  -0.437   0.439  73.655 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.1655381  3.3563105  15.840  < 2e-16 ***
## medv        -5.0948305  0.4338321 -11.744  < 2e-16 ***
## I(medv^2)    0.1554965  0.0171904   9.046  < 2e-16 ***
## I(medv^3)   -0.0014901  0.0002038  -7.312 1.05e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared:  0.4202, Adjusted R-squared:  0.4167 
## F-statistic: 121.3 on 3 and 502 DF,  p-value: < 2.2e-16
  1. chas is a binary predictors, so cubic relationship will not appear in this case.
  2. From the pvalue of each model, the final conclusion is: Cubic: indus, nox, age, dis, ptratio and medv; Linear: zn.