This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are
Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10 % of high school class
Top25perc : New students from top 25 % of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate
Before reading the data into R, it can be viewed in Excel or a text editor.
(a). Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data
urlpage = "https://www.statlearning.com/s/College.csv"
college = read.csv(url(urlpage))
(b). Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
# Remove the "#" below
rownames(college) <- college[, 1] # use university name as index
# need to install XQuartz in MAC and restart the computer
fix(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
# Remove the "#" below
college = college[,-1] # keep all rows but delete the 1st column
fix(college)
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
(c).i. Use the summary() function to produce a numerical summary of the variables in the data set
college$Private = as.factor(college$Private)
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
(c).ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
# pairs() function is used to draw matrix scatter plot
pairs(college[,1:10])
(c).iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
attach(college)
plot(Private,Outstate,
col = "blue",
varwidth = T,
xlab = "Private",
ylab = "Outstate",
main = "Side-By-Side Boxplots of Private VS Outstate")
(c).iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50%.
# Remove the # below
Elite = rep("No", nrow(college))
Elite[college$Top10perc>50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
Use the summary() function to see how many elite universities there are.
summary(college$Elite)
## No Yes
## 699 78
Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
attach(college)
plot(Elite, Outstate,
col = 'yellow',
varwidth = T,
xlab = 'Elite',
ylab = 'Outstate',
main = "Side-By-Side Boxplots of Elite VS Outstate")
(c).v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
Divide the print window into four regions
# par() could make one page of multiple images, par(mfrow=(2,2)) means that it displays 2 rows and 2 columns
par(mfrow=c(2,2))
Build histograms
par(mfrow=c(2,2))
attach(college)
# Comparison chart1: New students from top 25 % of high school class
hist(Top25perc, col = "#BB8FCE" , breaks =30,
ylab = "Number",
main = "Top25perc Histogram with 30 bins")
hist(Top25perc, col = "#BB8FCE", breaks =60,
ylab = "Number",
main = "Top25perc Histogram with 60 bins")
# Comparison chart2: Student/faculty ratio
hist(S.F.Ratio, col = "#009E73",breaks = 10,
ylab = "Count",
main = "S.F.Ratio Histogram with 10 bins")
hist(S.F.Ratio, col = "#009E73", breaks =40,
ylab = "Count",
main = "S.F.Ratio Histogram with 40 bins")
(c).vi. Continue exploring the data, and provide a brief summary of what you discover.
# step1: build a multiple linear regression model
model = lm(Grad.Rate~.,data=college)
summary(model)
##
## Call:
## lm(formula = Grad.Rate ~ ., data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.991 -7.100 -0.300 7.174 54.034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.8925541 4.8522723 6.985 6.24e-12 ***
## PrivateYes 3.4262050 1.7151733 1.998 0.046119 *
## Apps 0.0012936 0.0004428 2.921 0.003588 **
## Accept -0.0006909 0.0008638 -0.800 0.424030
## Enroll 0.0021440 0.0023111 0.928 0.353840
## Top10perc 0.0465274 0.0851268 0.547 0.584838
## Top25perc 0.1374511 0.0564462 2.435 0.015118 *
## F.Undergrad -0.0004648 0.0004026 -1.155 0.248635
## P.Undergrad -0.0014809 0.0003907 -3.790 0.000162 ***
## Outstate 0.0010197 0.0002339 4.360 1.48e-05 ***
## Room.Board 0.0019067 0.0005926 3.217 0.001348 **
## Books -0.0022140 0.0029189 -0.758 0.448388
## Personal -0.0016620 0.0007703 -2.158 0.031270 *
## PhD 0.0882924 0.0571134 1.546 0.122543
## Terminal -0.0751566 0.0624063 -1.204 0.228845
## S.F.Ratio 0.0746163 0.1595478 0.468 0.640153
## perc.alumni 0.2796432 0.0492353 5.680 1.92e-08 ***
## Expend -0.0004596 0.0001552 -2.961 0.003158 **
## EliteYes 0.4618984 2.5235781 0.183 0.854821
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.75 on 758 degrees of freedom
## Multiple R-squared: 0.4616, Adjusted R-squared: 0.4488
## F-statistic: 36.1 on 18 and 758 DF, p-value: < 2.2e-16
library(car)
vif(model)
## Private Apps Accept Enroll Top10perc Top25perc
## 2.788191 14.012117 21.385793 21.999953 10.758529 5.962313
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## 18.194541 1.688082 4.224016 2.015314 1.108072 1.297754
## PhD Terminal S.F.Ratio perc.alumni Expend Elite
## 4.149118 4.027321 1.902894 1.775935 3.132748 2.747435
model1 = lm(Grad.Rate~. -Apps-Accept-Enroll-Top10perc-F.Undergrad,data=college)
summary(model1)
##
## Call:
## lm(formula = Grad.Rate ~ . - Apps - Accept - Enroll - Top10perc -
## F.Undergrad, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.766 -7.884 -0.582 7.902 56.916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.2135523 4.7938873 6.303 4.95e-10 ***
## PrivateYes 0.8074562 1.5941170 0.507 0.612636
## Top25perc 0.2030182 0.0348216 5.830 8.17e-09 ***
## P.Undergrad -0.0011525 0.0003625 -3.179 0.001537 **
## Outstate 0.0010874 0.0002307 4.713 2.90e-06 ***
## Room.Board 0.0022963 0.0005903 3.890 0.000109 ***
## Books -0.0016098 0.0029666 -0.543 0.587532
## Personal -0.0015472 0.0007761 -1.994 0.046548 *
## PhD 0.1011227 0.0572586 1.766 0.077784 .
## Terminal -0.0908948 0.0629162 -1.445 0.148954
## S.F.Ratio 0.1625722 0.1608193 1.011 0.312385
## perc.alumni 0.2604421 0.0493332 5.279 1.69e-07 ***
## Expend -0.0002720 0.0001496 -1.818 0.069409 .
## EliteYes 2.2919269 2.1006727 1.091 0.275598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.98 on 763 degrees of freedom
## Multiple R-squared: 0.4384, Adjusted R-squared: 0.4289
## F-statistic: 45.82 on 13 and 763 DF, p-value: < 2.2e-16
vif(model1)
## Private Top25perc P.Undergrad Outstate Room.Board Books
## 2.324528 2.189928 1.402653 3.967357 1.929870 1.104673
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 1.271330 4.024846 3.950690 1.865938 1.720837 2.808543
## Elite
## 1.837379
Summary: I want to know which factors affects graduation rate in colleges, so I build a linear regression model. Then I use VIF function to check whether there are some multicollinearit and drop some problematic variables, and the R-square drops from 0.462 to 0.438.
This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data
Read the data “Auto” from your computer
# Method 1:
Auto = read.csv("/Users/zhangling/Desktop/7035 AI/Assignment/ass1/Auto.csv",header = T, na.strings = "?")
fix(Auto)
# Method 2:
Auto1 = read.table("/Users/zhangling/Desktop/7035 AI/Assignment/ass1/Auto.data", header = T, na.strings = "?")
fix(Auto1)
Using na.omit() function to remove NAs.
Auto = na.omit(Auto)
(a). Which of the predictors are quantitative, and which are qualitative?
# Method 1:
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
# Method 2:
# library(dplyr)
# glimpse(Auto)
# show the variable definitions
?ISLR::Auto
Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year Qualitative predictors: name,origin(1.American,2.European,3.Japanese)
(b). What is the range of each quantitative predictor? You can answer this using the range() function.
rg = sapply(Auto[,1:7], range)
rownames(rg) = c("Min","Max")
rg
## mpg cylinders displacement horsepower weight acceleration year
## Min 9.0 3 68 46 1613 8.0 70
## Max 46.6 8 455 230 5140 24.8 82
(c). What is the mean and standard deviation of each quantitative predictor?
sapply(Auto[,1:7],mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
sapply(Auto[,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
(d). Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
sub_Auto = Auto[-c(10:85),]
# Range
rg = sapply(sub_Auto[,1:7], range)
rownames(rg) = c("Min","Max")
rg
## mpg cylinders displacement horsepower weight acceleration year
## Min 11.0 3 68 46 1649 8.5 70
## Max 46.6 8 455 230 4997 24.8 82
# Mean
sapply(sub_Auto[,1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
# Standard Deviation
sapply(sub_Auto[,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
(e). Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
# Step1
attach(Auto)
Auto$name = as.factor(Auto$name)
Auto$origin = as.factor(Auto$origin)
pairs(Auto[,1:9])
plot(as.factor(origin),weight,
xlab = "origin",
ylab = "weight",
varwidth = T,
names = c("American","European","Japanese"),
main = "weight vs origin" ) # an example
# Step2
library(ggplot2)
ggplot(Auto, aes(x = weight, y = acceleration, col = origin)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ origin) +
theme(legend.position = "none") +
scale_x_continuous(labels = scales::comma_format()) +
labs(x = "Weight",
y = "Acceleration",
title = "Correlation between weight and acceleration, by origin")
findings: (1) From the step1, I find that all variables(except the name) might differ in the different categories of origin. Furthermore, many of these variable pairs show a positive/negative relationship. (2) From the step2, with the vehicle weight increase, the time to accelerate from 0 to 60 mph will decrease in America and Japan, but things will be opposite in Europe.
(f). Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
# Step1: mpg vs origin
attach(Auto)
plot(as.factor(origin),mpg,
xlab = "origin",
ylab = "mpg",
varwidth = T,
col = c('#5f9cfc', '#fa766d','#00ba44'),
names = c("American","European","Japanese"),
main = "Miles Per Gallon VS Origin" ) # an example
# Step2: Boosting Decision Tree
library(gbm)
set.seed(1)
attach(Auto)
boost.auto = gbm(mpg~cylinders+displacement+horsepower+weight+acceleration+year, data = Auto, distribution = "gaussian", n.trees = 5000, interaction.depth = 3)
summary(boost.auto)
## var rel.inf
## weight weight 29.272808
## displacement displacement 22.754345
## year year 17.336012
## horsepower horsepower 16.095825
## acceleration acceleration 9.715508
## cylinders cylinders 4.825502
par(mfrow=c(2,2))
plot(boost.auto,i="weight")
plot(boost.auto,i="displacement")
plot(boost.auto,i="year")
plot(boost.auto,i="horsepower")
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
(a). Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0,1) distribution. This represents a feature, X.
set.seed(1)
x = rnorm(100, mean = 0, sd = 1)
(b). Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.
eps = rnorm(100, mean = 0, sd = 0.5) # var = sd^2
(c). Using x and eps, generate a vector y according to the model \[ Y= -1+0.5X+\epsilon\] What is the length of the vector y? What are the values of \(\beta_{0}\) and \(\beta_{1}\) in this linear model?
y = -1 + 0.5*x + eps
length(y)
## [1] 100
The length of vector y is: 100
\(\beta_{0}\): -1 \(\beta_{1}\): 0.5
(d). Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
plot(x,y)
finding: (1) The plot is almost a simple linear regression, and the x and y have a positive relationship. (2) Intercept ≈ -1, slope ≈ 0.5, as the function we built before.
(e). Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) compare to \(\beta_{0}\) and \(\beta_{1}\)?
lm.fit = lm(y~x)
lm.fit
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -1.0188 0.4995
\(\beta_{0}\): -1.0188 \(\beta_{1}\): 0.4995 These two values are almost the same as the previous values.
(f).Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.
plot(x,y, pch = 20, main = "x vs y")
abline(a = -1, b = 0.5, col = 'red', lwd = 1)
abline(lm.fit, lwd = 1, col = 'blue',lty = 2 )
legend("bottom",
legend=c("least squares","population"),
col= c('blue','red'),
lwd = 2,
lty = c(2,1))
(g). Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.
lm.fit2 = lm(y ~ x + I(x^2))
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98252 -0.31270 -0.06441 0.29014 1.13500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97164 0.05883 -16.517 < 2e-16 ***
## x 0.50858 0.05399 9.420 2.4e-15 ***
## I(x^2) -0.05946 0.04238 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
## F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
The p-value associated with the quadratic term is equal to 0.164, so it could not reject the null hypothesis(the coefficient for x^2 is 0). The conclusion is that there is no statistical evidence that the quadratic term improves the model fit.
(h). Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model in part(c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.
# less noise(sd of epsilon from 0.5 to 0.05)
set.seed(1)
x1 = rnorm(100, mean = 0, sd = 1)
eps1 = rnorm(100, sd = 0.05)
y1 = -1 + 0.5*x1 + eps1
plot(x1,y1)
lm.fit3 = lm(y1~x1)
lm.fit3
##
## Call:
## lm(formula = y1 ~ x1)
##
## Coefficients:
## (Intercept) x1
## -1.0019 0.4999
plot(x1,y1, pch = 20, main = "x1 vs y1")
abline(a = -1, b = 0.5, col = 'red', lwd = 3) # population
abline(lm.fit3, lwd = 3, col = '#9cd9b9',lty = 2 ) # least squares
legend("bottom",
legend=c("least squares","population"),
col= c('#9cd9b9','red'),
lwd = 2,
lty = c(2,1))
# Comparision
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
summary(lm.fit3)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.093842 -0.030688 -0.006975 0.026970 0.117309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.001885 0.004849 -206.60 <2e-16 ***
## x1 0.499947 0.005386 92.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04814 on 98 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9886
## F-statistic: 8615 on 1 and 98 DF, p-value: < 2.2e-16
As I’d expect, the residual standard error decrease, as we set sd of eps from 0.25 to 0.05. After reducing noise, the distribution of points is more concentrated, the regression line can explain more data, so the multiple r-squared increase.
(i). Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model in part (c) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.
# more noise(sd of epsilon from 0.5 to 1)
set.seed(1)
x2 = rnorm(100, mean = 0, sd = 1)
eps2 = rnorm(100, sd = 1)
y2 = -1 + 0.5*x2 + eps2
plot(x2,y2)
lm.fit4 = lm(y2~x2)
lm.fit4
##
## Call:
## lm(formula = y2 ~ x2)
##
## Coefficients:
## (Intercept) x2
## -1.0377 0.4989
plot(x2,y2, pch = 20, main = "x2 vs y2")
abline(a = -1, b = 0.5, col = 'green', lwd = 3) # population
abline(lm.fit4, lwd = 3, col = 'red',lty = 2 ) # least squares
legend("bottom",
legend=c("least squares","population"),
col= c('red','green'),
lwd = 2,
lty = c(2,1))
# Comparision
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
summary(lm.fit4)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03769 0.09699 -10.699 < 2e-16 ***
## x2 0.49894 0.10773 4.632 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.1796, Adjusted R-squared: 0.1712
## F-statistic: 21.45 on 1 and 98 DF, p-value: 1.117e-05
As I’d expect, the residual standard error increase, as we set sd of eps from 0.5 to 1. After increasing noise, the distribution of points is more separated, the regression line can explain less data, so the multiple r-squared decrease.
(j). What are the confidence intervals for \(\beta_{0}\) and \(\beta_{1}\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
# original data set(sd = 0.5)
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -1.1150804 -0.9226122
## x 0.3925794 0.6063602
# the noisier data set(sd = 1)
confint(lm.fit4)
## 2.5 % 97.5 %
## (Intercept) -1.2301607 -0.8452245
## x2 0.2851588 0.7127204
# the less noisy data set(sd = 0.05)
confint(lm.fit3)
## 2.5 % 97.5 %
## (Intercept) -1.0115080 -0.9922612
## x1 0.4892579 0.5106360
This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
library(MASS)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
(a). For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
Convert the chas variable into factor variable
Boston$chas = as.factor(Boston$chas)
# Loop + linear regression
target = Boston$crim
R2 = c()
p_value = c()
intercept = c()
beta1 = c()
varname = c()
for (i in 2:ncol(Boston)){
data = Boston[ ,i]
m = lm(target~data, data = Boston)
R2[i] = summary(m)$r.squared
p_value[i] = summary(m)$coefficients[2,4]
intercept[i] = summary(m)$coefficients[1,1]
beta1[i] = summary(m)$coefficients[2,1]
varname[i] = names(Boston)[i]
}
library(magrittr) # in order to use %>% function
library(dplyr)
crime_preds = data.frame(varname = varname,
R2 = round(R2,5),
p_value = round(p_value,10),
intercept = round(intercept,5),
slope = round(beta1,5)) %>%
arrange(desc(R2))
crime_preds
## varname R2 p_value intercept slope
## 1 rad 0.39126 0.0000000000 -2.28716 0.61791
## 2 tax 0.33961 0.0000000000 -8.52837 0.02974
## 3 lstat 0.20759 0.0000000000 -3.33054 0.54880
## 4 nox 0.17722 0.0000000000 -13.71988 31.24853
## 5 indus 0.16531 0.0000000000 -2.06374 0.50978
## 6 medv 0.15078 0.0000000000 11.79654 -0.36316
## 7 black 0.14827 0.0000000000 16.55353 -0.03628
## 8 dis 0.14415 0.0000000000 9.49926 -1.55090
## 9 age 0.12442 0.0000000000 -3.77791 0.10779
## 10 ptratio 0.08407 0.0000000000 -17.64693 1.15198
## 11 rm 0.04807 0.0000006347 20.48180 -2.68405
## 12 zn 0.04019 0.0000055065 4.45369 -0.07393
## 13 chas 0.00312 0.2094345015 3.74445 -1.89278
## 14 <NA> NA NA NA NA
To fit a simple linear regression model to predict the response for each predictor and summary the results.
zn
lm.zn = lm(crim~zn,data = Boston)
summary(lm.zn)
##
## Call:
## lm(formula = crim ~ zn, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
indus
lm.indus = lm(crim~indus,data = Boston)
summary(lm.indus)
##
## Call:
## lm(formula = crim ~ indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## indus 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
chas
lm.chas = lm(crim~chas,data = Boston)
summary(lm.chas)
##
## Call:
## lm(formula = crim ~ chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## chas1 -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
nox
lm.nox = lm(crim~nox,data = Boston)
summary(lm.nox)
##
## Call:
## lm(formula = crim ~ nox, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## nox 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
rm
lm.rm = lm(crim~rm,data = Boston)
summary(lm.rm)
##
## Call:
## lm(formula = crim ~ rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## rm -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
age
lm.age = lm(crim~age,data = Boston)
summary(lm.age)
##
## Call:
## lm(formula = crim ~ age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## age 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
dis
lm.dis = lm(crim~dis,data = Boston)
summary(lm.dis)
##
## Call:
## lm(formula = crim ~ dis, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## dis -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
rad
lm.rad = lm(crim~rad,data = Boston)
summary(lm.rad)
##
## Call:
## lm(formula = crim ~ rad, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## rad 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
tax
lm.tax = lm(crim~tax,data = Boston)
summary(lm.tax)
##
## Call:
## lm(formula = crim ~ tax, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## tax 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
ptratio
lm.ptratio = lm(crim~ptratio,data = Boston)
summary(lm.ptratio)
##
## Call:
## lm(formula = crim ~ ptratio, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## ptratio 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
black
lm.black = lm(crim~black,data = Boston)
summary(lm.black)
##
## Call:
## lm(formula = crim ~ black, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
lstat
lm.lstat = lm(crim~lstat,data = Boston)
summary(lm.lstat)
##
## Call:
## lm(formula = crim ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## lstat 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
medv
lm.medv = lm(crim~medv,data = Boston)
summary(lm.medv)
##
## Call:
## lm(formula = crim ~ medv, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## medv -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
(b). Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_0\) : \(\beta_{j}\) = 0?
lm = lm(crim ~., data = Boston)
summary(lm)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas1 -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
we can reject the null hypothesis \(H_0\) : \(\beta_{j}\) = 0 (at 5% level) for the predictors: zn, dis, rad, black and medv. For all other predictors, there was insufficient evidence to reject the null hypothesis that their coefficient was zero.
(c). How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.
Get the regression coefficients from (a)
x_a = crime_preds[1:13,4]
x_a
## [1] -2.28716 -8.52837 -3.33054 -13.71988 -2.06374 11.79654 16.55353
## [8] 9.49926 -3.77791 -17.64693 20.48180 4.45369 3.74445
Get the regression coefficients from (b)
y_b = summary(lm)$coefficients[-1,1]
y_b
## zn indus chas1 nox rm
## 0.044855215 -0.063854824 -0.749133611 -10.313534912 0.430130506
## age dis rad tax ptratio
## 0.001451643 -0.987175726 0.588208591 -0.003780016 -0.271080558
## black lstat medv
## -0.007537505 0.126211376 -0.198886821
Create a plot displaying the univariate regression coefficients from (a) on the x-axis and the multiple regression coefficients from (b) on the y-axis
plot(x_a,y_b,
main = "Simple VS Multiple Rgression Coefficients",
sub = "predicting crim",
xlab = "Simple Coefficient",
ylab = "Multiple Coefficient",
col = factor(y_b),
pch = 16)
(d). Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form \[ Y= \beta_{0}+\beta_{1}X+\beta_{2}X^2+\beta_{3}^3X+\epsilon\]
Fit a linear regression model with a quadratic form for each predictor and summary the results.
zn
lm.zn1 = lm(crim ~zn+I(zn^2)+I(zn^3), data = Boston)
summary(lm.zn1)
##
## Call:
## lm(formula = crim ~ zn + I(zn^2) + I(zn^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.821 -4.614 -1.294 0.473 84.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.846e+00 4.330e-01 11.192 < 2e-16 ***
## zn -3.322e-01 1.098e-01 -3.025 0.00261 **
## I(zn^2) 6.483e-03 3.861e-03 1.679 0.09375 .
## I(zn^3) -3.776e-05 3.139e-05 -1.203 0.22954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261
## F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
indus
lm.indus1 = lm(crim ~indus+I(indus^2)+I(indus^3), data = Boston)
summary(lm.indus1)
##
## Call:
## lm(formula = crim ~ indus + I(indus^2) + I(indus^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -2.514 0.054 0.764 79.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6625683 1.5739833 2.327 0.0204 *
## indus -1.9652129 0.4819901 -4.077 5.30e-05 ***
## I(indus^2) 0.2519373 0.0393221 6.407 3.42e-10 ***
## I(indus^3) -0.0069760 0.0009567 -7.292 1.20e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552
## F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
chas
# it will appear error, because the data type of chas is factor, it cannot use square or cube.
nox
lm.nox1 = lm(crim ~ nox + I(nox^2) + I(nox^3), data = Boston)
summary(lm.nox1)
##
## Call:
## lm(formula = crim ~ nox + I(nox^2) + I(nox^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.110 -2.068 -0.255 0.739 78.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 233.09 33.64 6.928 1.31e-11 ***
## nox -1279.37 170.40 -7.508 2.76e-13 ***
## I(nox^2) 2248.54 279.90 8.033 6.81e-15 ***
## I(nox^3) -1245.70 149.28 -8.345 6.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared: 0.297, Adjusted R-squared: 0.2928
## F-statistic: 70.69 on 3 and 502 DF, p-value: < 2.2e-16
rm
lm.rm1 = lm(crim ~ rm + I(rm^2) + I(rm^3), data = Boston)
summary(lm.rm1)
##
## Call:
## lm(formula = crim ~ rm + I(rm^2) + I(rm^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -3.468 -2.221 -0.015 87.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.6246 64.5172 1.746 0.0815 .
## rm -39.1501 31.3115 -1.250 0.2118
## I(rm^2) 4.5509 5.0099 0.908 0.3641
## I(rm^3) -0.1745 0.2637 -0.662 0.5086
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared: 0.06779, Adjusted R-squared: 0.06222
## F-statistic: 12.17 on 3 and 502 DF, p-value: 1.067e-07
age
lm.age1 = lm(crim ~ age + I(age^2) + I(age^3), data = Boston)
summary(lm.age1)
##
## Call:
## lm(formula = crim ~ age + I(age^2) + I(age^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.762 -2.673 -0.516 0.019 82.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.549e+00 2.769e+00 -0.920 0.35780
## age 2.737e-01 1.864e-01 1.468 0.14266
## I(age^2) -7.230e-03 3.637e-03 -1.988 0.04738 *
## I(age^3) 5.745e-05 2.109e-05 2.724 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared: 0.1742, Adjusted R-squared: 0.1693
## F-statistic: 35.31 on 3 and 502 DF, p-value: < 2.2e-16
dis
lm.dis1 = lm(crim ~ dis + I(dis^2) + I(dis^3), data = Boston)
summary(lm.dis1)
##
## Call:
## lm(formula = crim ~ dis + I(dis^2) + I(dis^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.757 -2.588 0.031 1.267 76.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.0476 2.4459 12.285 < 2e-16 ***
## dis -15.5543 1.7360 -8.960 < 2e-16 ***
## I(dis^2) 2.4521 0.3464 7.078 4.94e-12 ***
## I(dis^3) -0.1186 0.0204 -5.814 1.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared: 0.2778, Adjusted R-squared: 0.2735
## F-statistic: 64.37 on 3 and 502 DF, p-value: < 2.2e-16
rad
lm.rad1 = lm(crim ~ rad + I(rad^2) + I(rad^3), data = Boston)
summary(lm.rad1)
##
## Call:
## lm(formula = crim ~ rad + I(rad^2) + I(rad^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.381 -0.412 -0.269 0.179 76.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.605545 2.050108 -0.295 0.768
## rad 0.512736 1.043597 0.491 0.623
## I(rad^2) -0.075177 0.148543 -0.506 0.613
## I(rad^3) 0.003209 0.004564 0.703 0.482
##
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3965
## F-statistic: 111.6 on 3 and 502 DF, p-value: < 2.2e-16
tax
lm.tax1 = lm(crim ~ tax + I(tax^2) + I(tax^3), data = Boston)
summary(lm.tax1)
##
## Call:
## lm(formula = crim ~ tax + I(tax^2) + I(tax^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.273 -1.389 0.046 0.536 76.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.918e+01 1.180e+01 1.626 0.105
## tax -1.533e-01 9.568e-02 -1.602 0.110
## I(tax^2) 3.608e-04 2.425e-04 1.488 0.137
## I(tax^3) -2.204e-07 1.889e-07 -1.167 0.244
##
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3651
## F-statistic: 97.8 on 3 and 502 DF, p-value: < 2.2e-16
ptratio
lm.ptratio1 = lm(crim ~ ptratio + I(ptratio^2) + I(ptratio^3), data = Boston)
summary(lm.ptratio1)
##
## Call:
## lm(formula = crim ~ ptratio + I(ptratio^2) + I(ptratio^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.833 -4.146 -1.655 1.408 82.697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 477.18405 156.79498 3.043 0.00246 **
## ptratio -82.36054 27.64394 -2.979 0.00303 **
## I(ptratio^2) 4.63535 1.60832 2.882 0.00412 **
## I(ptratio^3) -0.08476 0.03090 -2.743 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared: 0.1138, Adjusted R-squared: 0.1085
## F-statistic: 21.48 on 3 and 502 DF, p-value: 4.171e-13
black
lm.black1 = lm(crim ~ black + I(black^2) + I(black^3), data = Boston)
summary(lm.black1)
##
## Call:
## lm(formula = crim ~ black + I(black^2) + I(black^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.096 -2.343 -2.128 -1.439 86.790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.826e+01 2.305e+00 7.924 1.5e-14 ***
## black -8.356e-02 5.633e-02 -1.483 0.139
## I(black^2) 2.137e-04 2.984e-04 0.716 0.474
## I(black^3) -2.652e-07 4.364e-07 -0.608 0.544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared: 0.1498, Adjusted R-squared: 0.1448
## F-statistic: 29.49 on 3 and 502 DF, p-value: < 2.2e-16
lstat
lm.lstat1 = lm(crim ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
summary(lm.lstat1)
##
## Call:
## lm(formula = crim ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.234 -2.151 -0.486 0.066 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2009656 2.0286452 0.592 0.5541
## lstat -0.4490656 0.4648911 -0.966 0.3345
## I(lstat^2) 0.0557794 0.0301156 1.852 0.0646 .
## I(lstat^3) -0.0008574 0.0005652 -1.517 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
## F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
medv
lm.medv1 = lm(crim ~ medv + I(medv^2) + I(medv^3), data = Boston)
summary(lm.medv1)
##
## Call:
## lm(formula = crim ~ medv + I(medv^2) + I(medv^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.427 -1.976 -0.437 0.439 73.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.1655381 3.3563105 15.840 < 2e-16 ***
## medv -5.0948305 0.4338321 -11.744 < 2e-16 ***
## I(medv^2) 0.1554965 0.0171904 9.046 < 2e-16 ***
## I(medv^3) -0.0014901 0.0002038 -7.312 1.05e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared: 0.4202, Adjusted R-squared: 0.4167
## F-statistic: 121.3 on 3 and 502 DF, p-value: < 2.2e-16