Sketch of Curves
Squared Bias : When the model flexibility increases, the bias becomes less and the model becomes more nonlinear.If we use a simple linear regression model for a complex real-life problem, it will potentially have a high bias as the error introduced by estimating a complex problem with a simple model will be high. A more flexible non-linear model will reduce the error and hence the bias.
Variance: A simple and less flexible model results in a small variance but as flexibility and complexity of the model increases and more and more points are fitted, the variance also increases. Once has to be careful not to overfit the data because in that case the test MSE will increase.
Training MSE: For a simple, less flexible model the training MSE is large as only a few points are fitted but it decreases as the complexiity of the model increases and more points are fitted.Similar shape to squared bias curve.
Test MSE : The test MSE curve has the shape of combined effect of the bias and variance curves. When the model is simple, the suared bias and test MSE are large. The test MSE reduces as the flexibility increases but increases again as the model becomes too flexible. The second half of the “U” shape is similar the the variance curve when the complexity if high.
Bayes(Irreductible Error): irreductible error is not affected by the model flexibility and complexity and hence maintains a consistent horizontal value. The accuracy of any model is limited by the irreductible error , so the lower this error, the more accurate the model.
Goal : Predict – a movie will be successful or not Predictors – cast, money spent, run time, release time of the year
Goal : Predict – a cricket team will win the match or not Predictors – past performances, home/away, coaching, team members, weather
Goal : inference? – life expectancy of a country Eating habits, population, disease, gender distribution, medical services available
Goal : Prediction – CO2 levels by 2050 Sustainability measures, use of renewable resources, enforced policies and laws etc, advancement in sustainable technologies, awareness programs
scores of a test taken by class – below average, average, above average.
Clustering of population in the US by political party support/affiliation – democratic, republican , neutral
Obs->Euclidean distance
1 -> 3 (sqrt((0-0)^2 + (0-3)^2 + (0-0)^2 )) - rest of values below are calculated in the same way
2 -> 2
3 -> 3.2
4 -> 2.2
5 -> 1.4
6 -> 1.7
b)Prediction with K=1 is Green as the 5th observation of sqrt(2) is the nearest neighbour in that case.
Prediction with K=3 is Red as the 3 nearest neighbours are observations 2 (Red), 5(Green) and 6(Red)
We would expect K to be small. When K is small, the decision boundary is flexible and non-linear. With higher K, the flexibility decreases and more points are considered leading to a decision boundary that is close to linear.
library (MASS)
dim(Boston)
## [1] 506 14
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
The Boston data frame has 506 rows of abservation of housing Values in suburbs of Boston and 14 columns containing variables such as crime rate, industry, zones etc.
pairs(Boston)
Although some plots are difficult to distingush, below are some of the apparent relationships between different predictors:
Crime appears to be correlated with - zn, indus, chas, age, dis, rad, tax and ptratio
Zn appears to be correlated with - indus, chas, nox and black
Indus appears to be correlated with - chas, nox, dis, ptratio
Nox appears to be correlated with - age, dis
rm appears to be correlated with - age, dis
age appears to be correlated with - dis, rad, black, lstat
dis appears to be correlated with - black, lstat
rad appears to be correlated with - ptratio
tax appears to be correlated with - ptratio
lstat appears to be correlated with - medv
par(mfrow = c(3,3))
plot(Boston$zn, Boston$crim, xlab="Residential Land Zone",ylab="Crime", main="1")
plot(Boston$indus, Boston$crim, xlab="Non-retail Business Acres/Town",ylab="Crime", main="2")
plot(Boston$chas, Boston$crim, xlab="Charles River Dummy",ylab="Crime", main="3")
plot(Boston$age, Boston$crim, xlab="Age of House",ylab="Crime", main="4")
plot(Boston$dis, Boston$crim, xlab="Distance to Emploument Center",ylab="Crime", main="5")
plot(Boston$rad, Boston$crim, xlab="Accessibility to Radial Highways",ylab="Crime", main="6")
plot(Boston$tax, Boston$crim, xlab="Property Tax",ylab="Crime", main="7")
plot(Boston$ptratio, Boston$crim, xlab="Pupil-Teacher Ratio",ylab="Crime", main="8")
There seems to be high crime rate for residential land zoned for less than 25,000 sq ft but the crime is nearly negligible for lots over 25,000 sq ft.
Crime rate is nearly zero for all non-retail business acres per town except for around 18 at which the crime rate is very high
3)There is a high crime rate for tracks that do not bound the river but a significantly lower crime rate for tracts that bound the river
Crime rate increases as the age of homes increases
Crime rate is higher with less distance to employment homes
Crime rate is higher with higher accessibility to radial highways
Almost same correlation as one between crime and access to radial highways. Higher tax rate lead to higher crime rates.
Again, almost same correlation as one between crime and access to radial highways. Higher pupil-teacher ratio leads to higher crime rates.
Making histograms to analyse the distributions
par(mfrow = c(1,3))
hist(Boston$crim, xlab="Crime" , breaks=50)
hist(Boston$tax, xlab="Tax", breaks=50)
hist(Boston$ptratio, xlab="Pupil-Teacher Ratio", breaks=50)
Crime Rates - majority of suburbs of Boston have crimes rates below 20 but a few suburbs have higher crime rates. The spread extends until around 90.
nrow(Boston[Boston$crim > 20, ])
## [1] 18
18 suburbs have crime rates above 20 that can be considered high
Tax Rates Many suburbs in Boston appear to have low tax rates but some (around 130 suburbs) have very high rates (around 660).
Pupil-Teacher Ratios More than half of the suburbs appear to have a pupil-teacher ratio of below 20.
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
Some 201 suburbs of Boston have ratios above 20.
dim(subset(Boston, chas == 1))
## [1] 35 14
35 suburbs in the data set bound Charles river
median(Boston$ptratio)
## [1] 19.05
19.05 is the median pupil-teacher ratio
t(subset(Boston, medv == min(Boston$medv)))
## 399 406
## crim 38.3518 67.9208
## zn 0.0000 0.0000
## indus 18.1000 18.1000
## chas 0.0000 0.0000
## nox 0.6930 0.6930
## rm 5.4530 5.6830
## age 100.0000 100.0000
## dis 1.4896 1.4254
## rad 24.0000 24.0000
## tax 666.0000 666.0000
## ptratio 20.2000 20.2000
## black 396.9000 384.9700
## lstat 30.5900 22.9800
## medv 5.0000 5.0000
399 and 406 have the lowest medv values of 5.
checking range of values for each predictor
range(Boston$crim)# above the middle quartile
## [1] 0.00632 88.97620
range(Boston$zn) # minimum
## [1] 0 100
range(Boston$indus) #third quartile
## [1] 0.46 27.74
range(Boston$chas)# minimum
## [1] 0 1
range(Boston$nox)# third quartile
## [1] 0.385 0.871
range(Boston$rm)# around middle quartile
## [1] 3.561 8.780
range(Boston$age)# maximum
## [1] 2.9 100.0
range(Boston$dis)# first quartile
## [1] 1.1296 12.1265
range(Boston$rad)# maximum
## [1] 1 24
range(Boston$tax)# third quartile
## [1] 187 711
range(Boston$ptratio)# fourth quartile
## [1] 12.6 22.0
range(Boston$black)# close to max
## [1] 0.32 396.90
range(Boston$lstat)# third quartile
## [1] 1.73 37.97
I would think these suburbs with lowest median value of owner occupied homes are not ideal places to live for reasons such as the crime rate is relatively high and the chrles river dummy is 0 which also corresponds to high crime rates.
dim(subset(Boston, rm > 7))
## [1] 64 14
dim(subset(Boston, rm > 8))
## [1] 13 14
64 suburbs average more than seven rooms per dwelling
13 suburbs average more than eight rooms per dwelling
getwd()
## [1] "D:/STANFORD/SDC/PhD Q1/WINTER 2020/STATS 216"
Adv=read.csv("Advertising.csv",header=T,na.strings ="?")
dim(Adv)
## [1] 200 5
par(mfrow = c(1,3))
#TV
plot(Adv$TV,Adv$sales, xlab="TV", ylab="Sales", col = "red")
model1 <- lm(sales~TV, data=Adv)
regline=abline(model1,col="navy blue", lwd=3)
#Radio
plot(Adv$radio,Adv$sales, xlab="Radio", ylab="Sales", col = "red")
model2 <- lm(sales~radio, data=Adv)
regline=abline(model2,col="navy blue", lwd=3)
#Newspaper
plot(Adv$newspaper,Adv$sales, xlab="Newspaper", ylab="Sales", col = "red")
model3 <- lm(sales~newspaper, data=Adv)
regline=abline(model3,col="navy blue", lwd=3)
getwd()
## [1] "D:/STANFORD/SDC/PhD Q1/WINTER 2020/STATS 216"
Adv=read.csv("Advertising.csv",header=T,na.strings ="?")
dim(Adv)
## [1] 200 5
plot(Adv$TV,Adv$sales, xlab="TV", ylab="Sales", pch=19, col="maroon")
# LINEAR REGRESSION
model1 <- lm(sales~TV, data=Adv)
regline=abline(model1,col="navy blue", lwd=3)
summary(model1)
##
## Call:
## lm(formula = sales ~ TV, data = Adv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
res <- signif(residuals(model1), 200)
pre <- predict(model1)
segments(Adv$TV, Adv$sales, Adv$TV, pre, col = "gray")
library(ISLR)
data(Auto)
# PLOT BW MPG AND HORSEPOWER
plot(Auto$horsepower,Auto$mpg, xlab="Horsepower", ylab="Miles per gallon")
# LINEAR REGRESSION
model1 <- lm(Auto$mpg~Auto$horsepower)
abline(model1,col="Orange", lwd=3)
summary(model1)
##
## Call:
## lm(formula = Auto$mpg ~ Auto$horsepower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## Auto$horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
# POLYNOMINAL REGRESSION METHOD 1
#model2 <- lm(Auto$mpg~Auto$horsepower + I(Auto$horsepower^2))
#summary(model2)
# POLYNOMINAL REGRESSION METHOD 2
model21 <- lm(Auto$mpg~poly(Auto$horsepower,degree=2,raw=TRUE))
summary(model21)
##
## Call:
## lm(formula = Auto$mpg ~ poly(Auto$horsepower, degree = 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 56.9000997 1.8004268 31.60
## poly(Auto$horsepower, degree = 2, raw = TRUE)1 -0.4661896 0.0311246 -14.98
## poly(Auto$horsepower, degree = 2, raw = TRUE)2 0.0012305 0.0001221 10.08
## Pr(>|t|)
## (Intercept) <2e-16 ***
## poly(Auto$horsepower, degree = 2, raw = TRUE)1 <2e-16 ***
## poly(Auto$horsepower, degree = 2, raw = TRUE)2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
lines(smooth.spline(Auto$horsepower,predict(model21)), col="blue",lwd=3)
# POLYNOMIAL REGRESSION DEGREE 5
model3 <- lm(Auto$mpg~poly(Auto$horsepower,degree=5,raw=TRUE))
summary(model3)
##
## Call:
## lm(formula = Auto$mpg ~ poly(Auto$horsepower, degree = 5, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4326 -2.5285 -0.2925 2.1750 15.9730
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -3.223e+01 2.857e+01 -1.128
## poly(Auto$horsepower, degree = 5, raw = TRUE)1 3.700e+00 1.303e+00 2.840
## poly(Auto$horsepower, degree = 5, raw = TRUE)2 -7.142e-02 2.253e-02 -3.170
## poly(Auto$horsepower, degree = 5, raw = TRUE)3 5.931e-04 1.850e-04 3.206
## poly(Auto$horsepower, degree = 5, raw = TRUE)4 -2.281e-06 7.243e-07 -3.150
## poly(Auto$horsepower, degree = 5, raw = TRUE)5 3.330e-09 1.085e-09 3.068
## Pr(>|t|)
## (Intercept) 0.26003
## poly(Auto$horsepower, degree = 5, raw = TRUE)1 0.00475 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)2 0.00164 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)3 0.00146 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)4 0.00176 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)5 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.326 on 386 degrees of freedom
## Multiple R-squared: 0.6967, Adjusted R-squared: 0.6928
## F-statistic: 177.4 on 5 and 386 DF, p-value: < 2.2e-16
lines(smooth.spline(Auto$horsepower,predict(model3)), col="green",lwd=3)
anova(model1,model21)
## Analysis of Variance Table
##
## Model 1: Auto$mpg ~ Auto$horsepower
## Model 2: Auto$mpg ~ poly(Auto$horsepower, degree = 2, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ADDING LEGEND
legend("topright",c("Linear","Degree 2","Degree 5"),
col=c("orange","blue","green"), lwd=3)