Altaf_ISLR HW1

Q1-2.4.2

Regression problem (CEO salary is a quantitative output)

Goal : Inference
n = 500 firms
p = profit, number of employees and industry

Classification problem (success or failure categories of output

Goal : Prediction
n = data on 20 similar products
p = price charged for the product, marketing budget, competition price, and ten other variables

regression problem (% change is quantitative output)

Goal : Prediction
n = weekly data for 2012 (52 weeks)
p = the % change in the US, British and German markets

Q2-2.4.3

Sketch of Curves

Squared Bias : When the model flexibility increases, the bias becomes less and the model becomes more nonlinear.If we use a simple linear regression model for a complex real-life problem, it will potentially have a high bias as the error introduced by estimating a complex problem with a simple model will be high. A more flexible non-linear model will reduce the error and hence the bias.

Variance: A simple and less flexible model results in a small variance but as flexibility and complexity of the model increases and more and more points are fitted, the variance also increases. Once has to be careful not to overfit the data because in that case the test MSE will increase.

Training MSE: For a simple, less flexible model the training MSE is large as only a few points are fitted but it decreases as the complexiity of the model increases and more points are fitted.Similar shape to squared bias curve.

Test MSE : The test MSE curve has the shape of combined effect of the bias and variance curves. When the model is simple, the suared bias and test MSE are large. The test MSE reduces as the flexibility increases but increases again as the model becomes too flexible. The second half of the “U” shape is similar the the variance curve when the complexity if high.

Bayes(Irreductible Error): irreductible error is not affected by the model flexibility and complexity and hence maintains a consistent horizontal value. The accuracy of any model is limited by the irreductible error , so the lower this error, the more accurate the model.

Q3-2.4.4

Goal : Predict – a citizen will vote or not Predictors – age, salary, past record, ethnicity

Goal : Predict – a movie will be successful or not Predictors – cast, money spent, run time, release time of the year

Goal : Predict – a cricket team will win the match or not Predictors – past performances, home/away, coaching, team members, weather

Goal : Inference - Employee’s salary Predictors – experience, degree, age, field, company working at

Goal : inference? – life expectancy of a country Eating habits, population, disease, gender distribution, medical services available

Goal : Prediction – CO2 levels by 2050 Sustainability measures, use of renewable resources, enforced policies and laws etc, advancement in sustainable technologies, awareness programs

blood-group types for blood drawn from patients - type A /B/Ab/O

scores of a test taken by class – below average, average, above average.

Clustering of population in the US by political party support/affiliation – democratic, republican , neutral

Q4-2.4.7

Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0 :

Obs->Euclidean distance

1 -> 3 (sqrt((0-0)^2 + (0-3)^2 + (0-0)^2 )) - rest of values below are calculated in the same way

2 -> 2

3 -> 3.2

4 -> 2.2

5 -> 1.4

6 -> 1.7

b)Prediction with K=1 is Green as the 5th observation of sqrt(2) is the nearest neighbour in that case.

Prediction with K=3 is Red as the 3 nearest neighbours are observations 2 (Red), 5(Green) and 6(Red)
We would expect K to be small. When K is small, the decision boundary is flexible and non-linear. With higher K, the flexibility decreases and more points are considered leading to a decision boundary that is close to linear.

Q5-2.4.10

library (MASS)

dim(Boston)

## [1] 506  14

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

The Boston data frame has 506 rows of abservation of housing Values in suburbs of Boston and 14 columns containing variables such as crime rate, industry, zones etc.

pairs(Boston)

Although some plots are difficult to distingush, below are some of the apparent relationships between different predictors:

Crime appears to be correlated with - zn, indus, chas, age, dis, rad, tax and ptratio

Zn appears to be correlated with - indus, chas, nox and black

Indus appears to be correlated with - chas, nox, dis, ptratio

Nox appears to be correlated with - age, dis

rm appears to be correlated with - age, dis

age appears to be correlated with - dis, rad, black, lstat

dis appears to be correlated with - black, lstat

rad appears to be correlated with - ptratio

tax appears to be correlated with - ptratio

lstat appears to be correlated with - medv

Plotting each correlated predictor (from b) against crime

par(mfrow = c(3,3))

plot(Boston$zn, Boston$crim, xlab="Residential Land Zone",ylab="Crime", main="1")

plot(Boston$indus, Boston$crim, xlab="Non-retail Business Acres/Town",ylab="Crime", main="2")

plot(Boston$chas, Boston$crim, xlab="Charles River Dummy",ylab="Crime", main="3")

plot(Boston$age, Boston$crim, xlab="Age of House",ylab="Crime", main="4")

plot(Boston$dis, Boston$crim, xlab="Distance to Emploument Center",ylab="Crime", main="5")

plot(Boston$rad, Boston$crim, xlab="Accessibility to Radial Highways",ylab="Crime", main="6")

plot(Boston$tax, Boston$crim, xlab="Property Tax",ylab="Crime", main="7")

plot(Boston$ptratio, Boston$crim, xlab="Pupil-Teacher Ratio",ylab="Crime", main="8")

There seems to be high crime rate for residential land zoned for less than 25,000 sq ft but the crime is nearly negligible for lots over 25,000 sq ft.
Crime rate is nearly zero for all non-retail business acres per town except for around 18 at which the crime rate is very high

3)There is a high crime rate for tracks that do not bound the river but a significantly lower crime rate for tracts that bound the river

Crime rate increases as the age of homes increases
Crime rate is higher with less distance to employment homes
Crime rate is higher with higher accessibility to radial highways
Almost same correlation as one between crime and access to radial highways. Higher tax rate lead to higher crime rates.
Again, almost same correlation as one between crime and access to radial highways. Higher pupil-teacher ratio leads to higher crime rates.

Making histograms to analyse the distributions

par(mfrow = c(1,3))

hist(Boston$crim, xlab="Crime" , breaks=50)

hist(Boston$tax, xlab="Tax", breaks=50)

hist(Boston$ptratio, xlab="Pupil-Teacher Ratio", breaks=50)

Crime Rates - majority of suburbs of Boston have crimes rates below 20 but a few suburbs have higher crime rates. The spread extends until around 90.

nrow(Boston[Boston$crim > 20, ])

## [1] 18

18 suburbs have crime rates above 20 that can be considered high

Tax Rates Many suburbs in Boston appear to have low tax rates but some (around 130 suburbs) have very high rates (around 660).

Pupil-Teacher Ratios More than half of the suburbs appear to have a pupil-teacher ratio of below 20.

nrow(Boston[Boston$ptratio > 20, ])

## [1] 201

Some 201 suburbs of Boston have ratios above 20.

dim(subset(Boston, chas == 1))

## [1] 35 14

35 suburbs in the data set bound Charles river

median(Boston$ptratio)

## [1] 19.05

19.05 is the median pupil-teacher ratio

t(subset(Boston, medv == min(Boston$medv)))

##              399      406
## crim     38.3518  67.9208
## zn        0.0000   0.0000
## indus    18.1000  18.1000
## chas      0.0000   0.0000
## nox       0.6930   0.6930
## rm        5.4530   5.6830
## age     100.0000 100.0000
## dis       1.4896   1.4254
## rad      24.0000  24.0000
## tax     666.0000 666.0000
## ptratio  20.2000  20.2000
## black   396.9000 384.9700
## lstat    30.5900  22.9800
## medv      5.0000   5.0000

399 and 406 have the lowest medv values of 5.

checking range of values for each predictor

range(Boston$crim)# above the middle quartile

## [1]  0.00632 88.97620

range(Boston$zn) # minimum

## [1]   0 100

range(Boston$indus) #third quartile

## [1]  0.46 27.74

range(Boston$chas)# minimum

## [1] 0 1

range(Boston$nox)# third quartile

## [1] 0.385 0.871

range(Boston$rm)# around middle quartile

## [1] 3.561 8.780

range(Boston$age)# maximum

## [1]   2.9 100.0

range(Boston$dis)# first quartile

## [1]  1.1296 12.1265

range(Boston$rad)# maximum

## [1]  1 24

range(Boston$tax)# third quartile

## [1] 187 711

range(Boston$ptratio)# fourth quartile

## [1] 12.6 22.0

range(Boston$black)# close to max

## [1]   0.32 396.90

range(Boston$lstat)# third quartile

## [1]  1.73 37.97

I would think these suburbs with lowest median value of owner occupied homes are not ideal places to live for reasons such as the crime rate is relatively high and the chrles river dummy is 0 which also corresponds to high crime rates.

dim(subset(Boston, rm > 7))

## [1] 64 14

dim(subset(Boston, rm > 8))

## [1] 13 14

64 suburbs average more than seven rooms per dwelling

13 suburbs average more than eight rooms per dwelling

Q6-Reproduce Fig 2.1

getwd()

## [1] "D:/STANFORD/SDC/PhD Q1/WINTER 2020/STATS 216"

Adv=read.csv("Advertising.csv",header=T,na.strings ="?")

dim(Adv)

## [1] 200   5

par(mfrow = c(1,3))

#TV

plot(Adv$TV,Adv$sales, xlab="TV", ylab="Sales", col = "red")

model1 <- lm(sales~TV, data=Adv)
regline=abline(model1,col="navy blue", lwd=3)

#Radio

plot(Adv$radio,Adv$sales, xlab="Radio", ylab="Sales", col = "red")

model2 <- lm(sales~radio, data=Adv)
regline=abline(model2,col="navy blue", lwd=3)

#Newspaper

plot(Adv$newspaper,Adv$sales, xlab="Newspaper", ylab="Sales", col = "red")


model3 <- lm(sales~newspaper, data=Adv)
regline=abline(model3,col="navy blue", lwd=3)

Q7-Reproduce Fig 3.1

getwd()

## [1] "D:/STANFORD/SDC/PhD Q1/WINTER 2020/STATS 216"

Adv=read.csv("Advertising.csv",header=T,na.strings ="?")

dim(Adv)

## [1] 200   5

plot(Adv$TV,Adv$sales, xlab="TV", ylab="Sales", pch=19, col="maroon")

# LINEAR REGRESSION

model1 <- lm(sales~TV, data=Adv)
regline=abline(model1,col="navy blue", lwd=3)
summary(model1)

## 
## Call:
## lm(formula = sales ~ TV, data = Adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

res <- signif(residuals(model1), 200)
pre <- predict(model1)
segments(Adv$TV, Adv$sales, Adv$TV, pre, col = "gray")

Q8-Reproduce Fig 3.8

library(ISLR)
data(Auto)

# PLOT BW MPG AND HORSEPOWER

plot(Auto$horsepower,Auto$mpg, xlab="Horsepower", ylab="Miles per gallon")

# LINEAR REGRESSION

model1 <- lm(Auto$mpg~Auto$horsepower)
abline(model1,col="Orange", lwd=3)
summary(model1)

## 
## Call:
## lm(formula = Auto$mpg ~ Auto$horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     39.935861   0.717499   55.66   <2e-16 ***
## Auto$horsepower -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

# POLYNOMINAL REGRESSION METHOD 1

#model2 <- lm(Auto$mpg~Auto$horsepower + I(Auto$horsepower^2))
#summary(model2)

# POLYNOMINAL REGRESSION METHOD 2

model21 <- lm(Auto$mpg~poly(Auto$horsepower,degree=2,raw=TRUE))
summary(model21)

## 
## Call:
## lm(formula = Auto$mpg ~ poly(Auto$horsepower, degree = 2, raw = TRUE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                    56.9000997  1.8004268   31.60
## poly(Auto$horsepower, degree = 2, raw = TRUE)1 -0.4661896  0.0311246  -14.98
## poly(Auto$horsepower, degree = 2, raw = TRUE)2  0.0012305  0.0001221   10.08
##                                                Pr(>|t|)    
## (Intercept)                                      <2e-16 ***
## poly(Auto$horsepower, degree = 2, raw = TRUE)1   <2e-16 ***
## poly(Auto$horsepower, degree = 2, raw = TRUE)2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

lines(smooth.spline(Auto$horsepower,predict(model21)), col="blue",lwd=3)

# POLYNOMIAL REGRESSION DEGREE 5

model3 <- lm(Auto$mpg~poly(Auto$horsepower,degree=5,raw=TRUE))
summary(model3)

## 
## Call:
## lm(formula = Auto$mpg ~ poly(Auto$horsepower, degree = 5, raw = TRUE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4326  -2.5285  -0.2925   2.1750  15.9730 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                    -3.223e+01  2.857e+01  -1.128
## poly(Auto$horsepower, degree = 5, raw = TRUE)1  3.700e+00  1.303e+00   2.840
## poly(Auto$horsepower, degree = 5, raw = TRUE)2 -7.142e-02  2.253e-02  -3.170
## poly(Auto$horsepower, degree = 5, raw = TRUE)3  5.931e-04  1.850e-04   3.206
## poly(Auto$horsepower, degree = 5, raw = TRUE)4 -2.281e-06  7.243e-07  -3.150
## poly(Auto$horsepower, degree = 5, raw = TRUE)5  3.330e-09  1.085e-09   3.068
##                                                Pr(>|t|)   
## (Intercept)                                     0.26003   
## poly(Auto$horsepower, degree = 5, raw = TRUE)1  0.00475 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)2  0.00164 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)3  0.00146 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)4  0.00176 **
## poly(Auto$horsepower, degree = 5, raw = TRUE)5  0.00231 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.326 on 386 degrees of freedom
## Multiple R-squared:  0.6967, Adjusted R-squared:  0.6928 
## F-statistic: 177.4 on 5 and 386 DF,  p-value: < 2.2e-16

lines(smooth.spline(Auto$horsepower,predict(model3)), col="green",lwd=3)

anova(model1,model21)

## Analysis of Variance Table
## 
## Model 1: Auto$mpg ~ Auto$horsepower
## Model 2: Auto$mpg ~ poly(Auto$horsepower, degree = 2, raw = TRUE)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ADDING LEGEND

legend("topright",c("Linear","Degree 2","Degree 5"), 
       col=c("orange","blue","green"), lwd=3)