Conceptual—————————————–>
Section 2.4
Q1)
Given a large n and small number of predictors, a more flexible statistical learning method should be better than an inflexible method. Normally, a primary concern with using a more flexible learning method is that we will overfit the data. However, a larger n should make the model more generalizable (decreasing bias) and the low number of predictors should decrease the chances of having high variance
If the number of predictors is extremley large, and the number of obsercations n is small we are likely to run the risk of overfitting the data if we use a more flexible method. In this case, we would likely want to run a less flexible method (making fewer assumptions about the underlying distribution)
If the relatinoship between predictors and reponse is highly non-linear we will want to use a model that does make linearity assumptions (which tend to be less flexible methods). In this case we may want to use a more flexible method.
If te variance of the error terms is extremely high we might be concerned with overfitting with more flexible models. In this case we’ll want to avoid overfitting by using a less flexible model.
Q2)
Inference: We want to understand the nature of the relationship between the factors (which ones affect CEO salary, not necessarily predicting CEO salary).
n = 500 (the number of companies)
p = 4 (proft, # of employees, industry, CEO salary)
Prediction: We want to know whether (given the regressors) the company will be a success or failure
n = 20 (the number of similar products)
p = 14 (success vs failure, price charged, marketing budget, competition price, + 10 others)
Prediciton: We’re prediction the % change
n = 52 (number of weeks in a year)
p = 4 (% change in dollar, % change in US market, % change in British market, % change in German market)
Q6)
Parametric approaches to statistical learning make a priori assumptions about the generating distribution. This makes the the problem of estimating f a bit easier because the problem simply becomes one of estimating the parameters. For example, we might assume that the there is a linear relationship between our independent and dependent variables. This kind of approach is preferred if we need to make inference about the nature / relationships of the independent variables (the function is more interetable).
Alternatively, non-parametric methods make no explicit assummptions about the underlying form of f. This allows to fit a wider range of possible functions for f, possibly getting us closer to the actual form. But we run the risk of overfitting the data if we have a small number of observations.
Section 3.7
Q3)
salary = 50 + 20xGPA + 0.07xIQ + 35xGR + 0.01(GPAxIQ) - 10(GPAxGR)
#Part b from Q3, 3.7 below:
salary.est = function(GPA, GR, IQ) {
return(50 + 20*GPA + 0.07*IQ + 35*GR +
0.01*GPA*IQ - 10*GPA*GR)
}
salary.test = sapply(seq(from=1.0, to=4.0, by=.5), FUN=function(i){
salary.est(i, 1, 100)
})
female.salaries = salary.test
salary.test = sapply(seq(from=1.0, to=4.0, by=.5), FUN=function(i){
salary.est(i, 0, 100)
})
male.salaries = salary.test
plot(male.salaries, female.salaries,
xlim=c(75, 150), ylim=c(75,150),
xlab="Male Salaries", ylab="Female Salaries")
abline(a = 92, b = 1, col="red")
salary.est(4.0, 1, 100)
## [1] 136
salary.est(4.0, 0, 100)
## [1] 141
Section 4.7
Q8)
Applied—————————————–>
Section 2.4
Q9)
auto=read.table(file="/Users/benpeloquin/Desktop/Spring_2015/CME250/HW/auto.data", header=T, na.strings="?")
auto = na.omit(auto)
auto$cylinders = as.factor(auto$cylinders)
auto$origin = as.factor(auto$origin)
names(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
#a) Quantitative vs Qualititave vars:
cols = colnames(auto)
quants = quals = c()
for (i in 1:length(cols)) {
if(is.numeric(auto[,cols[i]])) {
print(paste(cols[i], ": numeric"))
quants = cbind(quants, cols[i])
} else {
print(paste(cols[i], ": factor"))
quals = cbind(quals, cols[i])
}
}
## [1] "mpg : numeric"
## [1] "cylinders : factor"
## [1] "displacement : numeric"
## [1] "horsepower : numeric"
## [1] "weight : numeric"
## [1] "acceleration : numeric"
## [1] "year : numeric"
## [1] "origin : factor"
## [1] "name : factor"
#b) Range of quants:
for (i in 1:ncol(quants)) {
if(is.numeric(auto[,quants[i]])) {
print(paste(quants[i], "range: ",
range(auto[,quants[i]])[1],
"-",
range(auto[,quants[i]])[2]))
}
}
## [1] "mpg range: 9 - 46.6"
## [1] "displacement range: 68 - 455"
## [1] "horsepower range: 46 - 230"
## [1] "weight range: 1613 - 5140"
## [1] "acceleration range: 8 - 24.8"
## [1] "year range: 70 - 82"
m.sd.print = function(x, data) {
for (i in 1:ncol(x)) {
cat(paste(x[i], "-->\n",
"mean:", mean(data[, x[1, i]]), "\n",
"sd:", sd(data[, x[1, i]]), "\n"))
}
}
m.sd.print(quants, auto)
## mpg -->
## mean: 23.4459183673469
## sd: 7.8050074865718
## displacement -->
## mean: 194.411989795918
## sd: 104.644003908905
## horsepower -->
## mean: 104.469387755102
## sd: 38.4911599328285
## weight -->
## mean: 2977.58418367347
## sd: 849.402560042949
## acceleration -->
## mean: 15.5413265306122
## sd: 2.75886411918808
## year -->
## mean: 75.9795918367347
## sd: 3.68373654357783
auto.rm = auto[c(1:9, 85:nrow(auto)),]
m.sd.print(quants,auto.rm)
## mpg -->
## mean: 24.3684542586751
## sd: 7.88089834213537
## displacement -->
## mean: 187.753943217666
## sd: 99.9394881404822
## horsepower -->
## mean: 100.955835962145
## sd: 35.8955667771312
## weight -->
## mean: 2939.64353312303
## sd: 812.649629297648
## acceleration -->
## mean: 15.7182965299685
## sd: 2.69381257754339
## year -->
## mean: 77.1324921135647
## sd: 3.11002631924956
names(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
attach(auto)
pairs(~ mpg + cylinders + weight + horsepower + year, col="blue",
main="mpg, cylinders, weight, horsepower, year")
plot(as.numeric(cylinders), mpg,
main ="MPG by Cylinders",
xlab = "# of cylinders", ylab = "MPG",
col = "blue", pch = 10)
plot(mpg ~ horsepower, col = "blue", pch = 10,
main = "MPG by Horsepower")
m1 = lm(mpg ~ horsepower)
abline(m1[1], m1[2], col = "red")
plot(horsepower ~ weight, col = "blue", pch = 10,
main = "Horspower ~ Weight")
m2 = lm(horsepower ~ weight)
abline(m2[1], m2[2], col = "red")
#f)
names(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
m3 = lm(mpg ~ horsepower + cylinders +
weight + year + acceleration + displacement)
anova(m3)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## horsepower 1 14433.1 14433.1 1417.2300 <2e-16 ***
## cylinders 4 2349.2 587.3 57.6700 <2e-16 ***
## weight 1 893.3 893.3 87.7139 <2e-16 ***
## year 1 2242.9 2242.9 220.2365 <2e-16 ***
## acceleration 1 0.7 0.7 0.0686 0.7935
## displacement 1 9.5 9.5 0.9339 0.3345
## Residuals 382 3890.3 10.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m4 = lm(mpg ~ horsepower * cylinders * weight * year)
anova(m4)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## horsepower 1 14433.1 14433.1 2014.2068 < 2.2e-16
## cylinders 4 2349.2 587.3 81.9623 < 2.2e-16
## weight 1 893.3 893.3 124.6615 < 2.2e-16
## year 1 2242.9 2242.9 313.0063 < 2.2e-16
## horsepower:cylinders 4 675.6 168.9 23.5694 < 2.2e-16
## horsepower:weight 1 132.9 132.9 18.5457 2.14e-05
## cylinders:weight 4 40.9 10.2 1.4256 0.224883
## horsepower:year 1 264.3 264.3 36.8900 3.17e-09
## cylinders:year 3 29.1 9.7 1.3522 0.257235
## weight:year 1 8.6 8.6 1.2052 0.273008
## horsepower:cylinders:weight 2 10.3 5.1 0.7161 0.489360
## horsepower:cylinders:year 2 79.2 39.6 5.5294 0.004312
## horsepower:weight:year 1 0.1 0.1 0.0076 0.930704
## cylinders:weight:year 2 45.2 22.6 3.1538 0.043867
## horsepower:cylinders:weight:year 2 27.6 13.8 1.9252 0.147347
## Residuals 361 2586.8 7.2
##
## horsepower ***
## cylinders ***
## weight ***
## year ***
## horsepower:cylinders ***
## horsepower:weight ***
## cylinders:weight
## horsepower:year ***
## cylinders:year
## weight:year
## horsepower:cylinders:weight
## horsepower:cylinders:year **
## horsepower:weight:year
## cylinders:weight:year *
## horsepower:cylinders:weight:year
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1