CME250

Conceptual—————————————–>

Section 2.4

Q1)

Given a large n and small number of predictors, a more flexible statistical learning method should be better than an inflexible method. Normally, a primary concern with using a more flexible learning method is that we will overfit the data. However, a larger n should make the model more generalizable (decreasing bias) and the low number of predictors should decrease the chances of having high variance
If the number of predictors is extremley large, and the number of obsercations n is small we are likely to run the risk of overfitting the data if we use a more flexible method. In this case, we would likely want to run a less flexible method (making fewer assumptions about the underlying distribution)
If the relatinoship between predictors and reponse is highly non-linear we will want to use a model that does make linearity assumptions (which tend to be less flexible methods). In this case we may want to use a more flexible method.
If te variance of the error terms is extremely high we might be concerned with overfitting with more flexible models. In this case we’ll want to avoid overfitting by using a less flexible model.

Q2)

Regression: CEO salary is a continuous (quantitative) variable

Inference: We want to understand the nature of the relationship between the factors (which ones affect CEO salary, not necessarily predicting CEO salary).

n = 500 (the number of companies)

p = 4 (proft, # of employees, industry, CEO salary)

Classificaiton: We’re looking at the categorical variables (succes vs failure)

Prediction: We want to know whether (given the regressors) the company will be a success or failure

n = 20 (the number of similar products)

p = 14 (success vs failure, price charged, marketing budget, competition price, + 10 others)

Regression: We’re looking at a quantitiave variable (% change)

Prediciton: We’re prediction the % change

n = 52 (number of weeks in a year)

p = 4 (% change in dollar, % change in US market, % change in British market, % change in German market)

Q6)

Parametric approaches to statistical learning make a priori assumptions about the generating distribution. This makes the the problem of estimating f a bit easier because the problem simply becomes one of estimating the parameters. For example, we might assume that the there is a linear relationship between our independent and dependent variables. This kind of approach is preferred if we need to make inference about the nature / relationships of the independent variables (the function is more interetable).

Alternatively, non-parametric methods make no explicit assummptions about the underlying form of f. This allows to fit a wider range of possible functions for f, possibly getting us closer to the actual form. But we run the risk of overfitting the data if we have a small number of observations.

Section 3.7

Q3)

We are left with the equation

salary = 50 + 20xGPA + 0.07xIQ + 35xGR + 0.01(GPAxIQ) - 10(GPAxGR)

is the correct answer in this case: For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. This driven primarily by the negative interactino of GPA and gender. Females we have high GPAs suffer (the -10 coefficient) compared to males.

#Part b from Q3, 3.7 below:

salary.est = function(GPA, GR, IQ) {
  return(50 + 20*GPA + 0.07*IQ + 35*GR +
           0.01*GPA*IQ - 10*GPA*GR)
  }

salary.test = sapply(seq(from=1.0, to=4.0, by=.5), FUN=function(i){
  salary.est(i, 1, 100)
})
female.salaries = salary.test

salary.test = sapply(seq(from=1.0, to=4.0, by=.5), FUN=function(i){
  salary.est(i, 0, 100)
})
male.salaries = salary.test

plot(male.salaries, female.salaries,
     xlim=c(75, 150), ylim=c(75,150),
     xlab="Male Salaries", ylab="Female Salaries")
abline(a = 92, b = 1, col="red")

salary.est(4.0, 1, 100)

## [1] 136

salary.est(4.0, 0, 100)

## [1] 141

False: We saw in the the advertising example that when we added an interaction between TVxRadio we saw improved model fit. The coefficient for this interaction, however, was quite small - 0.0011, but with a small p-value (p < 0.0001). We’d need to examine our p-values for the interaction to assess whether or not it improves the model. In this case we might expect a strong relationship between these two predicters, which means it’s especially important that we assess their joint contribution.

Section 4.7

Q8)

Give how little we know about the data in this circumstance it’s hard to know. If the underlying decision boundaries are linear then we would prefer the less flexible logistic regression. Given the increase in the error rate for the test data it might be the case that the decision boundary is not linear. If we use KNN with k=1 we are fitting a very flexible model, however the error rate tends to be lower than the logistic error rate (18% vs 20%). Just because the logistic model has a 30% error rate doesn’t mean it will when we introduce more unseen data – this increase may have been due in part to irreduceable error that should even out across more applications (logistic regression has lower bias than KNN). Given these ambiguities and the lack of information about the nature of the clasffication (we’re assuming it’s binary) I would not want to advise on either of these methods.

Applied—————————————–>

Section 2.4

Q9)

auto=read.table(file="/Users/benpeloquin/Desktop/Spring_2015/CME250/HW/auto.data", header=T, na.strings="?")
auto = na.omit(auto)
auto$cylinders = as.factor(auto$cylinders)
auto$origin = as.factor(auto$origin)
names(auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

#a) Quantitative vs Qualititave vars:
cols = colnames(auto)
quants = quals = c()
for (i in 1:length(cols)) {
  if(is.numeric(auto[,cols[i]])) {
    print(paste(cols[i], ": numeric"))
    quants = cbind(quants, cols[i])
  } else {
    print(paste(cols[i], ": factor"))
    quals = cbind(quals, cols[i])
  }
}

## [1] "mpg : numeric"
## [1] "cylinders : factor"
## [1] "displacement : numeric"
## [1] "horsepower : numeric"
## [1] "weight : numeric"
## [1] "acceleration : numeric"
## [1] "year : numeric"
## [1] "origin : factor"
## [1] "name : factor"

#b) Range of quants:
for (i in 1:ncol(quants)) {
  if(is.numeric(auto[,quants[i]])) {
    print(paste(quants[i], "range: ",
                range(auto[,quants[i]])[1],
                "-",
                range(auto[,quants[i]])[2]))
  }
}

## [1] "mpg range:  9 - 46.6"
## [1] "displacement range:  68 - 455"
## [1] "horsepower range:  46 - 230"
## [1] "weight range:  1613 - 5140"
## [1] "acceleration range:  8 - 24.8"
## [1] "year range:  70 - 82"

m.sd.print = function(x, data) {
  for (i in 1:ncol(x)) {
    cat(paste(x[i], "-->\n",
                "mean:", mean(data[, x[1, i]]), "\n",
                "sd:", sd(data[, x[1, i]]), "\n"))
  }
}
m.sd.print(quants, auto)

## mpg -->
##  mean: 23.4459183673469 
##  sd: 7.8050074865718 
## displacement -->
##  mean: 194.411989795918 
##  sd: 104.644003908905 
## horsepower -->
##  mean: 104.469387755102 
##  sd: 38.4911599328285 
## weight -->
##  mean: 2977.58418367347 
##  sd: 849.402560042949 
## acceleration -->
##  mean: 15.5413265306122 
##  sd: 2.75886411918808 
## year -->
##  mean: 75.9795918367347 
##  sd: 3.68373654357783

auto.rm = auto[c(1:9, 85:nrow(auto)),]
m.sd.print(quants,auto.rm)

## mpg -->
##  mean: 24.3684542586751 
##  sd: 7.88089834213537 
## displacement -->
##  mean: 187.753943217666 
##  sd: 99.9394881404822 
## horsepower -->
##  mean: 100.955835962145 
##  sd: 35.8955667771312 
## weight -->
##  mean: 2939.64353312303 
##  sd: 812.649629297648 
## acceleration -->
##  mean: 15.7182965299685 
##  sd: 2.69381257754339 
## year -->
##  mean: 77.1324921135647 
##  sd: 3.11002631924956

names(auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

attach(auto)
pairs(~ mpg + cylinders + weight + horsepower + year, col="blue",
      main="mpg, cylinders, weight, horsepower, year")

plot(as.numeric(cylinders), mpg,
     main ="MPG by Cylinders",
     xlab = "# of cylinders", ylab = "MPG",
     col = "blue", pch = 10)

plot(mpg ~ horsepower, col = "blue", pch = 10,
     main = "MPG by Horsepower")
m1 = lm(mpg ~ horsepower)
abline(m1[1], m1[2], col = "red")

plot(horsepower ~ weight, col = "blue", pch = 10,
     main = "Horspower ~ Weight")
m2 = lm(horsepower ~ weight)
abline(m2[1], m2[2], col = "red")

#f)
names(auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

m3 = lm(mpg ~ horsepower + cylinders +
          weight + year + acceleration + displacement)
anova(m3)

## Analysis of Variance Table
## 
## Response: mpg
##               Df  Sum Sq Mean Sq   F value Pr(>F)    
## horsepower     1 14433.1 14433.1 1417.2300 <2e-16 ***
## cylinders      4  2349.2   587.3   57.6700 <2e-16 ***
## weight         1   893.3   893.3   87.7139 <2e-16 ***
## year           1  2242.9  2242.9  220.2365 <2e-16 ***
## acceleration   1     0.7     0.7    0.0686 0.7935    
## displacement   1     9.5     9.5    0.9339 0.3345    
## Residuals    382  3890.3    10.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m4 = lm(mpg ~ horsepower * cylinders * weight * year)
anova(m4)

## Analysis of Variance Table
## 
## Response: mpg
##                                   Df  Sum Sq Mean Sq   F value    Pr(>F)
## horsepower                         1 14433.1 14433.1 2014.2068 < 2.2e-16
## cylinders                          4  2349.2   587.3   81.9623 < 2.2e-16
## weight                             1   893.3   893.3  124.6615 < 2.2e-16
## year                               1  2242.9  2242.9  313.0063 < 2.2e-16
## horsepower:cylinders               4   675.6   168.9   23.5694 < 2.2e-16
## horsepower:weight                  1   132.9   132.9   18.5457  2.14e-05
## cylinders:weight                   4    40.9    10.2    1.4256  0.224883
## horsepower:year                    1   264.3   264.3   36.8900  3.17e-09
## cylinders:year                     3    29.1     9.7    1.3522  0.257235
## weight:year                        1     8.6     8.6    1.2052  0.273008
## horsepower:cylinders:weight        2    10.3     5.1    0.7161  0.489360
## horsepower:cylinders:year          2    79.2    39.6    5.5294  0.004312
## horsepower:weight:year             1     0.1     0.1    0.0076  0.930704
## cylinders:weight:year              2    45.2    22.6    3.1538  0.043867
## horsepower:cylinders:weight:year   2    27.6    13.8    1.9252  0.147347
## Residuals                        361  2586.8     7.2                    
##                                     
## horsepower                       ***
## cylinders                        ***
## weight                           ***
## year                             ***
## horsepower:cylinders             ***
## horsepower:weight                ***
## cylinders:weight                    
## horsepower:year                  ***
## cylinders:year                      
## weight:year                         
## horsepower:cylinders:weight         
## horsepower:cylinders:year        ** 
## horsepower:weight:year              
## cylinders:weight:year            *  
## horsepower:cylinders:weight:year    
## Residuals                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

CME250_HW1

BPeloquin

April 6, 2015