Lab 1, Todos los ejercicios cap 2

  1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
  1. The sample size n is extremely large, and the number of predictors p is small. Es mejor utilizar un metodo inflexible.

  2. The number of predictors p is extremely large, and the number of observations n is small. Es mejor utilizar un metodo flexible

  3. The relationship between the predictors and response is highly non-linear. Para adaptarse al modelo, mejor flexible.

  4. The variance of the error terms, i.e. σ2 = Var(), is extremely high. Utilizar un metodo flexible al tener una varianza alta.

  1. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
  1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. Problema de regresión con n = 500 p = 4

  2. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. Problame de clasificacion con n = 20 p = 14

  3. We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market. Problema de regresion ya con n = 52 p = 4

  1. We now revisit the bias-variance decomposition.
  1. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
Caption for the picture.

Caption for the picture.

  1. Explain why each of the five curves has the shape displayed in part (a). Bayes = Da un error menor al posible Test error = Este va variando en relación al entrenamiento Training error = Decrementa segun la flexibilidad Bias = Al ser tan flexible, incrementa el error Variance = Sube directamente proporcional a la flexibilidad
  1. You will now think of some real-life applications for statistical learning.
  1. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer. a.1) Elejir si un carro de algun color se va comprar para tener inventarios, basado en el color y ventas anteriores.

a.2) Determinar el comportamiento que tendra una persona con base a la educacion y circulo de amistades

a.3) Identificar si alguien caera en mora con base al tipo de prestamo y el monto

  1. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer. b.1) Comparar productos para saber precios de mercado

b.2) Determinar el mejor valor con base a eventos pasados.

b.3) Determinar el valor optimo de estudiantes en un salon para que todos aprendan

  1. Describe three real-life applications in which cluster analysis might be useful. c.1) Determinar las preferencias sexuales de las personas

c.2) Determinar si comprarian o no un determinado producto

c.3) Determinar colores

  1. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred? Una regresión flexible se aplica mucho a train generando malas clasificaciones

  2. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages? Un parametrico tiene suposiciones mientras que un no-parametrico no. Ventajas de un parametrico es que se puede predecir de forma exacta si se posee buena data y se tiene una buena funcion mientras que en sus desventajas esta que hay que buscar la funcion para hacerlo.

  3. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Obs. X1 X2 X3 Y 1 0 3 0 Red 2 2 0 0 Red 3 0 1 3 Red 4 0 1 2 Green 5 −1 0 1 Green 6 1 1 1 Red #Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. ##(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

library(plot3D)
## Warning: package 'plot3D' was built under R version 3.3.3
x1 <- c(0,0,0,2,0,0,0,0,0,-1,0,1)
x2 <- c(0,3,0,0,0,1,0,1,0,0,0,1)
x3 <- c(0,0,0,0,0,3,0,2,0,1,0,1)
lines3D(x1,x2,x3)

  1. What is our prediction with K = 1? Why? Verde al tener una observacion mas cercana a 5

  2. What is our prediction with K = 3? Why? Verde al tener dos observaciones, la 4 y 5

  3. If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why? Al ser la barrera no-lineal lo mejor seria tener un valor de K alto para poder comparar.

Aplicación

Q8. This exercise relates to the “College” data set, which can be found in the file “College.csv”. It contains a number of variables for 777 different universities and colleges in the US.

Use the read.csv() function to read the data into R. Call the loaded data “college”. Make sure that you have the directory set to the correct location for the data.

library(ISLR)
data(College)
college <- read.csv("College.csv")

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.

head(college[, 1:5])
##                              X Private Apps Accept Enroll
## 1 Abilene Christian University     Yes 1660   1232    721
## 2           Adelphi University     Yes 2186   1924    512
## 3               Adrian College     Yes 1428   1097    336
## 4          Agnes Scott College     Yes  417    349    137
## 5    Alaska Pacific University     Yes  193    146     55
## 6            Albertson College     Yes  587    479    158
rownames <- college[, 1]
college <- college[, -1]
head(college[, 1:5])
##   Private Apps Accept Enroll Top10perc
## 1     Yes 1660   1232    721        23
## 2     Yes 2186   1924    512        16
## 3     Yes 1428   1097    336        22
## 4     Yes  417    349    137        60
## 5     Yes  193    146     55        16
## 6     Yes  587    479    158        38

Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

pairs(college[, 1:10])

Use the plot() function to produce side-by-side boxplots of “Outstate” versus “Private”.

plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

Create a new qualitative variable, called “Elite”, by binning the “Top10perc” variable. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of “Outstate” versus “Elite”.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college$Elite <- Elite
summary(college$Elite)
##  No Yes 
## 699  78
plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

Use the hist() function to produce some histograms with numbers of bins for a few of the quantitative variables.

par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")

Continue exploring the data, and provide a brief summary of what you discover.

summary(college$PhD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.66   85.00  103.00

It is a little weird to have universities with 103% of faculty with Phd’s, let us see how many universities have this percentage and their names.

weird.phd <- college[college$PhD == 103, ]
nrow(weird.phd)
## [1] 1
rownames[as.numeric(rownames(weird.phd))]
## [1] Texas A&M University at Galveston
## 777 Levels: Abilene Christian University ... York College of Pennsylvania

Q9. This exercise involves the “Auto” data set studied in the lab. Make sure the missing values have been removed from the data.

Which of the predictors are quantitative, and which are qualitative ?

auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5] 33 127 331 337 355
##   .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Todas las variables excepto “horsepower” y “name” son cualitativas

What is the range of each quantitative predictor ?

summary(auto[, -c(4, 9)])
##       mpg          cylinders      displacement       weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :5140  
##   acceleration        year           origin     
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :15.50   Median :76.00   Median :1.000  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

What is the mean and standard deviation of each quantitative predictor ?

sapply(auto[, -c(4, 9)], mean)
##          mpg    cylinders displacement       weight acceleration 
##    23.445918     5.471939   194.411990  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531
sapply(auto[, -c(4, 9)], sd)
##          mpg    cylinders displacement       weight acceleration 
##    7.8050075    1.7057832  104.6440039  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains ?

subset <- auto[-c(10:85), -c(4,9)]
sapply(subset, range)
##       mpg cylinders displacement weight acceleration year origin
## [1,] 11.0         3           68   1649          8.5   70      1
## [2,] 46.6         8          455   4997         24.8   82      3
sapply(subset, mean)
##          mpg    cylinders displacement       weight acceleration 
##    24.404430     5.373418   187.240506  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266
sapply(subset, sd)
##          mpg    cylinders displacement       weight acceleration 
##     7.867283     1.654179    99.678367   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
pairs(auto)

Se puede observar los millages por galon a 4 cyl a comparación de otros, en el peso Parece que obtener más kilometraje por galón en un vehículo de 4 cilindros que los demás. Peso, desplazamiento y caballos de fuerza parecen tener un efecto inverso con mpg. Vemos un aumento general en mpg a lo largo de los años. Casi se duplicó en una década. Los coches japoneses tienen mpg más alto que los coches de los EEUU o europeos.

auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$weight, auto$horsepower)
## [1] 0.8645377
cor(auto$weight, auto$displacement)
## [1] 0.9329944
cor(auto$displacement, auto$horsepower)
## [1] 0.897257

Q10. This exercise involves the “Boston” housing data set.

To begin, load in the “Boston” data set.

library(MASS)
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14

Make some pairwise scatterplots of the predictors in this data set.

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

Are any of the predictors associated with per capita crime rate ?

hist(Boston$crim, breaks = 50)

La mayoria no tiene ningun crime.

pairs(Boston[Boston$crim < 20, ])

There may be a relationship between crim and nox, rm, age, dis, lstat and medv.

Do any of the suburbs of Boston appear to have particularly high crime rates ? Tax rates ? Pupil-teacher ratios ?

hist(Boston$crim, breaks = 50)

nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)

nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)

nrow(Boston[Boston$ptratio > 20, ])
## [1] 201

How many of the suburbs in this data set bound the Charles river ?

nrow(Boston[Boston$chas == 1, ])
## [1] 35

What is the median pupil-teacher ratio among the towns in this data set ?

median(Boston$ptratio)
## [1] 19.05

Which suburb of Boston has lowest median value of owner-occupied homes ? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors ?

row.names(Boston[min(Boston$medv), ])
## [1] "5"
range(Boston$tax)
## [1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222

In this data set, how many of the suburbs average more than seven rooms per dwelling ? More than eight rooms per dwelling ?

nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13