The sample size n is extremely large, and the number of predictors p is small. Es mejor utilizar un metodo inflexible.
The number of predictors p is extremely large, and the number of observations n is small. Es mejor utilizar un metodo flexible
The relationship between the predictors and response is highly non-linear. Para adaptarse al modelo, mejor flexible.
The variance of the error terms, i.e. σ2 = Var(), is extremely high. Utilizar un metodo flexible al tener una varianza alta.
We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. Problema de regresión con n = 500 p = 4
We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. Problame de clasificacion con n = 20 p = 14
We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market. Problema de regresion ya con n = 52 p = 4
Caption for the picture.
a.2) Determinar el comportamiento que tendra una persona con base a la educacion y circulo de amistades
a.3) Identificar si alguien caera en mora con base al tipo de prestamo y el monto
b.2) Determinar el mejor valor con base a eventos pasados.
b.3) Determinar el valor optimo de estudiantes en un salon para que todos aprendan
c.2) Determinar si comprarian o no un determinado producto
c.3) Determinar colores
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred? Una regresión flexible se aplica mucho a train generando malas clasificaciones
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages? Un parametrico tiene suposiciones mientras que un no-parametrico no. Ventajas de un parametrico es que se puede predecir de forma exacta si se posee buena data y se tiene una buena funcion mientras que en sus desventajas esta que hay que buscar la funcion para hacerlo.
The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Obs. X1 X2 X3 Y 1 0 3 0 Red 2 2 0 0 Red 3 0 1 3 Red 4 0 1 2 Green 5 −1 0 1 Green 6 1 1 1 Red #Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. ##(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
library(plot3D)
## Warning: package 'plot3D' was built under R version 3.3.3
x1 <- c(0,0,0,2,0,0,0,0,0,-1,0,1)
x2 <- c(0,3,0,0,0,1,0,1,0,0,0,1)
x3 <- c(0,0,0,0,0,3,0,2,0,1,0,1)
lines3D(x1,x2,x3)
What is our prediction with K = 1? Why? Verde al tener una observacion mas cercana a 5
What is our prediction with K = 3? Why? Verde al tener dos observaciones, la 4 y 5
If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why? Al ser la barrera no-lineal lo mejor seria tener un valor de K alto para poder comparar.
Q8. This exercise relates to the “College” data set, which can be found in the file “College.csv”. It contains a number of variables for 777 different universities and colleges in the US.
Use the read.csv() function to read the data into R. Call the loaded data “college”. Make sure that you have the directory set to the correct location for the data.
library(ISLR)
data(College)
college <- read.csv("College.csv")
Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.
head(college[, 1:5])
## X Private Apps Accept Enroll
## 1 Abilene Christian University Yes 1660 1232 721
## 2 Adelphi University Yes 2186 1924 512
## 3 Adrian College Yes 1428 1097 336
## 4 Agnes Scott College Yes 417 349 137
## 5 Alaska Pacific University Yes 193 146 55
## 6 Albertson College Yes 587 479 158
rownames <- college[, 1]
college <- college[, -1]
head(college[, 1:5])
## Private Apps Accept Enroll Top10perc
## 1 Yes 1660 1232 721 23
## 2 Yes 2186 1924 512 16
## 3 Yes 1428 1097 336 22
## 4 Yes 417 349 137 60
## 5 Yes 193 146 55 16
## 6 Yes 587 479 158 38
Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.
pairs(college[, 1:10])
Use the plot() function to produce side-by-side boxplots of “Outstate” versus “Private”.
plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
Create a new qualitative variable, called “Elite”, by binning the “Top10perc” variable. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of “Outstate” versus “Elite”.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college$Elite <- Elite
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
Use the hist() function to produce some histograms with numbers of bins for a few of the quantitative variables.
par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")
Continue exploring the data, and provide a brief summary of what you discover.
summary(college$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.66 85.00 103.00
It is a little weird to have universities with 103% of faculty with Phd’s, let us see how many universities have this percentage and their names.
weird.phd <- college[college$PhD == 103, ]
nrow(weird.phd)
## [1] 1
rownames[as.numeric(rownames(weird.phd))]
## [1] Texas A&M University at Galveston
## 777 Levels: Abilene Christian University ... York College of Pennsylvania
Q9. This exercise involves the “Auto” data set studied in the lab. Make sure the missing values have been removed from the data.
Which of the predictors are quantitative, and which are qualitative ?
auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:5] 33 127 331 337 355
## .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Todas las variables excepto “horsepower” y “name” son cualitativas
What is the range of each quantitative predictor ?
summary(auto[, -c(4, 9)])
## mpg cylinders displacement weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :5140
## acceleration year origin
## Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :15.50 Median :76.00 Median :1.000
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
What is the mean and standard deviation of each quantitative predictor ?
sapply(auto[, -c(4, 9)], mean)
## mpg cylinders displacement weight acceleration
## 23.445918 5.471939 194.411990 2977.584184 15.541327
## year origin
## 75.979592 1.576531
sapply(auto[, -c(4, 9)], sd)
## mpg cylinders displacement weight acceleration
## 7.8050075 1.7057832 104.6440039 849.4025600 2.7588641
## year origin
## 3.6837365 0.8055182
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains ?
subset <- auto[-c(10:85), -c(4,9)]
sapply(subset, range)
## mpg cylinders displacement weight acceleration year origin
## [1,] 11.0 3 68 1649 8.5 70 1
## [2,] 46.6 8 455 4997 24.8 82 3
sapply(subset, mean)
## mpg cylinders displacement weight acceleration
## 24.404430 5.373418 187.240506 2935.971519 15.726899
## year origin
## 77.145570 1.601266
sapply(subset, sd)
## mpg cylinders displacement weight acceleration
## 7.867283 1.654179 99.678367 811.300208 2.693721
## year origin
## 3.106217 0.819910
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
pairs(auto)
Se puede observar los millages por galon a 4 cyl a comparación de otros, en el peso Parece que obtener más kilometraje por galón en un vehículo de 4 cilindros que los demás. Peso, desplazamiento y caballos de fuerza parecen tener un efecto inverso con mpg. Vemos un aumento general en mpg a lo largo de los años. Casi se duplicó en una década. Los coches japoneses tienen mpg más alto que los coches de los EEUU o europeos.
auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$weight, auto$horsepower)
## [1] 0.8645377
cor(auto$weight, auto$displacement)
## [1] 0.9329944
cor(auto$displacement, auto$horsepower)
## [1] 0.897257
Q10. This exercise involves the “Boston” housing data set.
To begin, load in the “Boston” data set.
library(MASS)
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14
Make some pairwise scatterplots of the predictors in this data set.
par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
Are any of the predictors associated with per capita crime rate ?
hist(Boston$crim, breaks = 50)
La mayoria no tiene ningun crime.
pairs(Boston[Boston$crim < 20, ])
There may be a relationship between crim and nox, rm, age, dis, lstat and medv.
Do any of the suburbs of Boston appear to have particularly high crime rates ? Tax rates ? Pupil-teacher ratios ?
hist(Boston$crim, breaks = 50)
nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)
nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
How many of the suburbs in this data set bound the Charles river ?
nrow(Boston[Boston$chas == 1, ])
## [1] 35
What is the median pupil-teacher ratio among the towns in this data set ?
median(Boston$ptratio)
## [1] 19.05
Which suburb of Boston has lowest median value of owner-occupied homes ? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors ?
row.names(Boston[min(Boston$medv), ])
## [1] "5"
range(Boston$tax)
## [1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222
In this data set, how many of the suburbs average more than seven rooms per dwelling ? More than eight rooms per dwelling ?
nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13