2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
a. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
Regresion, inferencia, porque queremos entender que factores afectan. n = 500 firmas p = 3 factores
b. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
Si exito o fracaso son dos categorias, se haria una clasificacion. n = 20 productos p = 13 variables
c. We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.
Regresion, prediccion, queremos ver como cambiara pero no los factores. n = 52 semanas p = 3 mercados
3. We now revisit the bias-variance decomposition.
PENDIENTE
4. ou will now think of some real-life applications for statistical learning.
a. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
Clasificar que alumnos estan en riesgo, quienes van bien, y quienes pueden optar a extras. Predictores: notas, asistencia, asistencia a refuerzo. Inferencia, pues se quiere saber en qué aspecto pueden mejorar.
A qué grupo de personas vender un producto. Predictores: sueldo de los compradores, tiendas/lugares que frecuenta. Prediccion, quienes comprarian el producto.
Que artista va a ver la gente en un concierto multiple. Predictores: donde compro su entrada, adquirio alguna promocion. Prediccion, a quien van a ver.
b. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
Dinero que alguien gastara en sus vacaciones. Predictores: cuanto tiempo desde sus ultimas vacaciones, lugar al que ira, acompanantes. Prediccion, cantidad de dinero a gastar.
Prestamo que alguien va a pedir. Predictores: banco que usa, dinero ahorrado, que planea comprar/invertir. Prediccion: cantidad de dinero a pedir.
Ganancias de una empresa. Predictores: costo de produccion, cantidad de clientes, publicidad. Inferencia: Como afectan los factores, para poder aumentar la ganancia.
c. Describe three real-life applications in which cluster analysis might be useful.
Marca de vehiculo que alguien tiene
Lugar de vivienda de una persona
Centro comercial o supermercado donde alguien hace sus compras habituales
5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
Dependera de la forma que tomen los datos. Entre mas cercano sea a algo lineal, sera mejor un modelo no flexible. Si la forma no esta clara, un modelo flexible ayuda.
Hay casos, como clasificacion, donde no se quiere quedar tan cerca de las observaciones, sino hacer una distincion clara, se usaria un modelo no flexible.
6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
En el parametrico se busca ajustar los datos a una forma lo mas simple posible, para unicamente usar este ajuste para hacer predicciones. El no parametrico usara todos los datos para sacar predicciones.
Las ventajas que se tienen es la simplificacion que permite, la desventaja es que dependeremos del ajuste que le dimos, si lo ajustamos a un modelo que no era el apropiado se tendra problema con los resultados.
7. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.
PENDIENTE
8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.
a. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data
b. Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later
c. i. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
X Private
Abilene Christian University: 1 No :212
Adelphi University : 1 Yes:565
Adrian College : 1
Agnes Scott College : 1
Alaska Pacific University : 1
Albertson College : 1
(Other) :771
Apps Accept Enroll
Min. : 81 Min. : 72 Min. : 35
1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
Median : 1558 Median : 1110 Median : 434
Mean : 3002 Mean : 2019 Mean : 780
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
Max. :48094 Max. :26330 Max. :6392
Top10perc Top25perc F.Undergrad
Min. : 1.00 Min. : 9.0 Min. : 139
1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992
Median :23.00 Median : 54.0 Median : 1707
Mean :27.56 Mean : 55.8 Mean : 3700
3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
Max. :96.00 Max. :100.0 Max. :31643
P.Undergrad Outstate Room.Board
Min. : 1.0 Min. : 2340 Min. :1780
1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
Median : 353.0 Median : 9990 Median :4200
Mean : 855.3 Mean :10441 Mean :4358
3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
Max. :21836.0 Max. :21700 Max. :8124
Books Personal PhD
Min. : 96.0 Min. : 250 Min. : 8.00
1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
Median : 500.0 Median :1200 Median : 75.00
Mean : 549.4 Mean :1341 Mean : 72.66
3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
Max. :2340.0 Max. :6800 Max. :103.00
Terminal S.F.Ratio perc.alumni
Min. : 24.0 Min. : 2.50 Min. : 0.00
1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
Median : 82.0 Median :13.60 Median :21.00
Mean : 79.7 Mean :14.09 Mean :22.74
3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
Max. :100.0 Max. :39.80 Max. :64.00
Expend Grad.Rate
Min. : 3186 Min. : 10.00
1st Qu.: 6751 1st Qu.: 53.00
Median : 8377 Median : 65.00
Mean : 9660 Mean : 65.46
3rd Qu.:10830 3rd Qu.: 78.00
Max. :56233 Max. :118.00
c. ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(college[,1:10])
c. iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

c. iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

c. v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.
par(mfrow = c(2,2))
hist(college$Accept)
hist(college$Enroll)
hist(college$Personal)
hist(college$PhD)

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
a. Which of the predictors are quantitative, and which are qualitative?
Cuantitativos: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Cualitativos: origin, name
b. What is the range of each quantitative predictor? You can answer this using the range() function. range()
c. What is the mean and standard deviation of each quantitative predictor?
summary(auto[,1:7])
mpg cylinders displacement horsepower weight acceleration
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804 Median :15.50
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80
year
Min. :70.00
1st Qu.:73.00
Median :76.00
Mean :75.98
3rd Qu.:79.00
Max. :82.00
summary nos resume cada columna y nos muestra el minimo y maximo (para el rango) y la media de cada una.
sapply(auto[,1:8], range)
mpg cylinders displacement
[1,] 9.0 3 68
[2,] 46.6 8 455
horsepower weight acceleration year
[1,] 46 1613 8.0 70
[2,] 230 5140 24.8 82
origin
[1,] 1
[2,] 3
sapply(auto[,1:8], mean)
mpg cylinders displacement
23.445918 5.471939 194.411990
horsepower weight acceleration
104.469388 2977.584184 15.541327
year origin
75.979592 1.576531
sapply(auto[,1:8], sd)
mpg cylinders displacement
7.8050075 1.7057832 104.6440039
horsepower weight acceleration
38.4911599 849.4025600 2.7588641
year origin
3.6837365 0.8055182
apply nos permite aplicar sd (standard deviation) a las columnas cuantitativas
d. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
f. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

mpg vs displacement, mpg vs horsepower, mpg vs weight, y en menor medida mpg vs acceleration
par(mfrow = c(2,2))
plot(auto$displacement, auto$mpg)
plot(auto$horsepower, auto$mpg)
plot(auto$weight, auto$mpg)
plot(auto$acceleration, auto$mpg)

10. This exercise involves the Boston housing data set.
a. To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
library(MASS)
nrow(Boston)
[1] 506
ncol(Boston)
[1] 14
colnames(Boston)
[1] "crim" "zn" "indus"
[4] "chas" "nox" "rm"
[7] "age" "dis" "rad"
[10] "tax" "ptratio" "black"
[13] "lstat" "medv"
?Boston
b. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Entre mas industria hay, mas produccion de oxidos de nitrogeno

Entre mas cerca de los centros de empleo, mas alta la concentracion de oxidos de nitrogeno

c. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
El crimen es mas alto en lugares donde las viviendas valen menos.

Da la impresion que el crimen es cometido por gente de mayor edad. Hay criminales de 80 a 100 anios en Boston?
plot(Boston$age, Boston$crim)
e. How many of the suburbs in this data set bound the Charles river?
sum(Boston$chas == 1)
[1] 35
