1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer

a. The sample size n is extremely large, and the number of predictors p is small.

Los métodos flexibles requieren más predictores, entonces se usaría uno inflexible

b. The number of predictors p is extremely large, and the number of observations n is small.

Un método flexible, se adapta mejor a tener pocos datos.

c. The relationship between the predictors and response is highly non-linear.

Un método flexible, permitirá adaptarse mejor a la forma de los datos.

#####. The variance of the error terms, i.e. sigma^2 = Var(err), is extremely high.

Los datos dispersos van a requerir un modelo flexible para poder adaptarse mejor.

2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

a. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Regresion, inferencia, porque queremos entender que factores afectan. n = 500 firmas p = 3 factores

b. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Si exito o fracaso son dos categorias, se haria una clasificacion. n = 20 productos p = 13 variables

c. We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

Regresion, prediccion, queremos ver como cambiara pero no los factores. n = 52 semanas p = 3 mercados

3. We now revisit the bias-variance decomposition.

PENDIENTE

4. ou will now think of some real-life applications for statistical learning.

a. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
  • Clasificar que alumnos estan en riesgo, quienes van bien, y quienes pueden optar a extras. Predictores: notas, asistencia, asistencia a refuerzo. Inferencia, pues se quiere saber en qué aspecto pueden mejorar.

  • A qué grupo de personas vender un producto. Predictores: sueldo de los compradores, tiendas/lugares que frecuenta. Prediccion, quienes comprarian el producto.

  • Que artista va a ver la gente en un concierto multiple. Predictores: donde compro su entrada, adquirio alguna promocion. Prediccion, a quien van a ver.

b. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
  • Dinero que alguien gastara en sus vacaciones. Predictores: cuanto tiempo desde sus ultimas vacaciones, lugar al que ira, acompanantes. Prediccion, cantidad de dinero a gastar.

  • Prestamo que alguien va a pedir. Predictores: banco que usa, dinero ahorrado, que planea comprar/invertir. Prediccion: cantidad de dinero a pedir.

  • Ganancias de una empresa. Predictores: costo de produccion, cantidad de clientes, publicidad. Inferencia: Como afectan los factores, para poder aumentar la ganancia.

c. Describe three real-life applications in which cluster analysis might be useful.
  • Marca de vehiculo que alguien tiene

  • Lugar de vivienda de una persona

  • Centro comercial o supermercado donde alguien hace sus compras habituales

5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Dependera de la forma que tomen los datos. Entre mas cercano sea a algo lineal, sera mejor un modelo no flexible. Si la forma no esta clara, un modelo flexible ayuda.

Hay casos, como clasificacion, donde no se quiere quedar tan cerca de las observaciones, sino hacer una distincion clara, se usaria un modelo no flexible.

6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

En el parametrico se busca ajustar los datos a una forma lo mas simple posible, para unicamente usar este ajuste para hacer predicciones. El no parametrico usara todos los datos para sacar predicciones.

Las ventajas que se tienen es la simplificacion que permite, la desventaja es que dependeremos del ajuste que le dimos, si lo ajustamos a un modelo que no era el apropiado se tendra problema con los resultados.

7. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

PENDIENTE

8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

a. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data
b. Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later
c. i. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
                            X       Private  
 Abilene Christian University:  1   No :212  
 Adelphi University          :  1   Yes:565  
 Adrian College              :  1            
 Agnes Scott College         :  1            
 Alaska Pacific University   :  1            
 Albertson College           :  1            
 (Other)                     :771            
      Apps           Accept          Enroll    
 Min.   :   81   Min.   :   72   Min.   :  35  
 1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
 Median : 1558   Median : 1110   Median : 434  
 Mean   : 3002   Mean   : 2019   Mean   : 780  
 3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
 Max.   :48094   Max.   :26330   Max.   :6392  
                                               
   Top10perc       Top25perc      F.Undergrad   
 Min.   : 1.00   Min.   :  9.0   Min.   :  139  
 1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992  
 Median :23.00   Median : 54.0   Median : 1707  
 Mean   :27.56   Mean   : 55.8   Mean   : 3700  
 3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005  
 Max.   :96.00   Max.   :100.0   Max.   :31643  
                                                
  P.Undergrad         Outstate       Room.Board  
 Min.   :    1.0   Min.   : 2340   Min.   :1780  
 1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597  
 Median :  353.0   Median : 9990   Median :4200  
 Mean   :  855.3   Mean   :10441   Mean   :4358  
 3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050  
 Max.   :21836.0   Max.   :21700   Max.   :8124  
                                                 
     Books           Personal         PhD        
 Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median : 500.0   Median :1200   Median : 75.00  
 Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :2340.0   Max.   :6800   Max.   :103.00  
                                                 
    Terminal       S.F.Ratio      perc.alumni   
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
 Median : 82.0   Median :13.60   Median :21.00  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
 Max.   :100.0   Max.   :39.80   Max.   :64.00  
                                                
     Expend        Grad.Rate     
 Min.   : 3186   Min.   : 10.00  
 1st Qu.: 6751   1st Qu.: 53.00  
 Median : 8377   Median : 65.00  
 Mean   : 9660   Mean   : 65.46  
 3rd Qu.:10830   3rd Qu.: 78.00  
 Max.   :56233   Max.   :118.00  
                                 
c. ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(college[,1:10])
c. iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

c. iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

c. v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.
par(mfrow = c(2,2))
hist(college$Accept)
hist(college$Enroll)
hist(college$Personal)
hist(college$PhD)

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

a. Which of the predictors are quantitative, and which are qualitative?

Cuantitativos: mpg, cylinders, displacement, horsepower, weight, acceleration, year

Cualitativos: origin, name

b. What is the range of each quantitative predictor? You can answer this using the range() function. range()
c. What is the mean and standard deviation of each quantitative predictor?
summary(auto[,1:7])
      mpg          cylinders      displacement     horsepower        weight      acceleration  
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613   Min.   : 8.00  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225   1st Qu.:13.78  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804   Median :15.50  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978   Mean   :15.54  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615   3rd Qu.:17.02  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140   Max.   :24.80  
      year      
 Min.   :70.00  
 1st Qu.:73.00  
 Median :76.00  
 Mean   :75.98  
 3rd Qu.:79.00  
 Max.   :82.00  

summary nos resume cada columna y nos muestra el minimo y maximo (para el rango) y la media de cada una.

sapply(auto[,1:8], range)
      mpg cylinders displacement
[1,]  9.0         3           68
[2,] 46.6         8          455
     horsepower weight acceleration year
[1,]         46   1613          8.0   70
[2,]        230   5140         24.8   82
     origin
[1,]      1
[2,]      3
sapply(auto[,1:8], mean)
         mpg    cylinders displacement 
   23.445918     5.471939   194.411990 
  horsepower       weight acceleration 
  104.469388  2977.584184    15.541327 
        year       origin 
   75.979592     1.576531 
sapply(auto[,1:8], sd)
         mpg    cylinders displacement 
   7.8050075    1.7057832  104.6440039 
  horsepower       weight acceleration 
  38.4911599  849.4025600    2.7588641 
        year       origin 
   3.6837365    0.8055182 

apply nos permite aplicar sd (standard deviation) a las columnas cuantitativas

d. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

f. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

mpg vs displacement, mpg vs horsepower, mpg vs weight, y en menor medida mpg vs acceleration

par(mfrow = c(2,2))
plot(auto$displacement, auto$mpg)
plot(auto$horsepower, auto$mpg)
plot(auto$weight, auto$mpg)
plot(auto$acceleration, auto$mpg)

10. This exercise involves the Boston housing data set.

a. To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
library(MASS)
nrow(Boston)
[1] 506
ncol(Boston)
[1] 14
colnames(Boston)
 [1] "crim"    "zn"      "indus"  
 [4] "chas"    "nox"     "rm"     
 [7] "age"     "dis"     "rad"    
[10] "tax"     "ptratio" "black"  
[13] "lstat"   "medv"   
?Boston
b. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Entre mas industria hay, mas produccion de oxidos de nitrogeno

Entre mas cerca de los centros de empleo, mas alta la concentracion de oxidos de nitrogeno

c. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

El crimen es mas alto en lugares donde las viviendas valen menos.

Da la impresion que el crimen es cometido por gente de mayor edad. Hay criminales de 80 a 100 anios en Boston?

plot(Boston$age, Boston$crim)
d. Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Los impuestos no parecen influir. Hay una tasa de impuestos que es la mas comun, y hay toda cantidad de crimen en este lugar.

La cantidad de alumnos por maestro tampoco parece influir.

En el inciso anterior se vio que los lugares con menor valor de vivienda tienen mas crimen.

e. How many of the suburbs in this data set bound the Charles river?
sum(Boston$chas == 1)
[1] 35
f. What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
[1] 19.05
g. Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
minmedian = min(Boston$medv)
Boston[Boston$medv == minmedian,]
sapply(Boston, mean)
        crim           zn        indus         chas          nox           rm          age 
  3.61352356  11.36363636  11.13677866   0.06916996   0.55469506   6.28463439  68.57490119 
         dis          rad          tax      ptratio        black        lstat         medv 
  3.79504269   9.54940711 408.23715415  18.45553360 356.67403162  12.65306324  22.53280632 

Hay dos suburbios con medv de 5. No parecen buenos lugares para vivir, el crimen y la contaminacion son mas altos que el promedio.

g. In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
nrow(Boston[Boston$rm>8,])
[1] 13

64 suburbios con mas de 7 habitaciones en promedio; 13 con mas de 8 habitaciones.

