Parte 1: Problema 1. Descargar el DataSEt
admisiones <-read_csv("admis1.csv")
Parsed with column specification:
cols(
`Serial No.` = col_integer(),
`GRE Score` = col_integer(),
`TOEFL Score` = col_integer(),
`University Rating` = col_integer(),
SOP = col_double(),
LOR = col_double(),
CGPA = col_double(),
Research = col_integer(),
`Chance of Admit` = col_double()
)
glimpse(admisiones)
Observations: 500
Variables: 9
$ `Serial No.` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, ...
$ `GRE Score` <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 325, 327, 328, 307, 311, 314, 317, 319, 3...
$ `TOEFL Score` <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 106, 111, 112, 109, 104, 105, 107, 106, 1...
$ `University Rating` <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 2, 1, 2, 2...
$ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.5, 4.0, 4.0, 4.0, 3.5, 3.5, 4.0, 4.0, 4...
$ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.0, 4.5, 4.5, 3.0, 2.0, 2.5, 3.0, 3.0, 3...
$ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00, 8.60, 8.40, 9.00, 9.10, 8.00, 8.20, 8...
$ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1...
$ `Chance of Admit` <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50, 0.45, 0.52, 0.84, 0.78, 0.62, 0.61, 0...
Problema 2. Separe su dataset en 70% para train y 30% para test. Basandonos en el laboratorio hecho en clase
train <- sample(1:nrow(admisiones),size = nrow(admisiones)*0.7 )
train_data <- admisiones[train,]
glimpse(train_data)
Observations: 350
Variables: 9
$ `Serial No.` <int> 407, 415, 14, 254, 171, 34, 85, 487, 367, 120, 464, 90, 320, 111, 174, 479, 88, 128, 211,...
$ `GRE Score` <int> 322, 321, 307, 335, 312, 340, 340, 319, 320, 327, 304, 316, 327, 305, 323, 318, 317, 319,...
$ `TOEFL Score` <int> 103, 110, 109, 115, 101, 114, 115, 102, 104, 104, 107, 109, 113, 108, 113, 103, 107, 112,...
$ `University Rating` <int> 4, 4, 3, 4, 2, 5, 5, 3, 3, 5, 3, 4, 4, 5, 4, 3, 2, 3, 4, 3, 3, 5, 3, 3, 2, 2, 3, 3, 2, 1,...
$ SOP <dbl> 3.0, 3.5, 4.0, 4.5, 2.5, 4.0, 4.5, 2.5, 3.5, 3.0, 3.5, 4.5, 3.5, 3.0, 4.0, 4.0, 3.5, 2.5,...
$ LOR <dbl> 2.5, 4.0, 3.0, 4.5, 3.5, 4.0, 4.5, 2.5, 4.5, 3.5, 3.0, 3.5, 3.0, 3.0, 4.5, 4.5, 3.0, 2.0,...
$ CGPA <dbl> 8.02, 8.35, 8.00, 9.68, 8.04, 9.60, 9.45, 8.37, 8.34, 8.84, 7.86, 8.76, 8.69, 8.48, 9.23,...
$ Research <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,...
$ `Chance of Admit` <dbl> 0.61, 0.72, 0.62, 0.93, 0.68, 0.90, 0.94, 0.68, 0.74, 0.71, 0.57, 0.74, 0.80, 0.61, 0.89,...
Hemos seleccionado el 70% de las observaciones 350/500=.7
calcularemos el test que debe de ser de 150 obs
prueba <- sample(1:nrow(admisiones),size = nrow(admisiones)*0.3 )
prueba_data <- admisiones[prueba,]
glimpse(prueba_data)
Observations: 150
Variables: 9
$ `Serial No.` <int> 384, 261, 357, 348, 337, 47, 346, 307, 345, 414, 17, 350, 215, 151, 158, 280, 460, 7, 133...
$ `GRE Score` <int> 300, 327, 327, 299, 319, 329, 316, 323, 295, 317, 317, 313, 331, 334, 309, 304, 329, 321,...
$ `TOEFL Score` <int> 100, 108, 109, 94, 110, 114, 98, 110, 96, 101, 107, 101, 117, 114, 104, 102, 113, 109, 10...
$ `University Rating` <int> 3, 5, 3, 1, 3, 5, 1, 3, 2, 3, 3, 3, 4, 4, 2, 2, 4, 3, 5, 2, 4, 1, 4, 3, 3, 3, 2, 5, 5, 2,...
$ SOP <dbl> 3.0, 5.0, 3.5, 1.0, 3.0, 4.0, 1.5, 4.0, 1.5, 3.0, 4.0, 2.5, 4.5, 4.0, 2.0, 3.0, 4.0, 3.0,...
$ LOR <dbl> 3.5, 3.5, 4.0, 1.0, 2.5, 5.0, 2.0, 3.5, 2.0, 2.0, 3.0, 3.0, 5.0, 4.0, 2.5, 4.0, 3.5, 4.0,...
$ CGPA <dbl> 8.26, 9.13, 8.77, 7.34, 8.79, 9.30, 7.43, 9.10, 7.34, 7.94, 8.70, 8.04, 9.42, 9.43, 8.26,...
$ Research <int> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,...
$ `Chance of Admit` <dbl> 0.62, 0.87, 0.79, 0.42, 0.72, 0.86, 0.49, 0.79, 0.47, 0.49, 0.66, 0.62, 0.94, 0.93, 0.65,...
Problema 3. Diga cuales son las variables que tiene este dataset y que significa cada una.
Variables
Problema 4,5 Y 6. Realice una gráfica de (separada) para las variables GRE.Score, TOEFEL.Score, SOP, LOR, CGPA, University.Rating contra Chance.of.Admit.
Diga qué tipo de regresión utilizaría para construir un modelo de regresión que permita predecir el Chance.of.Admit, esto debe realizarlo con cada variable independiente es decir construir un modelo de dos variables, una explicada y una explicatoria
Utilice el método smooth para mostrar en cada gráfica el modelo que mejor se ajusta a los datos, esto debe hacerlo para cada gráfica por separado.
Apartid de este momento todas nuestras regreciones son simples
y=b_0+b_1x+u
x son la variable libre, y es la variable a predecir en nuestro caso es el porcentaje de oportunidad de admision y u el error irreducible
Ejemplo
fit <- lm(data = train_data, `Chance of Admit`~`GRE Score`)
ggplot(fit$model, aes(x = `GRE Score`, y = `Chance of Admit`)) + geom_point() +geom_smooth()+ geom_smooth(method = "lm", col = "red")
Dado que geom_smooth genera la LM para todos nuestros casos simples… procederemos a generarlo en automatico sin producir el modelo
gre_vs_admin<-train_data %>% select(`GRE Score`, `Chance of Admit`)
ggplot(gre_vs_admin, aes(x=`GRE Score`, y=`Chance of Admit`)) + geom_point() + geom_smooth() + geom_smooth(method = "lm", col = "red")
calculo de foef vs admin
toefl_vs_admin<-train_data %>% select(`TOEFL Score`, `Chance of Admit`)
ggplot(toefl_vs_admin, aes(x=`TOEFL Score`, y=`Chance of Admit`)) + geom_point() + geom_smooth() + geom_smooth(method = "lm", col = "red")
.
SOP_vs_admin<-train_data %>% select(`SOP`, `Chance of Admit`)
ggplot(SOP_vs_admin, aes(x=`SOP`, y=`Chance of Admit`)) + geom_point() + geom_smooth() + geom_smooth(method = "lm", col = "red")
LOR_vs_admin<-train_data %>% select(`LOR`, `Chance of Admit`)
ggplot(LOR_vs_admin, aes(x=`LOR`, y=`Chance of Admit`)) + geom_point() + geom_smooth() + geom_smooth(method = "lm", col = "red")
CGPA_vs_admin<-train_data %>% select(`CGPA`, `Chance of Admit`)
ggplot(CGPA_vs_admin, aes(x=`CGPA`, y=`Chance of Admit`)) + geom_point() + geom_smooth() + geom_smooth(method = "lm", col = "red")
Parte 2. Generacion de modelo multivariable
2.1 Realice un modelo de regresión (lineal/no lineal) con todas las variables anteriores, y diga la significancia de cada una de las variables.
fit <- lm(data = train_data, `Chance of Admit`~`GRE Score`+`TOEFL Score`+`University Rating`+SOP+LOR+CGPA)
summary(fit)
Call:
lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` +
`University Rating` + SOP + LOR + CGPA, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-0.279929 -0.021514 0.007078 0.033639 0.124415
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.5436158 0.1200514 -12.858 < 2e-16 ***
`GRE Score` 0.0028725 0.0005949 4.828 2.08e-06 ***
`TOEFL Score` 0.0025355 0.0010793 2.349 0.0194 *
`University Rating` 0.0054717 0.0046746 1.171 0.2426
SOP -0.0004857 0.0056017 -0.087 0.9310
LOR 0.0216602 0.0051715 4.188 3.58e-05 ***
CGPA 0.1153722 0.0122375 9.428 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06167 on 343 degrees of freedom
Multiple R-squared: 0.8148, Adjusted R-squared: 0.8116
F-statistic: 251.5 on 6 and 343 DF, p-value: < 2.2e-16