Fiabilidad
library(readr)
## Warning: package 'readr' was built under R version 3.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
PARTE I
1. Descargue el dataset de admis1.csv del GES.
admis1 <- read_csv("admis1.csv")
## Parsed with column specification:
## cols(
## `Serial No.` = col_double(),
## `GRE Score` = col_double(),
## `TOEFL Score` = col_double(),
## `University Rating` = col_double(),
## SOP = col_double(),
## LOR = col_double(),
## CGPA = col_double(),
## Research = col_double(),
## `Chance of Admit` = col_double()
## )
2. Separe su dataset en 70% para train y 30% para test
train<-admis1[1:(0.7*(nrow(admis1))), ]
test<-admis1[(0.7*nrow(admis1)):nrow(admis1),]
3. Diga cuales son las variables que tiene este dataset y que significa cada una
colnames(admis1)
## [1] "Serial No." "GRE Score" "TOEFL Score"
## [4] "University Rating" "SOP" "LOR"
## [7] "CGPA" "Research" "Chance of Admit"
GRE Score = Razonamiento verbal, razonamiento cuantitativo y escritura analĆtica.
TOEFL Score = Prueba estandarizada de dominio del idioma inglƩs.
University Rating = Clasificación de la universidad.
SOP = Ensayo de vida.
LOR = Documento que proporciona una visión completa de la candidatura adecuada para la admisión en la Universidad.
CGPA = Promedio acumulado de puntos de calificación.
Research = Sección verbal sin puntuación.
Chance of Admit = La probabilidad de ser admitido.
4. Realice una grƔfica de (separada) para las variables GRE.Score, TOEFEL.Score, SOP, LOR, CGPA, University.Rating contra Chance.of.Admit.
train %>%
ggplot(aes(x=`GRE Score`, y=`Chance of Admit`)) +
geom_point() + geom_abline(data = lm.fit)

train %>%
ggplot(aes(x=`TOEFL Score`, y=`Chance of Admit`)) +
geom_point()

train %>%
ggplot(aes(x=`University Rating`, y=`Chance of Admit`)) +
geom_point()

train %>%
ggplot(aes(x=SOP, y=`Chance of Admit`)) +
geom_point()

train %>%
ggplot(aes(x=LOR, y=`Chance of Admit`)) +
geom_point()

train %>%
ggplot(aes(x=CGPA, y=`Chance of Admit`)) +
geom_point()

5. Diga quĆ© tipo de regresión utilizarĆa para construir un modelo de regresión que permita predecir el Chance.of.Admit, esto debe realizarlo con cada variable independiente es decir construir un modelo de dos variables, una explicada y una explicatoria.
lm.fit=lm(`Chance of Admit`~`GRE Score`,data=train)
summary(lm.fit)
##
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score`, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33492 -0.04646 0.00554 0.06125 0.18135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.384815 0.128764 -18.52 <2e-16 ***
## `GRE Score` 0.009813 0.000406 24.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08583 on 348 degrees of freedom
## Multiple R-squared: 0.6267, Adjusted R-squared: 0.6256
## F-statistic: 584.1 on 1 and 348 DF, p-value: < 2.2e-16
lm.fit2=lm(`Chance of Admit`~`TOEFL Score`,data=train)
summary(lm.fit2)
##
## Call:
## lm(formula = `Chance of Admit` ~ `TOEFL Score`, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30856 -0.05342 0.01292 0.05792 0.21201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2533359 0.0828279 -15.13 <2e-16 ***
## `TOEFL Score` 0.0183809 0.0007683 23.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08637 on 348 degrees of freedom
## Multiple R-squared: 0.6219, Adjusted R-squared: 0.6208
## F-statistic: 572.4 on 1 and 348 DF, p-value: < 2.2e-16
lm.fit3=lm(`Chance of Admit`~`University Rating`,data=train)
summary(lm.fit3)
##
## Call:
## lm(formula = `Chance of Admit` ~ `University Rating`, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38016 -0.04310 0.01690 0.05617 0.27397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.451909 0.015355 29.43 <2e-16 ***
## `University Rating` 0.087063 0.004594 18.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09854 on 348 degrees of freedom
## Multiple R-squared: 0.5079, Adjusted R-squared: 0.5065
## F-statistic: 359.2 on 1 and 348 DF, p-value: < 2.2e-16
lm.fit4=lm(`Chance of Admit`~`SOP`,data=train)
summary(lm.fit4)
##
## Call:
## lm(formula = `Chance of Admit` ~ SOP, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48995 -0.05072 0.01646 0.07159 0.22569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.405850 0.020297 20.00 <2e-16 ***
## SOP 0.092821 0.005666 16.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1055 on 348 degrees of freedom
## Multiple R-squared: 0.4354, Adjusted R-squared: 0.4338
## F-statistic: 268.4 on 1 and 348 DF, p-value: < 2.2e-16
lm.fit5=lm(`Chance of Admit`~`LOR`,data=train)
summary(lm.fit5)
##
## Call:
## lm(formula = `Chance of Admit` ~ LOR, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34858 -0.05980 -0.00102 0.07325 0.24142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.361519 0.022009 16.43 <2e-16 ***
## LOR 0.104875 0.006141 17.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1036 on 348 degrees of freedom
## Multiple R-squared: 0.4559, Adjusted R-squared: 0.4544
## F-statistic: 291.6 on 1 and 348 DF, p-value: < 2.2e-16
lm.fit6=lm(`Chance of Admit`~`CGPA`,data=train)
summary(lm.fit6)
##
## Call:
## lm(formula = `Chance of Admit` ~ CGPA, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27321 -0.02959 0.01006 0.04333 0.18134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.062047 0.054768 -19.39 <2e-16 ***
## CGPA 0.207588 0.006346 32.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06959 on 348 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7539
## F-statistic: 1070 on 1 and 348 DF, p-value: < 2.2e-16
6. Utilice el mƩtodo smooth para mostrar en cada grƔfica el modelo que mejor se ajusta a los datos, esto debe hacerlo para cada grƔfica por separado
train %>%
ggplot(aes(x=`GRE Score`, y=`Chance of Admit`)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
ggplot(aes(x=`TOEFL Score`, y=`Chance of Admit`)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
ggplot(aes(x=`University Rating`, y=`Chance of Admit`)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 3
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

train %>%
ggplot(aes(x=SOP, y=`Chance of Admit`)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
ggplot(aes(x=LOR, y=`Chance of Admit`)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
ggplot(aes(x=CGPA, y=`Chance of Admit`)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

PARTE II
1. Realice un modelo de regresión (lineal/no lineal) con todas las variables anteriores, y diga la significancia de cada una de las variables.
fit_modelo <- lm(data = train, train$`Chance of Admit` ~ poly(train$`GRE Score`,2)+ poly(train$`TOEFL Score`,2) + train$CGPA + train$SOP + train$LOR + train$`University Rating`)
summary(fit_modelo)
##
## Call:
## lm(formula = train$`Chance of Admit` ~ poly(train$`GRE Score`,
## 2) + poly(train$`TOEFL Score`, 2) + train$CGPA + train$SOP +
## train$LOR + train$`University Rating`, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27463 -0.02366 0.01092 0.03667 0.16192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.353193 0.105829 -3.337 0.000939 ***
## poly(train$`GRE Score`, 2)1 0.424508 0.133796 3.173 0.001647 **
## poly(train$`GRE Score`, 2)2 -0.021811 0.089546 -0.244 0.807710
## poly(train$`TOEFL Score`, 2)1 0.357443 0.131190 2.725 0.006770 **
## poly(train$`TOEFL Score`, 2)2 0.006077 0.088303 0.069 0.945172
## train$CGPA 0.112982 0.013236 8.536 4.66e-16 ***
## train$SOP -0.003592 0.005804 -0.619 0.536364
## train$LOR 0.024859 0.005917 4.201 3.40e-05 ***
## train$`University Rating` 0.010140 0.005016 2.022 0.043993 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06402 on 341 degrees of freedom
## Multiple R-squared: 0.7965, Adjusted R-squared: 0.7917
## F-statistic: 166.8 on 8 and 341 DF, p-value: < 2.2e-16
2. Produzca el modelo que usted considere mejor para predecir el Chance.of.Admit.
fit_modelo2 <- lm(data = train, train$`Chance of Admit` ~ poly(train$`GRE Score`,2) + train$CGPA + train$`TOEFL Score`)
y_hat_train <- predict(fit_modelo2, train)
MSE6_train <- sum((train$`Chance of Admit` - y_hat_train)^2)
MSE6_train
## [1] 1.527538
summary(fit_modelo2)
##
## Call:
## lm(formula = train$`Chance of Admit` ~ poly(train$`GRE Score`,
## 2) + train$CGPA + train$`TOEFL Score`, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28772 -0.02609 0.01026 0.03970 0.14303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.907401 0.124169 -7.308 1.9e-12 ***
## poly(train$`GRE Score`, 2)1 0.428082 0.133874 3.198 0.00151 **
## poly(train$`GRE Score`, 2)2 -0.010563 0.067359 -0.157 0.87548
## train$CGPA 0.145619 0.012068 12.067 < 2e-16 ***
## train$`TOEFL Score` 0.003520 0.001184 2.973 0.00316 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06654 on 345 degrees of freedom
## Multiple R-squared: 0.7775, Adjusted R-squared: 0.775
## F-statistic: 301.5 on 4 and 345 DF, p-value: < 2.2e-16