Laboratorio # 1 - Oscar Padilla

Fiabilidad

library(readr)

## Warning: package 'readr' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

PARTE I

1. Descargue el dataset de admis1.csv del GES.

admis1 <- read_csv("admis1.csv")

## Parsed with column specification:
## cols(
##   `Serial No.` = col_double(),
##   `GRE Score` = col_double(),
##   `TOEFL Score` = col_double(),
##   `University Rating` = col_double(),
##   SOP = col_double(),
##   LOR = col_double(),
##   CGPA = col_double(),
##   Research = col_double(),
##   `Chance of Admit` = col_double()
## )

2. Separe su dataset en 70% para train y 30% para test

train<-admis1[1:(0.7*(nrow(admis1))), ]
test<-admis1[(0.7*nrow(admis1)):nrow(admis1),]

3. Diga cuales son las variables que tiene este dataset y que significa cada una

colnames(admis1)

## [1] "Serial No."        "GRE Score"         "TOEFL Score"      
## [4] "University Rating" "SOP"               "LOR"              
## [7] "CGPA"              "Research"          "Chance of Admit"

GRE Score = Razonamiento verbal, razonamiento cuantitativo y escritura analítica.

TOEFL Score = Prueba estandarizada de dominio del idioma inglés.

University Rating = Clasificación de la universidad.

SOP = Ensayo de vida.

LOR = Documento que proporciona una visión completa de la candidatura adecuada para la admisión en la Universidad.

CGPA = Promedio acumulado de puntos de calificación.

Research = Sección verbal sin puntuación.

Chance of Admit = La probabilidad de ser admitido.

4. Realice una gráfica de (separada) para las variables GRE.Score, TOEFEL.Score, SOP, LOR, CGPA, University.Rating contra Chance.of.Admit.

train %>%
  ggplot(aes(x=`GRE Score`, y=`Chance of Admit`)) +
  geom_point() + geom_abline(data = lm.fit)

train %>%
  ggplot(aes(x=`TOEFL Score`, y=`Chance of Admit`)) +
  geom_point()

train %>%
  ggplot(aes(x=`University Rating`, y=`Chance of Admit`)) +
  geom_point()

train %>%
  ggplot(aes(x=SOP, y=`Chance of Admit`)) +
  geom_point()

train %>%
  ggplot(aes(x=LOR, y=`Chance of Admit`)) +
  geom_point()

train %>%
  ggplot(aes(x=CGPA, y=`Chance of Admit`)) +
  geom_point()

5. Diga qué tipo de regresión utilizaría para construir un modelo de regresión que permita predecir el Chance.of.Admit, esto debe realizarlo con cada variable independiente es decir construir un modelo de dos variables, una explicada y una explicatoria.

lm.fit=lm(`Chance of Admit`~`GRE Score`,data=train)
summary(lm.fit)

## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score`, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33492 -0.04646  0.00554  0.06125  0.18135 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.384815   0.128764  -18.52   <2e-16 ***
## `GRE Score`  0.009813   0.000406   24.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08583 on 348 degrees of freedom
## Multiple R-squared:  0.6267, Adjusted R-squared:  0.6256 
## F-statistic: 584.1 on 1 and 348 DF,  p-value: < 2.2e-16

lm.fit2=lm(`Chance of Admit`~`TOEFL Score`,data=train)
summary(lm.fit2)

## 
## Call:
## lm(formula = `Chance of Admit` ~ `TOEFL Score`, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30856 -0.05342  0.01292  0.05792  0.21201 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.2533359  0.0828279  -15.13   <2e-16 ***
## `TOEFL Score`  0.0183809  0.0007683   23.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08637 on 348 degrees of freedom
## Multiple R-squared:  0.6219, Adjusted R-squared:  0.6208 
## F-statistic: 572.4 on 1 and 348 DF,  p-value: < 2.2e-16

lm.fit3=lm(`Chance of Admit`~`University Rating`,data=train)
summary(lm.fit3)

## 
## Call:
## lm(formula = `Chance of Admit` ~ `University Rating`, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38016 -0.04310  0.01690  0.05617  0.27397 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.451909   0.015355   29.43   <2e-16 ***
## `University Rating` 0.087063   0.004594   18.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09854 on 348 degrees of freedom
## Multiple R-squared:  0.5079, Adjusted R-squared:  0.5065 
## F-statistic: 359.2 on 1 and 348 DF,  p-value: < 2.2e-16

lm.fit4=lm(`Chance of Admit`~`SOP`,data=train)
summary(lm.fit4)

## 
## Call:
## lm(formula = `Chance of Admit` ~ SOP, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48995 -0.05072  0.01646  0.07159  0.22569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.405850   0.020297   20.00   <2e-16 ***
## SOP         0.092821   0.005666   16.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1055 on 348 degrees of freedom
## Multiple R-squared:  0.4354, Adjusted R-squared:  0.4338 
## F-statistic: 268.4 on 1 and 348 DF,  p-value: < 2.2e-16

lm.fit5=lm(`Chance of Admit`~`LOR`,data=train)
summary(lm.fit5)

## 
## Call:
## lm(formula = `Chance of Admit` ~ LOR, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34858 -0.05980 -0.00102  0.07325  0.24142 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.361519   0.022009   16.43   <2e-16 ***
## LOR         0.104875   0.006141   17.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1036 on 348 degrees of freedom
## Multiple R-squared:  0.4559, Adjusted R-squared:  0.4544 
## F-statistic: 291.6 on 1 and 348 DF,  p-value: < 2.2e-16

lm.fit6=lm(`Chance of Admit`~`CGPA`,data=train)
summary(lm.fit6)

## 
## Call:
## lm(formula = `Chance of Admit` ~ CGPA, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27321 -0.02959  0.01006  0.04333  0.18134 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.062047   0.054768  -19.39   <2e-16 ***
## CGPA         0.207588   0.006346   32.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06959 on 348 degrees of freedom
## Multiple R-squared:  0.7546, Adjusted R-squared:  0.7539 
## F-statistic:  1070 on 1 and 348 DF,  p-value: < 2.2e-16

6. Utilice el método smooth para mostrar en cada gráfica el modelo que mejor se ajusta a los datos, esto debe hacerlo para cada gráfica por separado

train %>%
  ggplot(aes(x=`GRE Score`, y=`Chance of Admit`)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
  ggplot(aes(x=`TOEFL Score`, y=`Chance of Admit`)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
  ggplot(aes(x=`University Rating`, y=`Chance of Admit`)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 3

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

train %>%
  ggplot(aes(x=SOP, y=`Chance of Admit`)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
  ggplot(aes(x=LOR, y=`Chance of Admit`)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

train %>%
  ggplot(aes(x=CGPA, y=`Chance of Admit`)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

PARTE II

1. Realice un modelo de regresión (lineal/no lineal) con todas las variables anteriores, y diga la significancia de cada una de las variables.

fit_modelo <- lm(data = train, train$`Chance of Admit` ~ poly(train$`GRE Score`,2)+ poly(train$`TOEFL Score`,2) + train$CGPA + train$SOP + train$LOR + train$`University Rating`)
summary(fit_modelo)

## 
## Call:
## lm(formula = train$`Chance of Admit` ~ poly(train$`GRE Score`, 
##     2) + poly(train$`TOEFL Score`, 2) + train$CGPA + train$SOP + 
##     train$LOR + train$`University Rating`, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27463 -0.02366  0.01092  0.03667  0.16192 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -0.353193   0.105829  -3.337 0.000939 ***
## poly(train$`GRE Score`, 2)1    0.424508   0.133796   3.173 0.001647 ** 
## poly(train$`GRE Score`, 2)2   -0.021811   0.089546  -0.244 0.807710    
## poly(train$`TOEFL Score`, 2)1  0.357443   0.131190   2.725 0.006770 ** 
## poly(train$`TOEFL Score`, 2)2  0.006077   0.088303   0.069 0.945172    
## train$CGPA                     0.112982   0.013236   8.536 4.66e-16 ***
## train$SOP                     -0.003592   0.005804  -0.619 0.536364    
## train$LOR                      0.024859   0.005917   4.201 3.40e-05 ***
## train$`University Rating`      0.010140   0.005016   2.022 0.043993 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06402 on 341 degrees of freedom
## Multiple R-squared:  0.7965, Adjusted R-squared:  0.7917 
## F-statistic: 166.8 on 8 and 341 DF,  p-value: < 2.2e-16

2. Produzca el modelo que usted considere mejor para predecir el Chance.of.Admit.

fit_modelo2 <- lm(data = train, train$`Chance of Admit` ~ poly(train$`GRE Score`,2) + train$CGPA + train$`TOEFL Score`)
y_hat_train <- predict(fit_modelo2, train)
MSE6_train <- sum((train$`Chance of Admit` - y_hat_train)^2)
MSE6_train

## [1] 1.527538

summary(fit_modelo2)

## 
## Call:
## lm(formula = train$`Chance of Admit` ~ poly(train$`GRE Score`, 
##     2) + train$CGPA + train$`TOEFL Score`, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28772 -0.02609  0.01026  0.03970  0.14303 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.907401   0.124169  -7.308  1.9e-12 ***
## poly(train$`GRE Score`, 2)1  0.428082   0.133874   3.198  0.00151 ** 
## poly(train$`GRE Score`, 2)2 -0.010563   0.067359  -0.157  0.87548    
## train$CGPA                   0.145619   0.012068  12.067  < 2e-16 ***
## train$`TOEFL Score`          0.003520   0.001184   2.973  0.00316 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06654 on 345 degrees of freedom
## Multiple R-squared:  0.7775, Adjusted R-squared:  0.775 
## F-statistic: 301.5 on 4 and 345 DF,  p-value: < 2.2e-16