library(ggplot2)
setwd("C:\\Users\\Fernando\\Documents\\STUFF\\GALILEO\\2-Econometria en R\\Laboratorios")
data = read.csv("1.1 Admission_Predict_Ver1.1.csv")
head(data)
## Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## 4 4 322 110 3 3.5 2.5 8.67 1
## 5 5 314 103 2 2.0 3.0 8.21 0
## 6 6 330 115 5 4.5 3.0 9.34 1
## Chance.of.Admit
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
Analisis estadistico de las variables usando Summary
summary(data)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:125.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :250.5 Median :317.0 Median :107.0 Median :3.000
## Mean :250.5 Mean :316.5 Mean :107.2 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :500.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.000 Min. :1.000 Min. :6.800 Min. :0.00
## 1st Qu.:2.500 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00
## Median :3.500 Median :3.500 Median :8.560 Median :1.00
## Mean :3.374 Mean :3.484 Mean :8.576 Mean :0.56
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00
## Max. :5.000 Max. :5.000 Max. :9.920 Max. :1.00
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6300
## Median :0.7200
## Mean :0.7217
## 3rd Qu.:0.8200
## Max. :0.9700
Realice una grafica de histograma o densidad para cada una de las variable numericas: GRE.Score, TOEFEL.Score, CGPA y Chance of Admit.
hist(data$GRE.Score, col=scales::alpha('green',.5))
hist(data$TOEFL.Score, col = scales::alpha('red',.5))
hist(data$Chance.of.Admit, col = scales::alpha('skyblue',.7))
Realice una grafica de correlación entre la variables anteriores.
pairs(data[c('TOEFL.Score', 'Chance.of.Admit', 'GRE.Score')])
cor(data[c('TOEFL.Score', 'Chance.of.Admit', 'GRE.Score')])
## TOEFL.Score Chance.of.Admit GRE.Score
## TOEFL.Score 1.0000000 0.7922276 0.8272004
## Chance.of.Admit 0.7922276 1.0000000 0.8103506
## GRE.Score 0.8272004 0.8103506 1.0000000
Comentario de las 3 variables observadas:
Realice un scatter plot de todas las variables numericas contra la variable Chance of Admit.
lapply(data[c('TOEFL.Score', 'GRE.Score')],scatter.smooth, y=data$Chance.of.Admit)
## $TOEFL.Score
## NULL
##
## $GRE.Score
## NULL
Modelo de Regresión Lineal Simple con cada Varible Numerica> Chance.of.Admit ~ GRE.Score
lm1 = lm(Chance.of.Admit ~ GRE.Score, data=data)
summary(lm1)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33784 -0.04479 0.00417 0.05449 0.18568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4828147 0.1038994 -23.90 <2e-16 ***
## GRE.Score 0.0101259 0.0003281 30.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08278 on 498 degrees of freedom
## Multiple R-squared: 0.6567, Adjusted R-squared: 0.656
## F-statistic: 952.5 on 1 and 498 DF, p-value: < 2.2e-16
plot(lm1$model)
Modelo> Chance.of.Admit ~ TOEFL.Score
lm2 = lm(Chance.of.Admit ~ TOEFL.Score, data=data)
summary(lm2)
##
## Call:
## lm(formula = Chance.of.Admit ~ TOEFL.Score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31337 -0.04990 0.01310 0.05633 0.20725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2489882 0.0681317 -18.33 <2e-16 ***
## TOEFL.Score 0.0183850 0.0006346 28.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08621 on 498 degrees of freedom
## Multiple R-squared: 0.6276, Adjusted R-squared: 0.6269
## F-statistic: 839.4 on 1 and 498 DF, p-value: < 2.2e-16
plot(lm2$model)
Los dos modelos son muy buenos prediciendo Chance.Of.Admit, dado que tienen similar correlacion (x contra y, aprx 80%). Si tuviera que escoger un modelo sería el de Chance.of.Admit ~ GRE.Score, dado que el ajuste a los datos es mejor.