Cristina Sierra Alcala (c.sierra10) - Oscar Andrés Mosquera Fonseca (oa.mosquera)
#install.packages("readxl") # Carga de archivos desde Excel
#install.packages("ggplot2") # Gráficas
library(readxl)
Salarios <- read_excel("Salarios.xlsx")
View(Salarios)
library(ggplot2)
ggplot(data = Salarios, aes(x = exper, y = salario)) +
geom_point(colour = "red4") +
ggtitle("Diagrama de dispersión") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
cor.test(x = Salarios$exper, y = Salarios$salario, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: Salarios$exper and Salarios$salario
## t = 4.2434, df = 44, p-value = 0.0001116
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2946908 0.7170172
## sample estimates:
## cor
## 0.5388879
El gráfico y el test de correlación muestran una relación lineal, de intensidad considerable (r = 0.5388879) y significativa (p-value = 0.0001116). Tiene sentido intentar generar un modelo de regresión lineal que permita predecir el salario en función del número de años de experiencia.
mod1 <- lm(salario ~ exper, data = Salarios)
summary(mod1)
##
## Call:
## lm(formula = salario ~ exper, data = Salarios)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4197 -2781 -2368 4685 6420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13584.0 1051.5 12.919 < 2e-16 ***
## exper 491.5 115.8 4.243 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4018 on 44 degrees of freedom
## Multiple R-squared: 0.2904, Adjusted R-squared: 0.2743
## F-statistic: 18.01 on 1 and 44 DF, p-value: 0.0001116
plot(x=Salarios$exper, y =Salarios$salario, xlab ="Años experiencia (#)", ylab="Salario ($)", main="Salarios vs Años experiencia", pch=20, col="red")
abline(mod1, lwd=2, col ="blue")
Como \((p<0.05)\) \((0.000112<0.05)\) existe una relación significativa entre el salario y el número de años de experiencia.
Como el p-valor es pequeño \((p<0.05)\) \((0.0001116<0.05)\) el modelo en conjunto es significativo.
Se calculan los residuos para cada observación y se representan (scatterplot). Si las observaciones siguen la línea del modelo, los residuos se deben distribuir aleatoriamente entorno al valor 0.
Salarios$prediccion <- mod1$fitted.values
Salarios$residuos <- mod1$residuals
ggplot(data = Salarios, aes(x = prediccion, y = residuos)) +
geom_point(aes(color = residuos)) +
scale_color_gradient2(low = "blue3", mid = "grey", high = "red") +
geom_segment(aes(xend = prediccion, yend = 0), alpha = 0.2) +
geom_smooth(se = FALSE, color = "firebrick") +
labs(title = "Distribución de los residuos", x = "predicción modelo",
y = "residuo") +
geom_hline(yintercept = 0) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Se evidencia que la variabilidad de los residuos es constante a lo largo del eje X y se aprecia un posible dato atípico
mod1c<-lm(salario~exper+I(exper^2),data=Salarios)
summary(mod1c)
##
## Call:
## lm(formula = salario ~ exper + I(exper^2), data = Salarios)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4682 -3080 -1860 4335 6907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12394.57 1581.01 7.840 8.01e-10 ***
## exper 905.67 427.17 2.120 0.0398 *
## I(exper^2) -23.26 23.09 -1.007 0.3194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4017 on 43 degrees of freedom
## Multiple R-squared: 0.3068, Adjusted R-squared: 0.2745
## F-statistic: 9.514 on 2 and 43 DF, p-value: 0.0003793
plot(Salarios$exper,Salarios$salario,xlab='Años experiencia (#)',ylab='Salario ($)',main='Ajuste cuadrático')
curve(mod1c$coefficient[1]+mod1c$coefficient[2]*x+mod1c$coefficient[3]*x^2,add=T,col="red")
Como \((p>0.05)\) \((0.3194>0.05)\) no existe una relación significativa entre el salario y el número de años de experiencia elevados al cuadro \(EXP^{2}\).
Aunque el p-valor es pequeño \((p<0.05)\) \((0.0003793<0.05)\) y el modelo en conjunto es significativo, por la significancia obtenida de los parametros de los modelos, se evidencia que existe una relación lineal entre el salario y el número de años de experiencia.
Salarios$cargoadm[Salarios$cargoadm == "si"] <- "1"
Salarios$cargoadm[Salarios$cargoadm == "no"] <- "0"
Salarios$cargoadm <- factor(Salarios$cargoadm)
mod2 <- lm(salario ~ exper + educ + cargoadm, data = Salarios)
summary(mod2)
##
## Call:
## lm(formula = salario ~ exper + educ + cargoadm, data = Salarios)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1884.64 -653.61 22.29 844.84 1716.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8035.52 386.70 20.780 < 2e-16 ***
## exper 546.19 30.52 17.896 < 2e-16 ***
## educpost 2996.21 411.76 7.277 6.73e-09 ***
## educuniv 3144.09 361.98 8.686 7.73e-11 ***
## cargoadm1 6883.59 313.93 21.927 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1027 on 41 degrees of freedom
## Multiple R-squared: 0.9568, Adjusted R-squared: 0.9525
## F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16
Salarios$prediccion <- mod2$fitted.values
Salarios$residuos <- mod2$residuals
ggplot(data = Salarios, aes(x = prediccion, y = residuos)) +
geom_point(aes(color = residuos)) +
scale_color_gradient2(low = "blue3", mid = "grey", high = "red") +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = prediccion, yend = 0), alpha = 0.2) +
labs(title = "Distribución de los residuos", x = "predicción modelo",
y = "residuo") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
mod3 <- lm(salario ~ exper*educ + cargoadm, data = Salarios)
summary(mod3)
##
## Call:
## lm(formula = salario ~ exper * educ + cargoadm, data = Salarios)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2013.08 -634.64 -16.72 615.67 2014.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7256.23 549.51 13.205 5.66e-16 ***
## exper 632.29 53.19 11.888 1.54e-14 ***
## educpost 3946.39 686.71 5.747 1.16e-06 ***
## educuniv 4172.46 674.99 6.182 2.90e-07 ***
## cargoadm1 7102.52 333.45 21.300 < 2e-16 ***
## exper:educpost -141.28 89.28 -1.582 0.1216
## exper:educuniv -125.50 69.87 -1.796 0.0802 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1005 on 39 degrees of freedom
## Multiple R-squared: 0.9606, Adjusted R-squared: 0.9546
## F-statistic: 158.6 on 6 and 39 DF, p-value: < 2.2e-16
Salarios$prediccion <- mod3$fitted.values
Salarios$residuos <- mod3$residuals
ggplot(data = Salarios, aes(x = prediccion, y = residuos)) +
geom_point(aes(color = residuos)) +
scale_color_gradient2(low = "blue3", mid = "grey", high = "red") +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = prediccion, yend = 0), alpha = 0.2) +
labs(title = "Distribución de los residuos", x = "predicción modelo",
y = "residuo") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
mod4 <- lm(salario ~ exper*educ + cargoadm + exper:cargoadm, data = Salarios)
summary(mod4)
##
## Call:
## lm(formula = salario ~ exper * educ + cargoadm + exper:cargoadm,
## data = Salarios)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2278.14 -545.01 -78.07 517.42 1839.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7466.44 550.49 13.563 3.84e-16 ***
## exper 614.40 52.98 11.597 4.76e-14 ***
## educpost 4132.12 679.26 6.083 4.38e-07 ***
## educuniv 4190.34 659.12 6.357 1.84e-07 ***
## cargoadm1 6353.88 546.38 11.629 4.38e-14 ***
## exper:educpost -208.66 95.70 -2.180 0.0355 *
## exper:educuniv -147.30 69.40 -2.123 0.0404 *
## exper:cargoadm1 118.33 69.35 1.706 0.0961 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 981.5 on 38 degrees of freedom
## Multiple R-squared: 0.9634, Adjusted R-squared: 0.9567
## F-statistic: 143 on 7 and 38 DF, p-value: < 2.2e-16
Salarios$prediccion <- mod4$fitted.values
Salarios$residuos <- mod4$residuals
ggplot(data = Salarios, aes(x = prediccion, y = residuos)) +
geom_point(aes(color = residuos)) +
scale_color_gradient2(low = "blue3", mid = "grey", high = "red") +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = prediccion, yend = 0), alpha = 0.2) +
labs(title = "Distribución de los residuos", x = "predicción modelo",
y = "residuo") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
mod5 <- lm(salario ~ exper*educ*cargoadm, data = Salarios)
summary(mod5)
##
## Call:
## lm(formula = salario ~ exper * educ * cargoadm, data = Salarios)
##
## Residuals:
## Min 1Q Median 3Q Max
## -918.20 -41.23 14.18 64.78 222.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9481.390 128.461 73.808 < 2e-16 ***
## exper 496.117 11.286 43.958 < 2e-16 ***
## educpost 1708.064 201.569 8.474 6.76e-10 ***
## educuniv 1327.046 161.178 8.233 1.32e-09 ***
## cargoadm1 3935.732 216.976 18.139 < 2e-16 ***
## exper:educpost 6.247 51.906 0.120 0.905
## exper:educuniv 6.493 15.071 0.431 0.669
## exper:cargoadm1 8.566 34.064 0.251 0.803
## educpost:cargoadm1 3158.650 294.904 10.711 1.95e-12 ***
## educuniv:cargoadm1 5062.446 277.738 18.227 < 2e-16 ***
## exper:educpost:cargoadm1 -18.416 62.762 -0.293 0.771
## exper:educuniv:cargoadm1 -22.017 38.307 -0.575 0.569
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 184.1 on 34 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9985
## F-statistic: 2683 on 11 and 34 DF, p-value: < 2.2e-16
Salarios$prediccion <- mod5$fitted.values
Salarios$residuos <- mod5$residuals
ggplot(data = Salarios, aes(x = prediccion, y = residuos)) +
geom_point(aes(color = residuos)) +
scale_color_gradient2(low = "blue3", mid = "grey", high = "red") +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = prediccion, yend = 0), alpha = 0.2) +
labs(title = "Distribución de los residuos", x = "predicción modelo",
y = "residuo") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
Al calcular los residuos para cada observación, las observaciones siguen la línea del modelo, y los residuos se distribuyen aleatoriamente entorno al valor 0, con un valor atipico.
library(car)
## Loading required package: carData
outlierTest(mod5)
## rstudent unadjusted p-value Bonferonni p
## 33 -14.48176 7.4198e-16 3.4131e-14
summary(influence.measures(mod5))
## Potentially influential observations of
## lm(formula = salario ~ exper * educ * cargoadm, data = Salarios) :
##
## dfb.1_ dfb.expr dfb.edcp dfb.edcn dfb.crg1 dfb.expr:dcp dfb.expr:dcn
## 1 0.00 0.00 0.00 0.00 -0.44 0.00 0.00
## 2 0.00 0.00 -0.35 0.00 0.00 0.32 0.00
## 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 5 0.00 0.00 0.31 0.00 0.00 -0.29 0.00
## 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 8 0.30 -0.25 -0.19 -0.24 -0.18 0.05 0.19
## 16 0.00 0.00 -0.02 0.00 0.00 0.10 0.00
## 19 0.00 0.00 0.09 0.00 0.00 -0.23 0.00
## 25 0.00 0.00 0.00 0.00 -0.02 0.00 0.00
## 27 0.00 0.00 0.00 0.00 0.13 0.00 0.00
## 33 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 41 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 46 0.23 -0.37 -0.15 -0.18 -0.14 0.08 0.28
## dfb.ex:1 dfb.edcp:1 dfb.edcn:1 dfb.expr:dcp:1 dfb.expr:dcn:1 dffit
## 1 0.43 0.32 0.34 -0.23 -0.38 -0.55
## 2 0.00 0.24 0.00 -0.26 0.00 -0.47
## 3 0.00 -0.17 0.00 0.08 0.00 -0.40
## 5 0.00 -0.21 0.00 0.24 0.00 0.42
## 6 0.00 0.00 0.27 0.00 -0.16 0.52
## 8 0.08 0.13 0.14 -0.04 -0.07 0.30
## 16 0.00 0.01 0.00 -0.08 0.00 0.16
## 19 0.00 -0.06 0.00 0.19 0.00 -0.29
## 25 0.08 0.02 0.02 -0.05 -0.07 0.14
## 27 -0.29 -0.09 -0.10 0.16 0.26 -0.40
## 33 0.00 0.00 -0.77 0.00 -0.59 -6.15_*
## 41 0.00 -0.02 0.00 0.02 0.00 0.10
## 46 0.12 0.10 0.11 -0.07 -0.11 -0.42
## cov.r cook.d hat
## 1 3.74_* 0.03 0.64
## 2 2.09_* 0.02 0.39
## 3 2.09_* 0.01 0.38
## 5 2.14_* 0.01 0.39
## 6 2.15_* 0.02 0.42
## 8 2.08_* 0.01 0.35
## 16 2.16_* 0.00 0.35
## 19 3.87_* 0.01 0.64
## 25 2.17_* 0.00 0.35
## 27 2.78_* 0.01 0.51
## 33 0.00 0.44 0.15
## 41 2.79_* 0.00 0.49
## 46 2.61_* 0.02 0.49
influencePlot(mod5)
## StudRes Hat CookD
## 1 -0.4097731 0.6402439 0.025527235
## 19 -0.2155878 0.6363636 0.006973633
## 33 -14.4817568 0.1529720 0.442126543
## 42 1.5568443 0.4676573 0.170305943
Se confirma que los residuos sí se distribuyen de forma normal a excepción de un dato extremo (observación 33).
Las funciones influence.measures() e influencePlot() detectan la observación 33 como atípica pero no significativamente influyente. Sí detectan como influyente la observación 42.
Para evaluar hasta qué punto condiciona el modelo, se recalcula la recta de mínimos cuadrados excluyendo la observación 33.
Salarios_2<-Salarios[-33,]
mod6<- lm(salario ~ exper*educ*cargoadm, data = Salarios_2)
summary(mod6)
##
## Call:
## lm(formula = salario ~ exper * educ * cargoadm, data = Salarios_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.804 -39.946 -3.265 42.245 113.081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9481.390 48.079 197.204 <2e-16 ***
## exper 496.117 4.224 117.451 <2e-16 ***
## educpost 1708.064 75.441 22.641 <2e-16 ***
## educuniv 1327.046 60.324 21.999 <2e-16 ***
## cargoadm1 3935.732 81.208 48.465 <2e-16 ***
## exper:educpost 6.247 19.427 0.322 0.750
## exper:educuniv 6.493 5.641 1.151 0.258
## exper:cargoadm1 8.566 12.749 0.672 0.506
## educpost:cargoadm1 3158.650 110.374 28.618 <2e-16 ***
## educuniv:cargoadm1 5142.990 104.098 49.405 <2e-16 ***
## exper:educpost:cargoadm1 -18.416 23.490 -0.784 0.439
## exper:educuniv:cargoadm1 -13.489 14.349 -0.940 0.354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.89 on 33 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 1.834e+04 on 11 and 33 DF, p-value: < 2.2e-16
Salarios_2$prediccion <- mod6$fitted.values
Salarios_2$residuos <- mod6$residuals
ggplot(data = Salarios_2, aes(x = prediccion, y = residuos)) +
geom_point(aes(color = residuos)) +
scale_color_gradient2(low = "blue3", mid = "grey", high = "red") +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = prediccion, yend = 0), alpha = 0.2) +
labs(title = "Distribución de los residuos", x = "predicción modelo",
y = "residuo") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
Salarios_2$educ <- ifelse(test = Salarios_2$educ == "post", yes = 1, no = 0)
rbind(head(Salarios_2, 3), tail(Salarios_2, 3))
## # A tibble: 6 x 6
## salario exper educ cargoadm prediccion residuos
## <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 13876 1 0 1 13922. -45.8
## 2 11608 1 1 0 11692. -83.8
## 3 18701 1 1 1 18776. -75.4
## 4 17483 16 0 0 17419. 63.7
## 5 19207 17 0 0 19353. -146.
## 6 19346 20 0 0 19404. -57.7
mod7<- lm(salario ~ exper*educ*cargoadm, data = Salarios_2)
summary(mod7)
##
## Call:
## lm(formula = salario ~ exper * educ * cargoadm, data = Salarios_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3871.5 -659.3 13.5 543.6 4270.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10425.30 705.33 14.781 < 2e-16 ***
## exper 478.77 67.31 7.113 2.01e-08 ***
## educ 764.15 1583.45 0.483 0.632233
## cargoadm1 4617.65 1149.12 4.018 0.000276 ***
## exper:educ 23.59 467.27 0.050 0.960000
## exper:cargoadm1 300.30 129.29 2.323 0.025795 *
## educ:cargoadm1 2476.73 2154.82 1.149 0.257771
## exper:educ:cargoadm1 -310.15 498.18 -0.623 0.537379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1680 on 37 degrees of freedom
## Multiple R-squared: 0.891, Adjusted R-squared: 0.8703
## F-statistic: 43.2 on 7 and 37 DF, p-value: 6.602e-16
Como el p-valor de la interaciión \(educpost:cargoadm1\) es muy pequeño \((<2e-16)\) …
confint(mod6)
## 2.5 % 97.5 %
## (Intercept) 9383.572436 9579.20785
## exper 487.522658 504.71042
## educpost 1554.577599 1861.55121
## educuniv 1204.314808 1449.77635
## cargoadm1 3770.513573 4100.95005
## exper:educpost -33.276842 45.77103
## exper:educuniv -4.982347 17.96909
## exper:cargoadm1 -17.371993 34.50476
## educpost:cargoadm1 2934.092592 3383.20788
## educuniv:cargoadm1 4931.201413 5354.77932
## exper:educpost:cargoadm1 -66.207195 29.37452
## exper:educuniv:cargoadm1 -42.682825 15.70484