Preguntas parte 1: (30 puntos) Utilice el conjunto
de datos “chicago” disponibles en la librerıa
“faraway”. Considere Y = involact como variable respuesta, todas las dem
́as ser ́an variables explica-
tivas.
1. Realizar una an ́alisis descriptivo de las variables de la base de
datos. Debe incluir indicadores y
gráficas.
# Cargar la librería necesaria
library(faraway)
#Obtención informacion de dataset chicago
?faraway::chicago
# Cargar el conjunto de datos
data(chicago)
head(chicago, n=10)
# Obtener un resumen estadístico de las variables
summary(chicago)
race fire theft age volact involact income
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00 Min. : 0.50 Min. :0.0000 Min. : 5583
1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60 1st Qu.: 3.10 1st Qu.:0.0000 1st Qu.: 8447
Median :24.50 Median :10.40 Median : 29.00 Median :65.00 Median : 5.90 Median :0.4000 Median :10694
Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33 Mean : 6.53 Mean :0.6149 Mean :10696
3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30 3rd Qu.: 9.65 3rd Qu.:0.9000 3rd Qu.:11989
Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10 Max. :14.30 Max. :2.2000 Max. :21480
# Ver los nombres de las variables en el conjunto de datos
names(chicago)
[1] "race" "fire" "theft" "age" "volact" "involact" "income"
# O usar str() para ver la estructura del conjunto de datos
str(chicago)
'data.frame': 47 obs. of 7 variables:
$ race : num 10 22.2 19.6 17.3 24.5 54 4.9 7.1 5.3 21.5 ...
$ fire : num 6.2 9.5 10.5 7.7 8.6 34.1 11 6.9 7.3 15.1 ...
$ theft : num 29 44 36 37 53 68 75 18 31 25 ...
$ age : num 60.4 76.5 73.5 66.9 81.4 52.6 42.6 78.5 90.1 89.8 ...
$ volact : num 5.3 3.1 4.8 5.7 5.9 4 7.9 6.9 7.6 3.1 ...
$ involact: num 0 0.1 1.2 0.5 0.7 0.3 0 0 0.4 1.1 ...
$ income : num 11744 9323 9948 10656 9730 ...
# Crear gráficas
# Histograma para la variable target del dataset
hist(chicago$involact)
# Diagrama de dispersión para todas las variables
plot(chicago)
# Diagrama de dispersión para dos variables numéricas
plot(chicago$age, chicago$involact)
#Generamos gráfico de dispersión para la data Chicago
ggplot(chicago, aes(x=fire, y=involact)) + geom_point()
ggplot(chicago, aes(x=fire)) + geom_histogram(binwidth=1)
2. Utilizando alguno de los criterios de selección de variables, determine el modelo lineal que mejor ajusta a la variable respuesta. Indique el criterio utilizado
# Fit all possible linear models using the lm function
model1 <- lm(involact ~ ., data = chicago)
# Print the summary of the best model based on AIC
summary(model1)
Call:
lm(formula = involact ~ ., data = chicago)
Residuals:
Min 1Q Median 3Q Max
-0.84296 -0.14613 -0.01007 0.18386 0.81235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.862e-01 6.020e-01 -0.808 0.424109
race 8.527e-03 2.863e-03 2.978 0.004911 **
fire 3.778e-02 8.982e-03 4.206 0.000142 ***
theft -1.016e-02 2.908e-03 -3.494 0.001178 **
age 7.615e-03 3.330e-03 2.287 0.027582 *
volact -1.018e-02 2.773e-02 -0.367 0.715519
income 2.568e-05 3.220e-05 0.798 0.429759
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3387 on 40 degrees of freedom
Multiple R-squared: 0.7517, Adjusted R-squared: 0.7144
F-statistic: 20.18 on 6 and 40 DF, p-value: 1.072e-10
# Ajustar un modelo lineal con criterio de selección por AIC
model2 <- step(lm(involact ~ . , data = chicago), direction = "both", trace = 1000)
Start: AIC=-95.34
involact ~ race + fire + theft + age + volact + income
Df Sum of Sq RSS AIC
- volact 1 0.01546 4.6047 -97.184
- income 1 0.07300 4.6622 -96.601
<none> 4.5892 -95.342
- age 1 0.59993 5.1892 -91.568
- race 1 1.01743 5.6067 -87.931
- theft 1 1.40048 5.9897 -84.825
- fire 1 2.02990 6.6191 -80.129
Step: AIC=-97.18
involact ~ race + fire + theft + age + income
Df Sum of Sq RSS AIC
- income 1 0.06710 4.6718 -98.504
<none> 4.6047 -97.184
+ volact 1 0.01546 4.5892 -95.342
- age 1 0.99296 5.5977 -90.007
- theft 1 1.46328 6.0680 -86.215
- race 1 1.74657 6.3513 -84.070
- fire 1 2.37807 6.9828 -79.615
Step: AIC=-98.5
involact ~ race + fire + theft + age
Df Sum of Sq RSS AIC
<none> 4.6718 -98.504
+ income 1 0.06710 4.6047 -97.184
+ volact 1 0.00955 4.6622 -96.601
- age 1 0.99734 5.6691 -91.410
- theft 1 1.41436 6.0862 -88.074
- race 1 2.05375 6.7256 -83.379
- fire 1 2.38365 7.0554 -81.128
# Ajustar un modelo lineal con criterio de selección por AIC
model3 <- step(lm(involact ~ . , data = chicago), trace = 0, k=log(length(chicago$involact)))
# Imprimir los resultados del modelo ajustado
summary(model3)
Call:
lm(formula = involact ~ race + fire + theft + age, data = chicago)
Residuals:
Min 1Q Median 3Q Max
-0.87108 -0.14830 -0.01961 0.19968 0.81638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.243118 0.145054 -1.676 0.101158
race 0.008104 0.001886 4.297 0.000100 ***
fire 0.036646 0.007916 4.629 3.51e-05 ***
theft -0.009592 0.002690 -3.566 0.000921 ***
age 0.007210 0.002408 2.994 0.004595 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3335 on 42 degrees of freedom
Multiple R-squared: 0.7472, Adjusted R-squared: 0.7231
F-statistic: 31.03 on 4 and 42 DF, p-value: 4.799e-12
3. Analice la significancia del modelo obtenido luego del proceso de selección, y responda si:
```r
anova(model1)
```
```
Analysis of Variance Table
Response: involact
Df Sum Sq Mean Sq F value Pr(>F)
race 1 9.4143 9.4143 82.0556 3.087e-11 ***
fire 1 2.2326 2.2326 19.4597 7.556e-05 ***
theft 1 1.1635 1.1635 10.1409 0.002808 **
age 1 0.9973 0.9973 8.6929 0.005312 **
volact 1 0.0096 0.0096 0.0833 0.774414
income 1 0.0730 0.0730 0.6363 0.429759
Residuals 40 4.5892 0.1147
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
```r
anova(model2)
```
```
Analysis of Variance Table
Response: involact
Df Sum Sq Mean Sq F value Pr(>F)
race 1 9.4143 9.4143 84.6358 1.274e-11 ***
fire 1 2.2326 2.2326 20.0716 5.645e-05 ***
theft 1 1.1635 1.1635 10.4598 0.002379 **
age 1 0.9973 0.9973 8.9662 0.004595 **
Residuals 42 4.6718 0.1112
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
```r
anova(model3)
```
```
Analysis of Variance Table
Response: involact
Df Sum Sq Mean Sq F value Pr(>F)
race 1 9.4143 9.4143 84.6358 1.274e-11 ***
fire 1 2.2326 2.2326 20.0716 5.645e-05 ***
theft 1 1.1635 1.1635 10.4598 0.002379 **
age 1 0.9973 0.9973 8.9662 0.004595 **
Residuals 42 4.6718 0.1112
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
¿Existe alguna covariable no significativa?
Además, podemos observar que la variable “crmrte” y “prbconv” son significativas (p < 0,05), mientras que las otras covariables no lo son. Si bien la eliminación de las covariables no significativas puede simplificar el modelo, en este caso es recomendable mantener todas las variables en el modelo ya que tienen relevancia teórica y podrían influir en la variable respuesta de forma indirecta. Además, eliminar una covariable no significativa podría estar sesgando el modelo y aumentando el riesgo de overfitting.
¿En caso de existir alguna covariable no significativa, la quitar ́ıa del modelo?. Fundamente.
En resumen, el modelo obtenido es significativo y las covariables “crmrte” y “prbconv” son importantes predictores de la variable respuesta “involact”. Aunque existen covariables no significativas, es recomendable mantenerlas en el modelo para una interpretación más completa y evitar el riesgo de overfitting.
4. Fijar al menos 5 valores para las covariables del modelo y con
ellas realizar la predicci ́on de la
media y la predicci ́on individual de la variable objetivo (incluir los
intervalos de confianza).
# Fit the linear regression model
model <- lm(involact ~ ., data = chicago)
summary(model)
Call:
lm(formula = involact ~ ., data = chicago)
Residuals:
Min 1Q Median 3Q Max
-0.84296 -0.14613 -0.01007 0.18386 0.81235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.862e-01 6.020e-01 -0.808 0.424109
race 8.527e-03 2.863e-03 2.978 0.004911 **
fire 3.778e-02 8.982e-03 4.206 0.000142 ***
theft -1.016e-02 2.908e-03 -3.494 0.001178 **
age 7.615e-03 3.330e-03 2.287 0.027582 *
volact -1.018e-02 2.773e-02 -0.367 0.715519
income 2.568e-05 3.220e-05 0.798 0.429759
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3387 on 40 degrees of freedom
Multiple R-squared: 0.7517, Adjusted R-squared: 0.7144
F-statistic: 20.18 on 6 and 40 DF, p-value: 1.072e-10
# Set values for the covariates
new_data <- data.frame( age =c(18.0, 23.0, 28.0, 56.0, 40.0),
race = c(15.0, 22.0, 35.0, 44.0, 60.0),
theft = c(5.0, 16.0, 20.0, 15.0, 22.0),
volact =c(5.0, 7.0 , 9.0, 10.0, 9.0),
income= c(6000.0, 6500.0, 7800.0, 7200.0, 8800.0),
fire = c(4.0, 15.0, 24.0, 30.0, 22.0),
involact = c(0.23,0.43,0.65,0.77,0.45))
model_new <- lm(involact ~ ., data = new_data)
summary(model_new)
Call:
lm(formula = involact ~ ., data = new_data)
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.472708 NaN NaN NaN
age -0.001310 NaN NaN NaN
race -0.006684 NaN NaN NaN
theft -0.008587 NaN NaN NaN
volact 0.173896 NaN NaN NaN
income NA NA NA NA
fire NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 4 and 0 DF, p-value: NA
# Make predictions for the mean and individual values
mean_pred <- predict(model, newdata = new_data, interval = 'confidence')
indiv_pred <- predict(model, newdata = new_data, interval = 'prediction')
# Print the results
print(mean_pred)
fit lwr upr
1 -0.01768718 -0.62262004 0.5872457
2 0.37638419 -0.07781537 0.8305838
3 0.83772323 0.45078257 1.2246639
4 1.37958497 0.88262419 1.8765458
5 1.07209165 0.74534069 1.3988426
print(indiv_pred)
fit lwr upr
1 -0.01768718 -0.93124662 0.8958723
2 0.37638419 -0.44516576 1.1979341
3 0.83772323 0.05135807 1.6240884
4 1.37958497 0.53364342 2.2255265
5 1.07209165 0.31353165 1.8306517
# Comparación de modelos según explicación
summary(reg1)
Call:
lm(formula = O3 ~ ., data = ozone)
Residuals:
Min 1Q Median 3Q Max
-12.1011 -2.9289 -0.2715 2.7080 13.3687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.3135755 29.5193067 0.688 0.49186
vh -0.0054271 0.0053985 -1.005 0.31551
wind -0.0545832 0.1348425 -0.405 0.68590
humidity 0.0809741 0.0188394 4.298 2.29e-05 ***
temp 0.2755492 0.0497912 5.534 6.52e-08 ***
ibh -0.0002338 0.0002956 -0.791 0.42944
dpg -0.0033629 0.0112805 -0.298 0.76581
ibt 0.0296411 0.0136088 2.178 0.03013 *
vis -0.0079910 0.0037503 -2.131 0.03387 *
doy -0.0091194 0.0027745 -3.287 0.00113 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.441 on 320 degrees of freedom
Multiple R-squared: 0.7012, Adjusted R-squared: 0.6927
F-statistic: 83.42 on 9 and 320 DF, p-value: < 2.2e-16
summary(reg2)
Call:
lm(formula = O3 ~ wind + temp + humidity, data = ozone)
Residuals:
Min 1Q Median 3Q Max
-10.307 -3.180 -0.366 2.873 15.335
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.73118 1.33189 -12.562 < 2e-16 ***
wind -0.15890 0.12785 -1.243 0.215
temp 0.39128 0.01941 20.161 < 2e-16 ***
humidity 0.08797 0.01449 6.071 3.53e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.767 on 326 degrees of freedom
Multiple R-squared: 0.6492, Adjusted R-squared: 0.646
F-statistic: 201.1 on 3 and 326 DF, p-value: < 2.2e-16
anova(reg2, reg1)
Analysis of Variance Table
Model 1: O3 ~ wind + temp + humidity
Model 2: O3 ~ vh + wind + humidity + temp + ibh + dpg + ibt + vis + doy
Res.Df RSS Df Sum of Sq F Pr(>F)
1 326 7407.0
2 320 6310.3 6 1096.8 9.2696 2.248e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# seleccion de modelos automatizado
mod1.step0=step(reg1) # backward por AIC
Start: AIC=993.78
O3 ~ vh + wind + humidity + temp + ibh + dpg + ibt + vis + doy
Df Sum of Sq RSS AIC
- dpg 1 1.75 6312.0 991.87
- wind 1 3.23 6313.5 991.95
- ibh 1 12.34 6322.6 992.42
- vh 1 19.93 6330.2 992.82
<none> 6310.3 993.78
- vis 1 89.53 6399.8 996.43
- ibt 1 93.55 6403.8 996.63
- doy 1 213.04 6523.3 1002.74
- humidity 1 364.30 6674.6 1010.30
- temp 1 603.94 6914.2 1021.94
Step: AIC=991.87
O3 ~ vh + wind + humidity + temp + ibh + ibt + vis + doy
Df Sum of Sq RSS AIC
- wind 1 3.75 6315.8 990.07
- ibh 1 11.15 6323.2 990.45
- vh 1 19.28 6331.3 990.88
<none> 6312.0 991.87
- vis 1 89.85 6401.9 994.53
- ibt 1 122.86 6434.9 996.23
- doy 1 211.85 6523.9 1000.76
- humidity 1 475.74 6787.8 1013.85
- temp 1 755.99 7068.0 1027.20
Step: AIC=990.07
O3 ~ vh + humidity + temp + ibh + ibt + vis + doy
Df Sum of Sq RSS AIC
- ibh 1 14.91 6330.7 988.84
- vh 1 16.61 6332.4 988.93
<none> 6315.8 990.07
- vis 1 94.54 6410.3 992.97
- ibt 1 120.10 6435.9 994.28
- doy 1 211.28 6527.1 998.93
- humidity 1 479.79 6795.6 1012.23
- temp 1 756.65 7072.4 1025.41
Step: AIC=988.84
O3 ~ vh + humidity + temp + ibt + vis + doy
Df Sum of Sq RSS AIC
- vh 1 28.99 6359.7 988.35
<none> 6330.7 988.84
- vis 1 103.89 6434.6 992.22
- doy 1 267.47 6598.2 1000.50
- ibt 1 521.66 6852.4 1012.97
- humidity 1 538.82 6869.5 1013.80
- temp 1 820.77 7151.5 1027.07
Step: AIC=988.35
O3 ~ humidity + temp + ibt + vis + doy
Df Sum of Sq RSS AIC
<none> 6359.7 988.35
- vis 1 96.83 6456.5 991.34
- doy 1 341.33 6701.0 1003.60
- ibt 1 531.82 6891.5 1012.85
- humidity 1 690.57 7050.3 1020.37
- temp 1 817.49 7177.2 1026.26
mod1.step1=step(reg1, k=log(length(ozone$O3))) # backward por BIC
Start: AIC=1031.77
O3 ~ vh + wind + humidity + temp + ibh + dpg + ibt + vis + doy
Df Sum of Sq RSS AIC
- dpg 1 1.75 6312.0 1026.1
- wind 1 3.23 6313.5 1026.1
- ibh 1 12.34 6322.6 1026.6
- vh 1 19.93 6330.2 1027.0
- vis 1 89.53 6399.8 1030.6
- ibt 1 93.55 6403.8 1030.8
<none> 6310.3 1031.8
- doy 1 213.04 6523.3 1036.9
- humidity 1 364.30 6674.6 1044.5
- temp 1 603.94 6914.2 1056.1
Step: AIC=1026.06
O3 ~ vh + wind + humidity + temp + ibh + ibt + vis + doy
Df Sum of Sq RSS AIC
- wind 1 3.75 6315.8 1020.5
- ibh 1 11.15 6323.2 1020.9
- vh 1 19.28 6331.3 1021.3
- vis 1 89.85 6401.9 1024.9
<none> 6312.0 1026.1
- ibt 1 122.86 6434.9 1026.6
- doy 1 211.85 6523.9 1031.2
- humidity 1 475.74 6787.8 1044.2
- temp 1 755.99 7068.0 1057.6
Step: AIC=1020.46
O3 ~ vh + humidity + temp + ibh + ibt + vis + doy
Df Sum of Sq RSS AIC
- ibh 1 14.91 6330.7 1015.4
- vh 1 16.61 6332.4 1015.5
- vis 1 94.54 6410.3 1019.6
<none> 6315.8 1020.5
- ibt 1 120.10 6435.9 1020.9
- doy 1 211.28 6527.1 1025.5
- humidity 1 479.79 6795.6 1038.8
- temp 1 756.65 7072.4 1052.0
Step: AIC=1015.44
O3 ~ vh + humidity + temp + ibt + vis + doy
Df Sum of Sq RSS AIC
- vh 1 28.99 6359.7 1011.1
- vis 1 103.89 6434.6 1015.0
<none> 6330.7 1015.4
- doy 1 267.47 6598.2 1023.3
- ibt 1 521.66 6852.4 1035.8
- humidity 1 538.82 6869.5 1036.6
- temp 1 820.77 7151.5 1049.9
Step: AIC=1011.15
O3 ~ humidity + temp + ibt + vis + doy
Df Sum of Sq RSS AIC
- vis 1 96.83 6456.5 1010.3
<none> 6359.7 1011.1
- doy 1 341.33 6701.0 1022.6
- ibt 1 531.82 6891.5 1031.8
- humidity 1 690.57 7050.3 1039.4
- temp 1 817.49 7177.2 1045.2
Step: AIC=1010.33
O3 ~ humidity + temp + ibt + doy
Df Sum of Sq RSS AIC
<none> 6456.5 1010.3
- doy 1 292.59 6749.1 1019.2
- ibt 1 709.75 7166.3 1039.0
- temp 1 771.17 7227.7 1041.8
- humidity 1 1035.22 7491.8 1053.6
mod1.step2=step(reg2, scope=list(lower=reg2, upper=reg1), direction="both") # ambos
Start: AIC=1034.66
O3 ~ wind + temp + humidity
Df Sum of Sq RSS AIC
+ ibh 1 735.40 6671.6 1002.1
+ ibt 1 659.81 6747.2 1005.9
+ doy 1 350.86 7056.2 1020.6
+ vis 1 160.33 7246.7 1029.4
+ dpg 1 84.87 7322.2 1032.9
<none> 7407.0 1034.7
+ vh 1 9.39 7397.7 1036.2
Step: AIC=1002.15
O3 ~ wind + temp + humidity + ibh
Df Sum of Sq RSS AIC
+ doy 1 130.06 6541.6 997.66
+ vis 1 66.60 6605.0 1000.84
+ ibt 1 61.46 6610.2 1001.10
<none> 6671.6 1002.15
+ dpg 1 10.15 6661.5 1003.65
+ vh 1 0.03 6671.6 1004.15
- ibh 1 735.40 7407.0 1034.66
Step: AIC=997.66
O3 ~ wind + temp + humidity + ibh + doy
Df Sum of Sq RSS AIC
+ ibt 1 125.94 6415.6 993.24
+ vis 1 104.59 6437.0 994.34
+ dpg 1 40.48 6501.1 997.61
<none> 6541.6 997.66
+ vh 1 6.67 6534.9 999.32
- doy 1 130.06 6671.6 1002.15
- ibh 1 514.61 7056.2 1020.65
Step: AIC=993.24
O3 ~ wind + temp + humidity + ibh + doy + ibt
Df Sum of Sq RSS AIC
+ vis 1 84.321 6331.3 990.88
- ibh 1 30.844 6446.5 992.82
<none> 6415.6 993.24
+ vh 1 13.748 6401.9 994.53
+ dpg 1 1.449 6414.2 995.17
- ibt 1 125.944 6541.6 997.66
- doy 1 194.549 6610.2 1001.10
Step: AIC=990.88
O3 ~ wind + temp + humidity + ibh + doy + ibt + vis
Df Sum of Sq RSS AIC
- ibh 1 24.783 6356.1 990.17
<none> 6331.3 990.88
+ vh 1 19.277 6312.0 991.87
+ dpg 1 1.101 6330.2 992.82
- vis 1 84.321 6415.6 993.24
- ibt 1 105.670 6437.0 994.34
- doy 1 227.624 6558.9 1000.53
Step: AIC=990.17
O3 ~ wind + temp + humidity + doy + ibt + vis
Df Sum of Sq RSS AIC
<none> 6356.1 990.17
+ vh 1 32.91 6323.2 990.45
+ ibh 1 24.78 6331.3 990.88
+ dpg 1 0.00 6356.1 992.17
- vis 1 90.38 6446.5 992.82
- doy 1 338.35 6694.4 1005.28
- ibt 1 480.79 6836.9 1012.23
# analisis de multicolinealidad
vif(ozone[,-1]) # no se considera la variable objetivo
vh wind humidity temp ibh dpg ibt vis doy
5.433339 1.359493 2.336734 8.646941 4.742458 2.708380 18.167487 1.477973 1.399167
vif(ozone[,-c(1,8)])
vh wind humidity temp ibh dpg vis doy
4.371083 1.340957 2.332862 4.608365 1.868366 2.290083 1.475888 1.388378
cor(ozone[,-8])
O3 vh wind humidity temp ibh dpg vis doy
O3 1.000000000 0.60734379 0.002471078 0.44922399 0.780702835 -0.58953415 0.2140464 -0.4409895 0.06763357
vh 0.607343794 1.00000000 -0.225820526 0.07448508 0.808058890 -0.50483503 -0.1480705 -0.3600802 0.33787912
wind 0.002471078 -0.22582053 1.000000000 0.22286832 -0.005885934 0.19674570 0.3419509 0.1277023 -0.25112456
humidity 0.449223990 0.07448508 0.222868318 1.00000000 0.340474212 -0.24232766 0.6477889 -0.4010085 0.04126119
temp 0.780702835 0.80805889 -0.005885934 0.34047421 1.000000000 -0.53264472 0.1892419 -0.3877210 0.23933504
ibh -0.589534148 -0.50483503 0.196745705 -0.24232766 -0.532644723 1.00000000 0.0370779 0.3866858 0.04260315
dpg 0.214046394 -0.14807054 0.341950922 0.64778894 0.189241917 0.03707790 1.0000000 -0.1258547 -0.15274230
vis -0.440989465 -0.36008025 0.127702305 -0.40100846 -0.387721044 0.38668582 -0.1258547 1.0000000 -0.21770068
doy 0.067633574 0.33787912 -0.251124564 0.04126119 0.239335037 0.04260315 -0.1527423 -0.2177007 1.00000000
reg0=lm(O3~temp+humidity, data=ozone[,-8])
reg3=lm(O3~., data=ozone[,-8])
mod1.step3=step(reg0, scope=list(lower=reg0, upper=reg3), direction="both") # ambos
Start: AIC=1034.22
O3 ~ temp + humidity
Df Sum of Sq RSS AIC
+ ibh 1 769.08 6673.1 1000.2
+ doy 1 275.86 7166.3 1023.8
+ vis 1 187.11 7255.0 1027.8
+ dpg 1 109.15 7333.0 1031.3
<none> 7442.1 1034.2
+ wind 1 35.10 7407.0 1034.7
+ vh 1 22.98 7419.2 1035.2
Step: AIC=1000.22
O3 ~ temp + humidity + ibh
Df Sum of Sq RSS AIC
+ doy 1 124.92 6548.2 995.99
+ vis 1 60.94 6612.1 999.20
<none> 6673.1 1000.22
+ dpg 1 8.23 6664.8 1001.82
+ wind 1 1.42 6671.6 1002.15
+ vh 1 0.25 6672.8 1002.21
- ibh 1 769.08 7442.1 1034.22
Step: AIC=995.99
O3 ~ temp + humidity + ibh + doy
Df Sum of Sq RSS AIC
+ vis 1 109.48 6438.7 992.43
+ dpg 1 43.88 6504.3 995.77
<none> 6548.2 995.99
+ vh 1 9.34 6538.8 997.52
+ wind 1 6.57 6541.6 997.66
- doy 1 124.92 6673.1 1000.22
- ibh 1 618.13 7166.3 1023.76
Step: AIC=992.43
O3 ~ temp + humidity + ibh + doy + vis
Df Sum of Sq RSS AIC
<none> 6438.7 992.43
+ dpg 1 34.37 6404.3 992.66
+ vh 1 2.78 6435.9 994.28
+ wind 1 1.69 6437.0 994.34
- vis 1 109.48 6548.2 995.99
- doy 1 173.46 6612.1 999.20
- ibh 1 452.84 6891.5 1012.85
# comparación de modelos
summary(mod1.step0)
Call:
lm(formula = O3 ~ humidity + temp + ibt + vis + doy, data = ozone)
Residuals:
Min 1Q Median 3Q Max
-12.2499 -3.0600 -0.1662 2.8773 13.2690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.017861 1.653058 -6.060 3.78e-09 ***
humidity 0.085101 0.014347 5.931 7.70e-09 ***
temp 0.232806 0.036074 6.454 3.98e-10 ***
ibt 0.034913 0.006707 5.205 3.45e-07 ***
vis -0.008201 0.003692 -2.221 0.027 *
doy -0.010197 0.002445 -4.170 3.91e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.43 on 324 degrees of freedom
Multiple R-squared: 0.6988, Adjusted R-squared: 0.6942
F-statistic: 150.3 on 5 and 324 DF, p-value: < 2.2e-16
summary(mod1.step1)
Call:
lm(formula = O3 ~ humidity + temp + ibt + doy, data = ozone)
Residuals:
Min 1Q Median 3Q Max
-12.0766 -3.2887 -0.1914 2.9635 13.1982
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.067079 1.379914 -8.745 < 2e-16 ***
humidity 0.096852 0.013417 7.219 3.74e-12 ***
temp 0.225055 0.036122 6.230 1.44e-09 ***
ibt 0.038879 0.006505 5.977 5.97e-09 ***
doy -0.009316 0.002428 -3.838 0.000149 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.457 on 325 degrees of freedom
Multiple R-squared: 0.6942, Adjusted R-squared: 0.6905
F-statistic: 184.5 on 4 and 325 DF, p-value: < 2.2e-16
summary(mod1.step2)
Call:
lm(formula = O3 ~ wind + temp + humidity + doy + ibt + vis, data = ozone)
Residuals:
Min 1Q Median 3Q Max
-12.0063 -3.1273 -0.1615 2.8612 13.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.902115 1.677129 -5.904 8.97e-09 ***
wind -0.055275 0.129250 -0.428 0.6692
temp 0.236255 0.037009 6.384 6.01e-10 ***
humidity 0.086509 0.014738 5.870 1.08e-08 ***
doy -0.010455 0.002521 -4.147 4.32e-05 ***
ibt 0.034201 0.006919 4.943 1.24e-06 ***
vis -0.007992 0.003729 -2.143 0.0329 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.436 on 323 degrees of freedom
Multiple R-squared: 0.699, Adjusted R-squared: 0.6934
F-statistic: 125 on 6 and 323 DF, p-value: < 2.2e-16
summary(mod1.step3)
Call:
lm(formula = O3 ~ temp + humidity + ibh + doy + vis, data = ozone[,
-8])
Residuals:
Min 1Q Median 3Q Max
-11.1490 -2.9319 -0.2593 2.8726 13.9644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.4291160 1.7993446 -4.685 4.14e-06 ***
temp 0.3418561 0.0219260 15.591 < 2e-16 ***
humidity 0.0661840 0.0139041 4.760 2.92e-06 ***
ibh -0.0008155 0.0001708 -4.774 2.74e-06 ***
doy -0.0075031 0.0025396 -2.954 0.00336 **
vis -0.0087056 0.0037090 -2.347 0.01952 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.458 on 324 degrees of freedom
Multiple R-squared: 0.6951, Adjusted R-squared: 0.6904
F-statistic: 147.7 on 5 and 324 DF, p-value: < 2.2e-16
# Diagnostico
shapiro.test(reg1$residuals)
Shapiro-Wilk normality test
data: reg1$residuals
W = 0.99687, p-value = 0.7751
shapiro.test(reg2$residuals)
Shapiro-Wilk normality test
data: reg2$residuals
W = 0.98852, p-value = 0.01046
qqnorm(reg1$residuals, pch=20); qqline(reg1$residuals, col=2, lwd=2)
qqnorm(reg2$residuals, pch=20); qqline(reg2$residuals, col=2, lwd=2)
qqnorm(ozone$O3)
Preguntas parte 2: (30 puntos) Utilice el conjunto
de datos "wbca" disponibles en la libreria
"faraway". Ademas, considere Y = Class, como la variable respuesta,
todas las demas seran variables
explicativas.
```r
# Cargar la librería necesaria
library(faraway)
#Obtención informacion de dataset wbca
?faraway::wbca
# Cargar el conjunto de datos
data(wbca)
head(wbca, n=10)
```
<div data-pagedtable="false">
<script data-pagedtable-source type="application/json">
{"columns":[{"label":[""],"name":["_rn_"],"type":[""],"align":["left"]},{"label":["Class"],"name":[1],"type":["int"],"align":["right"]},{"label":["Adhes"],"name":[2],"type":["int"],"align":["right"]},{"label":["BNucl"],"name":[3],"type":["int"],"align":["right"]},{"label":["Chrom"],"name":[4],"type":["int"],"align":["right"]},{"label":["Epith"],"name":[5],"type":["int"],"align":["right"]},{"label":["Mitos"],"name":[6],"type":["int"],"align":["right"]},{"label":["NNucl"],"name":[7],"type":["int"],"align":["right"]},{"label":["Thick"],"name":[8],"type":["int"],"align":["right"]},{"label":["UShap"],"name":[9],"type":["int"],"align":["right"]},{"label":["USize"],"name":[10],"type":["int"],"align":["right"]}],"data":[{"1":"1","2":"1","3":"1","4":"3","5":"2","6":"1","7":"1","8":"5","9":"1","10":"1","_rn_":"1"},{"1":"1","2":"5","3":"10","4":"3","5":"7","6":"1","7":"2","8":"5","9":"4","10":"4","_rn_":"2"},{"1":"1","2":"1","3":"2","4":"3","5":"2","6":"1","7":"1","8":"3","9":"1","10":"1","_rn_":"3"},{"1":"1","2":"1","3":"4","4":"3","5":"3","6":"1","7":"7","8":"6","9":"8","10":"8","_rn_":"4"},{"1":"1","2":"3","3":"1","4":"3","5":"2","6":"1","7":"1","8":"4","9":"1","10":"1","_rn_":"5"},{"1":"0","2":"8","3":"10","4":"9","5":"7","6":"1","7":"7","8":"8","9":"10","10":"10","_rn_":"6"},{"1":"1","2":"1","3":"10","4":"3","5":"2","6":"1","7":"1","8":"1","9":"1","10":"1","_rn_":"7"},{"1":"1","2":"1","3":"1","4":"3","5":"2","6":"1","7":"1","8":"2","9":"2","10":"1","_rn_":"8"},{"1":"1","2":"1","3":"1","4":"1","5":"2","6":"5","7":"1","8":"2","9":"1","10":"1","_rn_":"9"},{"1":"1","2":"1","3":"1","4":"2","5":"2","6":"1","7":"1","8":"4","9":"1","10":"2","_rn_":"10"}],"options":{"columns":{"min":{},"max":[10],"total":[10]},"rows":{"min":[10],"max":[10],"total":[10]},"pages":{}}}
</script>
</div>
```r
# Obtener un resumen estadístico de las variables
summary(wbca)
```
```
Class Adhes BNucl Chrom Epith Mitos NNucl Thick
Min. :0.0000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000
Median :1.0000 Median : 1.000 Median : 1.000 Median : 3.000 Median : 2.000 Median : 1.000 Median : 1.000 Median : 4.000
Mean :0.6505 Mean : 2.816 Mean : 3.542 Mean : 3.433 Mean : 3.231 Mean : 1.604 Mean : 2.859 Mean : 4.436
3rd Qu.:1.0000 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.: 4.000 3rd Qu.: 6.000
Max. :1.0000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
UShap USize
Min. : 1.000 Min. : 1.00
1st Qu.: 1.000 1st Qu.: 1.00
Median : 1.000 Median : 1.00
Mean : 3.204 Mean : 3.14
3rd Qu.: 5.000 3rd Qu.: 5.00
Max. :10.000 Max. :10.00
```
```r
# Ver los nombres de las variables en el conjunto de datos
names(wbca)
```
```
[1] "Class" "Adhes" "BNucl" "Chrom" "Epith" "Mitos" "NNucl" "Thick" "UShap" "USize"
```
```r
# O usar str() para ver la estructura del conjunto de datos
str(wbca)
```
```
'data.frame': 681 obs. of 10 variables:
$ Class: int 1 1 1 1 1 0 1 1 1 1 ...
$ Adhes: int 1 5 1 1 3 8 1 1 1 1 ...
$ BNucl: int 1 10 2 4 1 10 10 1 1 1 ...
$ Chrom: int 3 3 3 3 3 9 3 3 1 2 ...
$ Epith: int 2 7 2 3 2 7 2 2 2 2 ...
$ Mitos: int 1 1 1 1 1 1 1 1 5 1 ...
$ NNucl: int 1 2 1 7 1 7 1 1 1 1 ...
$ Thick: int 5 5 3 6 4 8 1 2 2 4 ...
$ UShap: int 1 4 1 8 1 10 1 2 1 1 ...
$ USize: int 1 4 1 8 1 10 1 1 1 2 ...
```
```r
# Crear gráficas
# Histograma para la variable target del dataset
hist(wbca$Class)
```
<img src="" />
```r
# Diagrama de dispersión para todas las variables
plot(wbca)
```
<img src="" />
```r
# Diagrama de dispersión para dos variables numéricas
plot(wbca$Thick, wbca$Class)
```
<img src="" />
```r
#Generamos gráfico de dispersión para la variable involact y fire.
ggplot(wbca, aes(x=Thick, y=Class)) + geom_point()
```
<img src="" />
```r
# crea un histograma que visualiza la distribución de los valores de la variable "Thick" del conjunto de datos "wbca"
ggplot(wbca, aes(x=Thick)) + geom_histogram(binwidth=1)
```
<img src="" />
```r
log1=glm(Class~BNucl, data=wbca, family=binomial )
log2=glm(Class~., data=wbca, family=binomial )
summary(log1)
```
```
Call:
glm(formula = Class ~ BNucl, family = binomial, data = wbca)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.58211 0.23791 15.06 <2e-16 ***
BNucl -0.88347 0.07396 -11.95 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 881.39 on 680 degrees of freedom
Residual deviance: 331.46 on 679 degrees of freedom
AIC: 335.46
Number of Fisher Scoring iterations: 6
```
```r
summary(log2)
```
```
Call:
glm(formula = Class ~ ., family = binomial, data = wbca)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.16678 1.41491 7.892 2.97e-15 ***
Adhes -0.39681 0.13384 -2.965 0.00303 **
BNucl -0.41478 0.10230 -4.055 5.02e-05 ***
Chrom -0.56456 0.18728 -3.014 0.00257 **
Epith -0.06440 0.16595 -0.388 0.69795
Mitos -0.65713 0.36764 -1.787 0.07387 .
NNucl -0.28659 0.12620 -2.271 0.02315 *
Thick -0.62675 0.15890 -3.944 8.01e-05 ***
UShap -0.28011 0.25235 -1.110 0.26699
USize 0.05718 0.23271 0.246 0.80589
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 881.388 on 680 degrees of freedom
Residual deviance: 89.464 on 671 degrees of freedom
AIC: 109.46
Number of Fisher Scoring iterations: 8
```
```r
# Imprimir los resultados del modelo ajustado
step(model, direction = "both", trace = FALSE)
```
```
Call:
lm(formula = involact ~ race + fire + theft + age, data = chicago)
Coefficients:
(Intercept) race fire theft age
-0.243118 0.008104 0.036646 -0.009592 0.007210
```
```r
anova(model, test="Chi")
```
```
Analysis of Variance Table
Response: involact
Df Sum Sq Mean Sq F value Pr(>F)
race 1 9.4143 9.4143 82.0556 3.087e-11 ***
fire 1 2.2326 2.2326 19.4597 7.556e-05 ***
theft 1 1.1635 1.1635 10.1409 0.002808 **
age 1 0.9973 0.9973 8.6929 0.005312 **
volact 1 0.0096 0.0096 0.0833 0.774414
income 1 0.0730 0.0730 0.6363 0.429759
Residuals 40 4.5892 0.1147
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
• ¿Es el modelo obtenido significativo?
**Respuesta:**Este comando realiza una prueba de chi-cuadrado de la hipótesis nula de que todos los coeficientes en el modelo de regresión logística son iguales a cero. Si el valor p está por debajo de cierto nivel de significación (p. ej., 0,05), podemos rechazar la hipótesis nula y concluir que el modelo es significativo.
\
• ¿Existe alguna covariable no significativa?
**Respuesta:**Este comando muestra los valores z y p para cada coeficiente en el modelo de regresión logística. Un valor de p por debajo del nivel de significancia indica que la variable correspondiente es significativa para predecir la variable de respuesta. Esto usa la función glm() para ajustar un modelo de regresión logística con "Class.f" como variable de respuesta y todas las demás variables en el conjunto de datos (excepto "ID" y "Class") como variables explicativas. La función step() luego realiza una selección de variables paso a paso usando AIC como criterio. El argumento de dirección especifica si se agregan o eliminan variables ("ambos" significa pasos hacia adelante y hacia atrás), y el argumento de seguimiento controla si se imprimen los resultados de cada paso. Esto le mostrará información sobre los coeficientes, errores estándar, valores z y valores p para cada variable explicativa en el modelo de regresión logística seleccionado. Nota: Según el conjunto de datos y la pregunta de investigación específica, diferentes criterios pueden ser apropiados para la selección de variables, por lo que es importante elegir el criterio más apropiado para su estudio específico. Esto usa la función glm() para ajustar un modelo de regresión logística con "Class.f" como variable de respuesta y todas las demás variables en el conjunto de datos (excepto "ID" y "Class") como variables explicativas. La función step() luego realiza una selección de variables paso a paso usando AIC como criterio. El argumento de dirección especifica si se agregan o eliminan variables ("ambos" significa pasos hacia adelante y hacia atrás), y el argumento de seguimiento controla si se imprimen los resultados de cada paso.
\
• ¿En caso de existir alguna covariable no significativa, la quitar ́ıa del modelo?. Fundamente.
```r
gofstat(resid, wbca$Class.f)
```
```
Error in gofstat(resid, wbca$Class.f) : could not find function "gofstat"
```
**Respuesta:**#Este comando usa la función gof() del paquete "gof" para calcular varias estadísticas de bondad de ajuste para el modelo de regresión logística. Un valor alto de la estadística de bondad de ajuste general (G) y un valor bajo de la bondad de ajuste de la desviación (D) sugieren un buen ajuste del modelo.Si hay covariables en el modelo que no son significativas, puede ser razonable eliminarlas para simplificar el modelo. Sin embargo, es importante tener en cuenta que la eliminación de covariables del modelo puede provocar la pérdida de información y estimaciones potencialmente sesgadas. Por lo tanto, es importante evaluar cuidadosamente el impacto de cada covariable en el ajuste del modelo y los resultados del análisis antes de tomar decisiones sobre su eliminación.
wbca$Class.f <- factor(wbca$Class)
model <- glm(Class.f ~ . - ID - Class, data = wbca, family = binomial)
Warning: 'varlist' has changed (from nvar=11) to new 12 after EncodeVars() -- should no longer happen!Error in eval(predvars, data, env) : object 'ID' not found
Esto usa la función predict() para predecir las probabilidades logarítmicas de presencia para los nuevos valores de covariable. Luego convertimos las probabilidades logarítmicas en probabilidades usando la función plogis(). Finalmente, calculamos los intervalos de confianza del 95 % para las probabilidades predichas usando la función qnorm() y concatenamos todo en un marco de datos.Tenga en cuenta que los valores de covariables específicos que establecemos en el paso 5 son solo un ejemplo y deben cambiarse para reflejar los valores de las observaciones reales en sus datos.