1. Cargar la base de datos y mostrar la estructura de las variables.
datos = read.table("Advertising.csv", sep=",", header = TRUE)
dim(datos)
## [1] 200 5
summary(datos)
## X TV Radio Newspaper
## Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75
## Median :100.50 Median :149.75 Median :22.900 Median : 25.75
## Mean :100.50 Mean :147.04 Mean :23.264 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10
## Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
## Sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00
2. Convertir a lo mucho el 5% de valores de las variables Radio y TV a datos perdidos (NA). Guardar la nueva base de datos en un nueva data frame con el nombre publicidad.
publicidad = datos
for (x in sample(0:199, 10)) {
publicidad$TV[x] = NA
publicidad$Radio[x] = NA
}
summary(publicidad)
## X TV Radio Newspaper
## Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 75.35 1st Qu.: 9.925 1st Qu.: 12.75
## Median :100.50 Median :150.65 Median :22.400 Median : 25.75
## Mean :100.50 Mean :149.03 Mean :23.075 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:220.18 3rd Qu.:36.175 3rd Qu.: 45.10
## Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
## NA's :10 NA's :10
## Sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00
##
3. Mostrar la proporción de datos perdidos por variable y por registro. Interpretar estos valores.
apply(publicidad, 2, function(x){sum(is.na(x))})
## X TV Radio Newspaper Sales
## 0 10 10 0 0
mice::md.pattern(publicidad, rotate.names=TRUE)
## X Newspaper Sales TV Radio
## 190 1 1 1 1 1 0
## 10 1 1 1 0 0 2
## 0 0 0 10 10 20
4. Analizar y visualizar el patrón de datos perdidos
sleep_aggr <- VIM::aggr(publicidad, col = mice::mdc(1:2), numbers = TRUE,
sortVars = TRUE, labels = names(publicidad),
cex.axis= 0.7, gap = 3,
ylab = c("Proporción de Pérdida",
"Patrón de Pérdida"))
##
## Variables sorted by number of missings:
## Variable Count
## TV 0.05
## Radio 0.05
## X 0.00
## Newspaper 0.00
## Sales 0.00
5. Realizar imputación simple usando la media y guardar la información en una nueva columna del dataset publicidad: imp_mean.
library(Hmisc)
imp_mean = publicidad
imp_mean$TV <- with(imp_mean, impute(TV, mean))
imp_mean$Radio <- with(imp_mean, impute(Radio, mean))
head(imp_mean)
## X TV Radio Newspaper Sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
6. Realizar imputación por vecinos más cercanos empleando una cantidad adecuada de vecinos y guardar la información en publicidad (imp_knn).
library(DMwR2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
imp_knn <- knnImputation(publicidad)
summary(imp_knn)
## X TV Radio Newspaper
## Min. : 1.00 Min. : 0.7 Min. : 0.00 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 76.1 1st Qu.:10.07 1st Qu.: 12.75
## Median :100.50 Median :149.8 Median :22.00 Median : 25.75
## Mean :100.50 Mean :148.7 Mean :22.91 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:218.8 3rd Qu.:35.65 3rd Qu.: 45.10
## Max. :200.00 Max. :296.4 Max. :49.60 Max. :114.00
## Sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00
head(imp_knn)
## X TV Radio Newspaper Sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
7. Comparar los datos imputados por la media y por vecinos más cercanos. a. ¿Qué tan diferentes son? Graficar para visualizar las observaciones imputadas por ambos métodos.
{
hist_plot <- function(df, title)
{
df[complete.cases(df), ] %>% gather() %>% head()
ggplot(gather(df[complete.cases(df), ]), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x') +
xlab(title) +
ylab("Cantidad")
}
}
hist_plot(publicidad[c(2:3)], 'Datos con NA')
hist_plot(imp_mean[c(2:3)], 'Imputación Mean')
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
hist_plot(imp_knn[c(2:3)], 'Imputación KNN')
8. Se desean predecir las ventas en base a los valores invertidos en publicidad.Realizar una regresión lineal con los datos imputados (es su criterio elegir publicidad imp_mean o publicidad imp_knn).
#scatter.smooth(x=imp_mean$Sales, y=imp_mean$TV, main="Sales ~ TV") # scatterplot
#scatter.smooth(x=imp_knn$Sales, y=imp_knn$TV, main="Sales ~ TV") # scatterplot
model_mean <- lm(Sales ~ TV + Radio + Newspaper, data = imp_mean)
summary(model_mean)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = imp_mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8486 -0.8903 0.2898 1.2169 9.1724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8550540 0.3928585 7.267 8.5e-12 ***
## TV 0.0454396 0.0017600 25.818 < 2e-16 ***
## Radio 0.1910462 0.0110291 17.322 < 2e-16 ***
## Newspaper -0.0004232 0.0073133 -0.058 0.954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.073 on 196 degrees of freedom
## Multiple R-squared: 0.8445, Adjusted R-squared: 0.8421
## F-statistic: 354.8 on 3 and 196 DF, p-value: < 2.2e-16
model_knn <- lm(Sales ~ TV + Radio + Newspaper, data = imp_knn)
summary(model_knn)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = imp_knn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9453 -0.9271 0.2252 1.1672 2.9623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.832977 0.308808 9.174 <2e-16 ***
## TV 0.045893 0.001392 32.980 <2e-16 ***
## Radio 0.194609 0.008758 22.222 <2e-16 ***
## Newspaper -0.003022 0.005864 -0.515 0.607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.659 on 196 degrees of freedom
## Multiple R-squared: 0.9004, Adjusted R-squared: 0.8989
## F-statistic: 590.5 on 3 and 196 DF, p-value: < 2.2e-16
9. Realizar imputación múltiple con el paquete mice y visualizar los datos imputados para cada variable y en cada muestra generada por la imputación. Comentar al respecto.
imp1 <- mice(publicidad, m = 5, seed = 2)
##
## iter imp variable
## 1 1 TV Radio
## 1 2 TV Radio
## 1 3 TV Radio
## 1 4 TV Radio
## 1 5 TV Radio
## 2 1 TV Radio
## 2 2 TV Radio
## 2 3 TV Radio
## 2 4 TV Radio
## 2 5 TV Radio
## 3 1 TV Radio
## 3 2 TV Radio
## 3 3 TV Radio
## 3 4 TV Radio
## 3 5 TV Radio
## 4 1 TV Radio
## 4 2 TV Radio
## 4 3 TV Radio
## 4 4 TV Radio
## 4 5 TV Radio
## 5 1 TV Radio
## 5 2 TV Radio
## 5 3 TV Radio
## 5 4 TV Radio
## 5 5 TV Radio
imp1
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## X TV Radio Newspaper Sales
## "" "pmm" "pmm" "" ""
## PredictorMatrix:
## X TV Radio Newspaper Sales
## X 0 1 1 1 1
## TV 1 0 1 1 1
## Radio 1 1 0 1 1
## Newspaper 1 1 1 0 1
## Sales 1 1 1 1 0
imp1$method
## X TV Radio Newspaper Sales
## "" "pmm" "pmm" "" ""
imp1$imp$TV
## 1 2 3 4 5
## 48 261.3 292.9 230.1 216.4 281.4
## 67 75.1 43.1 13.1 110.7 70.6
## 68 225.8 66.1 290.7 199.1 199.8
## 79 19.4 43.0 74.7 36.9 25.0
## 98 220.3 240.1 198.9 209.6 170.2
## 136 66.1 85.7 74.7 43.0 177.0
## 140 243.2 287.6 289.7 250.9 198.9
## 164 121.0 217.7 156.6 239.8 204.1
## 183 76.4 44.5 94.2 74.7 93.9
## 196 78.2 85.7 66.9 13.1 44.5
library(lattice)
xyplot(imp1, TV ~ Radio | .imp, pch = 20, cex = 1.4)
xyplot(imp1, TV ~ Radio, pch = 18)
xyplot(imp1, Sales ~ TV + Radio, pch = 18)
densityplot(imp1)
stripplot(imp1, pch = 20)
10. Elaborar una regresión lineal, similar a la del item 8, usando lo obtenido por imputación múltiple. Comparar los resultados con la regresión lineal anterior
ajuste_imp <- with(imp1, lm( Sales ~ TV + Radio + Newspaper))
summary(ajuste_imp)
## # A tibble: 20 x 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 2.97 0.309 9.63 3.22e-18 200
## 2 TV 0.0455 0.00138 32.9 1.17e-81 200
## 3 Radio 0.189 0.00855 22.1 3.33e-55 200
## 4 Newspaper -0.00103 0.00586 -0.175 8.61e- 1 200
## 5 (Intercept) 2.97 0.307 9.69 2.22e-18 200
## 6 TV 0.0457 0.00136 33.5 4.63e-83 200
## 7 Radio 0.190 0.00868 21.9 1.30e-54 200
## 8 Newspaper -0.00247 0.00588 -0.421 6.74e- 1 200
## 9 (Intercept) 2.92 0.312 9.35 2.08e-17 200
## 10 TV 0.0449 0.00139 32.4 1.30e-80 200
## 11 Radio 0.194 0.00876 22.2 2.24e-55 200
## 12 Newspaper -0.00169 0.00593 -0.285 7.76e- 1 200
## 13 (Intercept) 2.87 0.311 9.23 4.59e-17 200
## 14 TV 0.0459 0.00138 33.2 1.75e-82 200
## 15 Radio 0.188 0.00856 22.0 9.50e-55 200
## 16 Newspaper 0.000331 0.00586 0.0566 9.55e- 1 200
## 17 (Intercept) 3.00 0.305 9.85 7.76e-19 200
## 18 TV 0.0453 0.00138 32.9 9.34e-82 200
## 19 Radio 0.190 0.00857 22.2 2.16e-55 200
## 20 Newspaper -0.00197 0.00584 -0.337 7.37e- 1 200
ajuste_comb <- pool(ajuste_imp)
summary(ajuste_comb)
## term estimate std.error statistic df p.value
## 1 (Intercept) 2.946340749 0.314426941 9.3705099 176.2301 0.0000000
## 2 TV 0.045462452 0.001436013 31.6587970 140.4300 0.0000000
## 3 Radio 0.190403048 0.009039745 21.0628785 130.0351 0.0000000
## 4 Newspaper -0.001365394 0.005992489 -0.2278508 173.9788 0.8200298
pool.r.squared(ajuste_imp)
## est lo 95 hi 95 fmi
## R^2 0.899181 0.8686968 0.9228968 NaN