1. Cargar la base de datos y mostrar la estructura de las variables.

datos = read.table("Advertising.csv", sep=",", header = TRUE)
dim(datos)

## [1] 200   5

summary(datos)

##        X                TV             Radio          Newspaper     
##  Min.   :  1.00   Min.   :  0.70   Min.   : 0.000   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75  
##  Median :100.50   Median :149.75   Median :22.900   Median : 25.75  
##  Mean   :100.50   Mean   :147.04   Mean   :23.264   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.40   Max.   :49.600   Max.   :114.00  
##      Sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00

2. Convertir a lo mucho el 5% de valores de las variables Radio y TV a datos perdidos (NA). Guardar la nueva base de datos en un nueva data frame con el nombre publicidad.

publicidad = datos
for (x in sample(0:199, 10)) {
  publicidad$TV[x] = NA
  publicidad$Radio[x] = NA
}
summary(publicidad)

##        X                TV             Radio          Newspaper     
##  Min.   :  1.00   Min.   :  0.70   Min.   : 0.000   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 75.35   1st Qu.: 9.925   1st Qu.: 12.75  
##  Median :100.50   Median :150.65   Median :22.400   Median : 25.75  
##  Mean   :100.50   Mean   :149.03   Mean   :23.075   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:220.18   3rd Qu.:36.175   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.40   Max.   :49.600   Max.   :114.00  
##                   NA's   :10       NA's   :10                       
##      Sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00  
##

3. Mostrar la proporción de datos perdidos por variable y por registro. Interpretar estos valores.

apply(publicidad, 2, function(x){sum(is.na(x))})

##         X        TV     Radio Newspaper     Sales 
##         0        10        10         0         0

mice::md.pattern(publicidad, rotate.names=TRUE)

##     X Newspaper Sales TV Radio   
## 190 1         1     1  1     1  0
## 10  1         1     1  0     0  2
##     0         0     0 10    10 20

4. Analizar y visualizar el patrón de datos perdidos

sleep_aggr <- VIM::aggr(publicidad, col = mice::mdc(1:2), numbers = TRUE,
sortVars = TRUE, labels = names(publicidad),
cex.axis= 0.7, gap = 3,
ylab = c("Proporción de Pérdida",
"Patrón de Pérdida"))

## 
##  Variables sorted by number of missings: 
##   Variable Count
##         TV  0.05
##      Radio  0.05
##          X  0.00
##  Newspaper  0.00
##      Sales  0.00

Imputación de Datos

5. Realizar imputación simple usando la media y guardar la información en una nueva columna del dataset publicidad: imp_mean.

library(Hmisc)
imp_mean = publicidad
imp_mean$TV <- with(imp_mean, impute(TV, mean))
imp_mean$Radio <- with(imp_mean, impute(Radio, mean))
head(imp_mean)

##   X    TV Radio Newspaper Sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

6. Realizar imputación por vecinos más cercanos empleando una cantidad adecuada de vecinos y guardar la información en publicidad (imp_knn).

library(DMwR2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

imp_knn <- knnImputation(publicidad)
summary(imp_knn)

##        X                TV            Radio         Newspaper     
##  Min.   :  1.00   Min.   :  0.7   Min.   : 0.00   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 76.1   1st Qu.:10.07   1st Qu.: 12.75  
##  Median :100.50   Median :149.8   Median :22.00   Median : 25.75  
##  Mean   :100.50   Mean   :148.7   Mean   :22.91   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:218.8   3rd Qu.:35.65   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.4   Max.   :49.60   Max.   :114.00  
##      Sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00

head(imp_knn)

##   X    TV Radio Newspaper Sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

7. Comparar los datos imputados por la media y por vecinos más cercanos. a. ¿Qué tan diferentes son? Graficar para visualizar las observaciones imputadas por ambos métodos.

{ 
  hist_plot <- function(df, title)
  {
    df[complete.cases(df), ] %>% gather() %>% head()
    ggplot(gather(df[complete.cases(df), ]), aes(value)) + 
      geom_histogram(bins = 10) + 
      facet_wrap(~key, scales = 'free_x') +
    xlab(title) + 
    ylab("Cantidad")
  }
}

hist_plot(publicidad[c(2:3)], 'Datos con NA')

hist_plot(imp_mean[c(2:3)], 'Imputación Mean')

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

hist_plot(imp_knn[c(2:3)], 'Imputación KNN')

8. Se desean predecir las ventas en base a los valores invertidos en publicidad.Realizar una regresión lineal con los datos imputados (es su criterio elegir publicidad imp_mean o publicidad imp_knn).

#scatter.smooth(x=imp_mean$Sales, y=imp_mean$TV, main="Sales ~ TV")  # scatterplot

#scatter.smooth(x=imp_knn$Sales, y=imp_knn$TV, main="Sales ~ TV")  # scatterplot

model_mean <- lm(Sales ~ TV + Radio + Newspaper, data = imp_mean)
summary(model_mean)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = imp_mean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8486 -0.8903  0.2898  1.2169  9.1724 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.8550540  0.3928585   7.267  8.5e-12 ***
## TV           0.0454396  0.0017600  25.818  < 2e-16 ***
## Radio        0.1910462  0.0110291  17.322  < 2e-16 ***
## Newspaper   -0.0004232  0.0073133  -0.058    0.954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.073 on 196 degrees of freedom
## Multiple R-squared:  0.8445, Adjusted R-squared:  0.8421 
## F-statistic: 354.8 on 3 and 196 DF,  p-value: < 2.2e-16

model_knn <- lm(Sales ~ TV + Radio + Newspaper, data = imp_knn)
summary(model_knn)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = imp_knn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9453 -0.9271  0.2252  1.1672  2.9623 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.832977   0.308808   9.174   <2e-16 ***
## TV           0.045893   0.001392  32.980   <2e-16 ***
## Radio        0.194609   0.008758  22.222   <2e-16 ***
## Newspaper   -0.003022   0.005864  -0.515    0.607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.659 on 196 degrees of freedom
## Multiple R-squared:  0.9004, Adjusted R-squared:  0.8989 
## F-statistic: 590.5 on 3 and 196 DF,  p-value: < 2.2e-16

9. Realizar imputación múltiple con el paquete mice y visualizar los datos imputados para cada variable y en cada muestra generada por la imputación. Comentar al respecto.

imp1 <- mice(publicidad, m = 5, seed = 2)

## 
##  iter imp variable
##   1   1  TV  Radio
##   1   2  TV  Radio
##   1   3  TV  Radio
##   1   4  TV  Radio
##   1   5  TV  Radio
##   2   1  TV  Radio
##   2   2  TV  Radio
##   2   3  TV  Radio
##   2   4  TV  Radio
##   2   5  TV  Radio
##   3   1  TV  Radio
##   3   2  TV  Radio
##   3   3  TV  Radio
##   3   4  TV  Radio
##   3   5  TV  Radio
##   4   1  TV  Radio
##   4   2  TV  Radio
##   4   3  TV  Radio
##   4   4  TV  Radio
##   4   5  TV  Radio
##   5   1  TV  Radio
##   5   2  TV  Radio
##   5   3  TV  Radio
##   5   4  TV  Radio
##   5   5  TV  Radio

imp1

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##         X        TV     Radio Newspaper     Sales 
##        ""     "pmm"     "pmm"        ""        "" 
## PredictorMatrix:
##           X TV Radio Newspaper Sales
## X         0  1     1         1     1
## TV        1  0     1         1     1
## Radio     1  1     0         1     1
## Newspaper 1  1     1         0     1
## Sales     1  1     1         1     0

imp1$method

##         X        TV     Radio Newspaper     Sales 
##        ""     "pmm"     "pmm"        ""        ""

imp1$imp$TV

##         1     2     3     4     5
## 48  261.3 292.9 230.1 216.4 281.4
## 67   75.1  43.1  13.1 110.7  70.6
## 68  225.8  66.1 290.7 199.1 199.8
## 79   19.4  43.0  74.7  36.9  25.0
## 98  220.3 240.1 198.9 209.6 170.2
## 136  66.1  85.7  74.7  43.0 177.0
## 140 243.2 287.6 289.7 250.9 198.9
## 164 121.0 217.7 156.6 239.8 204.1
## 183  76.4  44.5  94.2  74.7  93.9
## 196  78.2  85.7  66.9  13.1  44.5

library(lattice)
xyplot(imp1, TV ~ Radio | .imp, pch = 20, cex = 1.4)

xyplot(imp1, TV ~ Radio, pch = 18)

xyplot(imp1, Sales ~ TV + Radio, pch = 18)

densityplot(imp1)

stripplot(imp1, pch = 20)

10. Elaborar una regresión lineal, similar a la del item 8, usando lo obtenido por imputación múltiple. Comparar los resultados con la regresión lineal anterior

ajuste_imp <- with(imp1, lm( Sales ~ TV + Radio + Newspaper))
summary(ajuste_imp)

## # A tibble: 20 x 6
##    term         estimate std.error statistic  p.value  nobs
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl> <int>
##  1 (Intercept)  2.97       0.309      9.63   3.22e-18   200
##  2 TV           0.0455     0.00138   32.9    1.17e-81   200
##  3 Radio        0.189      0.00855   22.1    3.33e-55   200
##  4 Newspaper   -0.00103    0.00586   -0.175  8.61e- 1   200
##  5 (Intercept)  2.97       0.307      9.69   2.22e-18   200
##  6 TV           0.0457     0.00136   33.5    4.63e-83   200
##  7 Radio        0.190      0.00868   21.9    1.30e-54   200
##  8 Newspaper   -0.00247    0.00588   -0.421  6.74e- 1   200
##  9 (Intercept)  2.92       0.312      9.35   2.08e-17   200
## 10 TV           0.0449     0.00139   32.4    1.30e-80   200
## 11 Radio        0.194      0.00876   22.2    2.24e-55   200
## 12 Newspaper   -0.00169    0.00593   -0.285  7.76e- 1   200
## 13 (Intercept)  2.87       0.311      9.23   4.59e-17   200
## 14 TV           0.0459     0.00138   33.2    1.75e-82   200
## 15 Radio        0.188      0.00856   22.0    9.50e-55   200
## 16 Newspaper    0.000331   0.00586    0.0566 9.55e- 1   200
## 17 (Intercept)  3.00       0.305      9.85   7.76e-19   200
## 18 TV           0.0453     0.00138   32.9    9.34e-82   200
## 19 Radio        0.190      0.00857   22.2    2.16e-55   200
## 20 Newspaper   -0.00197    0.00584   -0.337  7.37e- 1   200

ajuste_comb <- pool(ajuste_imp)
summary(ajuste_comb)

##          term     estimate   std.error  statistic       df   p.value
## 1 (Intercept)  2.946340749 0.314426941  9.3705099 176.2301 0.0000000
## 2          TV  0.045462452 0.001436013 31.6587970 140.4300 0.0000000
## 3       Radio  0.190403048 0.009039745 21.0628785 130.0351 0.0000000
## 4   Newspaper -0.001365394 0.005992489 -0.2278508 173.9788 0.8200298

pool.r.squared(ajuste_imp)

##          est     lo 95     hi 95 fmi
## R^2 0.899181 0.8686968 0.9228968 NaN

Imputación de Datos

Oscar Edmit Quispe Poccohuanca

28/3/2021

Imputación de Datos