Importar liberías

library(Amelia)

## Cargando paquete requerido: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(magrittr)

## 
## Adjuntando el paquete: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(scales)

## 
## Adjuntando el paquete: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(questionr)
library(ggplot2)
library(dplyr)
library(ggcorrplot)
library(moments)
library(psych)

## 
## Adjuntando el paquete: 'psych'
## 
## The following object is masked from 'package:questionr':
## 
##     describe
## 
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(reshape2)

## 
## Adjuntando el paquete: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(gridExtra)

## 
## Adjuntando el paquete: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Base de datos utilizada

https://archive.ics.uci.edu/dataset/2/adult

Estos datos fueron extraídos de la base de datos del Census Bureau del gobierno de los Estados Unidos por Ronny Kohavi y Barry Becker, el objetivo es determinar si el salario de un individuo sobrepasa los $50K dólares anuales.

Explicación de variables:

age: Edad del invidividuo
workclass: Tipo del trabajo del invidividuo (8 categorías)
fnlwgt: Número de personas con características similares
education: Nivel educativo más alto alcanzado por el individuo (16 categorías)
education-num: Nivel educativo más alto alcanzado por el individuo en orden númerico
marital-status: Estado civil (7 categorías)
occupation: Trabajo del individuo (14 categorías)
relationship: Tipo de relación con otros individuos (6 categorías)
race: Etnia (6 categorías)
sex: Género
capital-gain: Ganancias del capital del individuo (entero)
capital-loss: Perdidas del capital del individuo (entero)
hours-per-week: Horas de trabajo (entero)
native-country: País de origen
income: Variable objetivo (>50k / <=50k)

Lectura de Datos

df = read.csv("C:/Rstudio/adult.csv", sep = ',')
head(df)

##   age         workclass fnlwgt  education education.num      marital.status
## 1  39         State-gov  77516  Bachelors            13       Never-married
## 2  50  Self-emp-not-inc  83311  Bachelors            13  Married-civ-spouse
## 3  38           Private 215646    HS-grad             9            Divorced
## 4  53           Private 234721       11th             7  Married-civ-spouse
## 5  28           Private 338409  Bachelors            13  Married-civ-spouse
## 6  37           Private 284582    Masters            14  Married-civ-spouse
##           occupation   relationship   race     sex capital.gain capital.loss
## 1       Adm-clerical  Not-in-family  White    Male         2174            0
## 2    Exec-managerial        Husband  White    Male            0            0
## 3  Handlers-cleaners  Not-in-family  White    Male            0            0
## 4  Handlers-cleaners        Husband  Black    Male            0            0
## 5     Prof-specialty           Wife  Black  Female            0            0
## 6    Exec-managerial           Wife  White  Female            0            0
##   hours.per.week native.country income
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K
## 6             40  United-States  <=50K

Se observa el las primeras filas del DataFrame, con la descripción del tipo de variable respectiva en cada columna. El Dataframe cuenta con 6 variables númericas, 2 variables categóricas dicotómicas (incluyendo variable objetivo) y 7 categóricas nominales

sapply(df, class)

##            age      workclass         fnlwgt      education  education.num 
##      "integer"    "character"      "integer"    "character"      "integer" 
## marital.status     occupation   relationship           race            sex 
##    "character"    "character"    "character"    "character"    "character" 
##   capital.gain   capital.loss hours.per.week native.country         income 
##      "integer"      "integer"      "integer"    "character"    "character"

Los datos faltantes están recategorizados como el signo de interrogación, por tanto, si queremos visualizarlos usando la función missmap de la librería Amelia es necesario convertirlos a NA

df[df == ' ?'] <- NA

Posteriormente hacemos revisión de datos faltantes en las variables.

sapply(df, function(x) sum(is.na(x)))

##            age      workclass         fnlwgt      education  education.num 
##              0           1836              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0           1843              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0            583              0

missmap(df, col = c("red", "blue"), legend = TRUE)

Observamos que hay NA’s en las variables Workclasss, Occupation y Native-Country todas categóricas

Porcentaje de NA por variable

na_pct <- sapply(df, function(x) mean(is.na(x)) * 100)
na_pct

##            age      workclass         fnlwgt      education  education.num 
##       0.000000       5.638647       0.000000       0.000000       0.000000 
## marital.status     occupation   relationship           race            sex 
##       0.000000       5.660146       0.000000       0.000000       0.000000 
##   capital.gain   capital.loss hours.per.week native.country         income 
##       0.000000       0.000000       0.000000       1.790486       0.000000

Dado que el porcentaje de datos faltantes en las variables categóricas es relativamente bajo (No mayor al 6%) se decide imputar usando la moda para preservar la tendencia.

df$workclass[is.na(df$workclass)] <- names(which.max(table(df$workclass)))
df$occupation[is.na(df$occupation)] <- names(which.max(table(df$occupation)))
df$native.country[is.na(df$native.country)] <- names(which.max(table(df$native.country)))

Revisamos porcentaje de NA luego de la imputación

na_pct <- sapply(df, function(x) mean(is.na(x)) * 100)
na_pct

##            age      workclass         fnlwgt      education  education.num 
##              0              0              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0              0              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0              0              0

Realizamos el respectivo summary del Dataframe.

summary(df)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours.per.week  native.country        income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

Revisión de categorías únicas y número de ocurrencias en variables categóricas

Workclass

table(df$workclass)

## 
##       Federal-gov         Local-gov      Never-worked           Private 
##               960              2093                 7             24532 
##      Self-emp-inc  Self-emp-not-inc         State-gov       Without-pay 
##              1116              2541              1298                14

Marital

table(df$marital)

## 
##               Divorced      Married-AF-spouse     Married-civ-spouse 
##                   4443                     23                  14976 
##  Married-spouse-absent          Never-married              Separated 
##                    418                  10683                   1025 
##                Widowed 
##                    993

Education

table(df$education)

## 
##          10th          11th          12th       1st-4th       5th-6th 
##           933          1175           433           168           333 
##       7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
##           646           514          1067          1382          5355 
##     Doctorate       HS-grad       Masters     Preschool   Prof-school 
##           413         10501          1723            51           576 
##  Some-college 
##          7291

Relationship

table(df$relationship)

## 
##         Husband   Not-in-family  Other-relative       Own-child       Unmarried 
##           13193            8305             981            5068            3446 
##            Wife 
##            1568

Race

table(df$race)

## 
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black               Other 
##                 311                1039                3124                 271 
##               White 
##               27816

Análisis univariado

Variables binarias

A continuación, se genera un doble gráfico para cada variable binaria, permitiendo visualizarlo en forma de Piechart y tabla de frecuencia.

Variable Sex

Pie chart - Sex

Se observa una ligera mayoría masculina.

Tabla de frequencia

table_default <- questionr::freq(df$sex, cum = TRUE, sort = "dec", total = TRUE)
knitr::kable(table_default)

	n	%	val%	%cum	val%cum
Male	21790	66.9	66.9	66.9	66.9
Female	10771	33.1	33.1	100.0	100.0
Total	32561	100.0	100.0	100.0	100.0

Variable Income

Pie chart - Income

La mayoría percibe ingresos ≤ 50K; menor proporción con ingresos > 50K.

Tabla de frequencia

table_default <- questionr::freq(df$income, cum = TRUE, sort = "dec", total = TRUE)
knitr::kable(table_default)

	n	%	val%	%cum	val%cum
<=50K	24720	75.9	75.9	75.9	75.9
>50K	7841	24.1	24.1	100.0	100.0
Total	32561	100.0	100.0	100.0	100.0

Variables categóricas nominales

A continuación graficaremos las distribuciones de las variables categóricas por medio de diagramas de barras.

Workclass

Tabla_workclass <- df %>%
  filter(!is.na(workclass)) %>% 
  dplyr::group_by(workclass) %>%                                  
  dplyr::summarise(Total = n()) %>%                                
  dplyr::mutate(Porcentaje = round(Total/sum(Total)*100, 1)) %>%   
  dplyr::arrange(workclass)

Tabla_workclass

## # A tibble: 8 × 3
##   workclass           Total Porcentaje
##   <chr>               <int>      <dbl>
## 1 " Federal-gov"        960        2.9
## 2 " Local-gov"         2093        6.4
## 3 " Never-worked"         7        0  
## 4 " Private"          24532       75.3
## 5 " Self-emp-inc"      1116        3.4
## 6 " Self-emp-not-inc"  2541        7.8
## 7 " State-gov"         1298        4  
## 8 " Without-pay"         14        0

G1 <- ggplot(Tabla_workclass, aes(x=workclass, y=Total) ) + 
  geom_bar(width = 0.7, stat="identity", fill="grey") +  
  coord_cartesian(ylim = c(0, 27500)) +
  labs(x="Tipo de trabajo", y="Frecuencias \n (Porcentajes)") +
  geom_text(aes(label=paste0(Total, " (", Porcentaje, "%)")),
            vjust=-0.9, color="black", size=4) +  
  theme_bw(base_size = 16) +
  theme(
    axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 12)  
  ) +
  facet_wrap(~"Distribución de Tipo de trabajo")

G1

predominio del sector privado, seguido por Self-employed y Government

Education

Tabla_education <- df %>%
  filter(!is.na(education)) %>% 
  dplyr::group_by(education) %>%                                  
  dplyr::summarise(Total = n()) %>%                                
  dplyr::mutate(Porcentaje = round(Total/sum(Total)*100, 1)) %>%   
  dplyr::arrange(education)

Tabla_education

## # A tibble: 16 × 3
##    education       Total Porcentaje
##    <chr>           <int>      <dbl>
##  1 " 10th"           933        2.9
##  2 " 11th"          1175        3.6
##  3 " 12th"           433        1.3
##  4 " 1st-4th"        168        0.5
##  5 " 5th-6th"        333        1  
##  6 " 7th-8th"        646        2  
##  7 " 9th"            514        1.6
##  8 " Assoc-acdm"    1067        3.3
##  9 " Assoc-voc"     1382        4.2
## 10 " Bachelors"     5355       16.4
## 11 " Doctorate"      413        1.3
## 12 " HS-grad"      10501       32.3
## 13 " Masters"       1723        5.3
## 14 " Preschool"       51        0.2
## 15 " Prof-school"    576        1.8
## 16 " Some-college"  7291       22.4

G2 <- ggplot(Tabla_education, aes(x=education, y=Total) ) + 
  geom_bar(width = 0.7, stat="identity", fill="cadetblue3") +  
  coord_cartesian(ylim = c(0, 15000)) +
  labs(x="Nivel de educación", y="Frecuencias \n (Porcentajes)") +
  geom_text(aes(label=paste0(Total, " (", Porcentaje, "%)")),
            vjust=-0.9, color="black", size=3) +  
  theme_bw(base_size = 16) +
  theme(
    axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 12)  
  ) +
  facet_wrap(~"Distribución de Nivel de educación")

G2

Mayoría con educación secundaria o equivalente; minoría con títulos de posgrado.

Marital

Tabla_marital <- df %>%
  filter(!is.na(`marital.status`)) %>% 
  dplyr::group_by(`marital.status`) %>%                                  
  dplyr::summarise(Total = n()) %>%                                
  dplyr::mutate(Porcentaje = round(Total/sum(Total)*100, 1)) %>%   
  dplyr::arrange(`marital.status`)

Tabla_marital

## # A tibble: 7 × 3
##   marital.status           Total Porcentaje
##   <chr>                    <int>      <dbl>
## 1 " Divorced"               4443       13.6
## 2 " Married-AF-spouse"        23        0.1
## 3 " Married-civ-spouse"    14976       46  
## 4 " Married-spouse-absent"   418        1.3
## 5 " Never-married"         10683       32.8
## 6 " Separated"              1025        3.1
## 7 " Widowed"                 993        3

G3 <- ggplot(Tabla_marital, aes(x=`marital.status`, y=Total) ) + 
  geom_bar(width = 0.7, stat="identity", fill="lemonchiffon1") +  
  coord_cartesian(ylim = c(0, 17500)) +
  labs(x="Estado civil", y="Frecuencias \n (Porcentajes)") +
  geom_text(aes(label=paste0(Total, " (", Porcentaje, "%)")),
            vjust=-0.9, color="black", size=3) +  
  theme_bw(base_size = 16) +
  theme(
    axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 12)  
  ) +
  facet_wrap(~"Distribución de estado civil")

G3

Alto porcentaje de casados, seguido por solteros y divorciados.

Occupation

Tabla_occupation <- df %>%
  filter(!is.na(occupation)) %>% 
  dplyr::group_by(occupation) %>%                                  
  dplyr::summarise(Total = n()) %>%                                
  dplyr::mutate(Porcentaje = round(Total/sum(Total)*100, 1)) %>%   
  dplyr::arrange(occupation)

Tabla_occupation

## # A tibble: 14 × 3
##    occupation           Total Porcentaje
##    <chr>                <int>      <dbl>
##  1 " Adm-clerical"       3770       11.6
##  2 " Armed-Forces"          9        0  
##  3 " Craft-repair"       4099       12.6
##  4 " Exec-managerial"    4066       12.5
##  5 " Farming-fishing"     994        3.1
##  6 " Handlers-cleaners"  1370        4.2
##  7 " Machine-op-inspct"  2002        6.1
##  8 " Other-service"      3295       10.1
##  9 " Priv-house-serv"     149        0.5
## 10 " Prof-specialty"     5983       18.4
## 11 " Protective-serv"     649        2  
## 12 " Sales"              3650       11.2
## 13 " Tech-support"        928        2.9
## 14 " Transport-moving"   1597        4.9

G4 <- ggplot(Tabla_occupation, aes(x=occupation, y=Total) ) + 
  geom_bar(width = 0.7, stat="identity", fill="darkolivegreen2") +  
  coord_cartesian(ylim = c(0, 25000)) +
  labs(x="Ocupación", y="Frecuencias \n (Porcentajes)") +
  geom_text(aes(label=paste0(Total, " (", Porcentaje, "%)")),
            vjust=-0.9, color="black", size=4) +  
  theme_bw(base_size = 16) +
  theme(
    axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 12)  
  ) +
  facet_wrap(~"Distribución de la ocupación")

G4

Alta diversidad de profesiones; destacan craft-repair, exec-managerial y prof-specialty.

Relationship

Tabla_relationship <- df %>%
  filter(!is.na(relationship)) %>% 
  dplyr::group_by(relationship) %>%                                  
  dplyr::summarise(Total = n()) %>%                                
  dplyr::mutate(Porcentaje = round(Total/sum(Total)*100, 1)) %>%   
  dplyr::arrange(relationship)

Tabla_relationship

## # A tibble: 6 × 3
##   relationship      Total Porcentaje
##   <chr>             <int>      <dbl>
## 1 " Husband"        13193       40.5
## 2 " Not-in-family"   8305       25.5
## 3 " Other-relative"   981        3  
## 4 " Own-child"       5068       15.6
## 5 " Unmarried"       3446       10.6
## 6 " Wife"            1568        4.8

G5 <- ggplot(Tabla_relationship, aes(x=relationship, y=Total) ) + 
  geom_bar(width = 0.7, stat="identity", fill="coral") +  
  coord_cartesian(ylim = c(0, 15000)) +
  labs(x="Relación", y="Frecuencias \n (Porcentajes)") +
  geom_text(aes(label=paste0(Total, " (", Porcentaje, "%)")),
            vjust=-0.9, color="black", size=5) +  
  theme_bw(base_size = 16) +
  theme(
    axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 12)  
  ) +
  facet_wrap(~"Distribución de relación del individuo")

G5

Variables numéricas discretas

A continuación realizaremos las gráficas correspondientes a las variables númericas, graficaremos boxplots con el fin de identificar outliers y evaluaremos la asimetría y curtósis de

Age

ggp1 <- df %>% 
  ggplot(aes(x = "", y = age)) +
  geom_boxplot(color = "black", fill = "lightsalmon", alpha = 0.5) +
  theme(legend.position = "none", plot.title = element_text(size = 11)) +
  ggtitle("Distribución de las Edades") +
  coord_flip()

ggp2 <- df %>% 
  ggplot(aes(x = age)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 5,
                 color = "white",
                 fill = "lightsalmon",
                 alpha = 0.7) +
  geom_density(color = "darkred", size = 0.8, alpha = 0.2, fill = "darkred") +
  labs(title = "Distribución de Edades con Curva de Densidad") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

grid.arrange(ggp1, ggp2, ncol = 2)

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Distribución con mayor concentración entre 25 y 45 años; ligera cola hacia edades avanzadas.

Curtosis y asimetría - Age

kur = kurtosis(df[["age"]])
skew = skew(df[["age"]])
  
print(paste("Kurtosis:", round(kur, 2)))

## [1] "Kurtosis: 2.83"

print(paste("Skew:", round(skew, 2)))

## [1] "Skew: 0.56"

Curtosis = 2.83

Una curtosis de 2.83 indica una distribución ligeramente platicúrtica, un poco mas plana que la distribución normal
Skew = 0.56

El valor skew de 0.56 indica asimetría hacia la derecha, la mayoría de las edades se encuentran en valores bajos, pero hay ciertos valores altos que forman una cola hacia la derecha

Capital-gain

df %>% 
  ggplot(aes(x = "", y = capital.gain)) +
  geom_boxplot(color = "black", fill = "lightsalmon", alpha = 0.5) +
  theme(legend.position = "none", plot.title = element_text(size = 11)) +
  ggtitle("Distribución de Capital gain") +
  coord_flip()

La mayoría de valores en cero; datos muy dispersos y con pocos registros con ganancias de capital altas, lo que indica fuerte asimetría.

Asimetría - Capital Gain

skew = skew(df[["capital.gain"]])    
print(paste("Skew:", round(skew, 2)))

## [1] "Skew: 11.95"

Skew = 11.95

El valor skew de 11.56 indica alta asimetría hacia la derecha, muchos de los valores de capital gain se encuentran en valores bajos, y unos pocos valores extremadamente altos que arrastran la cola hacia la derecha

Capital-loss

df %>% 
  ggplot(aes(x = "", y = capital.loss)) +
  geom_boxplot(color = "black", fill = "lightpink", alpha = 0.5) +
  theme(legend.position = "none", plot.title = element_text(size = 11)) +
  ggtitle("Distribución de Capital loss") +
  coord_flip()

Similar a Capital-gain, la mayoría sin pérdidas; pocos casos con valores significativos.

Asimetría - Capital Loss

skew = skew(df[["capital.loss"]])     
print(paste("Skew:", round(skew, 2)))

## [1] "Skew: 4.59"

Skew = 4.59

El valor skew de 4.59 indica también alta asimetría hacia la derecha, aunque no tan marcado como en la variable "Capital gain" muchos de los valores de capital loss se encuentran en valores bajos (o cero incluso), y unos pocos valores extremadamente altos que arrastran la cola hacia la derecha

Hours per week

ggp1 <- df %>% 
  ggplot(aes(x = "", y = hours.per.week)) +
  geom_boxplot(color = "black", fill = "palegreen", alpha = 0.5) +
  theme(legend.position = "none", plot.title = element_text(size = 11)) +
  ggtitle("Distribución de las Edades") +
  coord_flip()

ggp2 <- df %>% 
  ggplot(aes(x = hours.per.week)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 5,
                 color = "white",
                 fill = "palegreen",
                 alpha = 0.7) +
  geom_density(color = "darkgreen", size = 0.8, alpha = 0.2, fill = "darkgreen") +
  labs(title = "Distribución de Edades con Curva de Densidad") +
  theme_minimal()

grid.arrange(ggp1, ggp2, ncol = 2)

Pico marcado en 40 horas, lo que sugiere predominio de empleo a tiempo completo.

Curtosis y Asimetría - Hours per Week

kur = kurtosis(df[["hours.per.week"]])
skew = skew(df[["hours.per.week"]])
  
print(paste("Kurtosis:", round(kur, 2)))

## [1] "Kurtosis: 5.92"

print(paste("Skew:", round(skew, 2)))

## [1] "Skew: 0.23"

Curtosis = 5.92

Esto refleja una alta concentración de valores alrededor del centro, pero también una mayor probabilidad de observar valores extremos.
Skew = 0.23

El valor skew de 0.23 indica una muy ligera cola hacia la derecha.

Análisis bivariado

Sex - Workclass

Tabla_sex_occ <- df %>%
  filter(!is.na(sex), !is.na(occupation)) %>%         
  group_by(sex, occupation) %>%                       
  summarise(Total = n(), .groups = "drop") %>%
  mutate(Porcentaje = round(Total / sum(Total) * 100, 1))

ggplot(Tabla_sex_occ, aes(x = occupation, y = Total, fill = sex)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = Total),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  labs(x = "Ocupación", y = "Frecuencia", fill = "Sexo") +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

En el sector privado predominan hombres. En trabajos del gobierno y sin clasificación clara hay mayor equilibrio de género.

Workclass - hours-per-week

ggplot(
  df %>% filter(!is.na(workclass), !is.na(hours.per.week)),
  aes(x = workclass, y = hours.per.week, fill = workclass)
) +
  geom_boxplot() +
  labs(title = "Diagrama de horas de trabajo semanal de acuerdo a la ocupación",
       x = "Trabajo", y = "Horas por semana") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Sector privado con mayor concentración en 40 horas/semana, autónomos tienden a trabajar más de 40 horas, mientras que otros sectores presentan jornadas parciales (< 40 horas).

Income - Education

Tabla_income_edu <- df %>%
  filter(!is.na(income), !is.na(education)) %>%         
  group_by(income, education) %>%                       
  summarise(Total = n(), .groups = "drop") %>%
  mutate(Porcentaje = round(Total / sum(Total) * 100, 1))

ggplot(Tabla_income_edu, aes(x = education, y = Total, fill = income)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = Total),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  labs(x = "Educación", y = "Frecuencia", fill = "Income") +
  scale_fill_manual(values = c("orange", "seagreen2")) +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Ingresos > 50K son más frecuentes en personas con educación universitaria o superior.

Sex - Income

Tabla_income_sex <- df %>%
  filter(!is.na(sex), !is.na(income)) %>%         
  group_by(sex, income) %>%                       
  summarise(Total = n(), .groups = "drop") %>%
  mutate(Porcentaje = round(Total / sum(Total) * 100, 1))

ggplot(Tabla_income_sex, aes(x = income, y = Total, fill = sex)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = Total),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  labs(x = "Income", y = "Frecuencia", fill = "Income") +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Hombres con mayor proporción de ingresos > 50K. Mujeres presentan más registros en el rango ≤ 50K, lo que sugiere posible brecha salarial o de acceso a empleos mejor remunerados.

Workclass - Income

Tabla_income_workclass <- df %>%
  filter(!is.na(workclass), !is.na(income)) %>%         
  group_by(workclass, income) %>%                       
  summarise(Total = n(), .groups = "drop") %>%
  mutate(Porcentaje = round(Total / sum(Total) * 100, 1))

ggplot(Tabla_income_workclass, aes(x = workclass, y = Total, fill = income)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = Total),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  labs(x = "Ocupación", y = "Frecuencia", fill = "Income") +
  scale_fill_manual(values = c("steelblue3", "tan2")) +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Sector privado concentra la mayor cantidad de personas en ambos niveles de ingresos, pero también es donde hay mayor desigualdad. Autónomos y gobierno tienen porcentajes más equilibrados de > 50K.

Analisis Multivariado

Utilizaremos una matriz de correlación para identificar relaciones entre las variables numéricas

numeric_df <- df[sapply(df, is.numeric)]

cor_matrix <- cor(numeric_df, use = "complete.obs")

cor_df <- melt(cor_matrix)

ggplot(melt(cor_matrix), aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.7) +
  geom_text(aes(label = round(value, 2)), 
            color = "black", size = 3.5) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                      midpoint = 0, limit = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank()) +
  labs(title = "Matriz de Correlación",
       x = "",
       y = "",
       fill = "Correlación") +
  coord_fixed()

No se evidencian correlaciones lineales fuertes entre las variables numéricas del dataset. Aunque se evidencian un par con un ligero nivel de correlación.

‘capital.gain’ y ‘education.num’ = 0.12

Muy leve relación positiva a mayor nivel educativo, ligeramente más ganancia de capital.
‘capital.gain’ y ‘education.num’ = 0.15

Trabajar más horas suele estar débilmente asociado con mayor educación.

Tratamiento de Outliers en variables numéricas

Utilizaremos el método de filtro de Hampel, dado que las distribuciones están sesgadas y cuentan con alto número de outliers y picos. Luego, usaremos capping en las variables Age y Hours per Week, con el fin de tratar los outliers

hampel_filter <- function(x, k = 3) {
  median_x <- median(x, na.rm = TRUE)
  mad_x <- mad(x, constant = 1, na.rm = TRUE) 
  
  lower_bound <- median_x - k * mad_x
  upper_bound <- median_x + k * mad_x
  
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  
  return(x)
}

df <- df %>%
  mutate(
    age = hampel_filter(age, k = 3),
    
    hours.per.week = hampel_filter(hours.per.week, k = 2.5)
  )

summary(df[c("age", "hours.per.week")])

##       age        hours.per.week 
##  Min.   :17.00   Min.   :32.50  
##  1st Qu.:28.00   1st Qu.:40.00  
##  Median :37.00   Median :40.00  
##  Mean   :38.41   Mean   :40.42  
##  3rd Qu.:48.00   3rd Qu.:45.00  
##  Max.   :67.00   Max.   :47.50

Graficas de outliers luego de la imputación en variable Age luego de la imputación

ggp1 <- df %>% 
  ggplot(aes(x = "", y = age)) +
  geom_boxplot(color = "black", fill = "lightsalmon", alpha = 0.5) +
  theme(legend.position = "none", plot.title = element_text(size = 11)) +
  ggtitle("Distribución de las Edades") +
  coord_flip()

ggp2 <- df %>% 
  ggplot(aes(x = age)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 5,
                 color = "white",
                 fill = "lightsalmon",
                 alpha = 0.7) +
  geom_density(color = "darkred", size = 0.8, alpha = 0.2, fill = "darkred") +
  labs(title = "Distribución de Edades con Curva de Densidad") +
  theme_minimal()

grid.arrange(ggp1, ggp2, ncol = 2)

Graficas de outliers luego de la imputación en variable Hours per Week luego de la imputación

ggp1 <- df %>% 
  ggplot(aes(x = "", y = hours.per.week)) +
  geom_boxplot(color = "black", fill = "palegreen", alpha = 0.5) +
  theme(legend.position = "none", plot.title = element_text(size = 11)) +
  ggtitle("Distribución de las Edades") +
  coord_flip()

ggp2 <- df %>% 
  ggplot(aes(x = hours.per.week)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 5,
                 color = "white",
                 fill = "palegreen",
                 alpha = 0.7) +
  geom_density(color = "darkgreen", size = 0.8, alpha = 0.2, fill = "darkgreen") +
  labs(title = "Distribución de Hours per Week con Curva de Densidad") +
  theme_minimal()

grid.arrange(ggp1, ggp2, ncol = 2)

Para las variables capital.gain y capital.loss, el 75% de los valores es igual a 0, como se puede observar en el summary en el Q3, por lo cual se optó por no realizar imputación, dado que la alta concentración de ceros es una característica propia de la distribución de estas variables y no el resultado de datos faltantes. Además, la imputación podría distorsionar su comportamiento real, ya que los valores distintos de cero corresponden a casos legítimos pero poco frecuentes.

summary(df$capital.gain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1078       0   99999

summary(df$capital.loss)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    87.3     0.0  4356.0

Hallazgos clave

Brecha de género: hombres predominan en ingresos altos y sectores técnicos.
Educación como factor clave: correlación positiva entre nivel educativo y probabilidad de ingresos > 50K.
Sector laboral y horas trabajadas: los autónomos tienden a trabajar más horas, pero no necesariamente todos alcanzan mayores ingresos.
Concentración de ingresos bajos: gran parte de la población gana ≤ 50K, independientemente del sector laboral.
Poca correlación entra variables numéricas

EDA - Adult

Cristian Linero, David Ibañez

2025-08-14

Importar liberías

Base de datos utilizada

Explicación de variables:

Lectura de Datos

Revisión de categorías únicas y número de ocurrencias en variables categóricas

Workclass

Marital

Education

Relationship

Race

Análisis univariado

Variables binarias

Variable Sex

Pie chart - Sex

Tabla de frequencia

Variable Income

Pie chart - Income

Tabla de frequencia

Variables categóricas nominales

Workclass

Education

Marital

Occupation

Relationship

Variables numéricas discretas

Age

Curtosis y asimetría - Age

Capital-gain

Asimetría - Capital Gain

Capital-loss

Asimetría - Capital Loss

Hours per week

Curtosis y Asimetría - Hours per Week

Análisis bivariado

Sex - Workclass

Workclass - hours-per-week

Income - Education

Sex - Income

Workclass - Income

Analisis Multivariado

Tratamiento de Outliers en variables numéricas

Hallazgos clave