Week 5 - Data Dive

Documentation

Section 1: Unclear Columns

The following columns from the Census Income data set were not clear before I read the documentation:

capital_gain - I didn’t know what capital gain means, not that it was explained what it means in the documentation, but I had to look up the meaning online. It simply means the gain on an investment.

https://www.investopedia.com/terms/c/capitalgain.asp
capital_loss - The same case applies to this column. I didn’t know what it exactly means. After doing some research online, I found that it means the loss on investment. https://www.investopedia.com/terms/c/capitalloss.asp
fnlwgt - after searching through a number of online sources, I found one source that equates it to final weight of the Current Population Survey (CPS) that is arrived using 3 set of controls. https://www.kaggle.com/datasets/uciml/adult-census-income/data

Section 2: Unclear Columns still After Documentation

While I now understand fnlwgt stands for final weight, it was still unclear how that feature affects my data set. I had to do some more research and I found out that it is the weight that the census assigns every entry in terms of representation of that population. Initially, I thought that this was some sort of unique id for every entry. With this new information, however, I will be able to investigate how I can use fnlwgt feauture to eliminate sampling bias if it arises. https://www.kaggle.com/code/yashhvyass/adult-census-income-logistic-reg-explained-86-2
Also, for the work_class column, there were “?” symbols for some of the entries. While I thought they were placeholders for missing values, I wasn’t quite sure about that. Again, I did some research online, and I found out that that was actually correct. From the documentation too, they have indicated that the work_class column has missing values, so it checked out.

Section 3: Visualization

I was a little bit concerned about the missing values for the work_class feature, as it is a critical feature of the Census Income data set. Below is a visualization that helps me visualize what percentage of my entries actually have missing values for the work class feature.

#first we load the library tidyverse
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#Loading data
df_census = read.csv("./censusincome.csv")

#making the visualization plot
library(ggplot2)


missing_workclass_num <- sum(df_census$workclass == " ?")

total_entries <- nrow(df_census)

missing_percentage <- (missing_workclass_num/total_entries) * 100
non_missing_percentage <- 100 - missing_percentage

#bar plot for visualization
plot_data <- data.frame(
  Category = c("Missing", "Non-missing"),
  Percentage = c(missing_percentage, non_missing_percentage)
)

ggplot(plot_data, aes(x=Category, y=Percentage, fill=Category)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "green")) +
  labs(
    title="Percentage of Missing Values for Work Clas",
    y = "Percentage (%)",
    x = "Category"
  ) +
  theme_minimal()

From the plot above, we see that the percentage of missing values for the work class is not as huge when compared to the non-missing percentage. This implies that if we choose to filter out the entries that have missing values for the work class, then we won’t be removing a significant amount of entries from our analysis.

Week 5 - Data Dive

Documentation

Section 1: Unclear Columns

Section 2: Unclear Columns still After Documentation

Section 3: Visualization

Conclusion