Documentation

Section 1: Unclear Columns

The following columns from the Census Income data set were not clear before I read the documentation:

  1. capital_gain - I didn’t know what capital gain means, not that it was explained what it means in the documentation, but I had to look up the meaning online. It simply means the gain on an investment.

    https://www.investopedia.com/terms/c/capitalgain.asp

  2. capital_loss - The same case applies to this column. I didn’t know what it exactly means. After doing some research online, I found that it means the loss on investment. https://www.investopedia.com/terms/c/capitalloss.asp

  3. fnlwgt - after searching through a number of online sources, I found one source that equates it to final weight of the Current Population Survey (CPS) that is arrived using 3 set of controls. https://www.kaggle.com/datasets/uciml/adult-census-income/data

Section 2: Unclear Columns still After Documentation

  • While I now understand fnlwgt stands for final weight, it was still unclear how that feature affects my data set. I had to do some more research and I found out that it is the weight that the census assigns every entry in terms of representation of that population. Initially, I thought that this was some sort of unique id for every entry. With this new information, however, I will be able to investigate how I can use fnlwgt feauture to eliminate sampling bias if it arises. https://www.kaggle.com/code/yashhvyass/adult-census-income-logistic-reg-explained-86-2

  • Also, for the work_class column, there were “?” symbols for some of the entries. While I thought they were placeholders for missing values, I wasn’t quite sure about that. Again, I did some research online, and I found out that that was actually correct. From the documentation too, they have indicated that the work_class column has missing values, so it checked out.

Section 3: Visualization

I was a little bit concerned about the missing values for the work_class feature, as it is a critical feature of the Census Income data set. Below is a visualization that helps me visualize what percentage of my entries actually have missing values for the work class feature.

#first we load the library tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Loading data
df_census = read.csv("./censusincome.csv")
#making the visualization plot
library(ggplot2)


missing_workclass_num <- sum(df_census$workclass == " ?")

total_entries <- nrow(df_census)

missing_percentage <- (missing_workclass_num/total_entries) * 100
non_missing_percentage <- 100 - missing_percentage

#bar plot for visualization
plot_data <- data.frame(
  Category = c("Missing", "Non-missing"),
  Percentage = c(missing_percentage, non_missing_percentage)
)

ggplot(plot_data, aes(x=Category, y=Percentage, fill=Category)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "green")) +
  labs(
    title="Percentage of Missing Values for Work Clas",
    y = "Percentage (%)",
    x = "Category"
  ) +
  theme_minimal()

  • From the plot above, we see that the percentage of missing values for the work class is not as huge when compared to the non-missing percentage. This implies that if we choose to filter out the entries that have missing values for the work class, then we won’t be removing a significant amount of entries from our analysis.

Conclusion

This week’s data dive helped us understand our data set better, and will definitely improve our future analysis. We now appreciate the need for understanding our data through documentation and extra research on the data.