The following columns from the Census Income data set were not clear before I read the documentation:
capital_gain - I didn’t know what capital gain means, not that it was explained what it means in the documentation, but I had to look up the meaning online. It simply means the gain on an investment.
capital_loss - The same case applies to this column. I didn’t know what it exactly means. After doing some research online, I found that it means the loss on investment. https://www.investopedia.com/terms/c/capitalloss.asp
fnlwgt - after searching through a number of online sources, I found one source that equates it to final weight of the Current Population Survey (CPS) that is arrived using 3 set of controls. https://www.kaggle.com/datasets/uciml/adult-census-income/data
While I now understand fnlwgt stands for final weight, it was still unclear how that feature affects my data set. I had to do some more research and I found out that it is the weight that the census assigns every entry in terms of representation of that population. Initially, I thought that this was some sort of unique id for every entry. With this new information, however, I will be able to investigate how I can use fnlwgt feauture to eliminate sampling bias if it arises. https://www.kaggle.com/code/yashhvyass/adult-census-income-logistic-reg-explained-86-2
Also, for the work_class column, there were “?” symbols
for some of the entries. While I thought they were placeholders for
missing values, I wasn’t quite sure about that. Again, I did some
research online, and I found out that that was actually correct. From
the documentation too, they have indicated that the work_class
column has missing values, so it checked out.
I was a little bit concerned about the missing values for the work_class feature, as it is a critical feature of the Census Income data set. Below is a visualization that helps me visualize what percentage of my entries actually have missing values for the work class feature.
#first we load the library tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Loading data
df_census = read.csv("./censusincome.csv")
#making the visualization plot
library(ggplot2)
missing_workclass_num <- sum(df_census$workclass == " ?")
total_entries <- nrow(df_census)
missing_percentage <- (missing_workclass_num/total_entries) * 100
non_missing_percentage <- 100 - missing_percentage
#bar plot for visualization
plot_data <- data.frame(
Category = c("Missing", "Non-missing"),
Percentage = c(missing_percentage, non_missing_percentage)
)
ggplot(plot_data, aes(x=Category, y=Percentage, fill=Category)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "green")) +
labs(
title="Percentage of Missing Values for Work Clas",
y = "Percentage (%)",
x = "Category"
) +
theme_minimal()
This week’s data dive helped us understand our data set better, and will definitely improve our future analysis. We now appreciate the need for understanding our data through documentation and extra research on the data.