Stage 1: Data Proposal and Probability Analysis Report

Classification Dataset: Adult Census Income

Author

Selhan Çil & Sude Arslan

Published

May 10, 2026

# Packages used in this report
library(tidyverse)
library(knitr)
library(scales)

theme_set(theme_minimal(base_size = 12))

1 Dataset Description and Source

The dataset used for the classification part of this project is the Adult Census Income dataset, also known as the “Census Income” dataset. It was originally created from U.S. Census Bureau data and is available through the UCI Machine Learning Repository and Kaggle.

Source:

Each observation represents an individual. The target variable is income, which classifies whether an individual’s annual income is <=50K or >50K. Because the outcome has two possible classes, this is a binary classification dataset. The dataset includes demographic and labor market variables such as age, education, occupation, work class, marital status, sex, capital gain, capital loss, and weekly working hours.

This dataset is relevant to an economic question because it can be used to study the relationship between human capital, labor market position, work intensity, and income inequality. In labor economics, education, occupation, age, and working hours are important predictors of earnings. A classification model based on this dataset can help identify which socioeconomic characteristics are associated with belonging to a higher-income group.

The dataset satisfies the assignment requirements because it contains more than 500 observations and more than 5 variables, including the binary target variable.

2 Economic Question

To what extent do education, occupation, age, and weekly working hours predict whether an individual belongs to the high-income group earning more than $50,000 per year?

3 Data Import and Cleaning

# The CSV file is stored in the same folder as this Quarto document.
# Keeping a relative path makes the project portable across computers.
adult_path <- "adult.csv"

# Import the Adult Census Income dataset.
adult_raw <- read_csv(adult_path, show_col_types = FALSE)

# Clean the data:
# 1. Convert "?" values to missing values.
# 2. Rename variables to snake_case for easier use in R.
# 3. Keep economically relevant variables for the classification question.
# 4. Convert categorical variables to factors and numeric variables to numeric type.
# 5. Remove observations with missing values in the selected variables.
# 6. Create a binary target variable: 1 if income is >50K, 0 otherwise.
adult_clean <- adult_raw |>
  mutate(across(where(is.character), ~ na_if(.x, "?"))) |>
  rename(
    education_num = education.num,
    marital_status = marital.status,
    capital_gain = capital.gain,
    capital_loss = capital.loss,
    hours_per_week = hours.per.week,
    native_country = native.country
  ) |>
  select(
    income, age, education, education_num, occupation, hours_per_week,
    sex, marital_status, workclass, capital_gain, capital_loss, native_country
  ) |>
  mutate(
    income = factor(income, levels = c("<=50K", ">50K")),
    high_income = if_else(income == ">50K", 1L, 0L),
    log_high_income = log1p(high_income),
    across(
      c(education, occupation, sex, marital_status, workclass, native_country),
      as.factor
    ),
    across(
      c(age, education_num, hours_per_week, capital_gain, capital_loss),
      as.numeric
    )
  ) |>
  drop_na()

# Check whether the cleaned dataset satisfies the project requirements.
dataset_size <- tibble(
  observations = nrow(adult_clean),
  variables = ncol(adult_clean)
)

kable(dataset_size, caption = "Cleaned Dataset Size")
Cleaned Dataset Size
observations variables
30162 14

After cleaning, the data are tidy: each row is one individual and each column is one variable. Missing values coded as ? were converted to NA, and observations with missing values in the selected variables were removed.

class_balance <- adult_clean |>
  count(income, name = "observations") |>
  mutate(share = observations / sum(observations))

kable(
  class_balance,
  digits = 3,
  caption = "Target Variable Class Balance"
)
Target Variable Class Balance
income observations share
<=50K 22654 0.751
>50K 7508 0.249

4 Summary Statistics for the Target Variable

For probability analysis, the target variable is converted into the numeric binary variable high_income, where:

  • 0 means the individual earns <=50K
  • 1 means the individual earns >50K
income_summary <- adult_clean |>
  summarise(
    mean = mean(high_income),
    median = median(high_income),
    standard_deviation = sd(high_income),
    minimum = min(high_income),
    q1 = quantile(high_income, 0.25),
    q3 = quantile(high_income, 0.75),
    maximum = max(high_income)
  )

kable(
  income_summary,
  digits = 3,
  caption = "Summary Statistics for the Binary High-Income Target"
)
Summary Statistics for the Binary High-Income Target
mean median standard_deviation minimum q1 q3 maximum
0.249 0 0.432 0 0 0 1

Because high_income is a binary variable, its mean can be interpreted as the sample probability that an individual earns more than $50,000 per year. The median and quartiles also reflect the class distribution: if most individuals are in the <=50K group, the median will be 0.

5 Histogram of the Original Target Variable

ggplot(adult_clean, aes(x = high_income)) +
  geom_histogram(
    binwidth = 0.5,
    boundary = -0.25,
    fill = "#2f6f73",
    color = "white"
  ) +
  scale_x_continuous(
    breaks = c(0, 1),
    labels = c("0 = <=50K", "1 = >50K")
  ) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Distribution of the High-Income Target",
    x = "High-income classification target",
    y = "Number of individuals"
  )

The original target variable is not normally distributed because it is binary. The histogram has probability mass only at 0 and 1. Most observations are in the <=50K group, while a smaller share of individuals are in the >50K group. Therefore, the target is imbalanced but still usable for a classification analysis.

6 Histogram of the Log-Transformed Target Variable

ggplot(adult_clean, aes(x = log_high_income)) +
  geom_histogram(
    binwidth = 0.25,
    boundary = -0.125,
    fill = "#b46a3c",
    color = "white"
  ) +
  scale_x_continuous(
    breaks = c(0, log(2)),
    labels = c("log(1 + 0)", "log(1 + 1)")
  ) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Distribution of the Log-Transformed High-Income Target",
    x = "log(1 + high_income)",
    y = "Number of individuals"
  )

The log transformation does not make the target variable normally distributed. Since high_income can only take the values 0 and 1, log(1 + high_income) can only take the values 0 and log(2). The transformation changes the scale of the value 1, but it does not change the fact that the outcome is binary.

7 Proposed Theoretical Distribution

The most appropriate theoretical distribution for the target variable is a Bernoulli distribution because each observation has exactly two possible outcomes:

  • success: the individual earns >50K
  • failure: the individual earns <=50K

Among the distributions listed in the assignment prompt, normal, log-normal, and exponential distributions are not appropriate for this target variable because they are designed for continuous outcomes. A normal distribution can take many values over a continuous range, a log-normal distribution is used for positive skewed continuous data, and an exponential distribution is commonly used for waiting-time or duration data. In contrast, this classification target is binary, so the Bernoulli distribution provides the clearest probability foundation.

8 Conclusion

The Adult Census Income dataset is suitable for the classification part of the project because it contains a binary income target, many economically meaningful predictor variables, and far more than the minimum required number of observations. The main economic question focuses on whether education, occupation, age, and weekly working hours can predict membership in the high-income group. The probability analysis shows that the target variable is binary and imbalanced, so it should be modeled using a Bernoulli distribution rather than a normal, log-normal, or exponential distribution.