Research Question: Can we predict a person is married or not based on their personal income?
For Project 3, I chose the census dataset from openintro.org. The dataset contains 500 observations/cases and 8 variables. The ones I will use for the project are personal income and marital status.The other 6 variables in the dataset are year, state, census year, total family income, age, sex, race and general race.
##Loading in the dataset and appropriate libraries.
data <- read.csv("census.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
I will be using predictive data analysis because I want to see if marital status is relevant to total personal income. To clean the data, I will check the structure and head of the dataset and group by the variables I am using. In my 3rd chunk I will focus on making a table that includes my chosen variables for a more focused analysis for when I code my regression model.
##Check head and structure of data
head(data)
## census_year state_fips_code total_family_income age sex race_general
## 1 2000 Florida 14550 44 Male Two major races
## 2 2000 Florida 22800 20 Female White
## 3 2000 Florida 0 20 Male Black
## 4 2000 Florida 23000 6 Female White
## 5 2000 Florida 48000 55 Male White
## 6 2000 Florida 74000 43 Female White
## marital_status total_personal_income
## 1 Married/spouse present 0
## 2 Never married/single 13000
## 3 Never married/single 20000
## 4 Never married/single NA
## 5 Married/spouse present 36000
## 6 Married/spouse present 27000
names(data)
## [1] "census_year" "state_fips_code" "total_family_income"
## [4] "age" "sex" "race_general"
## [7] "marital_status" "total_personal_income"
str(data)
## 'data.frame': 500 obs. of 8 variables:
## $ census_year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ state_fips_code : chr "Florida" "Florida" "Florida" "Florida" ...
## $ total_family_income : int 14550 22800 0 23000 48000 74000 23000 74000 60000 14600 ...
## $ age : int 44 20 20 6 55 43 60 47 54 58 ...
## $ sex : chr "Male" "Female" "Male" "Female" ...
## $ race_general : chr "Two major races" "White" "Black" "White" ...
## $ marital_status : chr "Married/spouse present" "Never married/single" "Never married/single" "Never married/single" ...
## $ total_personal_income: int 0 13000 20000 NA 36000 27000 11800 48000 40000 14600 ...
data |>
group_by(total_personal_income, marital_status)
## # A tibble: 500 × 8
## # Groups: total_personal_income, marital_status [281]
## census_year state_fips_code total_family_income age sex race_general
## <int> <chr> <int> <int> <chr> <chr>
## 1 2000 Florida 14550 44 Male Two major races
## 2 2000 Florida 22800 20 Female White
## 3 2000 Florida 0 20 Male Black
## 4 2000 Florida 23000 6 Female White
## 5 2000 Florida 48000 55 Male White
## 6 2000 Florida 74000 43 Female White
## 7 2000 Florida 23000 60 Female White
## 8 2000 Florida 74000 47 Female White
## 9 2000 Florida 60000 54 Female Black
## 10 2000 Florida 14600 58 Female White
## # ℹ 490 more rows
## # ℹ 2 more variables: marital_status <chr>, total_personal_income <int>
##Make a table that includes the 2 variables I'm focusing on for the linear model
lmdata <- table(data$total_personal_income, data$marital_status)
Firstly, I had to convert marital status using the factor function because it is a categorical variable with 3+ categories. Then, I created the linear regression model using my two chosen variables, marital status and total personal income. To call the model coefficient and intercept, I use the summary function on my new linear regression model.
Based on the p-value 0.953 being closer to 1 than 0 we can predict there is little to no statistical significance that personal income has to do with marital status. Since we’re using log odd probabilities for the model, based on the coefficients for total personal income all being extremely close to 0, we again can interpret that marital status and personal income have little to do with one another.
data$marital_status <- factor(data$marital_status)
lgmodel <- glm(marital_status ~ total_personal_income, data = data, family = binomial)
summary(lgmodel)
##
## Call:
## glm(formula = marital_status ~ total_personal_income, family = binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.238e+00 2.014e-01 11.114 <2e-16 ***
## total_personal_income -2.155e-07 3.626e-06 -0.059 0.953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 249.55 on 391 degrees of freedom
## Residual deviance: 249.55 on 390 degrees of freedom
## (108 observations deleted due to missingness)
## AIC: 253.55
##
## Number of Fisher Scoring iterations: 5
##Diagnostics
predicted.classes <- predict(lgmodel, newdata = data)
matrixx <- table(
Predicted = factor(predicted.classes, levels = c("Married/spouse present", "Never married/single", "Widowed")),
Actual = factor(data$marital_status, levels = c("Married/spouse present", "Never married/single", "Widowed")))
matrixx <- na.omit(data)
roclg <- multiclass.roc(response = data$marital_status, predictor = predicted.classes)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls > cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls > cases
## Setting direction: controls > cases
auc_val <- auc(roclg); auc_val
## Multi-class area under the curve: 0.6148
The AUC value area under the curve is 0.6148. This tells me that the model is weak at predicting marital status based off the one predictor I used being total personal income. While it is better than 50%, randomly guessing, the model is pretty weak at distinguishing someone’s personal income from their marital status from the cencus.
If you randomly pick one person from the census, the model has about a 61% chance of ranking someone based off their marital status and personal income.
##Conclusion
From the regression analysis I can conclude that we cannot determine someone’s total personal income is higher or lower based on their marital status. Therefore, the answer to my project question would be no, we cannot determine that someone’s marital status has anything to do with their individual income. The fit of the model is weak only at 61%.
In the future, or if I were to do this project over again, I would add in total family income as a second predictor for a stronger model. I also could have been more successful if I had changed the levels within the category marital status to numeric values like 1, 2, and 3 to get a numeric result for ROC. I would rather have done this model and made all the variables I used numeric.
References:
logistic.regression Rmd
Dataset: https://www.openintro.org/data/index.php?data=census
Week 10 zoom recordings on BB DATA 101