Introduction

Research Question: Can we predict a person is married or not based on their personal income?

For Project 3, I chose the census dataset from openintro.org. The dataset contains 500 observations/cases and 8 variables. The ones I will use for the project are personal income and marital status.The other 6 variables in the dataset are year, state, census year, total family income, age, sex, race and general race.

##Loading in the dataset and appropriate libraries.

data <- read.csv("census.csv")

library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Data Analysis

I will be using predictive data analysis because I want to see if marital status is relevant to total personal income. To clean the data, I will check the structure and head of the dataset and group by the variables I am using. In my 3rd chunk I will focus on making a table that includes my chosen variables for a more focused analysis for when I code my regression model.

##Check head and structure of data

head(data)
##   census_year state_fips_code total_family_income age    sex    race_general
## 1        2000         Florida               14550  44   Male Two major races
## 2        2000         Florida               22800  20 Female           White
## 3        2000         Florida                   0  20   Male           Black
## 4        2000         Florida               23000   6 Female           White
## 5        2000         Florida               48000  55   Male           White
## 6        2000         Florida               74000  43 Female           White
##           marital_status total_personal_income
## 1 Married/spouse present                     0
## 2   Never married/single                 13000
## 3   Never married/single                 20000
## 4   Never married/single                    NA
## 5 Married/spouse present                 36000
## 6 Married/spouse present                 27000
names(data)
## [1] "census_year"           "state_fips_code"       "total_family_income"  
## [4] "age"                   "sex"                   "race_general"         
## [7] "marital_status"        "total_personal_income"
str(data)
## 'data.frame':    500 obs. of  8 variables:
##  $ census_year          : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ state_fips_code      : chr  "Florida" "Florida" "Florida" "Florida" ...
##  $ total_family_income  : int  14550 22800 0 23000 48000 74000 23000 74000 60000 14600 ...
##  $ age                  : int  44 20 20 6 55 43 60 47 54 58 ...
##  $ sex                  : chr  "Male" "Female" "Male" "Female" ...
##  $ race_general         : chr  "Two major races" "White" "Black" "White" ...
##  $ marital_status       : chr  "Married/spouse present" "Never married/single" "Never married/single" "Never married/single" ...
##  $ total_personal_income: int  0 13000 20000 NA 36000 27000 11800 48000 40000 14600 ...
data |>
  group_by(total_personal_income, marital_status)
## # A tibble: 500 × 8
## # Groups:   total_personal_income, marital_status [281]
##    census_year state_fips_code total_family_income   age sex    race_general   
##          <int> <chr>                         <int> <int> <chr>  <chr>          
##  1        2000 Florida                       14550    44 Male   Two major races
##  2        2000 Florida                       22800    20 Female White          
##  3        2000 Florida                           0    20 Male   Black          
##  4        2000 Florida                       23000     6 Female White          
##  5        2000 Florida                       48000    55 Male   White          
##  6        2000 Florida                       74000    43 Female White          
##  7        2000 Florida                       23000    60 Female White          
##  8        2000 Florida                       74000    47 Female White          
##  9        2000 Florida                       60000    54 Female Black          
## 10        2000 Florida                       14600    58 Female White          
## # ℹ 490 more rows
## # ℹ 2 more variables: marital_status <chr>, total_personal_income <int>
##Make a table that includes the 2 variables I'm focusing on for the linear model

lmdata <- table(data$total_personal_income, data$marital_status)

Regression Analysis

Firstly, I had to convert marital status using the factor function because it is a categorical variable with 3+ categories. Then, I created the linear regression model using my two chosen variables, marital status and total personal income. To call the model coefficient and intercept, I use the summary function on my new linear regression model.

Based on the p-value 0.953 being closer to 1 than 0 we can predict there is little to no statistical significance that personal income has to do with marital status. Since we’re using log odd probabilities for the model, based on the coefficients for total personal income all being extremely close to 0, we again can interpret that marital status and personal income have little to do with one another.

data$marital_status <- factor(data$marital_status)


lgmodel <- glm(marital_status ~ total_personal_income, data = data, family = binomial)
summary(lgmodel)
## 
## Call:
## glm(formula = marital_status ~ total_personal_income, family = binomial, 
##     data = data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            2.238e+00  2.014e-01  11.114   <2e-16 ***
## total_personal_income -2.155e-07  3.626e-06  -0.059    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 249.55  on 391  degrees of freedom
## Residual deviance: 249.55  on 390  degrees of freedom
##   (108 observations deleted due to missingness)
## AIC: 253.55
## 
## Number of Fisher Scoring iterations: 5

##Diagnostics

predicted.classes <- predict(lgmodel, newdata = data)

matrixx <- table(
  Predicted = factor(predicted.classes, levels = c("Married/spouse present", "Never married/single", "Widowed")),
  Actual = factor(data$marital_status, levels = c("Married/spouse present", "Never married/single", "Widowed")))
matrixx <- na.omit(data)



roclg <- multiclass.roc(response = data$marital_status, predictor = predicted.classes)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls > cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls > cases
## Setting direction: controls > cases
auc_val <- auc(roclg); auc_val
## Multi-class area under the curve: 0.6148

The AUC value area under the curve is 0.6148. This tells me that the model is weak at predicting marital status based off the one predictor I used being total personal income. While it is better than 50%, randomly guessing, the model is pretty weak at distinguishing someone’s personal income from their marital status from the cencus.

If you randomly pick one person from the census, the model has about a 61% chance of ranking someone based off their marital status and personal income.

##Conclusion

From the regression analysis I can conclude that we cannot determine someone’s total personal income is higher or lower based on their marital status. Therefore, the answer to my project question would be no, we cannot determine that someone’s marital status has anything to do with their individual income. The fit of the model is weak only at 61%.

In the future, or if I were to do this project over again, I would add in total family income as a second predictor for a stronger model. I also could have been more successful if I had changed the levels within the category marital status to numeric values like 1, 2, and 3 to get a numeric result for ROC. I would rather have done this model and made all the variables I used numeric.

References:

logistic.regression Rmd

Dataset: https://www.openintro.org/data/index.php?data=census

Week 10 zoom recordings on BB DATA 101