Week 11 Data Dive

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

adult_dataa <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_dataa.csv")

head(adult_dataa)

##   Age  workclass fnlwgt     education education.num      marital.status
## 1  25    Private 226802          11th             7       Never-married
## 2  38    Private  89814       HS-grad             9  Married-civ-spouse
## 3  28  Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  44    Private 160323  Some-college            10  Married-civ-spouse
## 5  18          ? 103497  Some-college            10       Never-married
## 6  34    Private 198693          10th             6       Never-married
##           occupation   relationship   race     sex capital.gain capital.loss
## 1  Machine-op-inspct      Own-child  Black    Male            0            0
## 2    Farming-fishing        Husband  White    Male            0            0
## 3    Protective-serv        Husband  White    Male            0            0
## 4  Machine-op-inspct        Husband  Black    Male         7688            0
## 5                  ?      Own-child  White  Female            0            0
## 6      Other-service  Not-in-family  White    Male            0            0
##   hours.per.week native.country income income_binary
## 1             40  United-States  <=50K             0
## 2             50  United-States  <=50K             0
## 3             40  United-States   >50K             1
## 4             40  United-States   >50K             1
## 5             30  United-States  <=50K             0
## 6             30  United-States  <=50K             0

Loading required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Building a Generalized Linear Model

linear_model <- glm(income_binary ~ Age + education.num + hours.per.week, data = adult_dataa, family = binomial)

summary(linear_model)

## 
## Call:
## glm(formula = income_binary ~ Age + education.num + hours.per.week, 
##     family = binomial, data = adult_dataa)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -8.395773   0.154887  -54.21   <2e-16 ***
## Age             0.044578   0.001616   27.58   <2e-16 ***
## education.num   0.344672   0.009149   37.67   <2e-16 ***
## hours.per.week  0.041474   0.001751   23.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17801  on 16280  degrees of freedom
## Residual deviance: 14481  on 16277  degrees of freedom
## AIC: 14489
## 
## Number of Fisher Scoring iterations: 5

Diagnosing the Model

# QQ plot
qqnorm(linear_model$residuals)
qqline(linear_model$residuals)

Here few points on the QQ plot deviate from the line at the ends, it suggests the presence of outliers in the data. This means that there are observations in our dataset that have extreme values, and they are causing the residuals. Hence we need to identify and examine these outliers to determine if they are valid data points or errors.

# Residual plots
plot(linear_model, which = 1)

The divergence of points from the line in the middle section of the plot suggests heteroscedasticity within this range. Heteroscedasticity implies that the variance of the residuals varies as the fitted values change. In this case, it means that the spread of the residuals is not consistent in the middle range of fitted values, which is a violation of the assumption of homoscedasticity.

Interpretation of coefficients

coef(linear_model)

##    (Intercept)            Age  education.num hours.per.week 
##    -8.39577268     0.04457756     0.34467190     0.04147397

Age: The coefficient for ‘Age’ is approximately 0.045 i.e., for each one-year increase in age, the estimated log-odds of having an income greater than 50K increase by about 0.045. In other words, as individuals get older, their probability of having a higher income also increases. This is a positive relationship between age and income.

education.num: The coefficient for ‘education.num’ is approximately 0.345 i.e., for each one-unit increase in the ‘education.num’ variable, the estimated log-odds of having an income greater than 50K increase by approximately 0.345. This implies that individuals with higher education levels have a higher probability of having a high income.

hours.per.week: The coefficient for ‘hours.per.week’ is approximately 0.041 i.e., for each additional hour worked per week, the estimated log-odds of having an income greater than 50K increase by about 0.041. This suggests that working longer hours per week is associated with a higher probability of having a high income.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.