This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
adult_dataa <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_dataa.csv")
head(adult_dataa)
## Age workclass fnlwgt education education.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 18 ? 103497 Some-college 10 Never-married
## 6 34 Private 198693 10th 6 Never-married
## occupation relationship race sex capital.gain capital.loss
## 1 Machine-op-inspct Own-child Black Male 0 0
## 2 Farming-fishing Husband White Male 0 0
## 3 Protective-serv Husband White Male 0 0
## 4 Machine-op-inspct Husband Black Male 7688 0
## 5 ? Own-child White Female 0 0
## 6 Other-service Not-in-family White Male 0 0
## hours.per.week native.country income income_binary
## 1 40 United-States <=50K 0
## 2 50 United-States <=50K 0
## 3 40 United-States >50K 1
## 4 40 United-States >50K 1
## 5 30 United-States <=50K 0
## 6 30 United-States <=50K 0
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
linear_model <- glm(income_binary ~ Age + education.num + hours.per.week, data = adult_dataa, family = binomial)
summary(linear_model)
##
## Call:
## glm(formula = income_binary ~ Age + education.num + hours.per.week,
## family = binomial, data = adult_dataa)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.395773 0.154887 -54.21 <2e-16 ***
## Age 0.044578 0.001616 27.58 <2e-16 ***
## education.num 0.344672 0.009149 37.67 <2e-16 ***
## hours.per.week 0.041474 0.001751 23.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17801 on 16280 degrees of freedom
## Residual deviance: 14481 on 16277 degrees of freedom
## AIC: 14489
##
## Number of Fisher Scoring iterations: 5
# QQ plot
qqnorm(linear_model$residuals)
qqline(linear_model$residuals)
Here few points on the QQ plot deviate from the line at the ends, it suggests the presence of outliers in the data. This means that there are observations in our dataset that have extreme values, and they are causing the residuals. Hence we need to identify and examine these outliers to determine if they are valid data points or errors.
# Residual plots
plot(linear_model, which = 1)
The divergence of points from the line in the middle section of the plot suggests heteroscedasticity within this range. Heteroscedasticity implies that the variance of the residuals varies as the fitted values change. In this case, it means that the spread of the residuals is not consistent in the middle range of fitted values, which is a violation of the assumption of homoscedasticity.
coef(linear_model)
## (Intercept) Age education.num hours.per.week
## -8.39577268 0.04457756 0.34467190 0.04147397
Age: The coefficient for ‘Age’ is approximately 0.045 i.e., for each one-year increase in age, the estimated log-odds of having an income greater than 50K increase by about 0.045. In other words, as individuals get older, their probability of having a higher income also increases. This is a positive relationship between age and income.
education.num: The coefficient for ‘education.num’ is approximately 0.345 i.e., for each one-unit increase in the ‘education.num’ variable, the estimated log-odds of having an income greater than 50K increase by approximately 0.345. This implies that individuals with higher education levels have a higher probability of having a high income.
hours.per.week: The coefficient for ‘hours.per.week’ is approximately 0.041 i.e., for each additional hour worked per week, the estimated log-odds of having an income greater than 50K increase by about 0.041. This suggests that working longer hours per week is associated with a higher probability of having a high income.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.