adult_dataa <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_dataa.csv")
head(adult_dataa)
## Age workclass fnlwgt education education.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 18 ? 103497 Some-college 10 Never-married
## 6 34 Private 198693 10th 6 Never-married
## occupation relationship race sex capital.gain capital.loss
## 1 Machine-op-inspct Own-child Black Male 0 0
## 2 Farming-fishing Husband White Male 0 0
## 3 Protective-serv Husband White Male 0 0
## 4 Machine-op-inspct Husband Black Male 7688 0
## 5 ? Own-child White Female 0 0
## 6 Other-service Not-in-family White Male 0 0
## hours.per.week native.country income income_binary
## 1 40 United-States <=50K 0
## 2 50 United-States <=50K 0
## 3 40 United-States >50K 1
## 4 40 United-States >50K 1
## 5 30 United-States <=50K 0
## 6 30 United-States <=50K 0
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
binary_variable <- "income_binary"
summary(adult_dataa)
## Age workclass fnlwgt education
## Min. :17.00 Length:16281 Min. : 13492 Length:16281
## 1st Qu.:28.00 Class :character 1st Qu.: 116736 Class :character
## Median :37.00 Mode :character Median : 177831 Mode :character
## Mean :38.77 Mean : 189436
## 3rd Qu.:48.00 3rd Qu.: 238384
## Max. :90.00 Max. :1490400
## education.num marital.status occupation relationship
## Min. : 1.00 Length:16281 Length:16281 Length:16281
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.07
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital.gain capital.loss
## Length:16281 Length:16281 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1082 Mean : 87.9
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :3770.0
## hours.per.week native.country income income_binary
## Min. : 1.00 Length:16281 Length:16281 Min. :0.0000
## 1st Qu.:40.00 Class :character Class :character 1st Qu.:0.0000
## Median :40.00 Mode :character Mode :character Median :0.0000
## Mean :40.39 Mean :0.2362
## 3rd Qu.:45.00 3rd Qu.:0.0000
## Max. :99.00 Max. :1.0000
logistic_model <- glm(income_binary ~ Age + education.num + hours.per.week, data = adult_dataa, family = binomial)
coef(logistic_model)
## (Intercept) Age education.num hours.per.week
## -8.39577268 0.04457756 0.34467190 0.04147397
summary(logistic_model)
##
## Call:
## glm(formula = income_binary ~ Age + education.num + hours.per.week,
## family = binomial, data = adult_dataa)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.395773 0.154887 -54.21 <2e-16 ***
## Age 0.044578 0.001616 27.58 <2e-16 ***
## education.num 0.344672 0.009149 37.67 <2e-16 ***
## hours.per.week 0.041474 0.001751 23.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17801 on 16280 degrees of freedom
## Residual deviance: 14481 on 16277 degrees of freedom
## AIC: 14489
##
## Number of Fisher Scoring iterations: 5
Intercept: Represents the log-odds of the outcome when all predictors are 0.
Age: The coefficient for ‘Age’ is approximately 0.045 i.e., for each one-year increase in age, the estimated log-odds of having an income greater than 50K increase by about 0.045. In other words, as individuals get older, their probability of having a higher income also increases. This is a positive relationship between age and income.
education.num: The coefficient for ‘education.num’ is approximately 0.345 i.e., for each one-unit increase in the ‘education.num’ variable, the estimated log-odds of having an income greater than 50K increase by approximately 0.345. This implies that individuals with higher education levels have a higher probability of having a high income.
hours.per.week: The coefficient for ‘hours.per.week’ is approximately 0.041 i.e., for each additional hour worked per week, the estimated log-odds of having an income greater than 50K increase by about 0.041. This suggests that working longer hours per week is associated with a higher probability of having a high income.
conf_int <- confint(logistic_model, "Age")
## Waiting for profiling to be done...
conf_int
## 2.5 % 97.5 %
## 0.04141851 0.04775428
This means that for ‘Age,’ we are 95% confident that the true effect of one additional year of age on the log-odds of having an income greater than 50K falls between approximately 0.0414 and 0.0478.
model <- lm(Age ~ hours.per.week,
filter(adult_dataa, income_binary == 1))
rsquared <- summary(model)$r.squared
adult_dataa |>
filter(income_binary == 1 ) |>
ggplot(mapping = aes(x = Age,
y = hours.per.week)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red', linetype = 'dashed',
se = FALSE) +
geom_smooth(color ='purple', se = FALSE) +
labs(title = "Age vs Hours per week ")+
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The above code is a linear regression model to understand the relationship between the ‘Age’ of individuals and the number of ‘hours.per.week’ they work. It then calculates the R-squared value to measure how well the model explains the variance in ‘Age.’
The subsequent code creates a scatter plot with individual data points in blue. It overlays two trendlines: a red dashed line and a purple line. The red dashed line represents the linear relationship between ‘Age’ and ‘hours.per.week,’ as estimated by the linear regression model. The purple line may represent another trendline, although its purpose is not explicitly defined in the code.
The red line is linear with positive intercept. This shows that the relationship in the model is positively linear between the variables.
The plot helps visualize the relationship between ‘Age’ and ‘hours.per.week’ and assess whether there is a linear association between these variables.
The R-squared value will give you an idea of how well the linear model fits the data. A higher R-squared indicates that ‘hours.per.week’ explains a larger proportion of the variance in ‘Age.’ If the R-squared is close to 1, it suggests a strong linear relationship between the two variables.
model <- lm(capital.gain ~ hours.per.week,
filter(adult_dataa, income_binary == 1))
rsquared <- summary(model)$r.squared
adult_dataa |>
filter(income_binary == 1 ) |>
ggplot(mapping = aes(x = capital.gain,
y = hours.per.week)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red', linetype = 'dashed',
se = FALSE) +
geom_smooth(color ='purple', se = FALSE) +
labs(title = "Capital gain vs Hours per week ")+
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The code analyzing the relationship between ‘capital.gain’ and ‘hours.per.week’ for individuals with an income greater than 50K (income_binary == 1). It does this by fitting a linear regression model, displaying a scatter plot of the data points, and overlaying two regression lines: one in red (representing a linear fit) and one in purple (representing a potentially non-linear relationship). The R-squared value is calculated to assess how well ‘hours.per.week’ explains the variation in ‘capital.gain.’
The plot allows you to visually assess the relationship between ‘capital.gain’ and ‘hours.per.week’ and determine whether a linear regression model is a suitable representation of this relationship. The R-squared value provides information about the proportion of variance in ‘capital.gain’ that is explained by ‘hours.per.week.’ If the R-squared is high, it indicates that ‘hours.per.week’ is a good predictor of ‘capital.gain.’
Both the lines coincide almost perfectly which indicates that the model is almost perfectly linear. The line has a positive slope with a positive intercept. Therefore, the relationship between the variables is positively increasing.
Since, both the models almost linear, there is no need for transformation.