week10 data dive

Loading data

adult_dataa <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_dataa.csv")

head(adult_dataa)

##   Age  workclass fnlwgt     education education.num      marital.status
## 1  25    Private 226802          11th             7       Never-married
## 2  38    Private  89814       HS-grad             9  Married-civ-spouse
## 3  28  Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  44    Private 160323  Some-college            10  Married-civ-spouse
## 5  18          ? 103497  Some-college            10       Never-married
## 6  34    Private 198693          10th             6       Never-married
##           occupation   relationship   race     sex capital.gain capital.loss
## 1  Machine-op-inspct      Own-child  Black    Male            0            0
## 2    Farming-fishing        Husband  White    Male            0            0
## 3    Protective-serv        Husband  White    Male            0            0
## 4  Machine-op-inspct        Husband  Black    Male         7688            0
## 5                  ?      Own-child  White  Female            0            0
## 6      Other-service  Not-in-family  White    Male            0            0
##   hours.per.week native.country income income_binary
## 1             40  United-States  <=50K             0
## 2             50  United-States  <=50K             0
## 3             40  United-States   >50K             1
## 4             40  United-States   >50K             1
## 5             30  United-States  <=50K             0
## 6             30  United-States  <=50K             0

Loading required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Choosing Binary Column

binary_variable <- "income_binary"

summary(adult_dataa)

##       Age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:16281       Min.   :  13492   Length:16281      
##  1st Qu.:28.00   Class :character   1st Qu.: 116736   Class :character  
##  Median :37.00   Mode  :character   Median : 177831   Mode  :character  
##  Mean   :38.77                      Mean   : 189436                     
##  3rd Qu.:48.00                      3rd Qu.: 238384                     
##  Max.   :90.00                      Max.   :1490400                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:16281       Length:16281       Length:16281      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.07                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:16281       Length:16281       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1082   Mean   :  87.9  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :3770.0  
##  hours.per.week  native.country        income          income_binary   
##  Min.   : 1.00   Length:16281       Length:16281       Min.   :0.0000  
##  1st Qu.:40.00   Class :character   Class :character   1st Qu.:0.0000  
##  Median :40.00   Mode  :character   Mode  :character   Median :0.0000  
##  Mean   :40.39                                         Mean   :0.2362  
##  3rd Qu.:45.00                                         3rd Qu.:0.0000  
##  Max.   :99.00                                         Max.   :1.0000

Logistic Regression Model

logistic_model <- glm(income_binary ~ Age + education.num + hours.per.week, data = adult_dataa, family = binomial)

coef(logistic_model)

##    (Intercept)            Age  education.num hours.per.week 
##    -8.39577268     0.04457756     0.34467190     0.04147397

summary(logistic_model)

## 
## Call:
## glm(formula = income_binary ~ Age + education.num + hours.per.week, 
##     family = binomial, data = adult_dataa)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -8.395773   0.154887  -54.21   <2e-16 ***
## Age             0.044578   0.001616   27.58   <2e-16 ***
## education.num   0.344672   0.009149   37.67   <2e-16 ***
## hours.per.week  0.041474   0.001751   23.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17801  on 16280  degrees of freedom
## Residual deviance: 14481  on 16277  degrees of freedom
## AIC: 14489
## 
## Number of Fisher Scoring iterations: 5

Interpretation of coefficients

Intercept: Represents the log-odds of the outcome when all predictors are 0.

Age: The coefficient for ‘Age’ is approximately 0.045 i.e., for each one-year increase in age, the estimated log-odds of having an income greater than 50K increase by about 0.045. In other words, as individuals get older, their probability of having a higher income also increases. This is a positive relationship between age and income.

education.num: The coefficient for ‘education.num’ is approximately 0.345 i.e., for each one-unit increase in the ‘education.num’ variable, the estimated log-odds of having an income greater than 50K increase by approximately 0.345. This implies that individuals with higher education levels have a higher probability of having a high income.

hours.per.week: The coefficient for ‘hours.per.week’ is approximately 0.041 i.e., for each additional hour worked per week, the estimated log-odds of having an income greater than 50K increase by about 0.041. This suggests that working longer hours per week is associated with a higher probability of having a high income.

Confidence Interval

conf_int <- confint(logistic_model, "Age")

## Waiting for profiling to be done...

conf_int

##      2.5 %     97.5 % 
## 0.04141851 0.04775428

This means that for ‘Age,’ we are 95% confident that the true effect of one additional year of age on the log-odds of having an income greater than 50K falls between approximately 0.0414 and 0.0478.

Transformation for Explanatory Variables:

model <- lm(Age ~ hours.per.week,
            filter(adult_dataa, income_binary == 1))

rsquared <- summary(model)$r.squared

adult_dataa |> 
  filter(income_binary == 1 ) |>
  ggplot(mapping = aes(x = Age, 
                       y = hours.per.week)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(color ='purple', se = FALSE) +
  labs(title = "Age vs Hours per week ")+
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The above code is a linear regression model to understand the relationship between the ‘Age’ of individuals and the number of ‘hours.per.week’ they work. It then calculates the R-squared value to measure how well the model explains the variance in ‘Age.’

The subsequent code creates a scatter plot with individual data points in blue. It overlays two trendlines: a red dashed line and a purple line. The red dashed line represents the linear relationship between ‘Age’ and ‘hours.per.week,’ as estimated by the linear regression model. The purple line may represent another trendline, although its purpose is not explicitly defined in the code.

The red line is linear with positive intercept. This shows that the relationship in the model is positively linear between the variables.

The plot helps visualize the relationship between ‘Age’ and ‘hours.per.week’ and assess whether there is a linear association between these variables.

The R-squared value will give you an idea of how well the linear model fits the data. A higher R-squared indicates that ‘hours.per.week’ explains a larger proportion of the variance in ‘Age.’ If the R-squared is close to 1, it suggests a strong linear relationship between the two variables.

model <- lm(capital.gain ~ hours.per.week,
            filter(adult_dataa, income_binary == 1))

rsquared <- summary(model)$r.squared

adult_dataa |> 
  filter(income_binary == 1 ) |>
  ggplot(mapping = aes(x = capital.gain, 
                       y = hours.per.week)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(color ='purple', se = FALSE) +
  labs(title = "Capital gain vs Hours per week ")+
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The code analyzing the relationship between ‘capital.gain’ and ‘hours.per.week’ for individuals with an income greater than 50K (income_binary == 1). It does this by fitting a linear regression model, displaying a scatter plot of the data points, and overlaying two regression lines: one in red (representing a linear fit) and one in purple (representing a potentially non-linear relationship). The R-squared value is calculated to assess how well ‘hours.per.week’ explains the variation in ‘capital.gain.’

The plot allows you to visually assess the relationship between ‘capital.gain’ and ‘hours.per.week’ and determine whether a linear regression model is a suitable representation of this relationship. The R-squared value provides information about the proportion of variance in ‘capital.gain’ that is explained by ‘hours.per.week.’ If the R-squared is high, it indicates that ‘hours.per.week’ is a good predictor of ‘capital.gain.’

Both the lines coincide almost perfectly which indicates that the model is almost perfectly linear. The line has a positive slope with a positive intercept. Therefore, the relationship between the variables is positively increasing.

Since, both the models almost linear, there is no need for transformation.