Final R PROJECT

RESEARCH QUESTIONS

How do age, relationship, sex, marital status and education of a person influence their income range? What improvement in their lifestyle would increase their income range?

LOADING LIBRARIES

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

IMPORTING DATASET

adult_income_data <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_dataa.csv")

head(adult_income_data)

##   Age  workclass fnlwgt     education education.num      marital.status
## 1  25    Private 226802          11th             7       Never-married
## 2  38    Private  89814       HS-grad             9  Married-civ-spouse
## 3  28  Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  44    Private 160323  Some-college            10  Married-civ-spouse
## 5  18          ? 103497  Some-college            10       Never-married
## 6  34    Private 198693          10th             6       Never-married
##           occupation   relationship   race     sex capital.gain capital.loss
## 1  Machine-op-inspct      Own-child  Black    Male            0            0
## 2    Farming-fishing        Husband  White    Male            0            0
## 3    Protective-serv        Husband  White    Male            0            0
## 4  Machine-op-inspct        Husband  Black    Male         7688            0
## 5                  ?      Own-child  White  Female            0            0
## 6      Other-service  Not-in-family  White    Male            0            0
##   hours.per.week native.country income income_binary
## 1             40  United-States  <=50K             0
## 2             50  United-States  <=50K             0
## 3             40  United-States   >50K             1
## 4             40  United-States   >50K             1
## 5             30  United-States  <=50K             0
## 6             30  United-States  <=50K             0

EDA

# Calculate percentages
percentage_data <- adult_income_data %>%
  group_by(income) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

# Create a pie chart with percentages
ggplot(percentage_data, aes(x = "", y = percentage, fill = income)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar(theta = "y") +
  labs(title = "Distribution of Income", fill = "Income") +
  scale_fill_manual(values = c("lightgreen", "blue"), name = "Income", labels = c("<=50K", ">50K")) +
  theme_void() +  # Remove unnecessary elements
  geom_text(aes(label = paste0(round(percentage), "%")), position = position_stack(vjust = 0.5))

From the pie chart, we can see that the distribution of income in the dataset is relatively unequal. The majority of people (76%) earn less than or equal to $50,000 per year, while the remaining 24% earn more than $50,000 per year. It is evident from the above plot that, 76% of the people earn less than 50k. SO, here we analyse a couple of factors which might influence the income range of the person.

Education vs Income

ggplot(adult_income_data, aes(x = education, fill = income)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Income Distribution by Education Level", x = "Education Level", y = "Count") +
  scale_fill_manual(values = c("#66c2a5", "#fc8d62"), name = "Income", labels = c("<=50K", ">50K")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

We can see that the majority of people with a high school diploma or less earn less than or equal to $50,000 per year. However, the distribution of income becomes more even as education level increases. For example, about 40% of people with a bachelor’s degree earn more than $50,000 per year. And, about 70% of people with a professional degree earn more than $50,000 per year.

This chart provides clear evidence that education level is a strong predictor of income. People with higher levels of education are more likely to earn higher incomes.

The gap between the income distribution of people with a high school diploma or less and people with a bachelor’s degree is significant. This suggests that there is a financial benefit to pursuing higher education. The gap between the income distribution of people with a bachelor’s degree and people with a professional degree is also significant. This suggests that there may be an even greater financial benefit to pursuing graduate education.

Income vs Marital status

ggplot(adult_income_data, aes(x = marital.status, fill = income)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Income Distribution by Marital Status", x = "Marital Status", y = "Count") +
  scale_fill_manual(values = c("lightgreen", "skyblue"), name = "Income", labels = c("<=50K", ">50K")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

We can see that married couples are the most likely to earn more than $50,000 per year. About 50% of married couples earn more than $50,000 per year. This is significantly higher than the percentage of people in any other marital status group who earn more than $50,000 per year.Divorced people are the least likely to earn more than $50,000 per year. Only about 10% of divorced people earn more than $50,000 per year.

Marriage: Married couples are more likely to have two incomes, which can boost their overall income. Additionally, married couples may be more likely to have children, which can motivate them to earn more money.

Divorce: Divorce can lead to a decrease in income for both spouses. This is because they may have to split their assets and incomes. Additionally, divorced people may have to pay child support or alimony, which can further reduce their disposable income.

Income vs relationship

ggplot(adult_income_data, aes(x = relationship, fill = income)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Relationship vs Income", x = "Relationship", y = "Count") +
  scale_fill_manual(values = c("orange", "purple"), name = "Income", labels = c("<=50K", ">50K")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

Husband-wife families are more likely to earn more than $50,000 per year than other types of families. Additionally, people who are not in a family are less likely to earn more than $50,000 per year than people who are in a family.

Possible explanations for the differences in income distribution by relationship status:

Income: People in families are more likely to have two incomes, which can boost their overall income. Expenses: People in families may have more expenses, such as childcare and housing. However, they may also be able to take advantage of economies of scale, which can reduce their overall expenses. Savings: People in families may be more likely to save money than people who are not in families. This is because they may have more financial support from their spouse or other family members.

Income vs Sex

ggplot(adult_income_data, aes(x = sex, fill = income)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Sex vs Income", x = "Sex", y = "Count") +
  scale_fill_manual(values = c("pink", "lightblue"), name = "Income", labels = c("<=50K", ">50K")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

We can observe that the majority of males (75%) earn less than or equal to $50,000 per year, while a slightly greater proportion of females (85%) fall into this income category. The remaining 25% of males and 15% of females earn more than $50,000 per year.

This chart suggests that there is a modest income gap between men and women, with men slightly more likely to earn higher incomes.

Discrimination: Women may face discrimination in the workplace, leading to lower wages and fewer opportunities for advancement. Caregiving Responsibilities: Women are more likely to take on primary caregiving responsibilities for children and elderly family members, which can limit their ability to work full-time or pursue higher-paying careers.

Income vs Age

ggplot(adult_income_data, aes(x = Age, fill = income)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Age vs Income", x = "Age", y = "Count") +
  scale_fill_manual(values = c("lightgreen", "skyblue"), name = "Income", labels = c("<=50K", ">50K")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

We can see that the youngest people (those under the age of 25) are the most likely to earn less than or equal to $50,000 per year. About 90% of people under the age of 25 earn less than or equal to $50,000 per year. Income distribution becomes more even as people get older. For example, about 70% of people between the ages of 25 and 34 earn less than or equal to $50,000 per year. And, about 50% of people between the ages of 45 and 54 earn less than or equal to $50,000 per year. People over the age of 55 are the most likely to earn more than $50,000 per year. About 50% of people over the age of 55 earn more than $50,000 per year.

Hypothesis Testing: Age vs Income

# Set up hypothesis parameters
alpha <- 0.05 
power <- 0.80 
min_effect_size <- 3

Neyman-Pearson Test for hypothesis:

As we have enough data to perform a Neyman-Pearson hypothesis test. The Adult dataset contains 48,842 observations, which is more than the sample size required to achieve a power level of 0.80 at an alpha level of 0.05 for a two-tailed test with a minimum effect size of 3.

t_test_result <- t.test(adult_income_data$Age ~ adult_income_data$income, alternative = "two.sided" , alpha = 0.05)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  adult_income_data$Age by adult_income_data$income
## t = -34.006, df = 8497.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group  <=50K and group  >50K is not equal to 0
## 95 percent confidence interval:
##  -7.698401 -6.859245
## sample estimates:
## mean in group  <=50K  mean in group  >50K 
##             37.04801             44.32683

The t-test results indicate a p-value less than 0.05, which means we reject the null hypothesis i.e., the average age of individuals with income greater than $50K is statistically significantly different from the average age of individuals with income less than or equal to $50K.

Fisher’s Style Test for hypothesis:

fisher_test_result <- var.test(adult_income_data$Age ~ adult_income_data$income)
fisher_test_result

## 
##  F test to compare two variances
## 
## data:  adult_income_data$Age by adult_income_data$income
## F = 1.798, num df = 12434, denom df = 3845, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.707729 1.891691
## sample estimates:
## ratio of variances 
##           1.797987

The Fisher’s style test results also indicate a p-value less than 0.05, leading us to reject the null hypothesis. This reinforces the finding that there is a significant difference in ages between income groups.

# Visualization:
ggplot(adult_income_data, aes(x = income, y = Age, fill = income)) +
  geom_boxplot() +
  labs(x = "Income", y = "Age", title = "Age vs Income")

The boxplot shows clear differences in age distributions between income levels. Individuals with income greater than $50K tend to be older on average compared to those with income less than or equal to $50K.

Insights:

The hypothesis testing and visualization confirm that there is a significant age difference between income groups. This suggests that age may play a role in determining income levels, with older individuals having higher incomes, on average. Age appears to be a factor influencing income levels.

Logistic Regression Model

# Choosing Binary Column
binary_variable <- "income_binary"

summary(adult_income_data)

##       Age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:16281       Min.   :  13492   Length:16281      
##  1st Qu.:28.00   Class :character   1st Qu.: 116736   Class :character  
##  Median :37.00   Mode  :character   Median : 177831   Mode  :character  
##  Mean   :38.77                      Mean   : 189436                     
##  3rd Qu.:48.00                      3rd Qu.: 238384                     
##  Max.   :90.00                      Max.   :1490400                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:16281       Length:16281       Length:16281      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.07                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:16281       Length:16281       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1082   Mean   :  87.9  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :3770.0  
##  hours.per.week  native.country        income          income_binary   
##  Min.   : 1.00   Length:16281       Length:16281       Min.   :0.0000  
##  1st Qu.:40.00   Class :character   Class :character   1st Qu.:0.0000  
##  Median :40.00   Mode  :character   Mode  :character   Median :0.0000  
##  Mean   :40.39                                         Mean   :0.2362  
##  3rd Qu.:45.00                                         3rd Qu.:0.0000  
##  Max.   :99.00                                         Max.   :1.0000

logistic_model <- glm(income_binary ~ Age + education.num + hours.per.week, data = adult_income_data, family = binomial)

coef(logistic_model)

##    (Intercept)            Age  education.num hours.per.week 
##    -8.39577268     0.04457756     0.34467190     0.04147397

summary(logistic_model)

## 
## Call:
## glm(formula = income_binary ~ Age + education.num + hours.per.week, 
##     family = binomial, data = adult_income_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -8.395773   0.154887  -54.21   <2e-16 ***
## Age             0.044578   0.001616   27.58   <2e-16 ***
## education.num   0.344672   0.009149   37.67   <2e-16 ***
## hours.per.week  0.041474   0.001751   23.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17801  on 16280  degrees of freedom
## Residual deviance: 14481  on 16277  degrees of freedom
## AIC: 14489
## 
## Number of Fisher Scoring iterations: 5

Interpretation of coefficients

Intercept: Represents the log-odds of the outcome when all predictors are 0.

Age: The coefficient for ‘Age’ is approximately 0.045 i.e., for each one-year increase in age, the estimated log-odds of having an income greater than 50K increase by about 0.045. In other words, as individuals get older, their probability of having a higher income also increases. This is a positive relationship between age and income.

education.num: The coefficient for ‘education.num’ is approximately 0.345 i.e., for each one-unit increase in the ‘education.num’ variable, the estimated log-odds of having an income greater than 50K increase by approximately 0.345. This implies that individuals with higher education levels have a higher probability of having a high income.

hours.per.week: The coefficient for ‘hours.per.week’ is approximately 0.041 i.e., for each additional hour worked per week, the estimated log-odds of having an income greater than 50K increase by about 0.041. This suggests that working longer hours per week is associated with a higher probability of having a high income.

# Confidence Interval
conf_int <- confint(logistic_model, "Age")

## Waiting for profiling to be done...

conf_int

##      2.5 %     97.5 % 
## 0.04141851 0.04775428

This means that for ‘Age,’ we are 95% confident that the true effect of one additional year of age on the log-odds of having an income greater than 50K falls between approximately 0.0414 and 0.0478.

# Transformation for Explanatory Variables:
model <- lm(Age ~ hours.per.week,
            filter(adult_income_data, income_binary == 1))

rsquared <- summary(model)$r.squared

adult_income_data |> 
  filter(income_binary == 1 ) |>
  ggplot(mapping = aes(x = Age, 
                       y = hours.per.week)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(color ='purple', se = FALSE) +
  labs(title = "Age vs Hours per week ")+
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The above code is a linear regression model to understand the relationship between the ‘Age’ of individuals and the number of ‘hours.per.week’ they work. It then calculates the R-squared value to measure how well the model explains the variance in ‘Age.’

The subsequent code creates a scatter plot with individual data points in blue. It overlays two trendlines: a red dashed line and a purple line. The red dashed line represents the linear relationship between ‘Age’ and ‘hours.per.week,’ as estimated by the linear regression model.

The red line is linear with positive intercept. This shows that the relationship in the model is positively linear between the variables.

The plot helps visualize the relationship between ‘Age’ and ‘hours.per.week’ and assess whether there is a linear association between these variables.

The R-squared value will give you an idea of how well the linear model fits the data. A higher R-squared indicates that ‘hours.per.week’ explains a larger proportion of the variance in ‘Age.’ If the R-squared is close to 1, it suggests a strong linear relationship between the two variables.

LINEAR MODEL

To plot a linear we consider income as a binary numeric variable rather than a categorical variable.

model_sex <- lm(income_binary ~ sex, data = adult_income_data)
summary(model_sex)

## 
## Call:
## lm(formula = income_binary ~ sex, data = adult_income_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2998 -0.2998 -0.1088 -0.1088  0.8912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.108836   0.005638   19.30   <2e-16 ***
## sex Male    0.190980   0.006904   27.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4151 on 16279 degrees of freedom
## Multiple R-squared:  0.0449, Adjusted R-squared:  0.04484 
## F-statistic: 765.3 on 1 and 16279 DF,  p-value: < 2.2e-16

Males, compared to females, have about a 19% higher chance of having income >50K.The odds of having income >50K are approximately 21% higher for males compared to females. The model indicates that being male is associated with a higher chance of having income >50K, and this difference is statistically significant.

This can be because Traditional gender roles related to caregiving and family responsibilities may impact career progression and, subsequently, income, Despite legal advancements, gender-based discrimination may still exist in the workplace, affecting career advancement and income and etc.

To improve that we can Provide education and training programs that address unconscious biases and promote awareness about gender equality issues in the workplace. Offer flexible work arrangements to accommodate diverse family responsibilities. This helps in promoting work-life balance and equal opportunities for career advancement.

model_edu <- lm(income_binary ~ education, data = adult_income_data)
summary(model_edu)

## 
## Call:
## lm(formula = income_binary ~ education, data = adult_income_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75194 -0.18846 -0.15673 -0.05024  0.97468 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.054825   0.018564   2.953  0.00315 ** 
## education 11th         -0.004589   0.024317  -0.189  0.85032    
## education 12th          0.012140   0.032345   0.375  0.70743    
## education 1st-4th      -0.029508   0.048310  -0.611  0.54134    
## education 5th-6th       0.007675   0.035178   0.218  0.82729    
## education 7th-8th       0.016373   0.029210   0.561  0.57512    
## education 9th           0.003027   0.031528   0.096  0.92352    
## education Assoc-acdm    0.222329   0.025277   8.796  < 2e-16 ***
## education Assoc-voc     0.182289   0.024001   7.595 3.25e-14 ***
## education Bachelors     0.354164   0.020087  17.632  < 2e-16 ***
## education Doctorate     0.635783   0.034826  18.256  < 2e-16 ***
## education HS-grad       0.101905   0.019349   5.267 1.41e-07 ***
## education Masters       0.480507   0.022647  21.217  < 2e-16 ***
## education Preschool    -0.023575   0.072495  -0.325  0.74504    
## education Prof-school   0.697113   0.030882  22.573  < 2e-16 ***
## education Some-college  0.133634   0.019709   6.780 1.24e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3964 on 16265 degrees of freedom
## Multiple R-squared:  0.1299, Adjusted R-squared:  0.1291 
## F-statistic: 161.8 on 15 and 16265 DF,  p-value: < 2.2e-16

When someone has no formal education (reference category), the estimated chance of having income >50K is about 5.5%. For example, having an ‘Assoc-acdm’ education level increases the chance by about 22%.The ’***’ symbols mean that many education levels significantly impact the chance of having income >50K compared to having no formal education. In summary, the model suggests that education levels significantly influence the likelihood of having income >50K. Each education level has a specific impact on this likelihood, and the overall model is statistically significant in predicting income based on education.

The reason for this could be Higher education levels often correspond to acquiring specialized skills and knowledge. Individuals with advanced education may be more qualified for jobs that offer higher salaries.Certain professions and occupations require specific levels of education. Jobs that demand higher qualifications may offer better compensation.

This can be improved by promoting equal access to quality education at all levels. This includes addressing disparities in educational resources and opportunities from early childhood to higher education. Encourage and support the inclusion of individuals from all backgrounds in science, technology, engineering, and mathematics (STEM) education and careers, which often have high earning potentials.

ANOVA test

anova(model_edu)

## Analysis of Variance Table
## 
## Response: income_binary
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## education    15  381.45 25.4299  161.82 < 2.2e-16 ***
## Residuals 16265 2556.02  0.1571                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value associated with ‘education’ is very low, essentially zero (‘< 2.2e-16’). This means that the effect of ‘education’ on ‘income_binary’ is highly significant. The F value (161.82) is a way to compare how well ‘education’ explains income differences compared to what we’d expect by chance. A higher F value suggests that ‘education’ is doing a good job explaining income categories.On average, the impact of ‘education’ on income categories is about 25.4299.

anova(model_sex)

## Analysis of Variance Table
## 
## Response: income_binary
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## sex           1  131.89 131.887  765.26 < 2.2e-16 ***
## Residuals 16279 2805.59   0.172                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The super tiny number (‘< 2.2e-16’) tells us that the link between gender and income categories is very likely not a random thing—it’s a real connection. The big F value (765.26) says that considering whether someone is male or female really helps explain why people fall into different income categories. This result tells us there’s a story to uncover. Why do income categories seem to be connected to gender? Are there specific jobs, societal norms, or other things at play?

Conclusion

In conclusion, Education emerges as a pivotal determinant of income, with higher educational attainment generally corresponding to increased earning potential. The insights gained from this analysis hold practical implications for policymakers, businesses, and social researchers. Understanding the factors influencing income can inform targeted interventions and policies aimed at reducing disparities.