R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Loading required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
library(pwr)

# Load the dataset
adult <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_income_data.csv")

Hypothesis - 1: Age vs Income

Null Hypothesis 1: The average age of individuals with income greater than $50K is the same as the average age of individuals with income less than or equal to $50K.

Setting up hypothesis parameters:

Alpha Level: 0.05, We choose an alpha level of 0.05, which is a common choice in hypothesis testing, to control the Type I error rate, which is rejecting the null hypothesis when it’s true. A lower alpha level makes the test more stringent and less likely to produce false positives.

Power Level: 0.8, We set a power level of 0.8, which is the probability of correctly rejecting the null hypothesis when it’s false (i.e., avoiding a Type II error). A higher power level indicates a greater ability to detect an effect if it exists.

Minimum Effect Size: 0.3, We assume a minimum effect size of 3 years as the smallest difference in average age between income groups that we consider practically significant.

# Set up hypothesis parameters
alpha <- 0.05 
power <- 0.80 
min_effect_size <- 0.3

required_sample_size <- pwr.t.test(
  d = min_effect_size,
  sig.level = alpha,
  power = power,
  type = "two.sample"  # Specify a two-sample t-test
)

# Print the required sample size
print(required_sample_size)
## 
##      Two-sample t test power calculation 
## 
##               n = 175.3847
##               d = 0.3
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Neyman-Pearson Test for hypothesis 1:

Yes, we have enough data to perform a Neyman-Pearson hypothesis test. The Adult dataset contains 48,842 observations, which is more than the sample size required to achieve a power level of 0.80 at an alpha level of 0.05 for a two-tailed test with a minimum effect size of 3.

t_test_result <- t.test(adult$age ~ adult$income, alternative = "two.sided" , alpha = 0.05)
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  adult$age by adult$income
## t = -34.006, df = 8497.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group  <=50K. and group  >50K. is not equal to 0
## 95 percent confidence interval:
##  -7.698401 -6.859245
## sample estimates:
## mean in group  <=50K.  mean in group  >50K. 
##              37.04801              44.32683

The t-test results indicate a p-value less than 0.05, which means we reject the null hypothesis i.e., the average age of individuals with income greater than $50K is statistically significantly different from the average age of individuals with income less than or equal to $50K.

Fisher’s Style Test for hypothesis 1:

fisher_test_result <- var.test(adult$age ~ adult$income)
fisher_test_result
## 
##  F test to compare two variances
## 
## data:  adult$age by adult$income
## F = 1.798, num df = 12434, denom df = 3845, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.707729 1.891691
## sample estimates:
## ratio of variances 
##           1.797987

The Fisher’s style test results also indicate a p-value less than 0.05, leading us to reject the null hypothesis. This reinforces the finding that there is a significant difference in ages between income groups.

Visualization:

ggplot(adult, aes(x = income, y = age, fill = income)) +
  geom_boxplot() +
  labs(x = "Income", y = "Age", title = "Age vs Income")

The boxplot shows clear differences in age distributions between income levels. Individuals with income greater than $50K tend to be older on average compared to those with income less than or equal to $50K.

Insights:

The hypothesis testing and visualization confirm that there is a significant age difference between income groups. This suggests that age may play a role in determining income levels, with older individuals having higher incomes, on average. Age appears to be a factor influencing income levels.

Hypothesis 2: Capital Gain vs Sex:

Null Hypothesis 2: There is no significant difference in capital gain between different genders (sexes).

Setting up hypothesis parameters:

Alpha Level: 0.05, We choose an alpha level of 0.05, which is a common choice in hypothesis testing, to control the Type I error rate, which is rejecting the null hypothesis when it’s true. A lower alpha level makes the test more stringent and less likely to produce false positives.

Power Level: 0.8, We set a power level of 0.8, which is the probability of correctly rejecting the null hypothesis when it’s false (i.e., avoiding a Type II error). A higher power level indicates a greater ability to detect an effect if it exists.

Minimum Effect Size: 0.1 , We assume a minimum effect size of $1000 as practically significant for differences in capital gain based on gender.

# Set up hypothesis parameters
alpha <- 0.05 
power <- 0.80 
min_effect_size <- 0.1

required_sample_size <- pwr.t.test(
  d = min_effect_size,
  sig.level = alpha,
  power = power,
  type = "two.sample"  # Specify a two-sample t-test
)

# Print the required sample size
print(required_sample_size)
## 
##      Two-sample t test power calculation 
## 
##               n = 1570.733
##               d = 0.1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Neyman-Pearson Test for hypothesis 2:

Yes, we have enough data to perform a Neyman-Pearson hypothesis test.

t_test_result_2 <- t.test(adult$capitalgain ~ adult$sex, alternative = "two.sided")
t_test_result_2
## 
##  Welch Two Sample t-test
## 
## data:  adult$capitalgain by adult$sex
## t = -6.5272, df = 15311, p-value = 6.91e-11
## alternative hypothesis: true difference in means between group  Female and group  Male is not equal to 0
## 95 percent confidence interval:
##  -929.2812 -500.0546
## sample estimates:
## mean in group  Female   mean in group  Male 
##              605.1965             1319.8644

The t-test results indicate a p-value less than 0.05, which means we reject the null hypothesis i.e., there is no significant difference in capital gain between genders.

Fisher’s Style Test for hypothesis 2:

fisher_test_result_2 <- var.test(adult$capitalgain ~ adult$sex)
fisher_test_result_2
## 
##  F test to compare two variances
## 
## data:  adult$capitalgain by adult$sex
## F = 0.41082, num df = 5420, denom df = 10859, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3923790 0.4302783
## sample estimates:
## ratio of variances 
##           0.410818

The Fisher’s style test results also indicate a p-value less than 0.05, leading us to reject the null hypothesis. Supporting the findings of Neyman-Pearson Test that there is a significant difference in ages between income groups.

Visualization for hypothesis 2:

ggplot(adult, aes(x = sex, y = capitalgain, fill = sex)) +
  geom_bar(stat = "summary", fun.y = "mean") +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Gender (Sex)", y = "Capital Gain", title = "Capital Gain vs Gender (Sex)")
## Warning in geom_bar(stat = "summary", fun.y = "mean"): Ignoring unknown
## parameters: `fun.y`
## No summary function supplied, defaulting to `mean_se()`

The bar chart illustrates the average capital gain across different genders (sexes). There are noticeable variations in capital gain between genders.

Insights:

Hypothesis testing and visualization confirm that there is a significant difference in capital gain between genders. Gender appears to influence the amount of capital gain individuals achieve.

Conclusion:

In conclusion, for Hypothesis 1, we found a significant difference in ages between income groups, indicating that age plays a role in income levels. For Hypothesis 2, we found a significant difference in capital gain based on gender, suggesting that gender can influence financial outcomes.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.