Data Dive hypothesis testing

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'

## The following object is masked from 'package:stats':
## 
##     power.t.test

data("midwest")
head(midwest,3)

## # A tibble: 3 × 28
##     PID county   state  area poptotal popdensity popwhite popblack popamerindian
##   <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
## 1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
## 2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
## 3   563 BOND     IL    0.022    14991       681.    14477      429            35
## # ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
## #   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>

The midwest dataset is a built-in dataset in the ggplot2 library in R. It has various demographic and socio-economic variables for counties in the Midwest region of USA.The data-set can be used to answer questions like:

Is there any relation between poverty rate, education with urbanization.
population density relation with urbanization, poverty
relation between racial diversity and urbanization

Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis: Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value.

Alpha (α=0.05 ) is the significance level, representing the probability of making a Type I error (false positive) when rejecting the H0 when it is true. Taking α = 0.05 would mean taking a 5% chance of making type 1 error. We can also take 0.01 for more accurate results.

Power (1 - β)=0.8 is the probability of correctly rejecting the H0 when it is false, indicating the test’s ability to detect an effect when one exists.. A 0.8 would mean 80% chance of detecting a true effect if it exists.

Minimum effect size:((δ ≈ 0.2):-smallest meaningful difference we want to detect.

Null Hypothesis 1: Education Level vs poverty

Null Hypothesis (H0) :There is no difference between the percentage of adults with a college degree (percollege) and the percentage of the population living below the poverty line (percbelowpoverty). There is no difference between the percentage of adults with a college degree and the percentage of the population living below the poverty line.

Alternative Hypothesis (H1):There is relation between the percentage of adults with a college degree and the percentage of the population living below the poverty line

alpha <- 0.05
test_result <- t.test(midwest$percollege, midwest$percadultpoverty)
p_value <- test_result$p.value

if (p_value < alpha) {
  cat("Reject the null hypothesis: There is a significant difference between education level and poverty.\n")
} else {
  cat("Fail to reject the null hypothesis: There is no significant difference between education level and poverty.\n")
}

## Reject the null hypothesis: There is a significant difference between education level and poverty.

test_result

## 
##  Welch Two Sample t-test
## 
## data:  midwest$percollege and midwest$percadultpoverty
## t = 19.022, df = 838.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6.595113 8.112763
## sample estimates:
## mean of x mean of y 
##  18.27274  10.91880

These results provide evidence that education level and the percentage of adults living below the poverty line are associated in the Midwest region, with a higher percentage of college degree holders among adults in the region compared to those living below the poverty line. This can mean: having a college degree increases chance of getting job which in-turn can decrease the percentage of adults living below the poverty line.

Fishers test:

contingency_table1 <- table(midwest$popdensity, midwest$inmetro)
#fisher_result1 <- fisher.test(contingency_table1)

Null Hypothesis 2: urbanization vs pop density

H0: There is no difference between the level of urbanization and population density (popdensity) H1: There is a difference in the level of racial diversity between urban and rural Midwest counties.

#  urban and rural population density data
urban_pop_density <- midwest$popdensity[midwest$inmetro == "1"]
rural_pop_density <- midwest$popdensity[midwest$inmetro == "0"]

alpha <- 0.05
test_result2 <- t.test(urban_pop_density, rural_pop_density)
p_value <- test_result2$p.value

if (p_value < alpha) {
  cat("Reject the null hypothesis 1 as p-value",p_value," is lesser than alpha(at 0.05)")
} else {
  cat("Fail to reject the null hypothesis 1 as pvalue", p_value, "is greater than alpha(0.05)")
}

## Reject the null hypothesis 1 as p-value 2.419739e-09  is lesser than alpha(at 0.05)

test_result2

## 
##  Welch Two Sample t-test
## 
## data:  urban_pop_density and rural_pop_density
## t = 6.3521, df = 149.47, p-value = 2.42e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4308.964 8200.255
## sample estimates:
## mean of x mean of y 
##  7205.461   950.852

->The extremely small p-value (2.42e-09) indicates strong evidence against the null hypothesis. ->The point estimate of the mean urban population density (mean of x) is7,205.461. The point estimate of the mean rural population density (mean of y) is 950.852. ->These results suggest that urban areas in midwest dataset have a significantly higher population density compared to rural areas.

Fisher’s test:

contingency_table <- table(midwest$popdensity, midwest$inmetro)
#Fisher's exact test
#fisher_result <- fisher.test(contingency_table)

3. Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis.

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

1: Education Level vs poverty

p <- ggplot(data = midwest, aes(y = percbelowpoverty, x = percollege)) 
p + geom_point((aes(color = state))) + ggtitle("College Education Vs Total Poverty") + xlab("Percent College Educated") + ylab("Percentage of Total poverty")

2:urbanization vs pop density

mean_x <- 7205.461
mean_y <- 950.852


data <- data.frame(
  Group = c("Urban", "Rural"),
  Mean = c(mean_x, mean_y)
)

# box plot
ggplot(data, aes(x = Group, y = Mean, fill = Group)) +
  geom_bar(stat = "identity", width = 0.5) +
  labs(
    title = "Population Density - (Urban vs. Rural)",
    x = "Urbanization",
    y = "Mean Population Density"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("Urban" = "pink", "Rural" = "yellow"))

Data Dive hypothesis testing

parimala

2023-10-09

Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis: Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value.

Null Hypothesis 1: Education Level vs poverty

Null Hypothesis 2: urbanization vs pop density

3. Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis.