library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
## The following object is masked from 'package:stats':
##
## power.t.test
data("midwest")
head(midwest,3)
## # A tibble: 3 × 28
## PID county state area poptotal popdensity popwhite popblack popamerindian
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
## 2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
## 3 563 BOND IL 0.022 14991 681. 14477 429 35
## # ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
## # percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
The midwest dataset is a built-in dataset in the ggplot2 library in R. It has various demographic and socio-economic variables for counties in the Midwest region of USA.The data-set can be used to answer questions like:
Alpha (α=0.05 ) is the significance level, representing the probability of making a Type I error (false positive) when rejecting the H0 when it is true. Taking α = 0.05 would mean taking a 5% chance of making type 1 error. We can also take 0.01 for more accurate results.
Power (1 - β)=0.8 is the probability of correctly rejecting the H0 when it is false, indicating the test’s ability to detect an effect when one exists.. A 0.8 would mean 80% chance of detecting a true effect if it exists.
Minimum effect size:((δ ≈ 0.2):-smallest meaningful difference we want to detect.
Null Hypothesis (H0) :There is no difference between the percentage of adults with a college degree (percollege) and the percentage of the population living below the poverty line (percbelowpoverty). There is no difference between the percentage of adults with a college degree and the percentage of the population living below the poverty line.
Alternative Hypothesis (H1):There is relation between the percentage of adults with a college degree and the percentage of the population living below the poverty line
alpha <- 0.05
test_result <- t.test(midwest$percollege, midwest$percadultpoverty)
p_value <- test_result$p.value
if (p_value < alpha) {
cat("Reject the null hypothesis: There is a significant difference between education level and poverty.\n")
} else {
cat("Fail to reject the null hypothesis: There is no significant difference between education level and poverty.\n")
}
## Reject the null hypothesis: There is a significant difference between education level and poverty.
test_result
##
## Welch Two Sample t-test
##
## data: midwest$percollege and midwest$percadultpoverty
## t = 19.022, df = 838.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.595113 8.112763
## sample estimates:
## mean of x mean of y
## 18.27274 10.91880
These results provide evidence that education level and the percentage of adults living below the poverty line are associated in the Midwest region, with a higher percentage of college degree holders among adults in the region compared to those living below the poverty line. This can mean: having a college degree increases chance of getting job which in-turn can decrease the percentage of adults living below the poverty line.
Fishers test:
contingency_table1 <- table(midwest$popdensity, midwest$inmetro)
#fisher_result1 <- fisher.test(contingency_table1)
H0: There is no difference between the level of urbanization and population density (popdensity) H1: There is a difference in the level of racial diversity between urban and rural Midwest counties.
# urban and rural population density data
urban_pop_density <- midwest$popdensity[midwest$inmetro == "1"]
rural_pop_density <- midwest$popdensity[midwest$inmetro == "0"]
alpha <- 0.05
test_result2 <- t.test(urban_pop_density, rural_pop_density)
p_value <- test_result2$p.value
if (p_value < alpha) {
cat("Reject the null hypothesis 1 as p-value",p_value," is lesser than alpha(at 0.05)")
} else {
cat("Fail to reject the null hypothesis 1 as pvalue", p_value, "is greater than alpha(0.05)")
}
## Reject the null hypothesis 1 as p-value 2.419739e-09 is lesser than alpha(at 0.05)
test_result2
##
## Welch Two Sample t-test
##
## data: urban_pop_density and rural_pop_density
## t = 6.3521, df = 149.47, p-value = 2.42e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4308.964 8200.255
## sample estimates:
## mean of x mean of y
## 7205.461 950.852
->The extremely small p-value (2.42e-09) indicates strong evidence against the null hypothesis. ->The point estimate of the mean urban population density (mean of x) is7,205.461. The point estimate of the mean rural population density (mean of y) is 950.852. ->These results suggest that urban areas in midwest dataset have a significantly higher population density compared to rural areas.
Fisher’s test:
contingency_table <- table(midwest$popdensity, midwest$inmetro)
#Fisher's exact test
#fisher_result <- fisher.test(contingency_table)
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
1: Education Level vs poverty
p <- ggplot(data = midwest, aes(y = percbelowpoverty, x = percollege))
p + geom_point((aes(color = state))) + ggtitle("College Education Vs Total Poverty") + xlab("Percent College Educated") + ylab("Percentage of Total poverty")
2:urbanization vs pop density
mean_x <- 7205.461
mean_y <- 950.852
data <- data.frame(
Group = c("Urban", "Rural"),
Mean = c(mean_x, mean_y)
)
# box plot
ggplot(data, aes(x = Group, y = Mean, fill = Group)) +
geom_bar(stat = "identity", width = 0.5) +
labs(
title = "Population Density - (Urban vs. Rural)",
x = "Urbanization",
y = "Mean Population Density"
) +
theme_minimal() +
scale_fill_manual(values = c("Urban" = "pink", "Rural" = "yellow"))