1. Which country had the highest number of new leprosy cases in 2021? Why is it not a fair comparison to look at the raw number of cases when comparing prevalence of the disease across different countries?
leprosy <- read.csv("/Users/cindyfan/Desktop/SDS 313/homework2/Homework2_leprosy.csv")
max(leprosy$Cases, na.rm = TRUE)
## [1] 75394
na.omit(leprosy[(leprosy$Cases) == 75394, ])
##    Country Code       Region Population      GDP LandArea Cases
## 76   India  IND Asia/Pacific 1399179585 3176.295  1147955 75394

India has the highest number of new leprosy cases in 2021. It is not a fair comparison to look at the raw number of cases because every country has a different population.

2. Create a new variable in the dataset that provides the leprosy cases per 100K people in each country. Graph this new variable’s distribution and provide the relevant summary statistics inline within a short paragraph describing the distribution.
leprosy$density = (leprosy$Cases/leprosy$Population) * 1e+05
dp_ggplot <- ggplot(leprosy)
dp_ggplot + geom_histogram(aes(x = density), binwidth = 3, col = "black",
    fill = "aquamarine") + labs(title = "distribution of leprocy cases per 100k people across all countries",
    x = "leprocy cases per 100k people", y = "Frequency")

fivenum(leprosy$density)
## [1]  0.000000000  0.002424845  0.242289314  1.278899578 30.450669915
min(leprosy$density, na.rm = TRUE)
## [1] 0
median(leprosy$density, na.rm = TRUE)
## [1] 0.2422893
max(leprosy$density, na.rm = TRUE)
## [1] 30.45067

The five number summary for leprocy cases per 100k people: 0, 0.0024248, 0.2422893, 1.2788996, 30.4506699 shows it is skewed right and has a median of 0.2422893. It ranges from 0 to 30.4506699.

3. We want to compare cases per 100K across the different regions in this dataset. In a single plot output, create a graph that shows the distribution of cases per 100K split by region. Output a nicely formatted table that provides the region name, number of countries in that region, and the median cases per 100K for each region. Include a short paragraph summarizing differences in leprosy prevalence across regions.
library(ggplot2)
dp_ggplot + geom_histogram(aes(x = density), col = "black", fill = "red",
    alpha = 1, binwidth = 10, position = "identity") + labs(title = "Frequency of New Leprosy Cases per 100k by Region",
    x = "New Leprocy Cases per 100k people", y = "Frequency") +
    facet_grid(~Region) + theme(legend.position = "bottom")

TotalCases = as.data.frame(table(leprosy$Region))
TotalCases$Median = round(aggregate(leprosy$density ~ leprosy$Region,
    lep = leprosy, FUN = median)[, 2], 2)
library(kableExtra)
kable_styling(kbl(TotalCases, col.names = c("Regions", "Number of Countries",
    "Median Cases per 100k")))
Regions Number of Countries Median Cases per 100k
Africa 45 1.07
Americas 34 0.18
Asia/Pacific 33 0.30
Europe 51 0.00
Middle East 20 0.07

Differences in leprosy prevalence across regions: Leprosy is relatively more prevalent in both Africa and Asia/Pacific, with their median cases per 100k (Africa: 1.07; Asia/Pacific: 0.30) higher than the overall median cases per 100k(0.2422893). Leprosy is relatively less prevalent in Europe, Middle East, and Americas, with their median cases per 100k (Europe: 0.00; Middle East: 0.07; Americas: 0.18) lower than the overall median cases per 100k(0.2422893). Overall, the regions ranked from the most leprosy prevalent to the least leprosy prevalent are: Africa > Asia/Pacific > Americas > Middle East > Europe

4. Investigate the relationship between cases per 100K and one of the other variables in the dataset (other than region) by making the appropriate bivariate graph and providing the relevant summary statistic inline within a short paragraph describing the relationship.
library(ggplot2)
dp_ggplot + geom_point(aes(x = density, y = GDP)) + labs(title = "relationship between cases per 100k and GDP",
    x = "Cases per 100k", y = "GDP") + theme_classic()

The correlation coefficient between the cases per 100K and GDP is -0.0551929, indicating there is a negative, very weak, linear relationship between the cases per 100K and GDP.