Homework2

1. Which country had the highest number of new leprosy cases in 2021? Why is it not a fair comparison to look at the raw number of cases when comparing prevalence of the disease across different countries?

leprosy <- read.csv("/Users/cindyfan/Desktop/SDS 313/homework2/Homework2_leprosy.csv")
max(leprosy$Cases, na.rm = TRUE)

## [1] 75394

na.omit(leprosy[(leprosy$Cases) == 75394, ])

##    Country Code       Region Population      GDP LandArea Cases
## 76   India  IND Asia/Pacific 1399179585 3176.295  1147955 75394

India has the highest number of new leprosy cases in 2021. It is not a fair comparison to look at the raw number of cases because every country has a different population.

2. Create a new variable in the dataset that provides the leprosy cases per 100K people in each country. Graph this new variable’s distribution and provide the relevant summary statistics inline within a short paragraph describing the distribution.

leprosy$density = (leprosy$Cases/leprosy$Population) * 1e+05
dp_ggplot <- ggplot(leprosy)
dp_ggplot + geom_histogram(aes(x = density), binwidth = 3, col = "black",
    fill = "aquamarine") + labs(title = "distribution of leprocy cases per 100k people across all countries",
    x = "leprocy cases per 100k people", y = "Frequency")

fivenum(leprosy$density)

## [1]  0.000000000  0.002424845  0.242289314  1.278899578 30.450669915

min(leprosy$density, na.rm = TRUE)

## [1] 0

median(leprosy$density, na.rm = TRUE)

## [1] 0.2422893

max(leprosy$density, na.rm = TRUE)

## [1] 30.45067

The five number summary for leprocy cases per 100k people: 0, 0.0024248, 0.2422893, 1.2788996, 30.4506699 shows it is skewed right and has a median of 0.2422893. It ranges from 0 to 30.4506699.

3. We want to compare cases per 100K across the different regions in this dataset. In a single plot output, create a graph that shows the distribution of cases per 100K split by region. Output a nicely formatted table that provides the region name, number of countries in that region, and the median cases per 100K for each region. Include a short paragraph summarizing differences in leprosy prevalence across regions.

library(ggplot2)
dp_ggplot + geom_histogram(aes(x = density), col = "black", fill = "red",
    alpha = 1, binwidth = 10, position = "identity") + labs(title = "Frequency of New Leprosy Cases per 100k by Region",
    x = "New Leprocy Cases per 100k people", y = "Frequency") +
    facet_grid(~Region) + theme(legend.position = "bottom")

TotalCases = as.data.frame(table(leprosy$Region))
TotalCases$Median = round(aggregate(leprosy$density ~ leprosy$Region,
    lep = leprosy, FUN = median)[, 2], 2)
library(kableExtra)
kable_styling(kbl(TotalCases, col.names = c("Regions", "Number of Countries",
    "Median Cases per 100k")))

Regions	Number of Countries	Median Cases per 100k
Africa	45	1.07
Americas	34	0.18
Asia/Pacific	33	0.30
Europe	51	0.00
Middle East	20	0.07

Differences in leprosy prevalence across regions: Leprosy is relatively more prevalent in both Africa and Asia/Pacific, with their median cases per 100k (Africa: 1.07; Asia/Pacific: 0.30) higher than the overall median cases per 100k(0.2422893). Leprosy is relatively less prevalent in Europe, Middle East, and Americas, with their median cases per 100k (Europe: 0.00; Middle East: 0.07; Americas: 0.18) lower than the overall median cases per 100k(0.2422893). Overall, the regions ranked from the most leprosy prevalent to the least leprosy prevalent are: Africa > Asia/Pacific > Americas > Middle East > Europe

4. Investigate the relationship between cases per 100K and one of the other variables in the dataset (other than region) by making the appropriate bivariate graph and providing the relevant summary statistic inline within a short paragraph describing the relationship.

library(ggplot2)
dp_ggplot + geom_point(aes(x = density, y = GDP)) + labs(title = "relationship between cases per 100k and GDP",
    x = "Cases per 100k", y = "GDP") + theme_classic()

The correlation coefficient between the cases per 100K and GDP is -0.0551929, indicating there is a negative, very weak, linear relationship between the cases per 100K and GDP.

5. Write a brief conclusion to your analysis summarizing what you found. Include a hyperlink to the website for the International Leprosy Association for readers wanting more information about this disease.

Number of leprosy cases per 100k of a country is negatively correlated with its GDP. Therefore, countries with better economic conditions are expected to have lower prevalence of leprosy, which might be accounted by better medical technology and better access to medicines.

Click on the following hyperlink if you want to know more about leprocy: International Leprosy Association’s website

Homework2

Zhou Fan - SDS 313 UT Austin

2023-09-16

1. Which country had the highest number of new leprosy cases in 2021? Why is it not a fair comparison to look at the raw number of cases when comparing prevalence of the disease across different countries?

2. Create a new variable in the dataset that provides the leprosy cases per 100K people in each country. Graph this new variable’s distribution and provide the relevant summary statistics inline within a short paragraph describing the distribution.

4. Investigate the relationship between cases per 100K and one of the other variables in the dataset (other than region) by making the appropriate bivariate graph and providing the relevant summary statistic inline within a short paragraph describing the relationship.

5. Write a brief conclusion to your analysis summarizing what you found. Include a hyperlink to the website for the International Leprosy Association for readers wanting more information about this disease.