The goal of this assignment was to practice cleaning and manipulation of datasets for downstream analysis work. For the assignment, an untidy CSV was created and loaded into R. The dataset consist of ‘Tuberculosis(TB) incidence rates’ per 100,000 and corresponding ‘Human Development Index (HDI) rates’ for Poland, Germany, Hungary, Portugal, and Switzerland for years 2007, 2008 and 2009.The author of the blackboard post suggested the analysis could examine ‘the relationship between HDL and the incidence of TB’.
Below the untidy CSV was loaded into R via the read.csv command and placed into a dataframe. There were some data quality issues which needed to be addressed before any analysis could be conducted. Some of these issues included: transformation of the dataset to long format so the desired analysis could be conducted, deleting extraneous rows, renaming variables and converting some of the variables to numeric format.
TB<- read.csv("https://raw.githubusercontent.com/goygoyummm/Data607_R/main/20230302_CUNY_DATA_607_Project_2_TB.csv")
DF <- data.table(TB)
DF
The following changes were made to tidy the data.
1. Variables/column names were renamed
2. An extraneous row was deleted
3. All commas were replaced with decimals
4. Transformation to long format
5. Deleting extraneous columns from dataframe
6. Converting some variables to numeric format
#renaming columns
colnames(DF)[1] ="Country"
colnames(DF)[2] ="2007"
colnames(DF)[3] ="2008"
colnames(DF)[4] ="2009"
colnames(DF)[5] ="HDI_2007"
colnames(DF)[6] ="HDI_2008"
colnames(DF)[7] ="HDI_2009"
#deleting row
DF<- DF %>% filter(!row_number() %in% c(1))
#replacing all commas with decimals
dat <- DF %>%
map_df(str_replace, pattern = ",", replacement = ".")
#transform to long format
dat1 = dat %>%
gather("2007", "2008", "2009", key =
Year, value = Incidence_per_100000 ) %>%
gather("HDI_2007", "HDI_2008", "HDI_2009", key =
HDI_Year, value = HDI) %>%
filter(substring(Year, 4) == substring(HDI_Year, 8))
#deleting the extra year column
dat1 = subset(dat1, select = -c(HDI_Year) )
#converting to number
dat1[, c(2:4)] <- sapply(dat1[, c(2:4)], as.numeric)
I plotted TB incidence versus HDI ignoring year and country. The plot suggests a indirect correlation between HDI and TB incidence.
#Plotting all values regardless of year and country
ggplot(data=dat1, aes(x=Incidence_per_100000, y=HDI, group=1)) +
geom_line()+
geom_point() +
geom_smooth(method="lm", se=TRUE, fullrange=FALSE, level=0.95)
## `geom_smooth()` using formula = 'y ~ x'
TB incidence versus HDI was then plotted for each country individually ignoring year. The plots suggest for each country, with the exception of Switzerland, higher HDI rates are correlated with lower TB incidence. It is worth nothing that although Switzerland experienced an increase in TB incidence while also experiencing an increase in HDI, Switzerland’s incidence of TB remained low in comparision to the other countries included in the analysis.
ggplot(data=dat1, aes(x=Incidence_per_100000,y=HDI)) +
geom_line()+
geom_point()+
facet_wrap(~Country) +
ggtitle("TB Incidence vs HDI") +
theme(plot.title = element_text(face="bold", size=20, hjust=0, color="#555555")) +
theme(axis.text.x = element_text(angle=90))
Tidyr and dplyr are very convenient tools for manipulating data, as well as performing analysis. Although there appears to be a indirect correlation between ‘TB Incidence’ and ‘HDI’ for select countries between 2007 and 2009, more analysis is needed.