For this assignment, I am using the Malaria dataset derived from the csv files downloaded from Kaggle. This data set provides information about the estimated number of malaria cases across the world.
Dataset Source: https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Malaria_estimated_numbers.csv
Assigned name and loaded the raw URL file of Malaria dataset from Github. Also using tidyverse
malaria <- read.csv("https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Malaria_estimated_numbers.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
All codes used here were from R for Data Science(2e) book
#Code Base #Checking to see if I can open data as a tibble data frame and first few columns
glimpse(malaria)
## Rows: 856
## Columns: 11
## $ Country <chr> "Afghanistan", "Algeria", "Angola", "Argentina", …
## $ Year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
## $ No..of.cases <chr> "630308[495000-801000]", "0", "4615605[3106000-66…
## $ No..of.deaths <chr> "298[110-510]", "0", "13316[9970-16600]", "0", "0…
## $ No..of.cases_median <int> 630308, 0, 4615605, 0, 0, 0, 32924, 7, 4111699, 1…
## $ No..of.cases_min <int> 495000, NA, 3106000, NA, NA, NA, 30000, NA, 27740…
## $ No..of.cases_max <int> 801000, NA, 6661000, NA, NA, NA, 36000, NA, 65520…
## $ No..of.deaths_median <int> 298, 0, 13316, 0, 0, 0, 76, 0, 7328, 0, 2, 7, 30,…
## $ No..of.deaths_min <int> 110, NA, 9970, NA, NA, NA, 3, NA, 5740, NA, 0, 0,…
## $ No..of.deaths_max <int> 510, NA, 16600, NA, NA, NA, 130, NA, 8920, NA, 4,…
## $ WHO.Region <chr> "Eastern Mediterranean", "Africa", "Africa", "Ame…
#Select, rename, drop missing values on data set. assign another name in order to be used for creation of scatter plot
I used dpylr’s select(),rename(), then drop_na to clean After inspecting the data, I see there are missing data and the column names are difficult to use since there are many period. I selected the few columns and rename them. I removed the ones that had NA or no values in the data subsets.
malaria_clean <- malaria |> select(Country,Year,No..of.cases_max, No..of.deaths_max) |> rename(country = Country, year = Year, max_cases = No..of.cases_max, max_deaths = No..of.deaths_max)|> drop_na(max_cases, max_deaths)
#Max cases vs Max death cases scatterplot
Using tidyverse ggplot2() to explore the relation between max number of malaria cases vs the max number of deaths by malaria per country
ggplot(malaria_clean, aes(x=max_cases, y=max_deaths))+ geom_point()
#Conclusion
In this assigment, I loaded dataset from Github, inspected the structure, cleaned it using select(), rename(), drop_na(), and created a scatterplot to explore the relationships in the data. There is a clear positive relatioship between max malaria cases and deaths. To further analyze this, I could analyze trends across years, compare different countries , and check the dataset with WHO source for values accuracies. I can also add more variables and other data visualization to present more information and deepen the analysis.