Assignment1_malaria

Approach

For this assignment, I am using the Malaria dataset derived from the csv files downloaded from Kaggle. This data set provides information about the estimated number of malaria cases across the world.

Dataset Source: https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Malaria_estimated_numbers.csv

Assigned name and loaded the raw URL file of Malaria dataset from Github. Also using tidyverse

malaria <- read.csv("https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Malaria_estimated_numbers.csv")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

All codes used here were from R for Data Science(2e) book

#Code Base #Checking to see if I can open data as a tibble data frame and first few columns

glimpse(malaria)

## Rows: 856
## Columns: 11
## $ Country              <chr> "Afghanistan", "Algeria", "Angola", "Argentina", …
## $ Year                 <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
## $ No..of.cases         <chr> "630308[495000-801000]", "0", "4615605[3106000-66…
## $ No..of.deaths        <chr> "298[110-510]", "0", "13316[9970-16600]", "0", "0…
## $ No..of.cases_median  <int> 630308, 0, 4615605, 0, 0, 0, 32924, 7, 4111699, 1…
## $ No..of.cases_min     <int> 495000, NA, 3106000, NA, NA, NA, 30000, NA, 27740…
## $ No..of.cases_max     <int> 801000, NA, 6661000, NA, NA, NA, 36000, NA, 65520…
## $ No..of.deaths_median <int> 298, 0, 13316, 0, 0, 0, 76, 0, 7328, 0, 2, 7, 30,…
## $ No..of.deaths_min    <int> 110, NA, 9970, NA, NA, NA, 3, NA, 5740, NA, 0, 0,…
## $ No..of.deaths_max    <int> 510, NA, 16600, NA, NA, NA, 130, NA, 8920, NA, 4,…
## $ WHO.Region           <chr> "Eastern Mediterranean", "Africa", "Africa", "Ame…

#Select, rename, drop missing values on data set. assign another name in order to be used for creation of scatter plot

I used dpylr’s select(),rename(), then drop_na to clean After inspecting the data, I see there are missing data and the column names are difficult to use since there are many period. I selected the few columns and rename them. I removed the ones that had NA or no values in the data subsets.

malaria_clean <- malaria |> select(Country,Year,No..of.cases_max, No..of.deaths_max) |> rename(country = Country, year = Year, max_cases = No..of.cases_max, max_deaths = No..of.deaths_max)|> drop_na(max_cases, max_deaths)

#Max cases vs Max death cases scatterplot

Using tidyverse ggplot2() to explore the relation between max number of malaria cases vs the max number of deaths by malaria per country

ggplot(malaria_clean, aes(x=max_cases, y=max_deaths))+ geom_point()

#Conclusion

In this assigment, I loaded dataset from Github, inspected the structure, cleaned it using select(), rename(), drop_na(), and created a scatterplot to explore the relationships in the data. There is a clear positive relatioship between max malaria cases and deaths. To further analyze this, I could analyze trends across years, compare different countries , and check the dataset with WHO source for values accuracies. I can also add more variables and other data visualization to present more information and deepen the analysis.

Assignment1_malaria_cases

Mei Qi Ng

2026-02-01

Approach