Introduction:

In this analysis, I'm exploring the variability of Water, Sanitation, and Hygiene (WASH) data by categorizing it into three distinct groups based on categorical columns. I'll summarize various variables within these groups. Specifically, I'll focus on combinations of region and service type, year and residence type, as well as type and coverage. My goal is to understand why certain groups are less common than others and draw conclusions about the implications of these findings.

Grouping 1: Region and Service Type:

Utilizing the Region and Service Type categorical variable, By identifying coverage across different regions and types of services and tagging the group(s) with the lowest probability,Try to understand the factors contributing to disparities in coverage within specific regions and service types.

Grouping 2: Year and Residence Type:

I'm exploring how population characteristics change with different Year and Residence Type categories. By analyzing and visualizing the data, and aim to understand why some combinations of year and residence type have lower population counts than others. This helps to come up with testable hypotheses to explain these differences

Grouping 3: Type and Coverage:

By grouping the data based on Type and Coverage, Using statistical analysis and visualizations, we explore what factors influence the count of each combination and understand why some groups are less common."

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")
# View summary of the data
summary(data)
##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Grouping 1: Region and Service Type
grouped_data_1 <- data %>%
  group_by(Region, Service.Type) %>%
  summarise(Average_Coverage = mean(Coverage), .groups = "drop")

# Identify lowest probability group
lowest_prob_group_1 <- grouped_data_1 %>%
  filter(Average_Coverage == min(Average_Coverage))

# Add special tag to original data
data$Special_Tag <- ifelse(data$Region == lowest_prob_group_1$Region &
                             data$Service.Type == lowest_prob_group_1$Service.Type,
                           "Lowest Probability", "")
# Load required library
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
# Create a bar plotusing 'grouped_data_1' and 'lowest_prob_group_1'data.
ggplot(grouped_data_1, aes(x = Region, y = Average_Coverage, fill = Service.Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(data = lowest_prob_group_1, aes(label = "Lowest Probability"),
            hjust = 1.2, color = "red", size = 4) + # Add text for lowest probability group
  labs(title = "Average Coverage by Region and Service Type",
       x = "Region", y = "Average Coverage", fill = "Service Type") +
  theme_minimal() +
  theme(legend.position = "top") +
  coord_flip()
plot of chunk unnamed-chunk-1
# Grouping 2: Group data by Year and Residence Type.
grouped_data_2 <- data %>%
  group_by(Year, Residence.Type) %>%
  summarise(Total_Population = sum(Population))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Identify lowest probability group
lowest_prob_group_2 <- grouped_data_2 %>%
  filter(Total_Population == min(Total_Population))

# Add special tag to original data
data$Special_Tag <- ifelse(data$Year == lowest_prob_group_2$Year &
                             data$Residence.Type == lowest_prob_group_2$Residence.Type,
                           "Lowest Probability", "")


# Create a grouped bar plot.
ggplot(grouped_data_2, aes(x = Residence.Type, y = Total_Population, fill = factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(labels = scales::comma) + # Format y-axis labels as digits
  labs(title = "Total Population by Year and Residence Type",
       x = "Residence Type", y = "Total Population", fill = "Year") +
  theme_minimal() +
  theme(legend.position = "top")
plot of chunk unnamed-chunk-1
# Grouping 3: Type and Coverage
grouped_data_3 <- data %>%
  group_by(Type, Coverage) %>%
  summarise(Count = n())
## `summarise()` has grouped output by 'Type'. You can override using the
## `.groups` argument.
# Identify lowest probability group
lowest_prob_group_3 <- grouped_data_3 %>%
  filter(Count == min(Count))

# Add special tag to original data
data$Special_Tag <- ifelse(data$Type %in% lowest_prob_group_3$Type &
                             data$Coverage %in% lowest_prob_group_3$Coverage,
                           "Lowest Probability", "")


# Create a stacked bar plot
ggplot(grouped_data_3, aes(x = Type, y = Count, fill = Coverage)) +
  geom_bar(stat = "identity") +
  labs(title = "Count of Occurrences by Type and Coverage",
       x = "Type", y = "Count", fill = "Coverage") +
  theme_minimal() +
  theme(legend.position = "top")
plot of chunk unnamed-chunk-1
Grouping 1: Region and Service Type: This visualization shows the average coverage of services categorized by region and service type.

Title: The title of the plot is Average Coverage by Region and Service Type, providing a clear indication of what the visualization depicts.
X-axis (Region): Represents different regions where the data was collected.
Y-axis (Average Coverage): Indicates the average coverage of services in each region. This is the mean value calculated across all service types within each region.
Fill (Service Type): Each bar is divided into segments, with each segment representing a different service type. The fill color distinguishes between service types.
Geoms: The bars represent the average coverage for each region, with the dodge position separating bars for different service types within the same region. Additionally, red text labels are added to indicate the group(s) with the lowest probability, as identified by the analysis.
Coord_flip(): This function flips the coordinate system, making the x-axis horizontal, which is commonly used for easier readability when dealing with categorical variables like regions. this visualization helps to understand how coverage varies across different regions and service types, with a specific focus on identifying regions or service types with the lowest average coverage.

Grouping 2: Year and Residence Type: This visualization shows the total population categorized by year and residence type.

Title: The title of the plot is Total Population by Year and Residence Type, providing a clear indication of what the visualization depicts.
X-axis (Residence Type): Represents different types of residences, such as urban, rural, suburban, etc.
Y-axis (Total Population): Indicates the total population count for each residence type in the dataset.
Fill (Year): Each bar is divided into segments, with each segment representing a different year. The fill color distinguishes between years.
Geoms: The bars are grouped by residence type, and within each group, bars for different years are displayed side by side (dodged). This allows for easy comparison of population counts across different years within each residence type.
Scale_y_continuous(): This function formats the y-axis labels to use commas for better readability of large numbers.Overall, this visualization helps to understand how the total population varies across different years and types of residences, facilitating comparisons and insights into population trends over time and across different residence types.

Grouping 3: Type and Coverage: This visualization represents the count of occurrences by type and coverage through a stacked bar plot. Here's an explanation of its components:

Title: The title of the plot is Count of Occurrences by Type and Coverage providing a clear indication of what the visualization depicts.
X-axis (Type): Represents different types of categories or classifications present in the dataset.
Y-axis (Count): Indicates the count of occurrences for each type of category.
Fill (Coverage): Each bar is stacked, with each segment representing a different coverage category. The fill color distinguishes between coverage categories. Geoms: Each bar represents a type category, and within each bar, segments represent different coverage categories stacked on top of each other. This allows for a visual comparison of the count of occurrences across different coverage categories within each type, this visualization helps to understand the distribution of occurrences across different types and coverage categories, allowing for insights into the frequency and proportion of each coverage category within different types