Introduction:
WHO reported people within European Region does not have access to basic drinking water (15 million), basic sanitation (29 million) and hygiene services (4 million) during 2020 (WHO, 2022). A lot of focus has been emphasized on household but rarely to no focus on unserved or homeless persons. Still surface water has been reported to be used by 2.5 million persons within western and central Asia as well southern and eastern Europe in 2020. Despite seeing a decline in open defecation and surface water drinking since 2000 but still almost 99.8% of rural population still practice open defecation. One of the main reasons reported for this gap is social inequalities (WHO, 2022).
The data extracted from WHO comprised of 8 columns and showing data regarding SDG goal 6, European Regions, residence type (rural/urban/total), service type (drinking-water/ sanitation / hygiene), year (2010-2022), coverage, population, and service level (safely managed/ basic/ limited/ at least-basic/ unimproved/ surface water). The data provides the opportunity to identify potential gaps in service type and its level within specific region and do analysis to determine potential region that might need more effort with growing population.

Three novel Questions for Further Investigation: I have formulated three novel questions for future investigation:
Question_1: How does the distribution of basic drinking water, sanitation, and hygiene services differ among various European regions?
Question_2:What trends can be observed in the coverage of essential services from 2010 to 2022, and how are these trends associated with changes in population?
Question_3:Have there been significant disparities in service levels between rural and urban areas within the European Region, and how have these disparities evolved over time?

Research Purpose:
The primary purpose of this project is to compare the different regions receiving service level over time. Identifying if the coverage is optimum with the change in population and identifying potential regions that could need attention in the upcoming year. Furthermore, to predict in upcoming years who is going towards a more sustainable lifestyle.

Visulization:
Line Chart: The line chart illustrates the trends in coverage of essential services and population over the years 2010-2022. The solid line represents the mean coverage of essential services, while the dashed line represents the mean population. Both lines exhibit a general upward trend over time, indicating a potential correlation between the provision of essential services and population growth. This suggests that as the population increases, there is a corresponding increase in the coverage of essential services, which could indicate positive developments in public health infrastructure.

Bar Plot:
The bar plot displays the distribution of population across different regions. Each bar represents the population count in a specific region, providing a visual comparison of population sizes across geographical areas.

Correlation Analysis: The Pearson correlation coefficient and its significance are calculated using the cor.test() function. This analysis provides statistical insights into the relationship between coverage of essential services and population. The correlation coefficient indicates the strength and direction of the relationship, while the significance value assesses whether the correlation is statistically significant.
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Load the data
df <- read_csv("/Users/mohammedhossain/Desktop/washdash-download (1).csv")
## Rows: 3367 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Type, Region, Residence Type, Service Type, Service level
## dbl (3): Year, Coverage, Population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df)
##      Type              Region          Residence Type     Service Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09
options(scipen = 999)

# Aggregating coverage and population data by year
aggregated_data <- df %>%
  group_by(Year) %>%
  summarize(mean_coverage = mean(Coverage),
            mean_population = mean(Population))

# Rename the columns
names(aggregated_data)[1] <- "Year"
names(aggregated_data)[2] <- "Mean_Coverage"
names(aggregated_data)[3] <- "Mean_Population"

# Convert Year column to integer
aggregated_data$Year <- as.integer(aggregated_data$Year)

# Round Mean Value to 2 digits
aggregated_data$Mean_Coverage <- round(aggregated_data$Mean_Coverage, digits = 2)
aggregated_data$Mean_Population <- round(aggregated_data$Mean_Population, digits = 2)

# Plotting trends in coverage and population over time
ggplot(aggregated_data, aes(x = Year)) +
  geom_line(aes(y = Mean_Coverage, color = "Coverage"), linetype = "solid", linewidth = 1.5) +
  geom_line(aes(y = Mean_Population, color = "Population"), linetype = "dashed", linewidth = 1.5) +
  labs(title = "Trends in Coverage of Essential Services and Population (2010-2022)",
       x = "Year", y = "Mean Value") +
  scale_color_manual(values = c("Coverage" = "blue", "Population" = "red")) +
  theme_minimal() +
  theme(legend.key.size = unit(1.5, "lines"))
plot of chunk unnamed-chunk-1
# Bar plot
ggplot(df, aes(x = Region, y = Population, fill = Region)) +
  geom_bar(stat = "identity") +
  labs(title = "Population by Region", x = "Region", y = "Population") +
  theme_minimal()
plot of chunk unnamed-chunk-1
library(ppcor)
# Calculate Pearson correlation coefficient and its significance
cor_result <- cor.test(df$Coverage, df$Population, method = "pearson")
cor_result
## 
##  Pearson's product-moment correlation
## 
## data:  df$Coverage and df$Population
## t = 45.286, df = 3365, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5939276 0.6359223
## sample estimates:
##       cor 
## 0.6153614