Intro to Stats in R - Final Project
Mohammed Ahsan Hossain
April 21 2024

Introduction:
As I start looking into the data about essential services, I'm eager to uncover what it can tell us. This data holds a lot of details about how essential services are provided across different places and times, and to different groups of people. By digging into it and making some pictures, I hope to see how things are spread out, how they change over time, and if there are any connections between different aspects.
I want to start by just looking at the different types of services and how common each one is. Then, I'll check if certain types of services are more common in some places compared to others. After that, I'll see if there's a pattern in how service coverage changes over the years. And I'm curious if there's a relationship between how many people live somewhere and how much service coverage they have.
I'm also interested in seeing if there are differences in coverage between different types of homes, like rural versus urban areas. And I'll be looking at how the level of service changes over time. Then, I want to know if there's any link between the year and the coverage. Next, I'll figure out the average coverage for each level of service. And I'll check how coverage varies between regions.
After all that, I'll try to predict what future coverage might look like based on what we've seen so far. I'll use some clever math to do that. Finally, I'll check to see if there's a connection between how many people live somewhere and what kinds of services are available. This journey through the data isn't just about numbers and graphs. It's about finding out things that can help make decisions about how to provide essential services to everyone, no matter where they live. So, let's dive in and see what stories the data has to tell!



# Load required libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")

'Service.Type' Variation Across 'Region's:
As I began my exploration of the dataset, I focused on visualizing the distribution of 'Type'. This initial visualization provided a comprehensive overview of how different types are distributed within our data. By representing the data in a bar plot, I could easily discern the prevalence of each type and identify any notable patterns or outliers. It served as the foundation for understanding the composition of our dataset, offering insights into the relative frequencies of different types.
I used a stacked bar plot to show how service types vary across different regions. It's crucial for me to see if certain services are more concentrated in specific areas. Understanding these variations helps me assess whether resources are distributed equitably or if there are disparities that need to be addressed.The stacked bar plot illustrates how service types vary across different geographic regions. This visualization helps stakeholders understand regional disparities in service provision, highlighting areas where certain services may be more prevalent or lacking. Significance lies in identifying patterns or trends in service distribution that may inform targeted interventions or resource allocation strategies to address inequities.

# 'Service.Type' variation across 'Region's
ggplot(data, aes(x = Region, fill = factor(Service.Type))) +
  geom_bar(position = "stack") +
  labs(title = "Service.Type Variation Across Regions")
plot of chunk unnamed-chunk-2

'Coverage' Change Over Years for Different 'Service.Type's:
I plotted the trend of coverage levels over multiple years for each service type. This visualization allows me to track how coverage has evolved over time and whether there are any noticeable trends or patterns. It's essential for me to monitor these changes to evaluate the effectiveness of our interventions and policies. This line plot depicts the trend of coverage levels over multiple years for each service type, providing insights into how coverage evolves over time. Understanding temporal trends in coverage is essential for assessing the effectiveness of interventions or policies implemented over time. Significance lies in identifying whether coverage is improving, stagnating, or declining for different service types, which can guide decision-making and resource allocation efforts.

# 'Coverage' change over years for different 'Service.Type's
ggplot(data, aes(x = Year, y = Coverage, color = Service.Type)) +
  geom_line() +
  labs(title = "Coverage Trend Over Years by Service.Type")
plot of chunk unnamed-chunk-3

Relationship Between 'Population' and 'Coverage':
I created a scatter plot to explore the relationship between population size and coverage levels. This helps me understand if coverage scales appropriately with population growth or if there are discrepancies. Understanding this relationship is crucial for ensuring that our services meet the needs of growing populations.This scatter plot visualizes the relationship between population size and coverage levels. Understanding the relationship between population and coverage is crucial for assessing whether coverage scales appropriately with population growth. Significance lies in identifying any patterns or correlations between population size and coverage levels, which can inform resource allocation strategies and service planning efforts.

# Relationship between 'Population' and 'Coverage'
ggplot(data, aes(x = Population, y = Coverage)) +
  geom_point() +
  labs(title = "Relationship Between Population and Coverage")
plot of chunk unnamed-chunk-4

Difference in 'Coverage' Between 'Residence.Type's:
The box plot compares coverage levels between different residence types (e.g., rural vs. urban). Understanding differences in coverage between residence types is critical for addressing equity issues and ensuring fair access to services across different population groups. Significance lies in identifying disparities in coverage between residence types and exploring potential drivers of these disparities, such as differences in healthcare infrastructure, service delivery models, or socioeconomic factors.

# Difference in 'Coverage' between 'Residence.Type's
ggplot(data, aes(x = Residence.Type, y = Coverage)) +
  geom_boxplot() +
  labs(title = "Coverage Difference Between Residence.Types")
plot of chunk unnamed-chunk-5

Is there any correlation between 'Year' and 'Coverage'?
To explore the relationship between the year and coverage levels, I calculated the correlation coefficient between these two variables. This statistic helps quantify the strength and direction of the linear relationship between year and coverage. A positive correlation suggests that coverage tends to increase with each passing year, while a negative correlation indicates the opposite. Understanding this correlation is crucial for understanding how coverage has changed over time and identifying potential factors influencing this change.

# Calculate the correlation between 'Year' and 'Coverage'
correlation <- cor(data$Year, data$Coverage)
print(correlation)
## [1] 0.01474723

Calculate the average 'Coverage' for each 'Service.level:
I computed the average coverage for each service level by aggregating the data and calculating the mean coverage value. This analysis allows me to understand the typical coverage level associated with each service level category. It provides valuable insights into the distribution of coverage across different service levels and helps identify any disparities or trends that may exist. Understanding these average coverage values is essential for assessing the overall performance of our services and identifying areas for improvement.

# Calculate the average 'Coverage' for each 'Service.level'
average_coverage <- aggregate(Coverage ~ Service.level, data = data, FUN = mean)
print(average_coverage)
##             Service.level  Coverage
## 1          At least basic 63.866782
## 2           Basic service 31.272921
## 3         Limited service  9.133438
## 4 No handwashing facility 15.761967
## 5         Open defecation  8.282577
## 6  Safely managed service 59.621061
## 7           Surface water  4.042165
## 8              Unimproved  9.866627

Plot boxplot to visualize 'Coverage' distribution by 'Region':
To explore the variation in coverage across different regions, I created a box plot that visualizes the distribution of coverage levels for each region. This visualization allows me to compare the spread and central tendency of coverage across regions, helping me identify any regional disparities or patterns. Understanding how coverage varies by region is crucial for ensuring equitable access to services and identifying regions that may require additional support or resources.

# Plot boxplot to visualize 'Coverage' distribution by 'Region'
ggplot(data, aes(x = Region, y = Coverage)) +
  geom_boxplot() +
  labs(title = "Variation of Coverage by Region")
plot of chunk unnamed-chunk-8

Predict future 'Coverage' based on historical data:
To address this question, I first split the data into training and testing sets to evaluate the predictive performance of the model accurately. Then, I trained a random forest model using historical data, with 'Coverage' as the target variable and all other available variables as predictors. After training the model, I used it to make predictions on the test data and calculated three performance metrics to assess its accuracy: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2). These metrics provide insights into how well the model performs in predicting coverage based on historical data. The ability to predict future coverage levels accurately can inform decision-making and resource allocation for service planning and delivery.

# predict future 'Coverage' based on historical data?
# Split data into train and test sets
train_indices <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

# Train a random forest model to predict 'Coverage'
model_rf <- randomForest(Coverage ~ ., data = train_data)

# Predict 'Coverage' for test data
predictions <- predict(model_rf, newdata = test_data)
# Mean Absolute Error (MAE)
mae <- mean(abs(predictions - test_data$Coverage))
print(paste("Mean Absolute Error (MAE):", mae))
## [1] "Mean Absolute Error (MAE): 3.72472397968293"
# Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((predictions - test_data$Coverage)^2))
print(paste("Root Mean Squared Error (RMSE):", rmse))
## [1] "Root Mean Squared Error (RMSE): 5.30633569472508"
# R-squared (R2)
rsquared <- cor(predictions, test_data$Coverage)^2
print(paste("R-squared (R2):", rsquared))
## [1] "R-squared (R2): 0.973378619642059"

Predict future 'Coverage' based on historical data:
In this step, I explored another approach to predicting future coverage levels by training a predictive model, specifically a linear regression model, using historical data. I included 'Year', 'Population', and 'Service.Type' as predictors of coverage in the model. Once the model was trained, I made predictions for future years (2023, 2024, and 2025) using example values for population and service type. This analysis provides additional insights into the potential factors influencing coverage levels over time and offers an alternative method for forecasting future coverage trends.

# predict future 'Coverage' based on historical data?
model <- lm(Coverage ~ Year + Population + Service.Type, data = data)

# Make predictions for future years
future_years <- c(2023, 2024, 2025)
future_population <- c(5000000, 5500000, 6000000)
future_service_type <- c("Type1", "Type2", "Type3")

future_data <- data.frame(Year = future_years, Population = future_population, Service.Type = future_service_type)

Top regions with the highest average 'Coverage':
To identify regions with the highest average coverage, I calculated the mean coverage for each region using aggregate statistics. By ranking regions based on their average coverage values, I can pinpoint areas where service delivery has been particularly effective or where coverage levels are notably high. Understanding which regions perform well in terms of coverage can help identify successful strategies or interventions that could be replicated in other areas to improve overall service delivery and outcomes.

# Calculate average coverage for each region
average_coverage <- aggregate(Coverage ~ Region, data = data, FUN = mean)

# Identify top regions with highest average coverage
top_regions <- average_coverage[order(-average_coverage$Coverage), ]

'Service.level' variation between different 'Type's:
To explore how service levels vary across different types, I examined the relationship between service level and type by creating a box plot. By categorizing service level as a factor variable and visualizing its distribution within each type, I can identify any disparities or patterns in service levels across different service types. Understanding these variations is essential for tailoring interventions or strategies to address specific needs within each service type and ensure equitable access to services across the board.

# 'Service.level' variation between different 'Type's
data$Service.level <- factor(data$Service.level)

# Now, let's create the boxplot
boxplot(Service.level ~ Type, data = data, main = "Service Level by Type")
plot of chunk unnamed-chunk-12

Correlation between 'Population' and the distribution of 'Service.Type':
To understand the relationship between population and the distribution of service types, I calculated the correlation between these variables. By examining the correlation, I can determine whether there is any association between population size and the prevalence of different service types. The heatmap visualization provides a clear representation of the correlation matrix, allowing for easy identification of any patterns or trends. Understanding the relationship between population characteristics and service type distribution can inform resource allocation and service planning strategies to better meet the needs of different population groups.

# Calculate correlation between 'Population' and 'Service.Type'
correlation <- cor(table(data$Population, data$Service.Type))

# Convert correlation matrix to tidy format
correlation_tidy <- as.data.frame(as.table(correlation))
names(correlation_tidy) <- c("Population", "Service_Type", "Correlation")

# Create a heatmap of the correlation matrix
ggplot(data = correlation_tidy, aes(x = Population, y = Service_Type, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Correlation between Population and Service Type Distribution",
       x = "Population", y = "Service Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_fixed()
plot of chunk unnamed-chunk-13

Trend of 'Service.Type' over the years:
This analysis examines the trend of service types over time by plotting the count of each service type for each year. By visualizing how the distribution of service types changes over the years, I can identify any shifts or patterns in service provision. Understanding these trends is crucial for anticipating changes in service demand and ensuring that resources are allocated effectively to meet evolving needs over time.

# Trend of 'Service.Type' over the years
# Plot count of Service.Type over years
plot(table(data$Year, data$Service.Type), type = "l", main = "Trend of Service.Type Over Years")
plot of chunk unnamed-chunk-14

'Service.Type' distribution difference between rural and urban areas:
By creating a stacked bar chart of service types by residence type (rural or urban), I can compare the distribution of service types between different types of areas. This analysis helps identify any disparities or differences in service provision between rural and urban areas. Understanding these differences can inform targeted interventions or policies aimed at addressing disparities in service access and improving equity in service delivery across different geographic regions.

# Question 16: 'Service.Type' distribution difference between rural and urban areas
# Stacked bar chart of Service.Type by Residence.Type
table_data <- table(data$Service.Type, data$Residence.Type)
barplot(table_data, beside = TRUE, legend = TRUE, main = "Service Type Distribution by Residence Type")
plot of chunk unnamed-chunk-15

Difference in 'Coverage' between different 'Service.Type's:
This analysis investigates the variation in coverage levels across different service types by creating a box plot of coverage for each service type. By comparing coverage distributions between service types, I can identify which types have higher or lower coverage levels. Understanding these differences can provide insights into areas where service improvements may be needed and inform targeted interventions to enhance coverage and service quality for specific service types.

# Boxplot of Coverage by Service.Type
boxplot(Coverage ~ Service.Type, data = data, main = "Coverage by Service Type")
plot of chunk unnamed-chunk-16

Distribution of 'Service.Type' by 'Region':
By creating a stacked bar chart of service types by region, I can visualize how different service types are distributed across different geographic regions. This analysis helps identify regional variations in service provision and highlights areas where certain types of services may be more prevalent or scarce. Understanding these regional differences can inform decision-making processes related to resource allocation, service planning, and policy development to ensure equitable access to services across all regions.

# Distribution of 'Service.Type' by 'Region'
table_data <- table(data$Service.Type, data$Region)
barplot(table_data, beside = TRUE, legend = TRUE, main = "Service Type Distribution by Region")
plot of chunk unnamed-chunk-17

Trend of Coverage Over Time by Region:
This trend analysis explores how coverage levels have changed over time within each region by plotting coverage against years, color-coded by region. By visualizing the trends, I can identify whether coverage levels have been increasing, decreasing, or remaining stable over time within each region. Understanding these temporal patterns can provide insights into the effectiveness of past interventions and help identify regions that may require additional support or resources to improve coverage levels.

# Trend Analysis
ggplot(data, aes(x = Year, y = Coverage, color = Region)) +
  geom_line() +
  labs(title = "Trend of Coverage Over Time by Region")
plot of chunk unnamed-chunk-18

Predictive Modeling:
In this analysis, a predictive model using random forest regression is trained to predict coverage based on several predictor variables including region, year, population, and service type. Random forest is a powerful machine learning algorithm that can handle complex relationships and interactions among predictor variables.
The model is trained using historical data, where coverage is the target variable, and region, year, population, and service type are the predictor variables. By fitting the model to historical data, it learns patterns and relationships between predictor variables and coverage levels. Once the model is trained, it can be used to make predictions for future years. Future data is created with placeholder values for the region, year, population, and service type. These placeholder values can be replaced with actual values for the region of interest, projected future years, and estimated population and service type information.
Predictions are then generated using the trained random forest model applied to the future data. These predictions provide estimates of coverage levels for the specified region and years based on the input predictor variables. The predicted values can be further analyzed to assess potential future coverage levels and identify areas or time periods where coverage may be higher or lower based on the model's predictions. Additionally, model performance metrics such as mean absolute error (MAE), root mean squared error (RMSE), and R-squared (R2) can be used to evaluate the accuracy and reliability of the predictive model. Further investigation may be needed to refine the model and improve its predictive performance.

# Fit a random forest model
model_rf <- randomForest(Coverage ~ Region + Year + Population + Service.Type, data = data)

# Make predictions for future years
future_years <- c(2023, 2024, 2025)
future_population <- rep("Population_Value", 3)
future_service_type <- rep("Service_Type_Value", 3)

future_data <- data.frame(Region = rep("Your_Region", 3), Year = future_years, Population = future_population, Service.Type = future_service_type)

predictions_rf <- predict(model_rf, newdata = future_data)

# Create a data frame for visualization
predicted_data_rf <- data.frame(Year = future_years, Predicted_Coverage = predictions_rf)

# Visualize the predicted coverage values by year using random forest
ggplot(predicted_data_rf, aes(x = as.factor(Year), y = Predicted_Coverage)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Predicted Coverage for Future Years (Random Forest)",
       x = "Year", y = "Predicted Coverage",
       caption = "Source: Dataset") +
  theme_minimal()
plot of chunk unnamed-chunk-19

Conclusion:
As I conclude my exploration of the data on essential services, I've gained valuable insights into how these services are distributed, how they change over time, and the factors that influence their availability. Here are some key takeaways:
Distribution of Essential Services: The data revealed the distribution of different types of essential services, shedding light on which services are more prevalent in certain areas.
Variation Across Regions: There were noticeable variations in the types of services provided across different regions, indicating potential disparities in access to essential services.
Trends Over Time: Analysis of coverage trends over the years highlighted how service levels have evolved, offering insights into the progress or challenges in service provision.
Relationship with Population: Exploring the relationship between population and coverage unveiled patterns in service availability relative to population density, indicating areas where service expansion may be needed.
Disparities Between Residence Types: Differences in coverage between rural and urban areas underscored the importance of addressing disparities in service provision to ensure equitable access for all residents.
Predictive Modeling: Using predictive modeling techniques, I attempted to forecast future coverage based on historical data, providing a glimpse into potential future trends and helping in proactive planning.
Correlation Analysis: Correlation analysis between various factors such as population and service types offered valuable insights into potential relationships that could inform policy decisions.
In conclusion, this analysis underscores the significance of data-driven insights in understanding and addressing the challenges related to essential service provision. By leveraging the findings from this exploration, policymakers, stakeholders, and service providers can make informed decisions to enhance the accessibility and quality of essential services for communities across different regions. However, further investigation may be warranted to delve deeper into certain trends and correlations identified in the data, ensuring comprehensive and targeted interventions to meet the diverse needs of populations.