R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

df <- read.csv("C:\\Users\\USER\\Downloads\\world_health_data.csv")
View(df)

require(ggplot2)
## Loading required package: ggplot2
# ===================================================
# 1) HISTOGRAM – Health Expenditure
# ===================================================


ggplot(df, aes(x = health_exp)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(
    title = "Histogram of Health Expenditure",
    x = "Health Expenditure (% of GDP)",
    y = "Frequency"
  ) +
  theme_minimal()
## Warning: Removed 1483 rows containing non-finite outside the scale range
## (`stat_bin()`).

# ===================================================

# 1) Histogram — Health Expenditure

# The histogram shows that the majority of countries have health expenditure concentrated between approximately 3 and 8 percent of GDP. Higher expenditure values (above 15) appear much less frequently, indicating they are rare outliers within the dataset.

# ===================================================



# ===================================================
# 2) BOXPLOT – Life Expectancy
# ===================================================

ggplot(df, aes(y = life_expect)) +
  geom_boxplot(fill = "orange") +
  labs(
    title = "Boxplot of Life Expectancy",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()
## Warning: Removed 460 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# ===================================================

#2) Boxplot — Life Expectancy

# The boxplot indicates that the median life expectancy is around 72 years. Several lower-end outliers are present, suggesting that some countries have significantly lower life expectancy compared to the global average.

# ===================================================




# ===================================================
# 3) SCATTERPLOT – Health Expenditure vs Life Expectancy
# ===================================================

ggplot(df, aes(x = health_exp, y = life_expect)) +
  geom_point(alpha = 0.6, color = "darkblue") +
  labs(
    title = "Scatterplot: Health Expenditure vs Life Expectancy",
    x = "Health Expenditure (% of GDP)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()
## Warning: Removed 1568 rows containing missing values or values outside the scale range
## (`geom_point()`).

# ===================================================

# 3) Scatterplot — Health Expenditure vs Life Expectancy

# The scatterplot shows a weak but positive correlation between health expenditure and life expectancy. In general, countries that invest more in healthcare tend to have higher life expectancy, although the relationship is not perfectly linear.

# ===================================================





# ===================================================
# 4) BAR CHART – Count of Records per Year
# ===================================================

ggplot(df, aes(x = factor(year))) +
  geom_bar(fill = "purple") +
  labs(
    title = "Bar Chart: Number of Records per Year",
    x = "Year",
    y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# ===================================================

# 4) Bar Chart — Number of Records per Year

# The bar chart demonstrates that the dataset contains a similar number of records for each year, showing a consistent distribution of data points across time without major fluctuations.

# ===================================================





# ===================================================
# 5) Correlation Heatmap
# ===================================================


library(ggplot2)
require(reshape2)
## Loading required package: reshape2
## Warning: package 'reshape2' was built under R version 4.5.2
num_df <- df[, sapply(df, is.numeric)]


corr_matrix <- cor(num_df, use = "pairwise.complete.obs")


corr_melt <- melt(corr_matrix)

ggplot(corr_melt, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white",
                       midpoint = 0, limit = c(-1,1)) +
  labs(title = "Correlation Heatmap",
       x = "Variables",
       y = "Variables",
       fill = "Correlation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  )

# ===================================================

# Correlation Heatmap Interpretation

# The heatmap shows the strength and direction of relationships between numerical variables. Strong positive or negative correlations indicate meaningful associations that may influence health outcomes.


# The heatmap shows correlation coefficients between different health and demographic variables.
# Color coding:

# Red → strong positive correlation

# Blue → strong negative correlation

# White → little or no correlation

# Strongest Relationships
# Strong Negative Correlations (dark blue)

# These variables are strongly and inversely related to life expectancy:

# Infant mortality

# Under-5 mortality

# Neonatal mortality

# Maternal mortality

# As these mortality indicators increase, life expectancy sharply decreases.

# This makes sense medically: high child and maternal mortality often reflect poor healthcare systems, lack of access to care, poverty, and weak infrastructure.

# Strong Positive Correlations

# Health expenditure has a strong positive correlation with life expectancy.
# → Countries that invest more in healthcare achieve longer average lifespans.

# Moderate or Weak Correlations

# Variables like HIV prevalence and tuberculosis incidence show negative correlations, but not as strong as child/maternal mortality.

# The variable year has a mild positive correlation with life expectancy, reflecting global improvements over time.

# Interpretation

# Life expectancy is most strongly influenced by mortality indicators—especially childhood and maternal health outcomes.

# Higher healthcare spending is clearly linked with longer life expectancy.

# Infectious diseases matter but are not the primary drivers.

# The heatmap provides a clear overview of how health systems and demographic conditions shape population longevity.


# ===================================================








# ===================================================
# 6) Line Chart (Trend — Life Expectancy over Time)
# ===================================================


ggplot(df, aes(x = year, y = life_expect, group = country)) +
  geom_line(alpha = 0.2, color = "steelblue") +
  labs(title = "Life Expectancy Over Time",
       x = "Year",
       y = "Life Expectancy") +
  theme_minimal()
## Warning: Removed 445 rows containing missing values or values outside the scale range
## (`geom_line()`).

# ===================================================

# Line Chart Interpretation

# The line chart illustrates how life expectancy changes over time across countries. Most trajectories show a gradual improvement over the years.


# Each line represents one country’s life expectancy from about 1999 to 2023. The transparency helps visualize overall trends without clutter.

# Key Observations

# Most countries show a clear upward trend over time.
# → This indicates global improvement in healthcare access, vaccination, nutrition, and disease prevention.

# There is large variation between countries.

# Some countries stay consistently above 80 years → high-income nations.

# Others remain close to 50–60 years → typically lower-income countries or those affected by conflict and epidemics.

# A slight decline around 2019–2021 is visible for several countries.
# → Most likely due to the global impact of COVID-19.

# Even low-life-expectancy countries show growth over time.
# → Suggesting gradual development and healthcare improvement.

# Interpretation

# The global trajectory is positive: life expectancy has steadily increased over the past two decades.

# However, disparities between countries remain significant.

# Temporary declines linked to global crises (pandemic) can be clearly identified.

# The plot emphasizes both progress and persistent inequality.


# ===================================================










# ===================================================
# 7) Violin Plot — Life Expectancy
# ===================================================


ggplot(df, aes(x = "", y = life_expect)) +
  geom_violin(fill = "lightgreen", alpha = 0.6) +
  labs(title = "Violin Plot of Life Expectancy",
       x = "",
       y = "Life Expectancy") +
  theme_minimal()
## Warning: Removed 460 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# ===================================================

# Violin Plot Interpretation

# This plot shows how life expectancy is distributed across all countries in the dataset. The violin shape represents the density of observations—wide areas indicate many countries, while narrow areas indicate fewer countries.

# Key Observations

# Life expectancy across countries ranges roughly from 40 to 85 years.

# The highest density is between 70 and 75 years, meaning most countries fall within this range.

# The lower tail (50–60 years) is narrow → only a small number of countries have very low life expectancy.

# The upper tail (80+ years) is also narrow → only highly developed countries reach these levels.

# The distribution is asymmetric: the long lower tail shows that a few countries still struggle with very low life expectancy.

# Interpretation

# Globally, life expectancy is relatively high in most countries.

# A minority of countries with weaker healthcare systems or poorer socioeconomic conditions pull the lower end of the distribution downward.

# The plot reflects long-term global improvements in health, despite persistent inequality.


# ===================================================








# ===================================================
# 8) Pairplot (GGpairs)
# ===================================================




library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
pair_df <- df[, c("health_exp", "life_expect", 
                  "infant_mortality", "maternal_mortality")]

GGally::ggpairs(pair_df)
## Warning: Removed 1483 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 1568 rows containing missing values
## Warning: Removed 1483 rows containing missing values
## Warning: Removed 1927 rows containing missing values
## Warning: Removed 1568 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 460 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 888 rows containing missing values
## Warning: Removed 1757 rows containing missing values
## Warning: Removed 1483 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 888 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 794 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 1778 rows containing missing values
## Warning: Removed 1927 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1757 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1778 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1757 rows containing non-finite outside the scale range
## (`stat_density()`).

# ===================================================

# Pairplot Interpretation

# The pairplot visualizes relationships across multiple variables, making it easy to identify linear patterns, clusters, or outliers between health expenditure, life expectancy, and mortality indicators.


# This plot is a scatterplot matrix (or pairs plot) that shows the relationships among four variables in a dataset: health_exp, life_expect, infant_mortality, and maternal_mortality. Each row and column represents a different variable. Details are as follows:

# Diagonal cells (top-left → bottom-right):
# These show the distribution of each variable.
# For example, health_exp is represented with a kernel density plot, while life_expect might have a histogram or density plot.

# Off-diagonal cells:
# These are scatterplots, showing how two variables vary with respect to each other.
# For instance, the life_expect vs infant_mortality cell indicates that as infant mortality increases, life expectancy decreases.

# "Corr" numbers:
# These cells display the correlation coefficient between the two variables:
# *** → highly statistically significant (p < 0.001).
# For example, life_expect and infant_mortality show -0.922***, meaning there is a strong negative correlation between them.

# Overall observations from the plot:

# There is a positive relationship between health expenditure (health_exp) and life expectancy (life_expect) (0.346).

# Infant and maternal mortality are strongly negatively correlated with life expectancy (-0.922 and -0.857).

# infant_mortality and maternal_mortality have a strong positive correlation (0.888).

# In summary, this plot visually shows the distributions and pairwise correlations of the four variables in the dataset.

# ===================================================