# Load libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(corrplot)
library(caret)
library(janitor)
library(knitr)
library(car)

___________________________________________________

Level 1: Understanding the Data (Basic Exploration)

___________________________________________________

Question 1.1: What is the structure of the dataset (number of rows, columns, and data types)?

# Load dataset

data<- read.csv("Life Expectancy Data.csv")

# View first few rows

head(data)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2
# Clean column Names
data <- clean_names(data)
colnames(data)
##  [1] "country"                         "year"                           
##  [3] "status"                          "life_expectancy"                
##  [5] "adult_mortality"                 "infant_deaths"                  
##  [7] "alcohol"                         "percentage_expenditure"         
##  [9] "hepatitis_b"                     "measles"                        
## [11] "bmi"                             "under_five_deaths"              
## [13] "polio"                           "total_expenditure"              
## [15] "diphtheria"                      "hiv_aids"                       
## [17] "gdp"                             "population"                     
## [19] "thinness_1_19_years"             "thinness_5_9_years"             
## [21] "income_composition_of_resources" "schooling"
# Check number of rows and columns

dim(data)
## [1] 2938   22

=> Interpretation: The dataset contains multiple observations of countries across different years along with various health and socio-economic indicators. The dim() function helps determine the size of the dataset by showing the total number of rows and columns.

# Check structure of dataset

str(data)
## 'data.frame':    2938 obs. of  22 variables:
##  $ country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ life_expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ adult_mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant_deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage_expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ hepatitis_b                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ bmi                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under_five_deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ total_expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ hiv_aids                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ gdp                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness_1_19_years            : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness_5_9_years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ income_composition_of_resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

=> Interpretation: The str() function displays the structure of the dataset, including variable names and their data types. Understanding the structure of the dataset is important for selecting appropriate statistical and machine learning methods.

# Summary Satistics of the data set

summary(data)
##    country               year         status          life_expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  adult_mortality infant_deaths       alcohol        percentage_expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   hepatitis_b       measles              bmi        under_five_deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      polio       total_expenditure   diphtheria       hiv_aids     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       gdp              population        thinness_1_19_years thinness_5_9_years
##  Min.   :1.681e+00   Min.   :3.400e+01   Min.   : 0.10       Min.   : 0.10     
##  1st Qu.:4.639e+02   1st Qu.:1.958e+05   1st Qu.: 1.60       1st Qu.: 1.50     
##  Median :1.767e+03   Median :1.387e+06   Median : 3.30       Median : 3.30     
##  Mean   :7.483e+03   Mean   :1.275e+07   Mean   : 4.84       Mean   : 4.87     
##  3rd Qu.:5.911e+03   3rd Qu.:7.420e+06   3rd Qu.: 7.20       3rd Qu.: 7.20     
##  Max.   :1.192e+05   Max.   :1.294e+09   Max.   :27.70       Max.   :28.60     
##  NA's   :448         NA's   :652         NA's   :34          NA's   :34        
##  income_composition_of_resources   schooling    
##  Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.:0.4930                  1st Qu.:10.10  
##  Median :0.6770                  Median :12.30  
##  Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :0.9480                  Max.   :20.70  
##  NA's   :167                     NA's   :163

=> Interpretation: The summary() function provides descriptive statistics for each variable in the dataset. These statistics help identify the distribution and central tendency of the data, which is useful for further analysis.

Question 1.2: Are there any missing values in the dataset?

# check number of missing values in each columns
colSums(is.na(data))
##                         country                            year 
##                               0                               0 
##                          status                 life_expectancy 
##                               0                              10 
##                 adult_mortality                   infant_deaths 
##                              10                               0 
##                         alcohol          percentage_expenditure 
##                             194                               0 
##                     hepatitis_b                         measles 
##                             553                               0 
##                             bmi               under_five_deaths 
##                              34                               0 
##                           polio               total_expenditure 
##                              19                             226 
##                      diphtheria                        hiv_aids 
##                              19                               0 
##                             gdp                      population 
##                             448                             652 
##             thinness_1_19_years              thinness_5_9_years 
##                              34                              34 
## income_composition_of_resources                       schooling 
##                             167                             163

=> Interpretation: The output shows the number of missing values present in each variable of the dataset. Identifying missing values is important because they can affect statistical analysis and machine learning model performance. Variables with missing values will be handled in later stages using appropriate methods such as imputation or removal.

# Checking total values in entire Dataset

sum(is.na(data))
## [1] 2563

=> Interpretation: This value represents the total number of missing values across the entire dataset. A higher number of missing values indicates the need for careful data cleaning before performing analysis or building machine learning models.

# Missing data

missing_data <- colSums(is.na(data))

missing_df <- data.frame(
  Variable = names(missing_data),
  Missing_Count = missing_data
)

# Visualize Missing Values

ggplot(missing_df, aes(x = reorder(Variable, Missing_Count), y = Missing_Count, fill = Missing_Count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Missing Values by Variable",
    subtitle = "Higher bars indicate variables with more missing data",
    x = "Variables",
    y = "Number of Missing Values"
  ) +
  scale_fill_gradient(low = "#56B1F7", high = "#132B43") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "none"
  )

=> Interpretation: This plot shows which variable has the most missing data and where cleaning effort should be focused, as we can see in the plot population and hapatitis_b have the highest number of missing values while adult_mortality and life_expecrancy has the least amount of missing values.

Question 1.3: What is the average life expectancy across the dataset?

# Calculate average life expectancy

mean(data$life_expectancy, na.rm = TRUE)
## [1] 69.22493

=> Interpretation: The calculated value represents the average life expectancy across all countries and years in the dataset. This provides a general understanding of global health conditions and serves as a baseline for further statistical analysis.

# Calulate median life expectancy

median(data$life_expectancy, na.rm = TRUE)
## [1] 72.1

=> Interpretation: The median life expectancy represents the middle value of the dataset when observations are arranged in order. It is less affected by extreme values and provides a more stable measure of central tendency compared to the mean.

# Calculate standard deviation

sd(data$life_expectancy, na.rm = TRUE)
## [1] 9.523867

=> Interpretation: The standard deviation measures the variability of life expectancy across countries and years. A higher standard deviation indicates greater variation in life expectancy values.

ggplot(data, aes(x = life_expectancy)) +
  geom_histogram(bins = 30, fill = "#2C7FB8", color = "white", alpha = 0.9) +
  geom_density(aes(y = ..count..), color = "#D95F0E", size = 1) +
  labs(
    title = "Distribution of Life Expectancy",
    subtitle = "Most countries fall between 65–80 years",
    x = "Life Expectancy",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11),
    axis.title = element_text(size = 12)
  )

=> Interpretation: The histogram shows that most life expectancy values fall between 65 and 80 years, indicating that the majority of countries have moderate to high life expectancy. A smaller number of countries have lower values below 50 years, suggesting variation in health conditions across regions.

Question 1.4: Are there any outliers in the dataset?

# Create a boxplot for life expectancy

ggplot(data, aes(x = "", y = life_expectancy)) +
  geom_boxplot(fill = "#2C7FB8", alpha = 0.7) +
  labs(
    title = "Boxplot of Life Expectancy",
    x = "",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

=> Interpretation: The boxplot shows that most life expectancy values are concentrated between approximately 63 and 76 years, with a median around 72 years. A few lower values below 45 years appear as outliers, indicating countries with significantly lower life expectancy compared to the majority.

# Detect outlier using the IQR method

# Calculate quartiles
Q1 <- quantile(data$life_expectancy, 0.25,na.rm = TRUE)
Q3 <- quantile(data$life_expectancy, 0.75, na.rm = TRUE)

# Calculate IQR

IQR_value <- Q3 - Q1

# Define outlier limits

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Find outliers

outliers <- data$life_expectancy[
  data$life_expectancy < lower_bound |
  data$life_expectancy > upper_bound
]

length(outliers)
## [1] 20

=> Interpretation: The IQR method identifies observations that fall outside the acceptable range of life expectancy values. These values are considered outliers and may represent extreme health conditions or data irregularities.

____________________________________

Level 2: Data Extraction & Filtering

____________________________________

Question 2.1: Which are the top 10 countries with the highest life expectancy?

# Calculate average life expectancy by country

avg_life_exp_country <- data %>%
  group_by(country) %>%
  summarise(
    Avg_Life_Expectancy = mean(life_expectancy, na.rm = TRUE)
  )
print(avg_life_exp_country)
## # A tibble: 193 × 2
##    country             Avg_Life_Expectancy
##    <chr>                             <dbl>
##  1 Afghanistan                        58.2
##  2 Albania                            75.2
##  3 Algeria                            73.6
##  4 Angola                             49.0
##  5 Antigua and Barbuda                75.1
##  6 Argentina                          75.2
##  7 Armenia                            73.4
##  8 Australia                          81.8
##  9 Austria                            81.5
## 10 Azerbaijan                         70.7
## # ℹ 183 more rows
# Get top 10 countries with highest life expectancy

top_10_countries <- avg_life_exp_country %>%
  arrange(desc(Avg_Life_Expectancy)) %>%
  slice_head(n = 10)

top_10_countries
## # A tibble: 10 × 2
##    country     Avg_Life_Expectancy
##    <chr>                     <dbl>
##  1 Japan                      82.5
##  2 Sweden                     82.5
##  3 Iceland                    82.4
##  4 Switzerland                82.3
##  5 France                     82.2
##  6 Italy                      82.2
##  7 Spain                      82.1
##  8 Australia                  81.8
##  9 Norway                     81.8
## 10 Canada                     81.7
ggplot(top_10_countries,
       aes(x = reorder(country, Avg_Life_Expectancy),
           y = Avg_Life_Expectancy,
           fill = Avg_Life_Expectancy)) +
  geom_bar(stat = "identity", width = 0.7) +
  coord_flip() +
  geom_text(aes(label = round(Avg_Life_Expectancy, 1)),
            hjust = -0.1, size = 3.5) +
  labs(
    title = "Top 10 Countries with Highest Life Expectancy",
    subtitle = "Countries ranked by average life expectancy",
    x = "Country",
    y = "Average Life Expectancy (years)"
  ) +
  scale_fill_gradient(low = "#56B1F7", high = "#08306B") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11),
    axis.title = element_text(size = 12),
    legend.position = "none"
  ) +
  ylim(0, max(top_10_countries$Avg_Life_Expectancy) + 2)

=> Interpretation: The results show the top 10 countries with the highest average life expectancy across the dataset. These countries consistently demonstrate better health outcomes, likely due to stronger healthcare systems and improved living conditions.

Question 2.2: Which countries have life expectancy below 50 years?

# Filter records where life expectancy is below 50

low_life_exp <- data %>%
  filter(life_expectancy < 50)

head(low_life_exp)
##   country year     status life_expectancy adult_mortality infant_deaths alcohol
## 1  Angola 2010 Developing            49.6             365            78    7.80
## 2  Angola 2009 Developing            49.1             369            81    7.01
## 3  Angola 2008 Developing            48.7             371            84    7.07
## 4  Angola 2007 Developing            48.2             375            87    6.35
## 5  Angola 2006 Developing            47.7             381            90    5.84
## 6  Angola 2005 Developing            47.4             382            92    5.04
##   percentage_expenditure hepatitis_b measles  bmi under_five_deaths polio
## 1              191.65374          77    1190  2.4               121    81
## 2              212.92293          61    2807 19.8               127    63
## 3              249.91020          69     265 19.3               133    65
## 4              184.82134          73    1014 18.8               138    75
## 5               25.08689          NA     765 18.2               143    36
## 6               98.19145          NA     258 17.7               148    39
##   total_expenditure diphtheria hiv_aids       gdp population
## 1              3.39         77      2.5 3529.5348   23369131
## 2              4.37          6      2.5 3347.8448   22549547
## 3              3.84         69      2.6 3868.5789    2175942
## 4              3.38         73      2.6 2878.8371    2997687
## 5              4.54         34      2.5  262.4151    2262399
## 6              4.10         38      2.6 1443.9919   19552542
##   thinness_1_19_years thinness_5_9_years income_composition_of_resources
## 1                 9.1                9.0                           0.488
## 2                 9.3                9.2                           0.480
## 3                 9.5                9.4                           0.468
## 4                 9.6                9.6                           0.454
## 5                 9.8                9.7                           0.439
## 6                 1.0                9.9                           0.426
##   schooling
## 1       9.0
## 2       8.5
## 3       8.1
## 4       7.7
## 5       7.2
## 6       6.8
# Get unique country names

unique_low_countries <- low_life_exp %>%
  distinct(country)

unique_low_countries
##                        country
## 1                       Angola
## 2                     Botswana
## 3                Côte d'Ivoire
## 4     Central African Republic
## 5                         Chad
## 6                      Eritrea
## 7                        Haiti
## 8                      Lesotho
## 9                       Malawi
## 10                        Mali
## 11                  Mozambique
## 12                     Nigeria
## 13                      Rwanda
## 14                Sierra Leone
## 15                 South Sudan
## 16                   Swaziland
## 17                      Uganda
## 18 United Republic of Tanzania
## 19                      Zambia
## 20                    Zimbabwe
# Count number of countries

nrow(unique_low_countries)
## [1] 20
ggplot(low_life_exp,
       aes(x = reorder(country, life_expectancy),
           fill = after_stat(count))) +
  geom_bar() +
  coord_flip() +
  scale_fill_gradient(low = "#A6CEE3", high = "#1F78B4") +
  labs(
    title = "Countries with Life Expectancy Below 50 Years",
    x = "Country",
    y = "Number of Records"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

=> Interpretation: The bar chart shows the countries that have recorded life expectancy values below 50 years across multiple years. Countries such as Sierra Leone, the Central African Republic, and Angola appear more frequently, indicating persistent low life expectancy during the observed period. This pattern suggests ongoing health and socioeconomic challenges in these regions compared to other countries in the dataset.

Question 2.3: what is the average life expectancy by country status(Developed vs Developing)?

# Calculate average life expectancy by country status

life_exp_by_status <- data %>%
  group_by(status) %>%
  summarise(
    Avg_Life_Expectancy = mean(life_expectancy, na.rm = TRUE)
  )

life_exp_by_status
## # A tibble: 2 × 2
##   status     Avg_Life_Expectancy
##   <chr>                    <dbl>
## 1 Developed                 79.2
## 2 Developing                67.1
ggplot(life_exp_by_status,
       aes(x = status,
           y = Avg_Life_Expectancy,
           fill = status)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = round(Avg_Life_Expectancy, 1)),
            vjust = -0.5, size = 4) +
  labs(
    title = "Average Life Expectancy by Country Status",
    subtitle = "Developed countries show higher life expectancy",
    x = "Country Status",
    y = "Average Life Expectancy (years)"
  ) +
  scale_fill_manual(values = c("Developed" = "#1B9E77",
                               "Developing" = "#D95F02")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11),
    axis.title = element_text(size = 12),
    legend.position = "none"
  ) +
  ylim(0, max(life_exp_by_status$Avg_Life_Expectancy) + 5)

=> Interpretation: The bar chart shows that developed countries have a higher average life expectancy compared to developing countries. This indicates better overall health outcomes in developed nations, likely due to improved healthcare systems, living conditions, and socioeconomic stability.

Question 2.4: How has global average life expectancy changed over time?

# Calculate average life expectancy per year
life_exp_trend <- data %>%
  group_by(year) %>%
  summarise(Avg_Life_Expectancy = mean(life_expectancy, na.rm = TRUE))

ggplot(life_exp_trend, aes(x = year, y = Avg_Life_Expectancy)) +
  geom_line(color = "#1B9E77", size = 1.2) +
  geom_point(color = "#D95F02", size = 2.5) +
  labs(
    title = "Global Average Life Expectancy Over Time",
    subtitle = "Steady improvement observed from 2000 to 2015",
    x = "Year",
    y = "Average Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11)
  )

=> Interpretation: The line plot reveals a consistent upward trend in global average life expectancy from approximately 66.8 years in 2000 to 75.5 years in 2015. This represents an improvement of nearly 9 years over the 15-year period, suggesting sustained global progress in healthcare, disease control, and living standards. The rate of improvement appears to accelerate after 2004, which may reflect the increased global focus on public health initiatives such as the UN Millennium Development Goals adopted in 2000, whose effects began materializing in later years. The relatively flat period between 2002 and 2004 may indicate temporary stagnation possibly linked to disease outbreaks or economic disruptions during that period.

_____________________________

Level 3: Correlation Analysis

_____________________________

Question 3.1: What variables are strongly correlated with life expectancy?

=> Correlation only works on numeric data. Text columns cannot be used and will not contribute to the correlation analysis.

# Select only numeric columns
numeric_data <- data %>% 
  select(where(is.numeric))
print(head(numeric_data))
##   year life_expectancy adult_mortality infant_deaths alcohol
## 1 2015            65.0             263            62    0.01
## 2 2014            59.9             271            64    0.01
## 3 2013            59.9             268            66    0.01
## 4 2012            59.5             272            69    0.01
## 5 2011            59.2             275            71    0.01
## 6 2010            58.8             279            74    0.01
##   percentage_expenditure hepatitis_b measles  bmi under_five_deaths polio
## 1              71.279624          65    1154 19.1                83     6
## 2              73.523582          62     492 18.6                86    58
## 3              73.219243          64     430 18.1                89    62
## 4              78.184215          67    2787 17.6                93    67
## 5               7.097109          68    3013 17.2                97    68
## 6              79.679367          66    1989 16.7               102    66
##   total_expenditure diphtheria hiv_aids       gdp population
## 1              8.16         65      0.1 584.25921   33736494
## 2              8.18         62      0.1 612.69651     327582
## 3              8.13         64      0.1 631.74498   31731688
## 4              8.52         67      0.1 669.95900    3696958
## 5              7.87         68      0.1  63.53723    2978599
## 6              9.20         66      0.1 553.32894    2883167
##   thinness_1_19_years thinness_5_9_years income_composition_of_resources
## 1                17.2               17.3                           0.479
## 2                17.5               17.5                           0.476
## 3                17.7               17.7                           0.470
## 4                17.9               18.0                           0.463
## 5                18.2               18.2                           0.454
## 6                18.4               18.4                           0.448
##   schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2
# Compute correlation matrix

cor_matrix <- cor(numeric_data, use = "complete.obs")

print(head(cor_matrix))
##                                year life_expectancy adult_mortality
## year                    1.000000000      0.05077103     -0.03709178
## life_expectancy         0.050771035      1.00000000     -0.70252306
## adult_mortality        -0.037091782     -0.70252306      1.00000000
## infant_deaths           0.008029128     -0.16907380      0.04245024
## alcohol                -0.113364764      0.40271832     -0.17553509
## percentage_expenditure  0.069553468      0.40963082     -0.23760989
##                        infant_deaths    alcohol percentage_expenditure
## year                     0.008029128 -0.1133648             0.06955347
## life_expectancy         -0.169073804  0.4027183             0.40963082
## adult_mortality          0.042450237 -0.1755351            -0.23760989
## infant_deaths            1.000000000 -0.1062169            -0.09076463
## alcohol                 -0.106216917  1.0000000             0.41704736
## percentage_expenditure  -0.090764632  0.4170474             1.00000000
##                        hepatitis_b      measles          bmi under_five_deaths
## year                    0.11489709 -0.053822046  0.005739061        0.01047859
## life_expectancy         0.19993528 -0.068881222  0.542041588       -0.19226530
## adult_mortality        -0.10522544 -0.003966685 -0.351542478        0.06036503
## infant_deaths          -0.23176894  0.532679832 -0.234425154        0.99690562
## alcohol                 0.10988939 -0.050110235  0.353396205       -0.10108216
## percentage_expenditure  0.01676017 -0.063070789  0.242738243       -0.09215806
##                             polio total_expenditure  diphtheria     hiv_aids
## year                   -0.0166988        0.05949278  0.02964059 -0.123404990
## life_expectancy         0.3272944        0.17471764  0.34133123 -0.592236293
## adult_mortality        -0.1998530       -0.08522653 -0.19142876  0.550690745
## infant_deaths          -0.1569288       -0.14695112 -0.16187100  0.007711547
## alcohol                 0.2403145        0.21488509  0.24295143 -0.027112636
## percentage_expenditure  0.1286261        0.18387236  0.13481324 -0.095084991
##                                gdp  population thinness_1_19_years
## year                    0.09642148  0.01256689          0.01975661
## life_expectancy         0.44132181 -0.02230498         -0.45783819
## adult_mortality        -0.25503473 -0.01501184          0.27223004
## infant_deaths          -0.09809202  0.67175831          0.46341526
## alcohol                 0.44343279 -0.02888023         -0.40375499
## percentage_expenditure  0.95929886 -0.01679214         -0.25503460
##                        thinness_5_9_years income_composition_of_resources
## year                           0.01412242                       0.1228918
## life_expectancy               -0.45750829                       0.7210826
## adult_mortality                0.28672288                      -0.4422033
## infant_deaths                  0.46190792                      -0.1347539
## alcohol                       -0.38620819                       0.5610743
## percentage_expenditure        -0.25563544                       0.4021697
##                          schooling
## year                    0.08873179
## life_expectancy         0.72763003
## adult_mortality        -0.42117052
## infant_deaths          -0.21437190
## alcohol                 0.61697481
## percentage_expenditure  0.42208845

=> Interpretation: This is the correlation of each column with every other column. +1 signifies strong positive relationship, -1 signifies strong negative relationship and 0 signifies no relationship between the columns.

# Correlation with life expectancy

cor_life_exp <- cor_matrix["life_expectancy",]

# Sort values

sort(cor_life_exp, decreasing = TRUE)
##                 life_expectancy                       schooling 
##                      1.00000000                      0.72763003 
## income_composition_of_resources                             bmi 
##                      0.72108259                      0.54204159 
##                             gdp          percentage_expenditure 
##                      0.44132181                      0.40963082 
##                         alcohol                      diphtheria 
##                      0.40271832                      0.34133123 
##                           polio                     hepatitis_b 
##                      0.32729440                      0.19993528 
##               total_expenditure                            year 
##                      0.17471764                      0.05077103 
##                      population                         measles 
##                     -0.02230498                     -0.06888122 
##                   infant_deaths               under_five_deaths 
##                     -0.16907380                     -0.19226530 
##              thinness_5_9_years             thinness_1_19_years 
##                     -0.45750829                     -0.45783819 
##                        hiv_aids                 adult_mortality 
##                     -0.59223629                     -0.70252306

=> Interpretation: The correlation analysis shows that variable such as Schooling, Income composition of resources, bmi have a positive relation with life expectancy, while Adult mortality and HIV/AIDS prevalence show strong negative correlations. This indicates that better socioeconomic conditions and healthcare factors are associated with higher life expectancy, whereas higher disease burden reduces it.

# Filter strong correlations

strong_correlations <- cor_life_exp[abs(cor_life_exp)>0.5]

head(strong_correlations)
##                 life_expectancy                 adult_mortality 
##                       1.0000000                      -0.7025231 
##                             bmi                        hiv_aids 
##                       0.5420416                      -0.5922363 
## income_composition_of_resources                       schooling 
##                       0.7210826                       0.7276300

Question 3.2: Can we visualize the correlation using a heatmap?

# Visualize correlation matrix

corrplot(cor_matrix,
         method = "color",
         type = "upper",
         order = "hclust",
         tl.col = "black",
         tl.cex = 0.7,
         number.cex = 0.5)

=> Interpretation: The heatmap illustrates the strength and direction of relationship between variables in the dataset. Strong positive correlation are observed between life expectancy and variable such as schooling, income composition of resources, BMI and GDP indicating that better education, economic conditions and nutrition ate associated with higher life expectancy. Additionally, clusters of related variables can be observed such as infant deaths and under-five deaths, which are highly positively correlated, indicating they measure similar health outcomes. Overall, the heatmap highlights key socioeconomic and health-related factors that influence life expectancy and helps identify the most important variables for further analysis and modeling.

____________________________

Level 4: Regression Analysis

____________________________

Question 4.1: Can we predict life expectancy using multiple linear regression?

# Clean the data 

data_clean <- data %>%
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# create a temp model with initial variables(based on correlation)

model_temp <- lm(
  life_expectancy ~ schooling + income_composition_of_resources +
  adult_mortality + hiv_aids + bmi + gdp +
  thinness_1_19_years + thinness_5_9_years +
  infant_deaths + under_five_deaths,
  data = data_clean
)

# check Multicollinearity (VIF)

vif(model_temp)
##                       schooling income_composition_of_resources 
##                        3.025591                        2.947149 
##                 adult_mortality                        hiv_aids 
##                        1.697988                        1.402620 
##                             bmi                             gdp 
##                        1.681875                        1.291818 
##             thinness_1_19_years              thinness_5_9_years 
##                        8.688600                        8.816314 
##                   infant_deaths               under_five_deaths 
##                      161.611729                      161.245536

=> Interpretation: The VIF results indicate high multicollinearity among certain variables, particularly infant deaths and under-five deaths, as well as thinness indicators. One variable from each highly correlated pair will be removed to reduce redundancy and improve model reliability.

# creating the final model

model_final <- lm(
  life_expectancy ~ schooling + adult_mortality + hiv_aids + bmi + gdp + thinness_1_19_years +
    under_five_deaths,
  data = data_clean
)

vif(model_final)
##           schooling     adult_mortality            hiv_aids                 bmi 
##            1.687853            1.674425            1.392140            1.643741 
##                 gdp thinness_1_19_years   under_five_deaths 
##            1.256637            1.785263            1.285690

=> Interpretation: The VIF values for all selected variables are below 5, indicating that multicollinearity is not a concern in the final model. This confirms that redundant variables have been successfully removed, resulting in a stable and reliable regression model for predicting life expectancy.

Question 4.2: Which variables significantly affect life expectancy?

summary(model_final)
## 
## Call:
## lm(formula = life_expectancy ~ schooling + adult_mortality + 
##     hiv_aids + bmi + gdp + thinness_1_19_years + under_five_deaths, 
##     data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7171  -2.2571   0.0881   2.6777  19.9985 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.779e+01  5.031e-01 114.856  < 2e-16 ***
## schooling            1.157e+00  3.318e-02  34.883  < 2e-16 ***
## adult_mortality     -2.276e-02  8.696e-04 -26.175  < 2e-16 ***
## hiv_aids            -4.914e-01  1.938e-02 -25.364  < 2e-16 ***
## bmi                  5.887e-02  5.363e-03  10.978  < 2e-16 ***
## gdp                  6.679e-05  7.030e-06   9.500  < 2e-16 ***
## thinness_1_19_years -9.089e-02  2.534e-02  -3.588 0.000339 ***
## under_five_deaths   -2.608e-03  5.893e-04  -4.426 9.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.519 on 2930 degrees of freedom
## Multiple R-squared:  0.7747, Adjusted R-squared:  0.7742 
## F-statistic:  1439 on 7 and 2930 DF,  p-value: < 2.2e-16

=> Interpretation: The regression model is statistically significant, as indicated by a very high F-statistic and a p-value less than 0.001, confirming that the predictors collectively explain life expectancy. The model explains approximately 77.5% of the variation in life expectancy (R² = 0.7747), indicating strong explanatory power. Additionally, all selected variables are statistically significant, suggesting that both socioeconomic and health-related factors play an important role in determining life expectancy.

Question 4.3: is the model valid? (Model Diagnostics)

par(mfrow = c(2,2))
plot(model_final)

=> Interpretation: The diagnostic plots indicate that the regression model assumptions are reasonably satisfied. The residuals are mostly randomly distributed, suggesting that the linearity assumption holds with minor deviations. The Q-Q plot shows approximate normality of residuals, with slight deviations at the tails indicating the presence of outliers. The scale-location plot suggests mild heteroscedasticity, as the variance of residuals is not perfectly constant across fitted values. The residuals vs leverage plot shows a few influential observations, but none appear to significantly affect the model. Overall, the model is stable and suitable for interpretation despite minor deviations from ideal assumptions.

Question 4.4: How well does the model perform on unseen data?

# Train-test split

train_index <- sample(1:nrow(data_clean), 0.8 *nrow(data_clean))

train_data <- data_clean[train_index,] 
test_data <- data_clean[-train_index,]

# Train model on training data

model_train <- lm(
  life_expectancy ~ schooling + adult_mortality + hiv_aids + bmi + gdp + thinness_1_19_years +      under_five_deaths,
  data = train_data
)

# Make predictions

predictions <- predict(model_train, test_data)

# Calculate RMSE

rmse <- sqrt(mean((test_data$life_expectancy - predictions)^2))
rmse
## [1] 4.351144

=> Interpretation: The RMSE value of approximately 4.5 indicates that the model’s predictions deviate from the actual life expectancy values by an average of about 4.5 years. Considering the overall range of life expectancy in the dataset, this level of error is relatively low, suggesting that the model has good predictive performance and can reliably estimate life expectancy.

# Create predicted vs actual dataframe
pred_df <- data.frame(
  Actual = test_data$life_expectancy,
  Predicted = predictions
)

ggplot(pred_df, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.4, color = "#2C7FB8") +
  geom_abline(slope = 1, intercept = 0, 
              color = "#D95F02", linewidth = 1, linetype = "dashed") +
  labs(
    title = "Predicted vs Actual Life Expectancy",
    subtitle = "Points closer to the dashed line indicate better predictions",
    x = "Actual Life Expectancy (years)",
    y = "Predicted Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11)
  )

=> Interpretation: The predicted vs actual plot shows that most points cluster closely around the perfect prediction line, confirming that the model performs well across the majority of observations. Some deviation is observed at lower life expectancy values (below 55 years), suggesting the model slightly struggles with extreme cases, which is consistent with the RMSE of approximately 4.5 years.

____________

Conclusion

____________

=> This study analyzed global life expectancy using a range of public health and socioeconomic indicators. Exploratory data analysis revealed that most countries have life expectancy between 65 and 80 years, with a few countries exhibiting significantly lower values, indicating disparities in global health conditions.

Correlation analysis showed that variables such as schooling, GDP, and BMI are positively associated with life expectancy, while adult mortality, HIV/AIDS prevalence, and under-five deaths have strong negative relationships. These findings highlight the importance of education, economic development, and healthcare in improving population health.

A multiple linear regression model was developed to quantify these relationships. The model demonstrated strong explanatory power (R² ≈ 0.77), indicating that approximately 77% of the variation in life expectancy is explained by the selected variables. All predictors were statistically significant, and multicollinearity was addressed to ensure model reliability.

Model diagnostics confirmed that the assumptions of linear regression were reasonably satisfied, with only minor deviations. Additionally, predictive evaluation using RMSE (~4.576) showed that the model can estimate life expectancy with an average error of about 4.5 years, indicating good predictive performance.

Overall, the results suggest that both socioeconomic factors and healthcare conditions play a critical role in determining life expectancy. Improving education, reducing disease burden, and enhancing economic stability can significantly contribute to increased life expectancy across countries.