# Load libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(corrplot)
library(caret)
library(janitor)
library(knitr)
library(car)
# Load dataset
data<- read.csv("Life Expectancy Data.csv")
# View first few rows
head(data)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
# Clean column Names
data <- clean_names(data)
colnames(data)
## [1] "country" "year"
## [3] "status" "life_expectancy"
## [5] "adult_mortality" "infant_deaths"
## [7] "alcohol" "percentage_expenditure"
## [9] "hepatitis_b" "measles"
## [11] "bmi" "under_five_deaths"
## [13] "polio" "total_expenditure"
## [15] "diphtheria" "hiv_aids"
## [17] "gdp" "population"
## [19] "thinness_1_19_years" "thinness_5_9_years"
## [21] "income_composition_of_resources" "schooling"
# Check number of rows and columns
dim(data)
## [1] 2938 22
=> Interpretation: The dataset contains multiple observations of countries across different years along with various health and socio-economic indicators. The dim() function helps determine the size of the dataset by showing the total number of rows and columns.
# Check structure of dataset
str(data)
## 'data.frame': 2938 obs. of 22 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ life_expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ adult_mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant_deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage_expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ hepatitis_b : int 65 62 64 67 68 66 63 64 63 64 ...
## $ measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ bmi : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under_five_deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ total_expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ hiv_aids : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ gdp : num 584.3 612.7 631.7 670 63.5 ...
## $ population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness_1_19_years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness_5_9_years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ income_composition_of_resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
=> Interpretation: The str() function displays the structure of the dataset, including variable names and their data types. Understanding the structure of the dataset is important for selecting appropriate statistical and machine learning methods.
# Summary Satistics of the data set
summary(data)
## country year status life_expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## adult_mortality infant_deaths alcohol percentage_expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.0100 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 0.8775 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 3.7550 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.6029 Mean : 738.251
## 3rd Qu.:228.0 3rd Qu.: 22.0 3rd Qu.: 7.7025 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.8700 Max. :19479.912
## NA's :10 NA's :194
## hepatitis_b measles bmi under_five_deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:77.00 1st Qu.: 0.0 1st Qu.:19.30 1st Qu.: 0.00
## Median :92.00 Median : 17.0 Median :43.50 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:97.00 3rd Qu.: 360.2 3rd Qu.:56.20 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :553 NA's :34
## polio total_expenditure diphtheria hiv_aids
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.260 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.755 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.492 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## NA's :19 NA's :226 NA's :19
## gdp population thinness_1_19_years thinness_5_9_years
## Min. :1.681e+00 Min. :3.400e+01 Min. : 0.10 Min. : 0.10
## 1st Qu.:4.639e+02 1st Qu.:1.958e+05 1st Qu.: 1.60 1st Qu.: 1.50
## Median :1.767e+03 Median :1.387e+06 Median : 3.30 Median : 3.30
## Mean :7.483e+03 Mean :1.275e+07 Mean : 4.84 Mean : 4.87
## 3rd Qu.:5.911e+03 3rd Qu.:7.420e+06 3rd Qu.: 7.20 3rd Qu.: 7.20
## Max. :1.192e+05 Max. :1.294e+09 Max. :27.70 Max. :28.60
## NA's :448 NA's :652 NA's :34 NA's :34
## income_composition_of_resources schooling
## Min. :0.0000 Min. : 0.00
## 1st Qu.:0.4930 1st Qu.:10.10
## Median :0.6770 Median :12.30
## Mean :0.6276 Mean :11.99
## 3rd Qu.:0.7790 3rd Qu.:14.30
## Max. :0.9480 Max. :20.70
## NA's :167 NA's :163
=> Interpretation: The summary() function provides descriptive statistics for each variable in the dataset. These statistics help identify the distribution and central tendency of the data, which is useful for further analysis.
# check number of missing values in each columns
colSums(is.na(data))
## country year
## 0 0
## status life_expectancy
## 0 10
## adult_mortality infant_deaths
## 10 0
## alcohol percentage_expenditure
## 194 0
## hepatitis_b measles
## 553 0
## bmi under_five_deaths
## 34 0
## polio total_expenditure
## 19 226
## diphtheria hiv_aids
## 19 0
## gdp population
## 448 652
## thinness_1_19_years thinness_5_9_years
## 34 34
## income_composition_of_resources schooling
## 167 163
=> Interpretation: The output shows the number of missing values present in each variable of the dataset. Identifying missing values is important because they can affect statistical analysis and machine learning model performance. Variables with missing values will be handled in later stages using appropriate methods such as imputation or removal.
# Checking total values in entire Dataset
sum(is.na(data))
## [1] 2563
=> Interpretation: This value represents the total number of missing values across the entire dataset. A higher number of missing values indicates the need for careful data cleaning before performing analysis or building machine learning models.
# Missing data
missing_data <- colSums(is.na(data))
missing_df <- data.frame(
Variable = names(missing_data),
Missing_Count = missing_data
)
# Visualize Missing Values
ggplot(missing_df, aes(x = reorder(Variable, Missing_Count), y = Missing_Count, fill = Missing_Count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Missing Values by Variable",
subtitle = "Higher bars indicate variables with more missing data",
x = "Variables",
y = "Number of Missing Values"
) +
scale_fill_gradient(low = "#56B1F7", high = "#132B43") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
legend.position = "none"
)
=> Interpretation: This plot shows which variable has the most missing data and where cleaning effort should be focused, as we can see in the plot population and hapatitis_b have the highest number of missing values while adult_mortality and life_expecrancy has the least amount of missing values.
# Calculate average life expectancy
mean(data$life_expectancy, na.rm = TRUE)
## [1] 69.22493
=> Interpretation: The calculated value represents the average life expectancy across all countries and years in the dataset. This provides a general understanding of global health conditions and serves as a baseline for further statistical analysis.
# Calulate median life expectancy
median(data$life_expectancy, na.rm = TRUE)
## [1] 72.1
=> Interpretation: The median life expectancy represents the middle value of the dataset when observations are arranged in order. It is less affected by extreme values and provides a more stable measure of central tendency compared to the mean.
# Calculate standard deviation
sd(data$life_expectancy, na.rm = TRUE)
## [1] 9.523867
=> Interpretation: The standard deviation measures the variability of life expectancy across countries and years. A higher standard deviation indicates greater variation in life expectancy values.
ggplot(data, aes(x = life_expectancy)) +
geom_histogram(bins = 30, fill = "#2C7FB8", color = "white", alpha = 0.9) +
geom_density(aes(y = ..count..), color = "#D95F0E", size = 1) +
labs(
title = "Distribution of Life Expectancy",
subtitle = "Most countries fall between 65–80 years",
x = "Life Expectancy",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11),
axis.title = element_text(size = 12)
)
=> Interpretation: The histogram shows that most life expectancy values fall between 65 and 80 years, indicating that the majority of countries have moderate to high life expectancy. A smaller number of countries have lower values below 50 years, suggesting variation in health conditions across regions.
# Create a boxplot for life expectancy
ggplot(data, aes(x = "", y = life_expectancy)) +
geom_boxplot(fill = "#2C7FB8", alpha = 0.7) +
labs(
title = "Boxplot of Life Expectancy",
x = "",
y = "Life Expectancy (years)"
) +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
=> Interpretation: The boxplot shows that most life expectancy values are concentrated between approximately 63 and 76 years, with a median around 72 years. A few lower values below 45 years appear as outliers, indicating countries with significantly lower life expectancy compared to the majority.
# Detect outlier using the IQR method
# Calculate quartiles
Q1 <- quantile(data$life_expectancy, 0.25,na.rm = TRUE)
Q3 <- quantile(data$life_expectancy, 0.75, na.rm = TRUE)
# Calculate IQR
IQR_value <- Q3 - Q1
# Define outlier limits
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Find outliers
outliers <- data$life_expectancy[
data$life_expectancy < lower_bound |
data$life_expectancy > upper_bound
]
length(outliers)
## [1] 20
=> Interpretation: The IQR method identifies observations that fall outside the acceptable range of life expectancy values. These values are considered outliers and may represent extreme health conditions or data irregularities.
# Calculate average life expectancy by country
avg_life_exp_country <- data %>%
group_by(country) %>%
summarise(
Avg_Life_Expectancy = mean(life_expectancy, na.rm = TRUE)
)
print(avg_life_exp_country)
## # A tibble: 193 × 2
## country Avg_Life_Expectancy
## <chr> <dbl>
## 1 Afghanistan 58.2
## 2 Albania 75.2
## 3 Algeria 73.6
## 4 Angola 49.0
## 5 Antigua and Barbuda 75.1
## 6 Argentina 75.2
## 7 Armenia 73.4
## 8 Australia 81.8
## 9 Austria 81.5
## 10 Azerbaijan 70.7
## # ℹ 183 more rows
# Get top 10 countries with highest life expectancy
top_10_countries <- avg_life_exp_country %>%
arrange(desc(Avg_Life_Expectancy)) %>%
slice_head(n = 10)
top_10_countries
## # A tibble: 10 × 2
## country Avg_Life_Expectancy
## <chr> <dbl>
## 1 Japan 82.5
## 2 Sweden 82.5
## 3 Iceland 82.4
## 4 Switzerland 82.3
## 5 France 82.2
## 6 Italy 82.2
## 7 Spain 82.1
## 8 Australia 81.8
## 9 Norway 81.8
## 10 Canada 81.7
ggplot(top_10_countries,
aes(x = reorder(country, Avg_Life_Expectancy),
y = Avg_Life_Expectancy,
fill = Avg_Life_Expectancy)) +
geom_bar(stat = "identity", width = 0.7) +
coord_flip() +
geom_text(aes(label = round(Avg_Life_Expectancy, 1)),
hjust = -0.1, size = 3.5) +
labs(
title = "Top 10 Countries with Highest Life Expectancy",
subtitle = "Countries ranked by average life expectancy",
x = "Country",
y = "Average Life Expectancy (years)"
) +
scale_fill_gradient(low = "#56B1F7", high = "#08306B") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11),
axis.title = element_text(size = 12),
legend.position = "none"
) +
ylim(0, max(top_10_countries$Avg_Life_Expectancy) + 2)
=> Interpretation: The results show the top 10 countries with the highest average life expectancy across the dataset. These countries consistently demonstrate better health outcomes, likely due to stronger healthcare systems and improved living conditions.
# Filter records where life expectancy is below 50
low_life_exp <- data %>%
filter(life_expectancy < 50)
head(low_life_exp)
## country year status life_expectancy adult_mortality infant_deaths alcohol
## 1 Angola 2010 Developing 49.6 365 78 7.80
## 2 Angola 2009 Developing 49.1 369 81 7.01
## 3 Angola 2008 Developing 48.7 371 84 7.07
## 4 Angola 2007 Developing 48.2 375 87 6.35
## 5 Angola 2006 Developing 47.7 381 90 5.84
## 6 Angola 2005 Developing 47.4 382 92 5.04
## percentage_expenditure hepatitis_b measles bmi under_five_deaths polio
## 1 191.65374 77 1190 2.4 121 81
## 2 212.92293 61 2807 19.8 127 63
## 3 249.91020 69 265 19.3 133 65
## 4 184.82134 73 1014 18.8 138 75
## 5 25.08689 NA 765 18.2 143 36
## 6 98.19145 NA 258 17.7 148 39
## total_expenditure diphtheria hiv_aids gdp population
## 1 3.39 77 2.5 3529.5348 23369131
## 2 4.37 6 2.5 3347.8448 22549547
## 3 3.84 69 2.6 3868.5789 2175942
## 4 3.38 73 2.6 2878.8371 2997687
## 5 4.54 34 2.5 262.4151 2262399
## 6 4.10 38 2.6 1443.9919 19552542
## thinness_1_19_years thinness_5_9_years income_composition_of_resources
## 1 9.1 9.0 0.488
## 2 9.3 9.2 0.480
## 3 9.5 9.4 0.468
## 4 9.6 9.6 0.454
## 5 9.8 9.7 0.439
## 6 1.0 9.9 0.426
## schooling
## 1 9.0
## 2 8.5
## 3 8.1
## 4 7.7
## 5 7.2
## 6 6.8
# Get unique country names
unique_low_countries <- low_life_exp %>%
distinct(country)
unique_low_countries
## country
## 1 Angola
## 2 Botswana
## 3 Côte d'Ivoire
## 4 Central African Republic
## 5 Chad
## 6 Eritrea
## 7 Haiti
## 8 Lesotho
## 9 Malawi
## 10 Mali
## 11 Mozambique
## 12 Nigeria
## 13 Rwanda
## 14 Sierra Leone
## 15 South Sudan
## 16 Swaziland
## 17 Uganda
## 18 United Republic of Tanzania
## 19 Zambia
## 20 Zimbabwe
# Count number of countries
nrow(unique_low_countries)
## [1] 20
ggplot(low_life_exp,
aes(x = reorder(country, life_expectancy),
fill = after_stat(count))) +
geom_bar() +
coord_flip() +
scale_fill_gradient(low = "#A6CEE3", high = "#1F78B4") +
labs(
title = "Countries with Life Expectancy Below 50 Years",
x = "Country",
y = "Number of Records"
) +
theme_minimal() +
theme(legend.position = "none")
=> Interpretation: The bar chart shows the countries that have recorded life expectancy values below 50 years across multiple years. Countries such as Sierra Leone, the Central African Republic, and Angola appear more frequently, indicating persistent low life expectancy during the observed period. This pattern suggests ongoing health and socioeconomic challenges in these regions compared to other countries in the dataset.
# Calculate average life expectancy by country status
life_exp_by_status <- data %>%
group_by(status) %>%
summarise(
Avg_Life_Expectancy = mean(life_expectancy, na.rm = TRUE)
)
life_exp_by_status
## # A tibble: 2 × 2
## status Avg_Life_Expectancy
## <chr> <dbl>
## 1 Developed 79.2
## 2 Developing 67.1
ggplot(life_exp_by_status,
aes(x = status,
y = Avg_Life_Expectancy,
fill = status)) +
geom_bar(stat = "identity", width = 0.6) +
geom_text(aes(label = round(Avg_Life_Expectancy, 1)),
vjust = -0.5, size = 4) +
labs(
title = "Average Life Expectancy by Country Status",
subtitle = "Developed countries show higher life expectancy",
x = "Country Status",
y = "Average Life Expectancy (years)"
) +
scale_fill_manual(values = c("Developed" = "#1B9E77",
"Developing" = "#D95F02")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11),
axis.title = element_text(size = 12),
legend.position = "none"
) +
ylim(0, max(life_exp_by_status$Avg_Life_Expectancy) + 5)
=> Interpretation: The bar chart shows that developed countries have a higher average life expectancy compared to developing countries. This indicates better overall health outcomes in developed nations, likely due to improved healthcare systems, living conditions, and socioeconomic stability.
# Calculate average life expectancy per year
life_exp_trend <- data %>%
group_by(year) %>%
summarise(Avg_Life_Expectancy = mean(life_expectancy, na.rm = TRUE))
ggplot(life_exp_trend, aes(x = year, y = Avg_Life_Expectancy)) +
geom_line(color = "#1B9E77", size = 1.2) +
geom_point(color = "#D95F02", size = 2.5) +
labs(
title = "Global Average Life Expectancy Over Time",
subtitle = "Steady improvement observed from 2000 to 2015",
x = "Year",
y = "Average Life Expectancy (years)"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11)
)
=> Interpretation: The line plot reveals a consistent upward trend in global average life expectancy from approximately 66.8 years in 2000 to 75.5 years in 2015. This represents an improvement of nearly 9 years over the 15-year period, suggesting sustained global progress in healthcare, disease control, and living standards. The rate of improvement appears to accelerate after 2004, which may reflect the increased global focus on public health initiatives such as the UN Millennium Development Goals adopted in 2000, whose effects began materializing in later years. The relatively flat period between 2002 and 2004 may indicate temporary stagnation possibly linked to disease outbreaks or economic disruptions during that period.
# Visualize correlation matrix
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.col = "black",
tl.cex = 0.7,
number.cex = 0.5)
=> Interpretation: The heatmap illustrates the strength and direction of relationship between variables in the dataset. Strong positive correlation are observed between life expectancy and variable such as schooling, income composition of resources, BMI and GDP indicating that better education, economic conditions and nutrition ate associated with higher life expectancy. Additionally, clusters of related variables can be observed such as infant deaths and under-five deaths, which are highly positively correlated, indicating they measure similar health outcomes. Overall, the heatmap highlights key socioeconomic and health-related factors that influence life expectancy and helps identify the most important variables for further analysis and modeling.
# Clean the data
data_clean <- data %>%
mutate(across(where(is.numeric),
~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# create a temp model with initial variables(based on correlation)
model_temp <- lm(
life_expectancy ~ schooling + income_composition_of_resources +
adult_mortality + hiv_aids + bmi + gdp +
thinness_1_19_years + thinness_5_9_years +
infant_deaths + under_five_deaths,
data = data_clean
)
# check Multicollinearity (VIF)
vif(model_temp)
## schooling income_composition_of_resources
## 3.025591 2.947149
## adult_mortality hiv_aids
## 1.697988 1.402620
## bmi gdp
## 1.681875 1.291818
## thinness_1_19_years thinness_5_9_years
## 8.688600 8.816314
## infant_deaths under_five_deaths
## 161.611729 161.245536
=> Interpretation: The VIF results indicate high multicollinearity among certain variables, particularly infant deaths and under-five deaths, as well as thinness indicators. One variable from each highly correlated pair will be removed to reduce redundancy and improve model reliability.
# creating the final model
model_final <- lm(
life_expectancy ~ schooling + adult_mortality + hiv_aids + bmi + gdp + thinness_1_19_years +
under_five_deaths,
data = data_clean
)
vif(model_final)
## schooling adult_mortality hiv_aids bmi
## 1.687853 1.674425 1.392140 1.643741
## gdp thinness_1_19_years under_five_deaths
## 1.256637 1.785263 1.285690
=> Interpretation: The VIF values for all selected variables are below 5, indicating that multicollinearity is not a concern in the final model. This confirms that redundant variables have been successfully removed, resulting in a stable and reliable regression model for predicting life expectancy.
summary(model_final)
##
## Call:
## lm(formula = life_expectancy ~ schooling + adult_mortality +
## hiv_aids + bmi + gdp + thinness_1_19_years + under_five_deaths,
## data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.7171 -2.2571 0.0881 2.6777 19.9985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.779e+01 5.031e-01 114.856 < 2e-16 ***
## schooling 1.157e+00 3.318e-02 34.883 < 2e-16 ***
## adult_mortality -2.276e-02 8.696e-04 -26.175 < 2e-16 ***
## hiv_aids -4.914e-01 1.938e-02 -25.364 < 2e-16 ***
## bmi 5.887e-02 5.363e-03 10.978 < 2e-16 ***
## gdp 6.679e-05 7.030e-06 9.500 < 2e-16 ***
## thinness_1_19_years -9.089e-02 2.534e-02 -3.588 0.000339 ***
## under_five_deaths -2.608e-03 5.893e-04 -4.426 9.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.519 on 2930 degrees of freedom
## Multiple R-squared: 0.7747, Adjusted R-squared: 0.7742
## F-statistic: 1439 on 7 and 2930 DF, p-value: < 2.2e-16
=> Interpretation: The regression model is statistically significant, as indicated by a very high F-statistic and a p-value less than 0.001, confirming that the predictors collectively explain life expectancy. The model explains approximately 77.5% of the variation in life expectancy (R² = 0.7747), indicating strong explanatory power. Additionally, all selected variables are statistically significant, suggesting that both socioeconomic and health-related factors play an important role in determining life expectancy.
par(mfrow = c(2,2))
plot(model_final)
=> Interpretation: The diagnostic plots indicate that the regression model assumptions are reasonably satisfied. The residuals are mostly randomly distributed, suggesting that the linearity assumption holds with minor deviations. The Q-Q plot shows approximate normality of residuals, with slight deviations at the tails indicating the presence of outliers. The scale-location plot suggests mild heteroscedasticity, as the variance of residuals is not perfectly constant across fitted values. The residuals vs leverage plot shows a few influential observations, but none appear to significantly affect the model. Overall, the model is stable and suitable for interpretation despite minor deviations from ideal assumptions.
# Train-test split
train_index <- sample(1:nrow(data_clean), 0.8 *nrow(data_clean))
train_data <- data_clean[train_index,]
test_data <- data_clean[-train_index,]
# Train model on training data
model_train <- lm(
life_expectancy ~ schooling + adult_mortality + hiv_aids + bmi + gdp + thinness_1_19_years + under_five_deaths,
data = train_data
)
# Make predictions
predictions <- predict(model_train, test_data)
# Calculate RMSE
rmse <- sqrt(mean((test_data$life_expectancy - predictions)^2))
rmse
## [1] 4.351144
=> Interpretation: The RMSE value of approximately 4.5 indicates that the model’s predictions deviate from the actual life expectancy values by an average of about 4.5 years. Considering the overall range of life expectancy in the dataset, this level of error is relatively low, suggesting that the model has good predictive performance and can reliably estimate life expectancy.
# Create predicted vs actual dataframe
pred_df <- data.frame(
Actual = test_data$life_expectancy,
Predicted = predictions
)
ggplot(pred_df, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.4, color = "#2C7FB8") +
geom_abline(slope = 1, intercept = 0,
color = "#D95F02", linewidth = 1, linetype = "dashed") +
labs(
title = "Predicted vs Actual Life Expectancy",
subtitle = "Points closer to the dashed line indicate better predictions",
x = "Actual Life Expectancy (years)",
y = "Predicted Life Expectancy (years)"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 11)
)
=> Interpretation: The predicted vs actual plot shows that most points cluster closely around the perfect prediction line, confirming that the model performs well across the majority of observations. Some deviation is observed at lower life expectancy values (below 55 years), suggesting the model slightly struggles with extreme cases, which is consistent with the RMSE of approximately 4.5 years.
=> This study analyzed global life expectancy using a range of public health and socioeconomic indicators. Exploratory data analysis revealed that most countries have life expectancy between 65 and 80 years, with a few countries exhibiting significantly lower values, indicating disparities in global health conditions.
Correlation analysis showed that variables such as schooling, GDP, and BMI are positively associated with life expectancy, while adult mortality, HIV/AIDS prevalence, and under-five deaths have strong negative relationships. These findings highlight the importance of education, economic development, and healthcare in improving population health.
A multiple linear regression model was developed to quantify these relationships. The model demonstrated strong explanatory power (R² ≈ 0.77), indicating that approximately 77% of the variation in life expectancy is explained by the selected variables. All predictors were statistically significant, and multicollinearity was addressed to ensure model reliability.
Model diagnostics confirmed that the assumptions of linear regression were reasonably satisfied, with only minor deviations. Additionally, predictive evaluation using RMSE (~4.576) showed that the model can estimate life expectancy with an average error of about 4.5 years, indicating good predictive performance.
Overall, the results suggest that both socioeconomic factors and healthcare conditions play a critical role in determining life expectancy. Improving education, reducing disease burden, and enhancing economic stability can significantly contribute to increased life expectancy across countries.