Introduction

CO2 is the fourth most abundant gas in the earth’s atmosphere. At room temperature, carbon dioxide (CO2) is a colorless, odorless, non-flammable gas, at other temperatures and pressures, carbon dioxide can be a liquid or a solid. CO2 is a byproduct of normal cell function when it is breathed out of the body. CO2 is also produced when fossil fuels are burned or decaying vegetation. Exposure to CO2 can produce a variety of health effects. These may include headaches, dizziness, restlessness, a tingling or pins or needles feeling, difficulty breathing, sweating, tiredness, increased heart rate, elevated blood pressure, coma, asphyxia, and convulsions Reference.

For LBB (learning by building) - Regression Model, i’m using data from Kaggle - CO2 Emission to predict human life expectancy using several variable co2 emission and population from many countries around the world.

List of library

library(manipulate) #createinteractive plot
library(dplyr) #for data wrangling
library(performance) #to compare model performance
library(GGally) #to create gg correlation
library(MLmetrics) #to measure regression model performance
library(ggplot2) #to create visualization
library(lmtest) #to create Linear Model
library(car) #to check Multicolinearity
library(rmdformats)

Check and Import Data

#read data
co2 <- read.csv("CO2Emission.csv")
#to check NA value
colSums(is.na(co2))
##        Country           Code   CO2Emissions   YearlyChange      Percapita 
##              0              0              0              0              0 
##     Population LifeExpectancy 
##              0              0
#inspect dataframe
glimpse(co2)
## Rows: 208
## Columns: 7
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Anguill~
## $ Code           <chr> "AFG", "ALB", "DZA", "AGO", "AIA", "ATG", "ARG", "ARM",~
## $ CO2Emissions   <dbl> 9900004, 5208319, 156220560, 30566933, 30262, 438763, 2~
## $ YearlyChange   <dbl> 7.13, 4.45, 0.17, 3.13, 1.52, 1.51, 0.16, 3.06, 1.51, -~
## $ Percapita      <dbl> 0.28, 1.80, 3.85, 1.06, 2.10, 4.64, 4.61, 1.57, 2.74, 1~
## $ Population     <int> 35383032, 2886438, 40551392, 28842489, 14429, 94527, 43~
## $ LifeExpectancy <dbl> 63.763, 78.194, 76.298, 59.925, 81.441, 76.617, 76.221,~

Data columns description:

Country: Country

Code: Country Code

CO2Emissions: CO2 Emission by country (tons, 2016)

YearlyChange: Yearly CO2 Emission Change in Percentage

Percapita: CO2 Emission Per Capita

Population: Population of the Country (2016)

LifeExpectancy: Life Expectancy of the Country (2016)

Based on performance data glimpse above, we want to know what predictor variable affecting life expectancy in LifeExpectancy column result.

Changing several data type and information:

#To delete Country and Country Code information, since we do not use those columns in Model.
co2 <- co2 %>% select(-c(Country, Code))

Exploratory

Box Plot

We want to find if there is any outlier from our predictor variable using box plot as below:

Life Expectancy (Target)

boxplot(co2$LifeExpectancy)

CO2 Emission

boxplot(co2$CO2Emissions)

Yearly Change

boxplot(co2$YearlyChange)

Per Capita

boxplot(co2$Percapita)

Population

boxplot(co2$Population)
## Warning in x[floor(d)] + x[ceiling(d)]: NAs produced by integer overflow

Based on box plot, we noted there are several outlier in all predictor variables, thus we will check outlier and delete it from table if necessary.

rmarkdown::paged_table(co2 %>% arrange(desc(CO2Emissions)))

From box plot CO2Emissions and table information above, we decide to take out 5 outlier from CO2Emissions. Noted when we take out 5 CO2Emissionsoutlier, we also take out 3 outlier from Population variable.

#to remove outlier from DF
co2_clean <- co2 %>% filter(CO2Emissions<1000000000)
head(co2_clean %>% arrange(desc(CO2Emissions)),5) #outlier has been deleted.

Correlation

#to check correlation between predictor variable
ggcorr(co2_clean, label = TRUE, label_size = 3, hjust = 0.9, layout.exp = 2)

Based on graphic above, Percapita and CO2Emissions have slight correlation to target variable LifeExpectancy. This could be inline with actual condition, where the higher co2 percapita will trigger CO2Emissions produced, which later will affects human health condition and lifeexpectancy.

Create Model

  1. Model based on correlation
#Create Linear model using predictor which has positive correlation 
co2_mod <- lm(LifeExpectancy ~ Percapita + CO2Emissions, co2_clean)
co2_mod
## 
## Call:
## lm(formula = LifeExpectancy ~ Percapita + CO2Emissions, data = co2_clean)
## 
## Coefficients:
##  (Intercept)     Percapita  CO2Emissions  
##    6.945e+01     5.900e-01     6.881e-09
  1. Model using all variable
#create linear model using all predictor
co2_all <- lm(LifeExpectancy ~ ., co2_clean)
co2_all
## 
## Call:
## lm(formula = LifeExpectancy ~ ., data = co2_clean)
## 
## Coefficients:
##  (Intercept)  CO2Emissions  YearlyChange     Percapita    Population  
##    7.049e+01     1.427e-08    -1.023e-01     4.940e-01    -4.142e-08
  1. Model using step wise (backward)

For the third model, we will use step wise (backward) to let computer decide the best predictors variable based on the lowest AIC model.

step(object = co2_all, direction = "backward", trace = F)
## 
## Call:
## lm(formula = LifeExpectancy ~ CO2Emissions + Percapita + Population, 
##     data = co2_clean)
## 
## Coefficients:
##  (Intercept)  CO2Emissions     Percapita    Population  
##    7.017e+01     1.483e-08     5.007e-01    -4.247e-08
#create liner model based on step wise.
co2_back <- lm(formula = LifeExpectancy ~ CO2Emissions + Percapita + Population, 
    data = co2_clean)

Summary Model

co2_mod

#check model result
summary(co2_mod)
## 
## Call:
## lm(formula = LifeExpectancy ~ Percapita + CO2Emissions, data = co2_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9281  -4.4365   0.7909   5.1859  11.3946 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.945e+01  6.251e-01 111.101  < 2e-16 ***
## Percapita    5.900e-01  8.791e-02   6.712 1.94e-10 ***
## CO2Emissions 6.881e-09  3.715e-09   1.852   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.796 on 200 degrees of freedom
## Multiple R-squared:  0.2368, Adjusted R-squared:  0.2291 
## F-statistic: 31.02 on 2 and 200 DF,  p-value: 1.847e-12

co2_all

#check model result
summary(co2_all)
## 
## Call:
## lm(formula = LifeExpectancy ~ ., data = co2_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.509  -4.089   1.252   5.263  11.580 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.049e+01  8.269e-01  85.241  < 2e-16 ***
## CO2Emissions  1.427e-08  4.923e-09   2.899  0.00416 ** 
## YearlyChange -1.023e-01  1.503e-01  -0.680  0.49706    
## Percapita     4.940e-01  9.445e-02   5.230 4.29e-07 ***
## Population   -4.142e-08  1.705e-08  -2.430  0.01601 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.717 on 198 degrees of freedom
## Multiple R-squared:  0.2618, Adjusted R-squared:  0.2469 
## F-statistic: 17.56 on 4 and 198 DF,  p-value: 2.393e-12

co2_back

#check model result
summary(co2_back)
## 
## Call:
## lm(formula = LifeExpectancy ~ CO2Emissions + Percapita + Population, 
##     data = co2_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.4498  -4.1053   0.9113   5.4314  11.1282 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.017e+01  6.808e-01 103.067  < 2e-16 ***
## CO2Emissions  1.483e-08  4.849e-09   3.058  0.00254 ** 
## Percapita     5.007e-01  9.382e-02   5.337 2.56e-07 ***
## Population   -4.247e-08  1.695e-08  -2.505  0.01305 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.708 on 199 degrees of freedom
## Multiple R-squared:  0.2601, Adjusted R-squared:  0.2489 
## F-statistic: 23.32 on 3 and 199 DF,  p-value: 5.629e-13

Based on all Summary models, Percapita has 3 asterisk “***“, mean it is the most significant correlation with target variable LifeExpectancy in 3 models.

R-squared:

co2_mod : 0.2368 #2 Predictor variable

co2_all : 0.2618 #4 Predictor variable

co2_back: 0.2601 #3 Predictor variable

Adj. R-Square

co2_mod : 0.2291

co2_all : 0.2469

co2_back: 0.2489

From summary function, we can observe that the more predictor variable used, R-squared is increasing. Thus, we can assume model co2_all is better than the other 2 models, but when we observe from adjusted R-squared, model co2_back is better than the others. This is due to in model co2_back, all predictor variable has correlation to the target variable, which may give the model a better prediction. While in co2_all, 1 predictor YearlyChange has no correlation.

Compare model performance

#compare model performance based on AID and RMSE
compare_performance(co2_mod, co2_all, co2_back, metrics = c("AIC", "RMSE"))

Based on models performance comparison, model co2_back have the lowest AIC and model co2_all have the lowest RMSE. We use RMSE metrics since it was the most common metric to evaluate regression linear.

Model Prediction

co2_mod

predict(co2_mod, co2[1,], interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 69.67992 68.48568 70.87415

If we want to predict LifeExpectancy with confidence interval 95%, we can presume LifeExpectancy value for the first row to be around 68.48 (lwr) up to 70.87 (upr)

co2_all

predict(co2_all, co2[1,], interval = "confidence", level = 0.95)
##        fit      lwr     upr
## 1 68.57079 66.85548 70.2861

If we want to predict LifeExpectancy with confidence interval 95%, we can presume LifeExpectancy value for the first row to be around 66.85 (lwr) up to 70.28 (upr)

co2_back

predict(co2_back, co2[1,], interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 68.95176 67.64093 70.26258

If we want to predict LifeExpectancy with confidence interval 95%, we can presume LifeExpectancy value for the first row to be around 67.64 (lwr) up to 70.26 (upr)

Evaluating Model

Model co2_mod

#evaluating RMSE
sqrt(mean((co2_mod$residuals^2)))
## [1] 6.745433

Model co2_all

#evaluating RMSE
sqrt(mean((co2_all$residuals^2)))
## [1] 6.633796

Model co2_back

#evaluating RMSE
sqrt(mean((co2_back$residuals^2)))
## [1] 6.641546

Based on RMSE above, co2_all did slightly better job in summarizing LifeExpectancy, as it has the lowest RMSE value between 3 models.

Model Assumption

co2_mod

Normality

hist(co2_mod$residuals, breaks = 15)

#to check normality assumption
shapiro.test(co2_mod$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  co2_mod$residuals
## W = 0.96467, p-value = 5.736e-05

P-value > 0.05, mean Residuals are normally distributed

Residual Plot

#to check residual pattern
plot(co2_mod$fitted.values, co2_mod$residuals)
abline(h = 0, col = "red")

The residual plot has no random pattern, but it tend to gather on the left side of the plot.

BP test

bptest(co2_mod)
## 
##  studentized Breusch-Pagan test
## 
## data:  co2_mod
## BP = 4.5555, df = 2, p-value = 0.1025

P-value > 0.05, mean residual has a constant variance (Homoscedaticity) and the assumption is passed

Multicolinearity

#to check correlation between predictor variable
vif(co2_mod)
##    Percapita CO2Emissions 
##     1.104982     1.104982

There is no value above 10, thus no significant correlation between predictor variable noted in this model.

co2_all

Normality

#create histogram to check normality
hist(co2_all$residuals, breaks = 15)

#to check normality assumption
shapiro.test(co2_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  co2_all$residuals
## W = 0.9585, p-value = 1.18e-05

P-value > 0.05, mean Residuals are normally distributed

Residual Plot

#to check residual pattern
plot(co2_all$fitted.values, co2_all$residuals)
abline(h = 0, col = "red")

The residual plot has no random pattern, but it tend to gather on the left side of the plot.

BP test

bptest(co2_all)
## 
##  studentized Breusch-Pagan test
## 
## data:  co2_all
## BP = 8.3993, df = 4, p-value = 0.078

P-value > 0.05, mean residual has a constant variance (Homoscedaticity) and the assumption is passed

Multicolinearity

#to check correlation between predictor variable
vif(co2_all)
## CO2Emissions YearlyChange    Percapita   Population 
##     1.985704     1.070210     1.305775     1.785793

There is no value above 10, thus no significant correlation between predictor variable noted in this model.

co2_back

Normality

#create histogram to check normality
hist(co2_back$residuals, breaks = 15)

#to check normality assumption
shapiro.test(co2_back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  co2_back$residuals
## W = 0.96091, p-value = 2.159e-05

P-value > 0.05, mean Residuals are not normally distributed

Residual Plot

#to check residual pattern
plot(co2_back$fitted.values, co2_back$residuals)
abline(h = 0, col = "red")

The residual plot has no random pattern, but it tend to gather on the left side of the plot.

BP test

bptest(co2_back)
## 
##  studentized Breusch-Pagan test
## 
## data:  co2_back
## BP = 3.4443, df = 3, p-value = 0.3281

P-value > 0.05, mean residual has a constant variance (Homoscedaticity) and the assumption is passed

Multicolinearity

#to check correlation between predictor variable
vif(co2_back)
## CO2Emissions    Percapita   Population 
##     1.931598     1.291746     1.771161

There is no value above 10, thus no significant correlation between predictor variable noted in this model.

Conclusion

Based on model creation and evaluation, all model may not be a good model to describe our target variable LifeExpectancy. But between all models, co2_all and co2_back has closest result, indicate predictor variable CO2Emissions, Percapita and Population are the best variable to predict LifeExpectancy. From real world perspective, those variable may affecting people lifespan, as increase of population lead to increasing of transportation use, industrial and manufacturing process, fossil fuels consumption and deforestation, which will lead to produce more of CO2Emissions. Although there are many significant variable affects human lifeexpectancy in actual, government may consider to reduce CO2Emissions by making regulation such as increasing public transport facility, so people will be more interest to use public transportation rather than use their own vehicle. By reducing the use of individual transportation, production for individual vehicle may also decrease, which lead to decrease of manufacturing and industrial activity.