CO2 is the fourth most abundant gas in the earth’s atmosphere. At room temperature, carbon dioxide (CO2) is a colorless, odorless, non-flammable gas, at other temperatures and pressures, carbon dioxide can be a liquid or a solid. CO2 is a byproduct of normal cell function when it is breathed out of the body. CO2 is also produced when fossil fuels are burned or decaying vegetation. Exposure to CO2 can produce a variety of health effects. These may include headaches, dizziness, restlessness, a tingling or pins or needles feeling, difficulty breathing, sweating, tiredness, increased heart rate, elevated blood pressure, coma, asphyxia, and convulsions Reference.
For LBB (learning by building) - Regression Model, i’m using data from Kaggle - CO2 Emission to predict human life expectancy using several variable co2 emission and population from many countries around the world.
List of library
library(manipulate) #createinteractive plot
library(dplyr) #for data wrangling
library(performance) #to compare model performance
library(GGally) #to create gg correlation
library(MLmetrics) #to measure regression model performance
library(ggplot2) #to create visualization
library(lmtest) #to create Linear Model
library(car) #to check Multicolinearity
library(rmdformats)#read data
co2 <- read.csv("CO2Emission.csv")#to check NA value
colSums(is.na(co2))## Country Code CO2Emissions YearlyChange Percapita
## 0 0 0 0 0
## Population LifeExpectancy
## 0 0
#inspect dataframe
glimpse(co2)## Rows: 208
## Columns: 7
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Anguill~
## $ Code <chr> "AFG", "ALB", "DZA", "AGO", "AIA", "ATG", "ARG", "ARM",~
## $ CO2Emissions <dbl> 9900004, 5208319, 156220560, 30566933, 30262, 438763, 2~
## $ YearlyChange <dbl> 7.13, 4.45, 0.17, 3.13, 1.52, 1.51, 0.16, 3.06, 1.51, -~
## $ Percapita <dbl> 0.28, 1.80, 3.85, 1.06, 2.10, 4.64, 4.61, 1.57, 2.74, 1~
## $ Population <int> 35383032, 2886438, 40551392, 28842489, 14429, 94527, 43~
## $ LifeExpectancy <dbl> 63.763, 78.194, 76.298, 59.925, 81.441, 76.617, 76.221,~
Data columns description:
Country: Country
Code: Country Code
CO2Emissions: CO2 Emission by country (tons, 2016)
YearlyChange: Yearly CO2 Emission Change in Percentage
Percapita: CO2 Emission Per Capita
Population: Population of the Country (2016)
LifeExpectancy: Life Expectancy of the Country (2016)
Based on performance data glimpse above, we want to know what predictor variable affecting life expectancy in LifeExpectancy column result.
Changing several data type and information:
#To delete Country and Country Code information, since we do not use those columns in Model.
co2 <- co2 %>% select(-c(Country, Code))We want to find if there is any outlier from our predictor variable using box plot as below:
boxplot(co2$LifeExpectancy)boxplot(co2$CO2Emissions)boxplot(co2$YearlyChange)boxplot(co2$Percapita)boxplot(co2$Population)## Warning in x[floor(d)] + x[ceiling(d)]: NAs produced by integer overflow
Based on box plot, we noted there are several outlier in all predictor variables, thus we will check outlier and delete it from table if necessary.
rmarkdown::paged_table(co2 %>% arrange(desc(CO2Emissions)))From box plot CO2Emissions and table information above, we decide to take out 5 outlier from CO2Emissions. Noted when we take out 5 CO2Emissionsoutlier, we also take out 3 outlier from Population variable.
#to remove outlier from DF
co2_clean <- co2 %>% filter(CO2Emissions<1000000000)head(co2_clean %>% arrange(desc(CO2Emissions)),5) #outlier has been deleted.#to check correlation between predictor variable
ggcorr(co2_clean, label = TRUE, label_size = 3, hjust = 0.9, layout.exp = 2) Based on graphic above,
Percapita and CO2Emissions have slight correlation to target variable LifeExpectancy. This could be inline with actual condition, where the higher co2 percapita will trigger CO2Emissions produced, which later will affects human health condition and lifeexpectancy.
#Create Linear model using predictor which has positive correlation
co2_mod <- lm(LifeExpectancy ~ Percapita + CO2Emissions, co2_clean)
co2_mod##
## Call:
## lm(formula = LifeExpectancy ~ Percapita + CO2Emissions, data = co2_clean)
##
## Coefficients:
## (Intercept) Percapita CO2Emissions
## 6.945e+01 5.900e-01 6.881e-09
#create linear model using all predictor
co2_all <- lm(LifeExpectancy ~ ., co2_clean)
co2_all##
## Call:
## lm(formula = LifeExpectancy ~ ., data = co2_clean)
##
## Coefficients:
## (Intercept) CO2Emissions YearlyChange Percapita Population
## 7.049e+01 1.427e-08 -1.023e-01 4.940e-01 -4.142e-08
For the third model, we will use step wise (backward) to let computer decide the best predictors variable based on the lowest AIC model.
step(object = co2_all, direction = "backward", trace = F)##
## Call:
## lm(formula = LifeExpectancy ~ CO2Emissions + Percapita + Population,
## data = co2_clean)
##
## Coefficients:
## (Intercept) CO2Emissions Percapita Population
## 7.017e+01 1.483e-08 5.007e-01 -4.247e-08
#create liner model based on step wise.
co2_back <- lm(formula = LifeExpectancy ~ CO2Emissions + Percapita + Population,
data = co2_clean)#check model result
summary(co2_mod)##
## Call:
## lm(formula = LifeExpectancy ~ Percapita + CO2Emissions, data = co2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9281 -4.4365 0.7909 5.1859 11.3946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.945e+01 6.251e-01 111.101 < 2e-16 ***
## Percapita 5.900e-01 8.791e-02 6.712 1.94e-10 ***
## CO2Emissions 6.881e-09 3.715e-09 1.852 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.796 on 200 degrees of freedom
## Multiple R-squared: 0.2368, Adjusted R-squared: 0.2291
## F-statistic: 31.02 on 2 and 200 DF, p-value: 1.847e-12
#check model result
summary(co2_all)##
## Call:
## lm(formula = LifeExpectancy ~ ., data = co2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.509 -4.089 1.252 5.263 11.580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.049e+01 8.269e-01 85.241 < 2e-16 ***
## CO2Emissions 1.427e-08 4.923e-09 2.899 0.00416 **
## YearlyChange -1.023e-01 1.503e-01 -0.680 0.49706
## Percapita 4.940e-01 9.445e-02 5.230 4.29e-07 ***
## Population -4.142e-08 1.705e-08 -2.430 0.01601 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.717 on 198 degrees of freedom
## Multiple R-squared: 0.2618, Adjusted R-squared: 0.2469
## F-statistic: 17.56 on 4 and 198 DF, p-value: 2.393e-12
#check model result
summary(co2_back)##
## Call:
## lm(formula = LifeExpectancy ~ CO2Emissions + Percapita + Population,
## data = co2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4498 -4.1053 0.9113 5.4314 11.1282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.017e+01 6.808e-01 103.067 < 2e-16 ***
## CO2Emissions 1.483e-08 4.849e-09 3.058 0.00254 **
## Percapita 5.007e-01 9.382e-02 5.337 2.56e-07 ***
## Population -4.247e-08 1.695e-08 -2.505 0.01305 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.708 on 199 degrees of freedom
## Multiple R-squared: 0.2601, Adjusted R-squared: 0.2489
## F-statistic: 23.32 on 3 and 199 DF, p-value: 5.629e-13
Based on all Summary models, Percapita has 3 asterisk “***“, mean it is the most significant correlation with target variable LifeExpectancy in 3 models.
R-squared:
co2_mod : 0.2368 #2 Predictor variable
co2_all : 0.2618 #4 Predictor variable
co2_back: 0.2601 #3 Predictor variable
Adj. R-Square
co2_mod : 0.2291
co2_all : 0.2469
co2_back: 0.2489
From summary function, we can observe that the more predictor variable used, R-squared is increasing. Thus, we can assume model
co2_allis better than the other 2 models, but when we observe from adjusted R-squared, modelco2_backis better than the others. This is due to in modelco2_back, all predictor variable has correlation to the target variable, which may give the model a better prediction. While inco2_all, 1 predictorYearlyChangehas no correlation.
Compare model performance
#compare model performance based on AID and RMSE
compare_performance(co2_mod, co2_all, co2_back, metrics = c("AIC", "RMSE"))Based on models performance comparison, model
co2_backhave the lowest AIC and modelco2_allhave the lowest RMSE. We use RMSE metrics since it was the most common metric to evaluate regression linear.
predict(co2_mod, co2[1,], interval = "confidence", level = 0.95)## fit lwr upr
## 1 69.67992 68.48568 70.87415
If we want to predict
LifeExpectancywith confidence interval 95%, we can presumeLifeExpectancyvalue for the first row to be around 68.48 (lwr) up to 70.87 (upr)
predict(co2_all, co2[1,], interval = "confidence", level = 0.95)## fit lwr upr
## 1 68.57079 66.85548 70.2861
If we want to predict
LifeExpectancywith confidence interval 95%, we can presumeLifeExpectancyvalue for the first row to be around 66.85 (lwr) up to 70.28 (upr)
predict(co2_back, co2[1,], interval = "confidence", level = 0.95)## fit lwr upr
## 1 68.95176 67.64093 70.26258
If we want to predict
LifeExpectancywith confidence interval 95%, we can presumeLifeExpectancyvalue for the first row to be around 67.64 (lwr) up to 70.26 (upr)
Model co2_mod
#evaluating RMSE
sqrt(mean((co2_mod$residuals^2)))## [1] 6.745433
Model co2_all
#evaluating RMSE
sqrt(mean((co2_all$residuals^2)))## [1] 6.633796
Model co2_back
#evaluating RMSE
sqrt(mean((co2_back$residuals^2)))## [1] 6.641546
Based on RMSE above, co2_all did slightly better job in summarizing LifeExpectancy, as it has the lowest RMSE value between 3 models.
Normality
hist(co2_mod$residuals, breaks = 15)#to check normality assumption
shapiro.test(co2_mod$residuals)##
## Shapiro-Wilk normality test
##
## data: co2_mod$residuals
## W = 0.96467, p-value = 5.736e-05
P-value > 0.05, mean Residuals are normally distributed
Residual Plot
#to check residual pattern
plot(co2_mod$fitted.values, co2_mod$residuals)
abline(h = 0, col = "red") The residual plot has no random pattern, but it tend to gather on the left side of the plot.
BP test
bptest(co2_mod)##
## studentized Breusch-Pagan test
##
## data: co2_mod
## BP = 4.5555, df = 2, p-value = 0.1025
P-value > 0.05, mean residual has a constant variance (Homoscedaticity) and the assumption is passed
Multicolinearity
#to check correlation between predictor variable
vif(co2_mod)## Percapita CO2Emissions
## 1.104982 1.104982
There is no value above 10, thus no significant correlation between predictor variable noted in this model.
Normality
#create histogram to check normality
hist(co2_all$residuals, breaks = 15)#to check normality assumption
shapiro.test(co2_all$residuals)##
## Shapiro-Wilk normality test
##
## data: co2_all$residuals
## W = 0.9585, p-value = 1.18e-05
P-value > 0.05, mean Residuals are normally distributed
Residual Plot
#to check residual pattern
plot(co2_all$fitted.values, co2_all$residuals)
abline(h = 0, col = "red") The residual plot has no random pattern, but it tend to gather on the left side of the plot.
BP test
bptest(co2_all)##
## studentized Breusch-Pagan test
##
## data: co2_all
## BP = 8.3993, df = 4, p-value = 0.078
P-value > 0.05, mean residual has a constant variance (Homoscedaticity) and the assumption is passed
Multicolinearity
#to check correlation between predictor variable
vif(co2_all)## CO2Emissions YearlyChange Percapita Population
## 1.985704 1.070210 1.305775 1.785793
There is no value above 10, thus no significant correlation between predictor variable noted in this model.
Normality
#create histogram to check normality
hist(co2_back$residuals, breaks = 15)#to check normality assumption
shapiro.test(co2_back$residuals)##
## Shapiro-Wilk normality test
##
## data: co2_back$residuals
## W = 0.96091, p-value = 2.159e-05
P-value > 0.05, mean Residuals are not normally distributed
Residual Plot
#to check residual pattern
plot(co2_back$fitted.values, co2_back$residuals)
abline(h = 0, col = "red") The residual plot has no random pattern, but it tend to gather on the left side of the plot.
BP test
bptest(co2_back)##
## studentized Breusch-Pagan test
##
## data: co2_back
## BP = 3.4443, df = 3, p-value = 0.3281
P-value > 0.05, mean residual has a constant variance (Homoscedaticity) and the assumption is passed
Multicolinearity
#to check correlation between predictor variable
vif(co2_back)## CO2Emissions Percapita Population
## 1.931598 1.291746 1.771161
There is no value above 10, thus no significant correlation between predictor variable noted in this model.
Based on model creation and evaluation, all model may not be a good model to describe our target variable LifeExpectancy. But between all models, co2_all and co2_back has closest result, indicate predictor variable CO2Emissions, Percapita and Population are the best variable to predict LifeExpectancy. From real world perspective, those variable may affecting people lifespan, as increase of population lead to increasing of transportation use, industrial and manufacturing process, fossil fuels consumption and deforestation, which will lead to produce more of CO2Emissions. Although there are many significant variable affects human lifeexpectancy in actual, government may consider to reduce CO2Emissions by making regulation such as increasing public transport facility, so people will be more interest to use public transportation rather than use their own vehicle. By reducing the use of individual transportation, production for individual vehicle may also decrease, which lead to decrease of manufacturing and industrial activity.