ECOM2000 Data Analysis Project 2023

INITIAL SET-UP

Premilinary: Data load-in and preparation

# LOAD IN DATA (You need to change the file path)
dat2 = readxl::read_excel("//Users//mwilandhlovu//Desktop//ECONS2000//World_data.xlsx", sheet = "Data")
dat2

# CONVERT DATA FROM LONG TO WIDE FORM
datw = spread(dat2, "Series Name", "2020")
datw

# RENAME VARIABLES
datw = rename(datw, GHG = "Total greenhouse gas emissions (kt of CO2 equivalent)",
                    GDPpc = "GDP per capita (current US$)",
                    Pop = "Population, total",
                    EDU = "Government expenditure on education, total (% of GDP)",
                    PopDen = "Population density (people per sq. km of land area)",
                    UrbPop = "Urban population (% of total population)") 
 
# DROP MISSING OBSERVATIONS
datw[datw==".."]=NA
datw=na.omit(datw)

# CHANGE DATA TYPE FROM CHARACTER INTO NUMERIC
class(datw$GHG)="double"
class(datw$GDPpc)="double"
class(datw$Pop)="double"
class(datw$EDU)="double"
class(datw$PopDen)="double"
class(datw$UrbPop)="double"

# DISPLAY 1ST 6 ROWS
head(datw)

# DIMENSION OF DATA
c(nrow(datw),ncol(datw))

## [1] 150   7

# CREATE GHG PER CAPITA
datw$GHGpc = datw$GHG/datw$Pop*10^6

Question 1- Scatter Plot of GHGpc against GDP per capita

datw$GHGpc<-datw$GHG*1000000

datw$GHGpc<-datw$GHGpc/datw$Pop

plot(datw$GHGpc,datw$GDPpc, main = "Scatter Plot of GHG per captia against 
     GDP per capita",
     xlab = "GDP per Capita", ylab = "GHG per captia (KG)",
     pch = 1, frame = FALSE)

Question 2 - Consturct 90% CI of GHGpc

#Sample mean of GHGpc
x1=datw$GHGpc
xbar=sum(x1/length(x1))
xbar

## [1] 6130.624

#Sample variance of GHGpc 
xvar= sum((x1-xbar)^2)/(length(x1)-1)
xvar

## [1] 41592842

#Sample standard deviation of GHGpc 
sdx=sqrt(xvar)
sdx

## [1] 6449.251

#Critial values 90% CI of GHGpc 
qt(0.05,length(x1-1))

## [1] -1.655076

qt(0.95,length(x1-1))

## [1] 1.655076

#Construction of 90% CI of GHGpc
xbar+qt(0.05,length(x1)-1)*sqrt(xvar^2/length(x1))

## [1] -5614809

xbar+qt(0.95,length(x1)-1)*sqrt(xvar^2/length(x1))

## [1] 5627070

Interpretation of the CI

The confidence interval means we are 90% sure that the population mean will fall between the values of -5614809 and 5627070. We this assume is based on this particular sample set.However if a different sample set were to be used with different sample means, variance and standard deviations then the estimated values would change.

Question 3 - Estimate multiple linear regression model

GDPpc<-datw$GDPpc
Pop<-datw$Pop
EDU<-datw$EDU
PopDen<-datw$PopDen
Urbpop<-datw$UrbPop
GDPpc2<-((GDPpc)^2) #GDP squared 

mrl<-lm(GHGpc~GDPpc+Pop+EDU+PopDen+Urbpop+GDPpc2,data = datw)
summary(mrl)

## 
## Call:
## lm(formula = GHGpc ~ GDPpc + Pop + EDU + PopDen + Urbpop + GDPpc2, 
##     data = datw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8908.5 -2637.8  -909.4   944.1 28124.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.422e+02  1.562e+03   0.603  0.54726    
## GDPpc        2.943e-01  6.354e-02   4.631  8.1e-06 ***
## Pop          6.694e-07  2.608e-06   0.257  0.79778    
## EDU         -3.147e+02  2.269e+02  -1.387  0.16747    
## PopDen      -1.388e-01  6.448e-01  -0.215  0.82992    
## Urbpop       6.444e+01  2.423e+01   2.659  0.00873 ** 
## GDPpc2      -2.367e-06  7.123e-07  -3.323  0.00113 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5243 on 143 degrees of freedom
## Multiple R-squared:  0.3657, Adjusted R-squared:  0.3391 
## F-statistic: 13.74 on 6 and 143 DF,  p-value: 2.724e-12

Estimated sample regression equation \[ Yi=b0+b1GDPpc+b2Pop+b3EDU+b4Popden+b5Urbpop+b6GDP^2+ei \] Estimated sample regression equation for GHG emissions per capita (in kg) \[ GHGpc = 942.2+0.2943GDPpc+0.0000006694Pop-314.7395EDU-0.1387601Popden+64.4396Urbpop-0.00000367GDPpc^2+ei \]

Question 4 - Report/Interpret R-squared and Std error of regression

R squared equation: \[ R^2=SSE/SST \] In multiple linear regression r squared must be adjusted to be properly interpreted. The adjusted r squared accounts for the multiple added variables that would affect to the r squared value. In this model the adjusted R squared is 0.3391 suggests 33% of the variation in GHG per captia is explained by the explanatory variables (DP per capita, GDP per capita squared, government expenditure on education, population density, and the share of population living in urban areas are the explanatory Values).

This also means that there is still 67% of the variation in GHGpc that is not accounted for. The magnitude of the data compared to the R squared also tell to the goodness of fit.

The standard error of regression measure the average amount the observed data deviates from the predicted values. In this model the standard error is 5243, meaning on average the actual data is 5243 unit away from the predicted value. Generally the higher to number is, the less precise the model’s predicts values are. Hence suggesting that this model is not a good indication of the data.

When put together, the adjusted r squared and the standard error suggest that the model is not a good fit for the data. This also suggests that there may be other important variables that need to be included to make the model more accurate.

Question 5 - Interpret estimated coefficients EDU and PoPDen

Each slope coefficient variable represents how much dependent variable changes when associated variable increases by 1 unit holding other variables constant.

summary(mrl)$coefficients[4,1]#the slope of EDU pulled from the regression table

## [1] -314.7395

summary(mrl)$coefficients[5,1]#the slope of Popden pulled from the regression

## [1] -0.1387601

table

## function (..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no", 
##     "ifany", "always"), dnn = list.names(...), deparse.level = 1) 
## {
##     list.names <- function(...) {
##         l <- as.list(substitute(list(...)))[-1L]
##         if (length(l) == 1L && is.list(..1) && !is.null(nm <- names(..1))) 
##             return(nm)
##         nm <- names(l)
##         fixup <- if (is.null(nm)) 
##             seq_along(l)
##         else nm == ""
##         dep <- vapply(l[fixup], function(x) switch(deparse.level + 
##             1, "", if (is.symbol(x)) as.character(x) else "", 
##             deparse(x, nlines = 1)[1L]), "")
##         if (is.null(nm)) 
##             dep
##         else {
##             nm[fixup] <- dep
##             nm
##         }
##     }
##     miss.use <- missing(useNA)
##     miss.exc <- missing(exclude)
##     useNA <- if (miss.use && !miss.exc && !match(NA, exclude, 
##         nomatch = 0L)) 
##         "ifany"
##     else match.arg(useNA)
##     doNA <- useNA != "no"
##     if (!miss.use && !miss.exc && doNA && match(NA, exclude, 
##         nomatch = 0L)) 
##         warning("'exclude' containing NA and 'useNA' != \"no\"' are a bit contradicting")
##     args <- list(...)
##     if (length(args) == 1L && is.list(args[[1L]])) {
##         args <- args[[1L]]
##         if (length(dnn) != length(args)) 
##             dnn <- paste(dnn[1L], seq_along(args), sep = ".")
##     }
##     if (!length(args)) 
##         stop("nothing to tabulate")
##     bin <- 0L
##     lens <- NULL
##     dims <- integer()
##     pd <- 1L
##     dn <- NULL
##     for (a in args) {
##         if (is.null(lens)) 
##             lens <- length(a)
##         else if (length(a) != lens) 
##             stop("all arguments must have the same length")
##         fact.a <- is.factor(a)
##         if (doNA) 
##             aNA <- anyNA(a)
##         if (!fact.a) {
##             a0 <- a
##             op <- options(warn = 2)
##             on.exit(options(op))
##             a <- factor(a, exclude = exclude)
##             options(op)
##         }
##         add.na <- doNA
##         if (add.na) {
##             ifany <- (useNA == "ifany")
##             anNAc <- anyNA(a)
##             add.na <- if (!ifany || anNAc) {
##                 ll <- levels(a)
##                 if (add.ll <- !anyNA(ll)) {
##                   ll <- c(ll, NA)
##                   TRUE
##                 }
##                 else if (!ifany && !anNAc) 
##                   FALSE
##                 else TRUE
##             }
##             else FALSE
##         }
##         if (add.na) 
##             a <- factor(a, levels = ll, exclude = NULL)
##         else ll <- levels(a)
##         a <- as.integer(a)
##         if (fact.a && !miss.exc) {
##             ll <- ll[keep <- which(match(ll, exclude, nomatch = 0L) == 
##                 0L)]
##             a <- match(a, keep)
##         }
##         else if (!fact.a && add.na) {
##             if (ifany && !aNA && add.ll) {
##                 ll <- ll[!is.na(ll)]
##                 is.na(a) <- match(a0, c(exclude, NA), nomatch = 0L) > 
##                   0L
##             }
##             else {
##                 is.na(a) <- match(a0, exclude, nomatch = 0L) > 
##                   0L
##             }
##         }
##         nl <- length(ll)
##         dims <- c(dims, nl)
##         if (prod(dims) > .Machine$integer.max) 
##             stop("attempt to make a table with >= 2^31 elements")
##         dn <- c(dn, list(ll))
##         bin <- bin + pd * (a - 1L)
##         pd <- pd * nl
##     }
##     names(dn) <- dnn
##     bin <- bin[!is.na(bin)]
##     if (length(bin)) 
##         bin <- bin + 1L
##     y <- array(tabulate(bin, pd), dims, dimnames = dn)
##     class(y) <- "table"
##     y
## }
## <bytecode: 0x7fb877ab2fd8>
## <environment: namespace:base>

\[ \displaystyle \frac{\partial GHGpc}{\partial EDU} = -314.7395 \] The slope coefficient for EDU is -314.7395. This indicates that, holding all other variables constant, a one-unit increase in education expenditure (EDU) is associated with a decrease in GHGpc (greenhouse gas emissions per capita) by 314.7395 units. The coefficient is significantly larger than zero meaning the predicted values are most likely not very accurate.

\[ \displaystyle \frac{\partial GHGpc}{\partial PopDen} = -0.1387601 \] The slope coefficient for PopDen is -0.1387601. This means that, holding all other variables constant, a one-unit increase in population density (PopDen) is associated with a decrease in GHGpc by approximately 0.1387601 units.

Question 6 - Test significance of EDU coefficient

Step 1: state the null hypothesis

The null is the true population coefficient for EDU is less than or equal to 0. \[ H_0: \beta_1 EDU <0 \] The alternative Hypothesis is that coefficient for EDU is greater 0 \[ H_A : \beta_1 EDU >0 \] Step 2: Find the critical values

qt(.05,length(datw$EDU)-2)

## [1] -1.655215

qt(.95,length(datw$EDU)-2)

## [1] 1.655215

Step 3: T test statistics

\[ t. test = b1-0/se(b_1) \]

seEDU=summary(mrl)$coefficients[2,4]
EDU_estimate <-summary(mrl)$coefficients[1,4]

sum(EDU_estimate-0)/(seEDU*(EDU_estimate))

## [1] 123469.6

Step 4:

Question 7 - Construct 99% CI of UrbPop coefficient

# critical values of 99% CI for Urbpop coefficient 
qt(0.005,150-2)

## [1] -2.609456

qt(0.995,150-2)

## [1] 2.609456

#(se) is the standard error of Urbpop 
se1<-summary(mrl)$coefficients[6,2]
se1

## [1] 24.23331

#b1 is the estimate of Urbpop (slope)
b1<-summary(mrl)$coefficients[6,1]
b1

## [1] 64.4396

# manual calculation

#P(-2.609228<t<2.609228)=0.99

#P(b1-1.976*(se)) < B1 < b1+1.976*(se))=0.99

#P(64.4396-2.609228*24.23331 <B1< 64.4396+2.609228*24.23331)=0.99

#Lower bound of 95% CI 
b1+qt(0.005,150-2)*se1

## [1] 1.203826

#upper bound of 95% CI 
b1+qt(0.995,150-2)*se1

## [1] 127.6754

#P(1.203826 < B1 < 127.6754)=0.99

\[P(1.203826 < B1 < 127.6754)=0.99 \] Based on the data and statistical analysis, we can say with 99% confidence that the true value of the parameter falls within the range from 1.203826 to 127.6754. This interval provides a range of values for that we believe captures the true effect or relationship represented by this parameter in the population. We cannot say with certainty that is exactly any specific value within this interval, but we are reasonably confident that it falls within this range.

confint(mrl,level = 0.99)

##                     0.5 %        99.5 %
## (Intercept) -3.134830e+03  5.019151e+03
## GDPpc        1.283811e-01  4.601495e-01
## Pop         -6.138157e-06  7.476868e-06
## EDU         -9.069668e+02  2.774878e+02
## PopDen      -1.822128e+00  1.544608e+00
## Urbpop       1.174972e+00  1.277042e+02
## GDPpc2      -4.226187e-06 -5.070697e-07

Question 8 - Plot predicted CO2 over the range of GDP

# Create a sequence of GDP values from 0 to 120,000 in $1,000 increments
GDP_values <- seq(0, 120000, by = 1000)

# Get the sample mean values of PopDen and UrbPop from your dataset
PopDen<-datw$PopDen
Urbpop<-datw$UrbPop
PopDen_mean <- mean(datw$PopDen)
UrbPop_mean <- mean(datw$UrbPop)

# Define the coefficients of regression model
b0 <- 9.422e+02
beta_GDP <- 2.943e-01
beta_Pop <- 6.694e-07
beta_EDU <- -3.147e+02
beta_PopDen <- -1.388e-01
beta_Urbpop <- 6.444e+01
beta_GDP2 <- -2.367e-06

#Initialize an empty vector to store predicted GHGpc values
predicted_GHGpc <- vector()

# Loop through GDP values and calculate predicted GHGpc
for (gdp in GDPpc) {
  predicted <- b0+ beta_GDP * gdp + beta_Pop * mean(datw$Pop) + 
               beta_EDU * mean(datw$EDU) + beta_PopDen * PopDen_mean + 
               beta_Urbpop * UrbPop_mean + beta_GDP2 * (gdp^2)
  predicted_GHGpc <- c(predicted_GHGpc, predicted)
}

# Create a data frame with GDP and predicted GHGpc values
predicted_data <- data.frame(GDP = GDPpc, Predicted_GHGpc = predicted_GHGpc)
predicted_data

# Plot the predicted values
library(ggplot2)

ggplot(predicted_data, aes(x = GDP, y = Predicted_GHGpc)) +
  geom_line() +
  labs(x = "GDP", y = "Predicted GHGpc") +
  ggtitle("Predicted GHGpc vs. GDP") +
  theme_minimal()

The shape of the relationship in the plot will likely exhibit a curvilinear or U-shaped pattern due to the inclusion of the quadratic term for GDPpc (GDPpc2 GDPpc2) in the regression model. Initially, as GDP per capita increases, GHG emissions per capita may rise, but at some point, further increases in GDP per capita might lead to a slower rate of increase in GHG emissions per capita or even a decrease. This curvature is a result of the quadratic term capturing potential diminishing returns or saturation effects associated with economic development and environmental impact.

Question 9 - Calculate the turning point

\[ GHGpc=β0+β1×GDPpc+β2×Pop+β3×EDU+β4×PopDen+β5×Urbpop+β6×GDPpc^2+ϵ \]

Question 10 - A joint hypothesis test

Formulate the null and alternative hypotheses

Step 1: find the null hypotethsis Null Hypothesis (HO): The true population coefficient for EDU is equal or greater than zero.

Alternative Hypothesis (H1) The true population coefficient fo EDU is less than zero.

\[ H_0:\beta EDU = 0 \] \[ H_1:\beta EDU < 0 \]

Step 2: find the critcal value