Final Project

Part 1 - Introduction:

We aim to investigate which economic variables most strongly associate with Government Revenue. Macroeconomics has a wide range of theories on which variables affect which others. There are often contradictory interpretations of things like Average tax burden. Left wing economists say it increases revenue while conservatives argue it drags down productivity so overall taxable income is less, resulting in less revenue. This project investigates this as a matter of determining policy on a country level. This could affect Employment, Tax policy, and redistribution efforts for countries.

Part 2 - Data:

The data were linked from google analytics to the World Factbook:

http://www.oecd-ilibrary.org/economics/oecd-factbook_18147364

The data points are 34 individual countries and were collected via the Factbook and downloaded into Excel. The Excel were then merged into one global file and loaded into R. This is an observational study of outcomes for 2014 reported data. The outcome variable is government revenue as a function of GDP and the input variables Average Worker Productivity, Average Tax Burden on given worker, the Gini Inequality Coefficient and the Unemployment Rate.

Please note that we cannot establish a causal link because no experimentation was done, e.g. we haven’t tested the adjusting of variables to see how the Revenue changes for a given country. The scope of inference are countries around the world. This inference would not apply to countries who make most of their income not based on the economics of their countries - trade heavy economies for example.

Part 3 - Exploratory data analysis:

First lets look at the histograms:

Independent Variables

par(mfrow = c(2, 2))
hist(EconVars$Unemp,xlab="Unemployment",main = "")
hist(EconVars$Avg_tax,xlab="Average Taxes",main = "")
hist((EconVars$Gini),xlab="Gini Coef.",main = "")
hist(EconVars$Prod,xlab="Productivity",main = "")

summary(EconVars[2:6])

##     Avg_tax           Prod            Rev             Gini       
##  Min.   : 7.00   Min.   :20.23   Min.   :24.50   Min.   :0.2490  
##  1st Qu.:29.84   1st Qu.:35.33   1st Qu.:37.70   1st Qu.:0.2745  
##  Median :37.58   Median :50.51   Median :39.70   Median :0.3050  
##  Mean   :35.20   Mean   :49.23   Mean   :42.01   Mean   :0.3146  
##  3rd Qu.:42.61   3rd Qu.:63.26   3rd Qu.:47.40   3rd Qu.:0.3372  
##  Max.   :55.55   Max.   :92.50   Max.   :58.40   Max.   :0.5030  
##                  NA's   :1       NA's   :1                       
##      Unemp       
##  Min.   : 0.386  
##  1st Qu.:22.643  
##  Median :33.448  
##  Mean   :32.876  
##  3rd Qu.:44.214  
##  Max.   :63.899  
##  NA's   :1

Dependent Variable:

hist(EconVars$Rev,xlab="Revenue",main = "")
summary(EconVars$Rev)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   24.50   37.70   39.70   42.01   47.40   58.40       1

Most of the variables behave nicely, some with skew but in general Productivity is the messiest. It is almost uniformly distributed.

Let’s plot a correlation matrix for all these variables to see what’s going on preliminarily:

library(lattice)
cormat<-cor(EconVars[2:6],use = "pairwise.complete.obs")
levelplot(cormat,pretty=TRUE)

Overall we see most variables are at least slightly correlated, with the exception of Gini being anticorrelated with most things.

Part 4 - Inference:

H0 there is no significant (multi) linear correlation between Revenue and these economic variables

Ha: There is an association that is statistically significant.

I will be creating a linear model of Government Revenue as a function of the other 4 variables. The assumptions of these models are:

the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the outcome.

Let’s begin:

model<-lm(Rev~Prod+Avg_tax+Gini+Unemp,data = EconVars)

summary(model)

## 
## Call:
## lm(formula = Rev ~ Prod + Avg_tax + Gini + Unemp, data = EconVars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.429 -2.292 -1.497  1.710  9.196 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.30364    8.09358   5.227 1.66e-05 ***
## Prod          0.07356    0.04753   1.548  0.13335    
## Avg_tax       0.53348    0.09452   5.644 5.43e-06 ***
## Gini        -56.88682   17.72120  -3.210  0.00341 ** 
## Unemp        -0.16814    0.05332  -3.154  0.00393 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.044 on 27 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7701, Adjusted R-squared:  0.7361 
## F-statistic: 22.61 on 4 and 27 DF,  p-value: 2.737e-08

So the equation of for the model is:

Revenue = Prod.07+Avg_Tax.53+Gini(-56.88)+Unemp(-.16)

The model’s r squared is .77, meaning it accounts for 77% of the variance.

All variables except Productivity are significantly associated (t-test) with Revenue in this model. If you remove it from the model we get an R-squared of 74%, so including it or not doesn’t seem to matter much in terms of explanatory value. But let’s tease this out with an analysis of variance.

anova<-aov(Rev~Avg_tax+Gini+Unemp+Prod,data = EconVars)
summ<-summary(anova)
summ

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Avg_tax      1 1000.9  1000.9  61.189 2.07e-08 ***
## Gini         1  249.5   249.5  15.254 0.000568 ***
## Unemp        1  189.9   189.9  11.610 0.002071 ** 
## Prod         1   39.2    39.2   2.395 0.133355    
## Residuals   27  441.6    16.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness

pie(x=c(1000.9,249.5,189.9,39.2,441.6),labels = c("Tax","Gini","Unemp","Prod","Residuals"),main = "Sum sq Variance Percentages")

We see taxes account for a huge portion of the variance in the model, and productivity counts for very little. We still have around a fifth unexplained.

Now let’s check our assumptions:

hist(model$residuals, main="Residuals")
summary(model$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -5.429  -2.292  -1.497   0.000   1.710   9.196

qqnorm(model$residuals)
qqline(model$residuals)

This condition appears to be met based on being symmetrically centered on zero, but there may be non-normal behvaior in the quantiles above 1.

plot(x=model$fitted.values,y=abs(model$residuals),xlab="Fitted",ylab="Residuals")

We don’t see deviations from constant variance here.

We do not have the order with which these observations took place.

moddat<-anova$model
par(mfrow = c(2, 2))
plot((moddat$Avg_tax),model$residuals,main="",xlab="Tax",ylab="Residuals")
plot((moddat$Prod),model$residuals,main="",xlab="Prod",ylab="Residuals")
plot((moddat$Gini),model$residuals,main="",xlab="Gini",ylab="Residuals")
plot((moddat$Unemp),model$residuals,main="",xlab="Unemp",ylab="Residuals")

There are no correlations here, implying the model is linear in each of these variables.

Note the normality condition applies to the model when we take the ‘p-value’ approach, of trying to minimize the p-values. However, given that we have 30+ countries and are taking the “Rsq” approach, the fact that we are probably too sure of the p-values is not the end of the world. The t-test helps be sure we are not committing type II errors as often.

Part 5 - Conclusion:

Increasing Revenue in a country appears to be explained, at least in part, by increasing average tax burden, reducing Unemployment, and decreasing income inequality. Productivity is not significantly associated and explains very little of the variance.

This simple model appears to support a Keynesian view of economics: that taxes are important for government revenue, that income inequality is detrimental, and that everyone should be employed.

For future research, we would want to see how these observations held up over time, if new variables become significant. We would also like the add trade into the model to account for missing variance.