Objective: Create a dataset with 1000 data points, 3 independent continuous variables (predictors) and one continuous dependent variable (outcome). There should be a linear relationship (including some noise) between the predictors and the outcome. At least 2 predictors should have a limited range (e.g. between -10 and 10).
Preparing Workspace
rm(list = ls())
cat("\f")
Preparing Packages
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
Defining Linear Relationship
#making data set reproduceable
set.seed(7)
#defining sample size
n <- 1000
#defining three independent continuous variables w/ limited range
x.1 <- runif(n, min = 10, max = 20) #predictor for the natural resource revenues of a country
x.2 <- runif(n, min = 15, max = 30) #predictor for natural resource dependence of a country
x.3 <- runif(n, min = 8, max = 16) #predictor for the natural resource market share of a country
#defining linear relationship
beta.0 <- 4 #intercept, when all independent variables are equal to 0, y = 4
beta.1 <- 3 #for every one unit increase in x.1, y increases 3 units
beta.2 <- .8 #for every one unit increase in x.2, y increases .8 units
beta.3 <- 1.8 #for every one unit increase in x.3, y increases 1.8 units
#incorporating noise
epsilon <- rnorm(n, mean = 0, sd = 5)
#defining multiple linear regression equation
y <- beta.0 + beta.1 * x.1 + beta.2 * x.2 + beta.3 * x.3 + epsilon #the outcome variable (y) is the military expenditures of a given country
Background: For this assignment, I chose to create a hypothetical data set containing fictitious country-level data related to a state’s natural resource endowment and military expenditures. The data set contains three independent, continuous variables (predictors): 1) Natural resource revenues (x.1), 2) Natural resource dependence (x.2), 3) Natural resource market share (x.3). The outcome variable (y) is the military expenditures of a given country.
Building Dataset
my.dataset <- data.frame(x.1, x.2, x.3, y)
colnames(my.dataset) <- c("resource.wealth", "resource.dependence", "resource.marketshare", "military.expenditures")
summary(my.dataset)
## resource.wealth resource.dependence resource.marketshare military.expenditures
## Min. :10.01 Min. :15.04 Min. : 8.008 Min. : 55.88
## 1st Qu.:12.61 1st Qu.:18.68 1st Qu.: 9.792 1st Qu.: 80.16
## Median :14.96 Median :22.21 Median :11.858 Median : 88.74
## Mean :15.06 Mean :22.38 Mean :11.922 Mean : 88.63
## 3rd Qu.:17.63 3rd Qu.:26.16 3rd Qu.:13.935 3rd Qu.: 96.69
## Max. :19.99 Max. :29.98 Max. :15.993 Max. :120.19
Objective: Run 3 distinct simple linear regressions, regressing the outcome against each one of the independent variables
Model 1: Natural Resource Wealth
plot(x = my.dataset$resource.wealth, y = my.dataset$military.expenditures,
main = "Resource Wealth vs. Military Expenditures",
xlab = "Resource Wealth (GDP per capita)",
ylab = "Military Expenditures (MM USD)"
)
abline(lm(my.dataset$military.expenditures ~ my.dataset$resource.wealth), col="red")
mylm.1 <- lm(my.dataset$military.expenditures ~ my.dataset$resource.wealth)
summary(mylm.1)
##
## Call:
## lm(formula = my.dataset$military.expenditures ~ my.dataset$resource.wealth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1540 -4.9548 -0.3493 4.9448 22.6141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.35611 1.26019 34.40 <2e-16 ***
## my.dataset$resource.wealth 3.00583 0.08219 36.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.452 on 998 degrees of freedom
## Multiple R-squared: 0.5727, Adjusted R-squared: 0.5723
## F-statistic: 1338 on 1 and 998 DF, p-value: < 2.2e-16
Interpretation: The first linear regression model estimates the relationship between a country’s natural resource wealth (predictor) and military expenditures (outcome).
The intercept parameter of indicates that the predicted military expenditures of a country are 43.36MM USD when natural resource revenues are 0 (% of GDP per capita) - the result is statistically significant at the .001 level.
The coefficient suggests a 1% GDP per capita increase in natural resource revenues will lead to a 3MM USD increase in military expenditures. The result is statistically significant at the .001 level.
The p-value is statistically significant at the .001 level. In a simplified example, let’s say we are policymakers attempting to determine whether there is a link between a country’s natural resource wealth and military expenditures. Our null hypothesis claims a country’s natural resource wealth has no effect on its military expenditures, while our alternative hypothesis takes the opposite position - natural resource wealth does have an effect on military expenditures. If we are testing this claim at a significance level of .05 (alpha), our p-value of <2.2e-16 indicates we have strong evidence supporting the alternative hypothesis - natural resource wealth does not have “no effect” on a country’s military expenditures.
The high F-statistic indicates the model fits the data well - there is a highly significant relationship between resource wealth and military expenditures.
The adjusted R-squared value of .57 indicates there is a moderate difference between the observed and fitted values - natural resource wealth accounts for a large proportion of variability in military expenditures, however, there is still a large amount of unexplained variability.
par(mfrow=c(2,3))
plot(mylm.1, c(1:6))
Interpretation:
Residuals vs. Fitted: The residuals appear to be equally, and randomly spread around the horizontal line. In the context of the model, the plot appears to show constant variance, thus upholding the assumption of homoscedasticity. There does appear to slight deviations from the “cloud” around -20 and 10, which may be the result of the noise incorporated into the linear relationship. Points 210, 395, and 453 warrant further investigation.
The Durbin-Watson test statistic is close to 2 and the p-value is greater than the .05 level of significance which indicates we can reject the null hypothesis - there is insufficient evidence to suggest there autocorrelation in the model.
dwtest(my.dataset$military.expenditures ~ my.dataset$resource.wealth) #high p-value indicates no evidence of autocorrelation (insufficient evidence to reject null)
##
## Durbin-Watson test
##
## data: my.dataset$military.expenditures ~ my.dataset$resource.wealth
## DW = 2.0247, p-value = 0.6528
## alternative hypothesis: true autocorrelation is greater than 0Q-Q Residuals: The model suggests that the residuals are demonstrating non-normal characteristics as the observations slightly deviate from the line, and curve off in the extremities. To further test whether the assumption of normality of residuals holds, I performed a Shapiro-Wilk normality test in conjunction with the Q-Q plot. The result suggests we should not reject the null hypothesis - we do not have sufficient evidence to suggest the residuals are not normally distributed. Given the p-value is close to the level of significance (.05) and the test statistic is is close to 1, the result suggest the residuals are showing borderline normality.
shapiro.test(residuals (mylm.1))
##
## Shapiro-Wilk normality test
##
## data: residuals(mylm.1)
## W = 0.99701, p-value = 0.05773Scale-Location: The residuals again appear to uphold the assumption of homoscedasticity given the spread between the fitted and standardized residuals remains consistent as the fitted values increase.
Residuals vs. Leverage: The data contains three influential outliers (208, 487, and 597), which may be exerting a degree of leverage on the model.
Model 2: Natural Resource Dependence
plot(x = my.dataset$resource.dependence, y = my.dataset$military.expenditures,
main = "Resource Dependence vs. Military Expenditures",
xlab = "Resource Dependence (% of GDP)",
ylab = "Military Expenditures (MM USD)"
)
abline(lm(my.dataset$military.expenditures ~ my.dataset$resource.dependence), col="red")
mylm.2 <- lm(my.dataset$military.expenditures ~ my.dataset$resource.dependence)
summary(mylm.2)
##
## Call:
## lm(formula = my.dataset$military.expenditures ~ my.dataset$resource.dependence)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.6105 -8.0119 0.1986 7.8615 26.2196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.0730 1.7967 39.558 <2e-16 ***
## my.dataset$resource.dependence 0.7846 0.0788 9.957 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.87 on 998 degrees of freedom
## Multiple R-squared: 0.09036, Adjusted R-squared: 0.08945
## F-statistic: 99.14 on 1 and 998 DF, p-value: < 2.2e-16
Interpretation: The second linear regression model estimates the relationship between a country’s natural resource dependence and military expenditures.
The intercept parameter of indicates that the predicted military expenditures of a country is 71.07 MM USD when natural resource dependence is 0 units (% of GDP) - the result is statistically significant at the .001 level.
The coefficient suggests a 1% GDP increase in natural resource dependence will lead to a .78 M USD increase in military expenditures. The result is statistically significant at the .001 level.
The p-value is statistically significant at the .001 level. Using a similar hypothesis test to the one used for the first regression (the null states natural resource dependence has no effect on military expenditures), at a significance level of .05 (alpha), our p-value of <2.2e-16 indicates we have strong evidence supporting the alternative hypothesis - natural resource dependence does not have “no effect” on a country’s military expenditures.
The high F-statistic indicates the model fits the data well - there is a highly significant relationship between resource dependence and military expenditures.
The adjusted R-squared value of .09 indicates the model has very little explanatory power (lowest result of the three regressions). The result does raise questions around fit considering a significant portion of variability remains unexplained.
par(mfrow=c(2,3))
plot(mylm.2, c(1:6))
Interpretation:
Residuals vs. Fitted: The residuals appear to be equally spread around the horizontal line. There appears to be a few outliers, however, the spread remains consistent throughout the plot. The Durbin-Watson tests yields a high p-value (greater than alpha of .05) which indicates there’s no evidence of autocorrelation in the model.
dwtest(my.dataset$military.expenditures ~ my.dataset$resource.dependence) #high p-value indicates no evidence of autocorrelation (insufficient evidence to reject null)
##
## Durbin-Watson test
##
## data: my.dataset$military.expenditures ~ my.dataset$resource.dependence
## DW = 2.1327, p-value = 0.9821
## alternative hypothesis: true autocorrelation is greater than 0Q-Q Residuals: The model suggests that the residuals are demonstrating non-normal characteristics as the observations deviate from the line, and curve off in the extremities. The Shapiro-Wilk test suggests this is the case - we have strong evidence that the residuals are not normally distributed. The result raises questions around the reliability of our confidence intervals and p-values.
shapiro.test(residuals (mylm.2))
##
## Shapiro-Wilk normality test
##
## data: residuals(mylm.2)
## W = 0.99306, p-value = 0.0001283Scale-Location: The standardized residuals again appear to uphold the assumption of homoscedasticity given the spread between the fitted and standardized residuals remains consistent as the fitted values increase.
Residuals vs. Leverage: The data contains three outliers (208, 487, and 681), however they have limited leverage (all are outside of Cook’s distance).
Model 3: Natural Resource Market Share
plot(x = my.dataset$resource.marketshare, y = my.dataset$military.expenditures,
main = "Resource Market Share vs. Military Expenditures",
xlab = "Resource Market Share (production volume as a percentage of global production)",
ylab = "Military Expenditures (MM USD)"
)
abline(lm(my.dataset$military.expenditures ~ my.dataset$resource.marketshare), col="red")
mylm.3 <- lm(my.dataset$military.expenditures ~ my.dataset$resource.marketshare)
summary(mylm.3)
##
## Call:
## lm(formula = my.dataset$military.expenditures ~ my.dataset$resource.marketshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.734 -6.967 -0.191 7.238 27.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.3756 1.6459 37.90 <2e-16 ***
## my.dataset$resource.marketshare 2.2024 0.1354 16.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 998 degrees of freedom
## Multiple R-squared: 0.2095, Adjusted R-squared: 0.2087
## F-statistic: 264.5 on 1 and 998 DF, p-value: < 2.2e-16
Interpretation: The third linear regression model estimates the relationship between a country’s natural resource market share and military expenditures.
The intercept parameter indicates that the predicted military expenditures of a country is 62.38MM USD when a country’s natural resource market share is 0 (total volume as a % of global production) - the result is statistically significant at the .001 level.
The coefficient suggests a one-unit increase (total volume as a % of global production) in natural resource market share will lead to a 2.2 unit increase in military expenditures. The result is statistically significant at the .001 level.
The p-value is statistically significant at the .001 level, which would lead us to reject the null hypothesis - at a significance level of .05 (alpha), our p-value of <2.2e-16 indicates we have strong evidence supporting the alternative hypothesis - natural resource market share does not have “no effect” on a country’s military expenditures.
The high F-statistic indicates the model fits the data well - there is a highly significant relationship between resource market share and military expenditures.
The adjusted R-squared value of .21 indicates the model’s explanatory power falls in between the first two regressions - natural resource market share explains a smaller proportion of variability in the model than natural resource wealth, and a greater amount of variability than natural resource dependence. A large amount of variability remains unexplained which raises questions around fit.
par(mfrow=c(2,3))
plot(mylm.3, c(1:6))
Interpretation:
Residuals vs. Fitted: Similar to the previous two models, the residuals of the third model appear to be equally spread around the horizontal line. There appears to be a more notable outliers (208, 597, and 630) however the spread remains relatively consistent as the fitted values increase. The Durbin-Watson tests yields a p-value higher than our chosen alpha (.05) which indicates there’s no evidence of autocorrelation in the model.
dwtest(my.dataset$military.expenditures ~ my.dataset$resource.marketshare) #high p-value indicates no evidence of autocorrelation (insufficient evidence to reject null)
##
## Durbin-Watson test
##
## data: my.dataset$military.expenditures ~ my.dataset$resource.marketshare
## DW = 2.0881, p-value = 0.9184
## alternative hypothesis: true autocorrelation is greater than 0Q-Q Residuals: The model suggests that the residuals are demonstrating non-normal characteristics as the observations deviate from the line, and curve off in the extremities. The deviation is less severe than the second regression model. The Shapiro-Wilk test yields a similar outcome to the second regression - we have sufficient evidence in support of the alternative hypothesis. Similar to the first regression, the result is borderline and not as strong of an indication of violation as the second regression.
shapiro.test(residuals (mylm.3))
##
## Shapiro-Wilk normality test
##
## data: residuals(mylm.3)
## W = 0.9964, p-value = 0.02118Scale-Location: The fitted values appear to uphold the assumption of homoscedasticity - the spread between the fitted and standardized residuals remains consistent as the fitted values increase.
Residuals vs. Leverage: The data contains three outliers (306, 503, and 630). Point 503 is exerting the highest degree of leverage on the model and warrants further investigation.
The linear regression models appear to uphold three of the four Gauss Markov Assumptions (linearity, constant variability, and independent observations). Each regression departs from the assumption of normality of residuals (as indicated in the corresponding Q-Q plots and Shapiro-Wilk tests). Given the large sample size (n=1,000), the impact is less severe, however, the result does raise questions around the reliability of our confidence intervals. To further test the explanatory power of the models, a transformation of the data is recommended. Log transformations of the predictors or outcome variable (or both simultaneously) is one approach which may be employed to reduce skewness.