The gross domestic product (GDP) per capita is a widely used measure of a country’s (or state’s) economy. It is defined as the total market value of all goods and services produced within a country (or state) in a specified period of time. The most common computation of GDP includes five items: consumption, gross investment, government spending, exports, and imports (which negatively impact the total). The Census Bureau reports the GDP for each state in the Unites States quarterly. The government also reports annual personal income totals (seasonally adjusted in $millions) by state and each state’s population. Let’s examine how personal income is related to GDP at the state level.
names(x)
## [1] "State" "Personal.Income" "GDP" "Population"
Checking the Normal Population Assumption
The distribution of Personal Income is highly skewed to the right.
par(mfrow = c(1, 2))
hist(x$Personal.Income, main = 'Personal Income')
qqnorm(x$Personal.Income)
qqline(x$Personal.Income)
The best transformation for Personal Income found is Log10
Square Root transformation
qqnorm(sqrt(x$Personal.Income), main = expression(paste("QQ plot of ", sqrt(Personal.Income))))
qqline(sqrt(x$Personal.Income))
Log10 transformation
qqnorm(log10(x$Personal.Income), main = "QQ plot of Log10(Personal.Income)")
qqline(log10(x$Personal.Income))
-1/sqrt transformation
qqnorm(-1/sqrt(x$Personal.Income), main = expression(paste("QQ plot of ", -1/sqrt(Personal.Income))))
qqline(-1/sqrt(x$Personal.Income))
-1/y transformation
qqnorm(-1/(x$Personal.Income), main = "QQ plot of -1/Personal.Income")
qqline(-1/(x$Personal.Income))
The scatterplot shows an apparent concave downward pattern.Thus we need a transformation on GDP; Ladder of Powers.
plot(x$GDP, log10(x$Personal.Income), xlab = 'GDP', ylab = 'Log10(Personal.Income)')
The best transformation for GDP is Log10, since it is the only transformation that yields a straight scatter.
Square Root transformation for GDP
plot(sqrt(x$GDP), log10(x$Personal.Income), xlab = expression(sqrt(GDP)), ylab = 'Log10(Personal Income)')
Log10 transformation for GDP
plot(log10(x$GDP), log10(x$Personal.Income), xlab = 'Log10(GDP)', ylab = 'Log10(Personal Income)')
-1/sqrt transformation for GDP
plot(-1/sqrt(x$GDP), log10(x$Personal.Income), xlab = expression(-1/sqrt(GDP)), ylab = 'Log10(Personal Income)')
-1/y transformation for GDP
plot(-1/(x$GDP), log10(x$Personal.Income), xlab = '-1/GDp', ylab = 'Log10(Personal Income)')
Visualizing the Regression Line
m <- lm(log10(Personal.Income) ~ log10(GDP), data = x)
plot(log10(x$GDP), log10(x$Personal.Income), xlab = 'Log10(GDP)', ylab = 'Log10(Personal Income)')
abline(m, col = "red") #visualizing the Regression line
Residual vs. Fitted values. Is the equal-variance assumption satisfied in the fitted model?
Yes, It is satisfied, since the residual points spread equally across the fitted values.
plot(m$fitted.values, m$residuals, xlab = 'Fitted values', ylab = 'Residuals' )
abline(a = 0, b = 0, col = 'blue')
However, there are two residual points standing out in the residual plot. They are from District of Columbia with residual of -0.32, and Delaware with residual of -0.159. They seem to be outliers.
ord <- order(m$residuals)
x[ord[1], ]
## State Personal.Income GDP Population
## 9 District of Columbia 31779 69470 550521
x[ord[2], ]
## State Personal.Income GDP Population
## 8 Delaware 32359 49001 843524
To see if the two outliers are influential or not, we fit a new regression model without those points and compare it to our first model.
removing outliers
x_new <- x[-c(ord[1], ord[2]),]
Are those outliers influential?
As we can see in the scatterplot, the two regression models are very similar, and therefore the outliers are not influential.
new_m <- lm(log10(Personal.Income) ~ log10(GDP), data = x_new)
plot(log10(x_new$GDP), log10(x_new$Personal.Income), main = 'New Regression Plot', xlab = 'GDP', ylab = 'Personal Income' )
abline(new_m, col = 'red'); abline(m, col = 'green')
par(mfrow = c(1,2))
plot(log10(x_new$GDP), log10(x_new$Personal.Income), main = 'New Regression Plot', xlab = 'GDP', ylab = 'Personal Income' )
abline(new_m, col = 'red'); abline(m, col = 'green')
plot(new_m$fitted.values, new_m$residuals, main = 'New Residual PLot', xlab = 'Fitted values', ylab = 'Residuals' )
abline(a = 0, b = 0, col = 'orange' )
Since the two models are similar, the first model will be used for predictions. For example, What is the personal income for a state with a GDP = 300,000?
First, we check whether a GDP = $300,000 is within our data range. It is withing our data range.
range(x$GDP)
## [1] 19713 1457090
Second, we go ahead and predict the personal income of a state with a GDP = $300,000. The predicted Personal Income for that state is 288544.1 and its 95% prediction interval is (218840.9, 380448.5).
pred_data <- data.frame(GDP = 300000)
result <- predict(m, newdata = pred_data, interval = 'prediction')
10^result
## fit lwr upr
## 1 288544.1 218840.9 380448.5