Transformations on Linear Regression Analysis

GDP by State

The gross domestic product (GDP) per capita is a widely used measure of a country’s (or state’s) economy. It is defined as the total market value of all goods and services produced within a country (or state) in a specified period of time. The most common computation of GDP includes five items: consumption, gross investment, government spending, exports, and imports (which negatively impact the total). The Census Bureau reports the GDP for each state in the Unites States quarterly. The government also reports annual personal income totals (seasonally adjusted in $millions) by state and each state’s population. Let’s examine how personal income is related to GDP at the state level.

names(x)

## [1] "State"           "Personal.Income" "GDP"             "Population"

Checking the Normal Population Assumption

The distribution of Personal Income is highly skewed to the right.

par(mfrow = c(1, 2))
hist(x$Personal.Income, main = 'Personal Income')
qqnorm(x$Personal.Income)
qqline(x$Personal.Income)

Trying transformations from the Ladder of Powers

The best transformation for Personal Income found is Log10

Square Root transformation

qqnorm(sqrt(x$Personal.Income), main = expression(paste("QQ plot of ", sqrt(Personal.Income))))
qqline(sqrt(x$Personal.Income))

Log10 transformation

qqnorm(log10(x$Personal.Income), main = "QQ plot of Log10(Personal.Income)")
qqline(log10(x$Personal.Income))

-1/sqrt transformation

qqnorm(-1/sqrt(x$Personal.Income), main = expression(paste("QQ plot of ", -1/sqrt(Personal.Income))))
qqline(-1/sqrt(x$Personal.Income))

-1/y transformation

qqnorm(-1/(x$Personal.Income), main = "QQ plot of -1/Personal.Income")
qqline(-1/(x$Personal.Income))

Checking the Linearity Assumption of Personal Income vs. GDP

The scatterplot shows an apparent concave downward pattern.Thus we need a transformation on GDP; Ladder of Powers.

plot(x$GDP, log10(x$Personal.Income), xlab = 'GDP', ylab = 'Log10(Personal.Income)')

The best transformation for GDP is Log10, since it is the only transformation that yields a straight scatter.

Square Root transformation for GDP

plot(sqrt(x$GDP), log10(x$Personal.Income), xlab = expression(sqrt(GDP)), ylab = 'Log10(Personal Income)')

Log10 transformation for GDP

plot(log10(x$GDP), log10(x$Personal.Income), xlab = 'Log10(GDP)', ylab = 'Log10(Personal Income)')

-1/sqrt transformation for GDP

plot(-1/sqrt(x$GDP), log10(x$Personal.Income), xlab = expression(-1/sqrt(GDP)), ylab = 'Log10(Personal Income)')

-1/y transformation for GDP

plot(-1/(x$GDP), log10(x$Personal.Income), xlab = '-1/GDp', ylab = 'Log10(Personal Income)')

Fitting a Linear Regression Model

Visualizing the Regression Line

m <- lm(log10(Personal.Income) ~ log10(GDP), data = x)

plot(log10(x$GDP), log10(x$Personal.Income), xlab = 'Log10(GDP)', ylab = 'Log10(Personal Income)')
abline(m, col = "red") #visualizing the Regression line

Residual vs. Fitted values. Is the equal-variance assumption satisfied in the fitted model?

Yes, It is satisfied, since the residual points spread equally across the fitted values.

plot(m$fitted.values, m$residuals,  xlab = 'Fitted values', ylab = 'Residuals' )
abline(a = 0, b = 0, col = 'blue')

However, there are two residual points standing out in the residual plot. They are from District of Columbia with residual of -0.32, and Delaware with residual of -0.159. They seem to be outliers.

ord <- order(m$residuals)
x[ord[1], ]

##                  State Personal.Income   GDP Population
## 9 District of Columbia           31779 69470     550521

x[ord[2], ]

##      State Personal.Income   GDP Population
## 8 Delaware           32359 49001     843524

To see if the two outliers are influential or not, we fit a new regression model without those points and compare it to our first model.

removing outliers

x_new <- x[-c(ord[1], ord[2]),]

Are those outliers influential?

As we can see in the scatterplot, the two regression models are very similar, and therefore the outliers are not influential.

new_m <- lm(log10(Personal.Income) ~ log10(GDP), data = x_new)

plot(log10(x_new$GDP), log10(x_new$Personal.Income), main = 'New Regression Plot', xlab = 'GDP', ylab = 'Personal Income' )
abline(new_m, col = 'red'); abline(m, col = 'green')

par(mfrow = c(1,2))
plot(log10(x_new$GDP), log10(x_new$Personal.Income), main = 'New Regression Plot', xlab = 'GDP', ylab = 'Personal Income' )
abline(new_m, col = 'red'); abline(m, col = 'green')
plot(new_m$fitted.values, new_m$residuals, main = 'New Residual PLot',  xlab = 'Fitted values', ylab = 'Residuals' )
abline(a = 0, b = 0, col = 'orange' )

Since the two models are similar, the first model will be used for predictions. For example, What is the personal income for a state with a GDP = 300,000?

First, we check whether a GDP = $300,000 is within our data range. It is withing our data range.

range(x$GDP)

## [1]   19713 1457090

Second, we go ahead and predict the personal income of a state with a GDP = $300,000. The predicted Personal Income for that state is 288544.1 and its 95% prediction interval is (218840.9, 380448.5).

pred_data <- data.frame(GDP = 300000)
result <- predict(m, newdata = pred_data, interval = 'prediction')
10^result

##        fit      lwr      upr
## 1 288544.1 218840.9 380448.5

Transformations on Linear Regression Analysis

Christian Zuna Largo

5/24/2020

GDP by State

Trying transformations from the Ladder of Powers

Checking the Linearity Assumption of Personal Income vs. GDP

Fitting a Linear Regression Model