I. Correlation and Linear Models

When two numeric variables are related, we say they are correlated. If the correlation is reasonably strong, we can create a linear model. The modeling process is usually referred to as linear regression.

Like the hypothesis testing procedures, linear regression has a verification process based on the model assumptions.

Initializing RStudio

The data set we will use primarily is Data3350 which was produced in 2015 during an undergraduate research project about personality and humor. The VarsData3350 PDF file has descriptions of each variable in the Data3350 file. Both are available for download in D2L. Be sure to put the Data3350 in your R folder in Documents, and make sure your working directory is set the same way (Session menu). The code block below uses the library function to ensure that the Mosaic package is loaded and will import the data frames used in this module: Data3350 and the U.S. wages data frames Med, Perc and PercB.

library(mosaic)
library(readxl)
Data3350 = read_excel("Data3350.xlsx")
Med = read_excel("Med.xlsx")
Perc = read_excel("Perc.xlsx")
PerB = read_excel("PercB.xlsx")

II. Assumptions for Linear Regression

Recall the assumptions for \(t\)-tests and ANOVA: normality, independence and homogeneity of the variances (e.g. “same variances”). Similar ideas are present for regression, plus a 4th assumption of a linear connection between the variables.

  1. Linearity. A linear relationship between the numeric variables exists.
  2. Normality. The errors or residuals are normally distributed.
  3. Homoscedasticity. Big word that simply means “same variance,” and the idea is that the variance in the residuals will be the same regardless of position along the \(x\)-axis.
  4. Independence. The observations are independent of one another.

The creators of SAS and JMP, two powerfuls statistical software tools, have provided a nicely illustrated overview of the assumptions for linear regression.

Residuals

A residual is the vertical distance between an actual data point and it’s \(y\)-value as predicted by the model. Suppose there is a data point \((4, 10)\), but when we plug that \(x\)-value into the linear model, we get a predicted value of \[\hat y = 12\].

The residual is defined as the vertical distance between the predicted \(y\)-value, \(\hat y\), and the actual \(y\)-value from the data point, \(y\). Residuals are referred to with the letter \(e\) because one can think of them as the “errors” in the model: \[\text{Residual} = \text{Predicted Value} - \text{Observed Value}\] or \[e = \hat y - y\]

III. Verification Process for Linear Regression Modeling

We will use a two part verification process for linear regression: check linearity with an xyplot (scatter plot), and check the normality and equal variances assumptions with a qqplot.

Example: Optimism vs. Anxiety

Let’s check the correlation of two variables in the Data3350 data frame. We can use the independent variable Opt, a measure of optimism, to predict levels of the dependent variable Anx, a measure of anxiety. Either variable could be the independent variable if a bivariate correlation were the only question, but a multiple regression prediction model for Anxiety is the ultimate goal of this unit. Using multiple predictors (independent variables) to predict the dependent variable is just as easy in RStudio as analyzing single-predictor models. The output from R is actually anticipating a multiple-predictor model, so it helps to think in terms of multiple predictors as we go through our example.

1. Verifiying Linearity Assumption

The xyplot function is included with the Mosaic package. Also remember to import the data frame Data3350. (See initialization block at top of document.) The statistical model \[\text{Anx} \sim \text{Opt}\] indicates that Anxiety is the dependent variable.

#library(mosaic)
xyplot(Anx ~ Opt, data = Data3350 , type = c("p","r"),
       main = "Optimism vs. Anxiety",
       xlab = "Optimism",
       ylab = "Anxiety")

In the code block above. the type parameter specifies the type of plotting we’re asking RStudio to do, and we’re asking for both points (“p”) and the regression line (“r”). A vector is created (c is R’s concatenate function) with two components, one for points and one for the regression line. If we ask only for type “p”, we get a standard scatter plot with no line of best fit superimposed.

The pattern in the scatter plot provides evidence of a linear pattern between the variables Opt and Anx. As the Optimism score increases, generally the Anxiety scores decrease.

2. Creating a Model

For our example, let’s create a linear model for Anxiety vs. Optimism.

lm(Anx ~ Opt, data = Data3350)

Call:
lm(formula = Anx ~ Opt, data = Data3350)

Coefficients:
(Intercept)          Opt  
     59.596       -1.124  

We only see the output for the coefficients of the line of best fit: the \(y\)-intercept and the slope. Yet, lots of information from the model has been generated by RStudio and is lurking in the background. Let’s create an object (or variable) called mod and ask RStudio to store the results there. Then we can access additional model information whenever we like.

mod = lm(Anx ~ Opt, data = Data3350)

Let’s request a model summary.

summary(mod)

Call:
lm(formula = Anx ~ Opt, data = Data3350)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.8569  -5.5833  -0.3629   5.5158  20.6461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.5964     2.9234  20.386  < 2e-16 ***
Opt          -1.1243     0.1452  -7.741 1.68e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.915 on 142 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.2968,    Adjusted R-squared:  0.2918 
F-statistic: 59.93 on 1 and 142 DF,  p-value: 1.68e-12

We will provide more details about the output later. For now, just realize that the function lm creates an object of class “linear model.” That object has many different outputs including details about the line of best fit, the residuals (as can be seen in the five number summary) and much, much more. That’s why we saved it as mod, short for “model.” We plan to call it later for diagnostic and analytic tasks.

3. Verifying Normality Assumption

Let’s consider what we’ve learned. We checked the linearity assumption by inspecting a scatter plot which showed a cloud of paired data points with a downward sloping trend (negative correlation). We next created a linear model and saved it as the RStudio object mod. Now, we wish to evaluate the normality assumption which says that the residuals should be normally distributed. In previous sections, we used histograms to investigate the shape of the data.

RStudio can plot a histogram of the residuals, too, but more typically we’ll use the qq-plot instead. Here’s the histogram.

histogram (~ resid (mod))

We can analyze the residuals histogram like any other: there is evidence the data were drawn from a bell-shaped distribution with skew to the right. The qq-plot does have advantages. Let’s create a qq-plot and then discuss what we’re seeing.

qqmath( ~ resid(mod))

The qq in qq-plot refers to quantiles, and quantile is just another word for percentile. A qq-plot compares the percentiles of one distribution to the percentiles of another. If the distributions have the same shape, the qq-plot will be a straight line. Our function qqmath is a normal qq-plot which compares the residuals in mod to the bell curve. If the residuals meet the model assumptions, we should see a straight line.

QQ-Plots

For details about what we’re
hoping not to find in a qq-plot,
see the QQ Plots module S11.

With real world data, we rarely see a perfectly straight line. What we hope to avoid is seeing something S-shaped or C-shaped where the last few points on the chart at either end jump far out of line.

Quick Example: \(\text{SE}\sim\text{Opt}\)

Using the Data3350 data frame, we can use a qq-plot to analyze the fit of the linear model for self-esteem (SE) with optimism (Opt).

mod1 = lm(SE ~ Opt, data = Data3350)
qqmath( ~ resid(mod1), type = c("p","r"))

The sharp drop-off at the bottom-left indicates that there is a more extreme value in the residuals than we would expect if the residuals were normally distributed. The top two points are also further from the 45-degree line than we would prefer.

Analyzing QQ-Plots

The University of Virginia has a nice article about understanding qq-plots which goes into more detail. Just beware they use slightly different R commands to achieve what we’re doing. (Remember, we always load the Mosaic package and use it’s simplified commands. The UVA article doesn’t use Mosaic.) I have linked to an image from the UVA article that shows an example of a qqplot showing a pattern we definitely don’t want to see. You should definitely click the link to understand the problematic shape we’re hoping to avoid.

Back to our original qq-plot: everything seems fine. While the tails of the qq-plot wiggle a bit, none of the plotted comparisons jump off the line very far. Our data appear to meet the normality assumptions.

4. Analysis Statements for Linear Regression

There are three analysis statements for regression models, two that describe the strength of the relationship, and one that describes how the relationship works. Let’s look at the model summary again:

summary(mod)

Call:
lm(formula = Anx ~ Opt, data = Data3350)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.8569  -5.5833  -0.3629   5.5158  20.6461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.5964     2.9234  20.386  < 2e-16 ***
Opt          -1.1243     0.1452  -7.741 1.68e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.915 on 142 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.2968,    Adjusted R-squared:  0.2918 
F-statistic: 59.93 on 1 and 142 DF,  p-value: 1.68e-12

In the second row from the bottom of the output, consider “Multiple R-squared: 0.2968.” We only have one predictor in our model, so the correlation coefficient \(r\) is the square root of the R-squared value: \[r = \sqrt{0.2968}\approx .5448\]

4.a) Analyzing the Correlation Coefficient \(r\)

The correlation coefficient \(r\) indicates the strength and direction of the bivariate association. The chart below relates to quantitative research involving human subjects in social sciences such as psychology, economics, and educational psychology. Regression methods are widely used across many disparate branches of research. No single chart like the one below could ever be general enough for every situation. A chemistry professor might have students check the correlation of measured outputs from a Boyle’s Law lab and decide that any \(r<0.98\) is too weak, that there must have been a measurement error.

\[\begin{array}{ll} \textbf{Correlation}&&\textbf{Strength}\\ \hline \hspace{1cm}|r|<.25&&\text{Little or None}\\ .25\leq|r|<.50&&\text{Weak}\\ .50\leq|r|<.75&&\text{Moderate}\\ .75\leq|r|&&\text{Strong} \end{array}\]

The analysis statement for the correlation must identify the strength and direction of the relationship. For this example, we find a moderate, negative correlation between Optimism and Anxiety.

4.b) Analyzing the Coefficient of Determination \(R^2\)

The output “Multiple R-squared: 0.2968” indicates the coefficient of determination. Because \(R^2=0.2968\), we know that 29.7% of the changes in Anxiety scores (the dependent variable) is accounted for by how the Optimism scores (the independent variable) are changing.

Why do we write \(R^2\) when it’s the square of the correlation coefficient which is always written in lower case? It’s important to understand the multivariate context. The coefficient of determination \(R^2\) is primarily used in multiple regression models to indicate the total variance accounted for by the model, not just one predictor. Correlations, on the other hand, only apply to bivariate scenarios. This concept will once more make better sense in the multiple predictor setting of the follow-on analysis.

For this example, we find that 30% of the variance in Anxiety is accounted for by Optimism. Note that \(R^2\) refers to the total variance accounted for by the model, so we will see this statistic again in the next section when we use multiple predictors.

Example 2: U.S. Median Wage Growth

Consider the scatter plot below showing median U.S. wages since 1980. All wage values are represented in current USD (e.g. corrected for inflation by the BLS). All data for this module were downloaded in September, 2002, from the Bureau of Labor Statistics using either Table 2 or Table 5. The data charts produced by the BLS are not data frames, so the wage data was coerced into single-column format with column headers and other formatting added so that it would work well in R. Period 0 is the 1st Quarter of 1980, the first quarter of Ronald Reagan’s first term.

U.S. Median Wage Growth

We can create scatter plots in Microsoft Excel by selecting the two columns of data and clicking on “Charts: Scatter” from the Insert menu. By right-clicking on a data point, we can “Add Trendline” and select options to include equation and \(R^2\) on the chart.

The equation displayed is the line of best fit \[ y = 4.09x + 260.57\] which can be used to estimate the median wage for any time period since 1981 Q1. Historically, the median weekly wage has increased by \(\$4.09\) each quarter (or $16.36 each year). The coefficient of determination \(R^2=.993\) shows that \(99.3\%\) of the variance in median wages is accounted for by the time period.

The pattern is clearly linear with an odd uptick in the final 1-2 periods which represent the first two quarters of 2020. What’s going on? Why would median income surge massively during the COVID lockdown?

IV. Outliers Affect Correlation and Predictions Based on Regression Models.

The two outliers at the far right correspond to the first two quarters of 2020 when the pandemic started. Think about which U.S. workers lost their jobs due to the COVID lockdowns. Most of them were members of the working poor who make much less than the median income: folks like waitresses, non-essential retail workers, and movie theater attendants. The total number of workers decreased.

Think about taking a data set of sample size 20. The median wage would be the average of the 10th and 11th individuals when the sample is ordered from least to greatest wage. Suppose the five individuals with the lowest wages were removed, the entire bottom quarter of the data set. The median of the new 15-person data set would be the 8th person when ordered from least to greatest, but in reality this is the the 13th person from the original data set. The median shifts upward even though 25% of this sample lost their jobs.

To display the chart below, we must first import the file Med.xlsx into RStudio (File menu: “Import Dataset”). The number of workers decreases sharply at the onset of recessions, and the job loss was particularly nasty in second quarter of 2020. Note the large discontinuity at the beginning of the recession of 2008-2009. The work force contracted by almost 5 million wage earners in the 1st quarter of 2009. The last data point on the right is the second quarter of 2020 when more than 11 million U.S. wage earners lost their jobs due to the COVID shutdown. In the previous quarter, more than 2 million lost their jobs.

xyplot(Earners ~ Period, data = Med, 
       xlab = "Quarters since 1981 Q1", 
       ylab = "Number of Wage Earners (in thousands)")

Apparently, for the COVID lockdown, the vast majority of those 13 million wage earners were at the lower end of the economic spectrum. Had they been evenly distributed throughout the economic spectrum, the median would not have jumped upward. These data show economic devastation was immense especially for those earning below median wages. To correct for these obvious outliers, we will compare only the pre-COVID quarters. If Med.xlsx has been imported into RStudio, the following code will subset it to only include quarters through the end of 2019.

newMed = subset( Med , Period <= 155 )

Let’s recreate the model in RStudio without the COVID outliers. We’re not ignoring them, we’re instead understanding that the outliers create a very rosy picture when something horrible is happening. In R, we use the xyplot function to create scatter plots. The “type” command asks R to show both the points (“p”) and regression line (“r”).

# library(mosaic)
xyplot( Median ~ Period, data = newMed , 
        type = c("p","r") ,
        xlab = "Quarters since 1981 Q1" , 
        ylab = "U.S. Median Wage")

The R commands \(\fbox{lm}\) and \(\fbox{cor}\) generate a linear model and the correlation respectively.

lm( Median ~ Period, data = newMed)

Call:
lm(formula = Median ~ Period, data = newMed)

Coefficients:
(Intercept)       Period  
    259.155        4.041  
cor(Median ~ Period, data = Med)
[1] 0.9963508
cor(Median ~ Period, data = newMed)
[1] 0.9975952

The correlation increased by 0.0011 and the rate of wage growth changed from \(\$4.09\) to \(\$4.04\). A nickel may not seem like much. However, consider the relative error of the slope of the regression line equates to more than \(1\%\) per quarter propagated over 150-plus quarters.

Example 2: Analysis Statements

Let’s return to the modified data set, build and save the model, and conduct our analysis without the outliers present.

mod = lm( Median ~ Period, data = newMed)
summary(mod)

Call:
lm(formula = Median ~ Period, data = newMed)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.915  -9.027   1.179   8.671  33.461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 259.1551     2.0411     127   <2e-16 ***
Period        4.0412     0.0227     178   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.64 on 153 degrees of freedom
Multiple R-squared:  0.9952,    Adjusted R-squared:  0.9952 
F-statistic: 3.17e+04 on 1 and 153 DF,  p-value: < 2.2e-16

The line of best fit is \[\text{Med} = 4.05\ *\ \text{Period} + 259.16 \] We find that \(R^2 = .995\) which means that \(r = \sqrt{R^2}=0.997\) (Beginning analysts should always use “Multiple R-squared” value, not adjusted.) Thus, the correlation between median wage and quarter since 1980 is very strong. \(R^2\) is the coefficient of determination and indicates that the Period accounts for \(99.5\%\) of the variance in median wages. Lastly, we can see that the slope of the line of best is \(\$4.04\) which means that U.S. wage growth over the past 40 years has averaged about \(4.04\) per week per quarter (in current U.S. dollars), or about \(\$16.32\) per week per year.

Example 3: Median Wages during Trump Administration

Trump campaigned in 2016 on tax cuts, tariffs on foreign imports, exiting global trade treaties (TPP) and renegotiating NAFTA. Trump claimed his approach would help the wage earners in the country by reversing the globalization that shipped middle class jobs overseas. To suggest that experts disagreed would be a marked understatement. The U.S. political climate for three decades has been increasingly globalist. Trump’s economic nationalism was pilloried by pundits, Republican party stalwarts and his political opponents alike.

Let’s investigate what transpired as detailed breifly in the YouTube video lecture, but this time work with the much more advanced tools in RStudio.

Using subset function in R to create data frame for Trump years only

We first need to subset our Med data set for just the years 2017 through 2109, or 12 quarters. These will be Periods 144 to 155.

TrumpMed = subset( Med , Period >=  144 & Period <= 155)

Scatter Plot: Trump Era Median Wage

xyplot(Median ~ Period, data = TrumpMed , 
        type = c("p","r") ,
        xlab = "All Quarters 2017 to 2019" ,
        ylab = "Trump Era Median Wage") 

Regression model for Trump era median wage

We use the lm function in R to create the linear model, and we store the results in the variable modT (Trump model) so we can access it later. The summary statistics match the YouTube lecture video which were created using the TI graphing calculator.

modT = lm(Median ~ Period, data = TrumpMed)
summary(modT)

Call:
lm(formula = Median ~ Period, data = TrumpMed)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.139  -3.509   1.417   3.787  12.417 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -44.288     90.176  -0.491    0.634    
Period         6.185      0.603  10.257 1.26e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.211 on 10 degrees of freedom
Multiple R-squared:  0.9132,    Adjusted R-squared:  0.9045 
F-statistic: 105.2 on 1 and 10 DF,  p-value: 1.259e-06

QQ-Plot for Trump era median wage model

There is certainly some deviation in the qq-plot, but the overall pattern is roughly along the 45-degree line.

qqmath( ~ resid(modT), type = c("p","r"))

Example 3b: Median Wages during Obama Administration

We can subset the data frame for the whole Obama era or just the last three years.

ObamaMedTotal = subset( Med , Period >= 112 & Period <= 143 )
ObamaMedL3 = subset( Med , Period >= 132 & Period <= 143 )

Scatter plots for wage growth during Obama era

xyplot(Median ~ Period, data = ObamaMedTotal , 
        type = c("p","r") ,
        xlab = "All Quarters 2009 to 2016" ,
        ylab = "Obama Era Median Wage") 

A marked increase can be seen in the final few quarters above. Let’s display the last 3 years only.

xyplot(Median ~ Period, data = ObamaMedL3 , 
        type = c("p","r") ,
        xlab = "All Quarters 2014 to 2016" ,
        ylab = "Obama Era Median Wage") 

Model summaries for Obama era

modB = lm(Median ~ Period, data = ObamaMedTotal)
modBL3 = lm(Median ~ Period, data = ObamaMedL3)
summary(modB)

Call:
lm(formula = Median ~ Period, data = ObamaMedTotal)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.193  -5.309   1.039   5.343  15.086 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 384.6521    18.6409   20.64   <2e-16 ***
Period        3.0515     0.1458   20.93   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.616 on 30 degrees of freedom
Multiple R-squared:  0.9359,    Adjusted R-squared:  0.9337 
F-statistic: 437.9 on 1 and 30 DF,  p-value: < 2.2e-16
summary(modBL3)

Call:
lm(formula = Median ~ Period, data = ObamaMedL3)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6608  -5.4554  -0.7168   5.4222   9.6503 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 212.9615    86.2571   2.469   0.0332 *  
Period        4.3112     0.6271   6.875 4.33e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.499 on 10 degrees of freedom
Multiple R-squared:  0.8254,    Adjusted R-squared:  0.8079 
F-statistic: 47.26 on 1 and 10 DF,  p-value: 4.327e-05

Note that the median wage growth for the entire 8 years of the Obama administration is lower than the historic trend of \(\$4.05\), but in the final three years was \(\$5.08\) as discussed in the YouTube lecture video.

V. Testing Sigificance of Slope: Randomized Confidence Intervals

The hypothesis we wish to test is if the slope of the model from the Trump era is significantly different (higher) than the historic trend.

\[H_0 : \beta = 4.04\\H_a : \beta > 4.04\]

One reason we save the models is that it makes grabbing them later quite easy. Let’s generate the 95% confidence intervals for the slope of the regression line in each model.

Confidence interval: Trump era

confint.lm(modT)
                  2.5 %    97.5 %
(Intercept) -245.213481 156.63772
Period         4.841689   7.52894

Because the 95% confidence interval \((4.84,7.52)\) does not include \(\$4.04\), we reject the null and have evidence for a significantly higher Trump era median wage growth at the \(\alpha = 0.025\) level. With confidence intervals, we don’t have the luxury of a one-tailed hypothesis evaluation. When testing a two-tailed hypothesis with a confidence interval, then \[\alpha = 1 - \text{confidence level}\] For one-tailed tests, we have to match the area in the tails, so

\[\alpha = \frac{1 - \text{confidence level}}{2}\]

To produce the exact equivalent of testing at the \(\alpha = 0.05\) level, we would use the following 90% confidence interval estimate.

confint.lm(modT, level = 0.90)
                    5 %       95 %
(Intercept) -207.729148 119.153390
Period         5.092353   7.278276

Now we have exactly 5% of the area under the curve to the left which matches the rejection region for a one-tailed \(t\)-test.

Confidence intervals: Obama era

We can test the hypothesis that the Obama era median wage growth was different that historic value at the 0.05 level. \[H_0 : \beta = 4.04\\H_a : \beta \neq 4.04\]

confint.lm(modB)
                 2.5 %    97.5 %
(Intercept) 346.582336 422.72192
Period        2.753696   3.34931

The historic trend for median wage growth of $4.04 is not included in the interval which suggests (based on the endpoints of the confidence interval above) that we reject the null hypothesis that the Obama era median wage growth is $4.05. When all 32 quarters of the Obama era are included, evidence suggests that the overall growth of the U.S. median wage was significantly lower than the historic trend (at the 0.025 level of significance).

Confidence interval: Last 3 years of Obama era

Note that a model restricted to the final three years of the Obama era dramatically shifts the analysis in favor of Obama. While the effect is certainly a cherry-picking of the best Obama years, we were analyzing the argument that the Trump economy rode on the coattails of the Obama economy in 2016. The argument is plausible on its face, so let’s take a look at the data.

We’re testing the null hypothesis that the last three years of the Obama era growth in median wages is similar to the Trump era growth of $6.90.

\[H_0 : \beta = 6.19\\H_a : \beta \neq 6.19\]

We need the 95% confidence interval for the last three years of Obama.

confint.lm(modBL3, level = 0.95)
                2.5 %     97.5 %
(Intercept) 20.768690 405.154386
Period       2.913863   5.708515

The upper endpoint of the confidence is close to but not greater than $6.19. Thus, we reject the null. We have evidence the median wage growth was significantly different in the pre-COVID Trump era than during the last three years of Obama. Remember that removing the median wage outliers from the COVID recession actually makes Trump’s growth rate lower.

QQ-Plots for Obama era median wage growth models

As with the Trump era, both Obama era qq-plots line up approximately with the 45-degree line, but they do have some odd variations.

qqmath( ~ resid(modB), type = c("p","r"))

qqmath( ~ resid(modBL3), type = c("p","r"))

VI. Robustness of estimates of regression statistics

Due to the Gauss-Markov theorem, we know that the Ordinary Least Squares model is the best linear unbiased estimator (BlUE) for the model provided the relationship between the variables is actually linear. Better, we know that the regression statistics are generally robust even if the normality assumption is violated, so long as evidence of linearity is present.

We do have a problem, however, because the linear regression \(t\)-tests and \(t\)-intervals are not robust. Their intervals and \(p\)-values can be skewed by a lack of normality. When the qq-plot shows the residuals may not be normally distributed, we should use randomization techniques to generate p-values and confidence intervals instead traditional statistical procedures.

Mosaic’s Randomization Options

  1. Shuffle. Permutes the values in the sample data.
  2. Sample. Draws a sub-sample from the sample data
    without replacement.
  3. Resample. Draws a sub-sample from the sample data
    with replacement.

For examples of permutation tests and bootstrapping, see
the Randomization Tutorial.

VII. Randomization approach for Trump vs. Obama

We have two randomization options:

  1. Permutation Method
  2. Bootstrapping Method

The permutation method simply shuffles the \(y\)-values, pairing them with random \(x\)-values. The random association of \(y\)-values will generate, on average, a correlation and slope of zero. This allows us to test the null hypothesis \[H_0 : \rho = 0\] but not null hypotheses like \[H_0 : \rho = 5.08\]

As usual, the Rossman-Chance applet is both a great visualization and easily implements the randomization.

Rossman Chance Regression Shuffle App

Randomized Confidence Intervals

See the Statistical Estimation (module 6) for details
on using bootstrapping to create randomized confidence
intervals.

However, we knew that the correlation of median wage during the Trump era was significantly different than zero. Given the historic trend of $4.05, the correlation would be remarkable only if it were zero.

Bootstrapping

The bootstrapping method takes samples with replacement from the sample data to generate an estimate of the sampling distribution. Let’s create one of the resamples.

coef(lm(Median ~ Period, data=resample(TrumpMed, size = 10)))
(Intercept)      Period 
-112.674672    6.641921 

The resample function has produced 10 data points drawn with replacement from the TrumpMed data frame. Note that the same data point may be drawn more than once. The linear model created by using those resampled data points has a slope of 6.64.

For bootstrapping, we are going to allow the resample function to choose its own size parameter. If we resample a data set with 155 observations, the resample functions will make 155 draws with replacment each time it runs.

bootstrap = do(500) * coef(lm(Median ~ Period, data=resample(TrumpMed)))
densityplot(~Period, data=bootstrap)

There are a couple of methods to generate a confidence interval based on the bootstrap distribution. The best method - most often appropriate in most common situations - requires only that the bootstrap distribution be symmetric. The density plot above suggests the symmetry assumption is reasonable.

To create the confidence interval, we just find the relevant percentiles from within the bootstrap distribution. In our bootstrap distribution we have two values, “Intercept” and “Period.” The “Period” column contains the slope estimates.

qdata(~Period, p=c(0.025, 0.975), data=bootstrap)
    2.5%    97.5% 
5.026753 7.544513 

Thus, the 95% confidence interval using randomization is \[(5.03, 7.54)\] which is a bit different than the \(t\)-interval of \[(4.84, 7.53)\] calculated above.

You can rerun the bootstrapping resampling several times to see how the confidence interval changes. It remains somewhat close to the \(t\)-interval values, but does vary a bit. In any case, we still appear to reject the null hypothesis that the Trump era was no different than the Obama era, at least with regard to median wage growth.

Caveats and Takeaways

In political discussions, folks often choose teams to support and get frustrated when their team seems diminished. A word or two of caution.

  1. Agenda-driven geeks can lie with statistics. Politics is rife with deception. Media reports are increasingly agenda-driven. Question the motivation and factual integrity of every voice. Rely on good science.
  2. Multiple ways of assessing the economy are viable. If you felt the economic discussion was slanted right, remember there are lots of measures of economic health. Perhaps other measures tilt left. Wages are not the only way to earn money (investing, commisions, real estate, etc.). Median wages also don’t measure the lowest end of the economic spectrum. What about the 10th percentile or 25th percentile? Studying an issue like this is multifaceted and may require lots of research. Keep an open mind and dig through the most scientific results you can find.

Key Takeaways

  1. Savvy geeks detect others’ attempted lies with statistics. One major benefit of this class should be learning to apply your analytic skills to real world issues where it’s incredibly important not to naively fall for deceptions like cherry-picking cases that support one’s position. The counter-intuitive COVID outliers in the median wage data are a perfect example that would have sharply increased the median wage growth during the Trump era. People disagree about politics, but I hope we can agree that learning the art and science of applied statistics makes us more knowledgeable citizens and better potential employees and entrepreneurs.
  2. The median wage discussion contained great data sets. The major reason the economic data was chosen was for pedagogy. Several of the qq-plots looked quite wobbly which motivated a discussion about violations of the normality assumption. The media report statistics that were generated by analyses exactly like the ones shown above. We don’t get to check for crazy COVID outliers. We don’t get to scan the qq-plots to assess model fit. We don’t get to verify any of the assumptions or, in fact, check for bias on the part of the scientist who did the analysis. The median wage data sets allowed for an important discussion, and hopefully an interesting one as well.
  3. Regression model statistics are robust with respect violations of the normality assumption for residuals, but linear regression \(t\)-tests and \(t\)-intervals are not. We always use scatter plots to verify that a linear relationship is a reasonable. With qq-plots, we are trying to decide how reliable the resulting model will be. If the qq-plot detects issues, use a randomization method for tests of significant correlation or confidence interval estimates.
  4. Bootstrapping. We presented two randomization methods to tests of significant correlation. The permutation method is visually demonstrated by the Rossman-Chance applet. The bootstrap method allowed us to construct a confidence interval estimate of the slope parameter, yet the only the assumption was that bootstrap distribution was symmetric. When the qq-plot indicates potential violations of the normality assumption, use bootstrapping instead of \(t\)-procedures.

VIII. Exercises

  1. Use the Data3350 data frame to build and evaluate a linear model for narcissism (dependent variabele) vs. thrill-seeking (independent variable) using the Thrill and Narc variables. Be sure to check the linearity and normality assumptions and analyze all regression statistics. Construct a confidence interval for the slope of the regression line using an appropriate method.

  2. Use the Data3350 data frame to build and evaluate a linear model for neuroticism (dependent variabele) vs. optimism (independent variable) using the Neuro and Opt variables. Be sure to check the linearity and normality assumptions and analyze all regression statistics. Construct a confidence interval for the slope of the regression line using an appropriate method.

  3. Using the built-in Mosaic data set Dimes, test the significance of the correlation between mass and year. Would it be appropriate to build a linear model in this case? Why or why not?

  4. Use the Perc data frame to create linear models for the 25th percentile wage earners from 2000 Q1 through 2019 Q4. Be sure to look carefully at all diagnostic plots. Create models for the Trump era and the last 3 years of the Obmama era and conduct hypothesis tests that the Trump era growth was significantly greater than the historic trend as well as greater than the Obama era.

  5. Use the Perc data frame to create linear models for the 75th and 90th percentile wage earners from 2000 Q1 through 2019 Q4. Be sure to look carefully at all diagnostic plots. Create models for the Trump era and the last 3 years of the Obmama era and conduct hypothesis tests that the Trump era growth was significantly greater than the historic trend as well as greater than the Obama era.

  6. The PercB data is similar to Perc in that it includes the 10th, 25th, 50th, 75th and 90th percentile wages, but PerB includes Black wage earners only. Use the PercB data frame to create a linear model for the median wage growth (50th percentile) from 2000 Q1 through 2019 Q4. Be sure to look carefully at all diagnostic plots. Create models for the Trump era and the last 3 years of the Obmama era and conduct hypothesis tests that the Trump era growth was significantly greater than the historic trend as well as greater than the Obama era.

---
title: "Basics of Correlation and Regression"
subtitle: UNG MATH 3350 (online)
author: Robb Sinn
date: September 2020
output: html_notebook
---

# <span style="color: blue;">I. Correlation and Linear Models</span>

When two numeric variables are related, we say they are correlated. If the correlation is reasonably strong, we can create a linear model. The modeling process is usually referred to as linear regression.

Like the hypothesis testing procedures, linear regression has a verification process based on the model assumptions.

<div style="float:right; margin: 8px; border:2px black solid; padding: 0px 10px 5px">
### <span style="color: red;">Initializing RStudio</span>
The data set we will use primarily is **Data3350** which was produced in 2015 during an undergraduate research project about personality and humor. The **VarsData3350** PDF file has descriptions of each variable in the Data3350 file. Both are available for download in D2L. Be sure to put the Data3350 in your R folder in Documents, and make sure your working directory is set the same way (Session menu). The code block below uses the **library** function to ensure that the **Mosaic** package is loaded and will import the data frames used in this module: **Data3350** and the U.S. wages data frames **Med**, **Perc** and **PercB**.

```{r}
library(mosaic)
library(readxl)
Data3350 = read_excel("Data3350.xlsx")
Med = read_excel("Med.xlsx")
Perc = read_excel("Perc.xlsx")
PerB = read_excel("PercB.xlsx")
```
</div>

# <span style="color: blue;">II. Assumptions for Linear Regression</span>

Recall the assumptions for $t$-tests and ANOVA: normality, independence and homogeneity of the variances (e.g. "same variances"). Similar ideas are present for regression, plus a 4th assumption of a linear connection between the variables. 

1. **Linearity.** A *linear* relationship between the numeric variables exists.
2. **Normality.** The errors or *residuals* are normally distributed.
3. **Homoscedasticity.** Big word that simply means "same variance," and the idea is that the variance in the residuals will be the same regardless of position along the $x$-axis.
4. **Independence.** The observations are independent of one another.

The creators of SAS and JMP, two powerfuls statistical software tools, have provided a <a href = https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions.html>nicely illustrated overview of the assumptions for linear regression</a>. 

## Residuals

A residual is the vertical distance between an actual data point and it's $y$-value as predicted by the model. Suppose there is a data point $(4, 10)$, but when we plug that $x$-value into the linear model, we get a predicted value of $$\hat y = 12$$.

The residual is defined as the vertical distance between the predicted $y$-value, $\hat y$, and the actual $y$-value from the data point, $y$. Residuals are referred to with the letter $e$ because one can think of them as the "errors" in the model:
$$\text{Residual} = \text{Predicted Value} - \text{Observed Value}$$
or
$$e = \hat y - y$$

# <span style="color: blue;">III. Verification Process for Linear Regression Modeling</span>

We will use a two part verification process for linear regression: check linearity with an **xyplot** (scatter plot), and check the normality and equal variances assumptions with a **qqplot**.

## Example: Optimism vs. Anxiety

Let's check the correlation of two variables in the Data3350 data frame. We can use the independent variable **Opt**, a measure of optimism, to predict levels of the dependent variable **Anx**, a measure of anxiety. Either variable could be the independent variable if a bivariate correlation were the only question, but a multiple regression prediction model for **Anxiety** is the ultimate goal of this unit. Using multiple predictors (independent variables) to predict the dependent variable is just as easy in RStudio as analyzing single-predictor models. The output from R is actually anticipating a multiple-predictor model, so it helps to think in terms of multiple predictors as we go through our example.

### 1. Verifiying Linearity Assumption

The **xyplot** function is included with the Mosaic package. Also remember to import the data frame Data3350. (See initialization block at top of document.) The statistical model $$\text{Anx} \sim \text{Opt}$$ indicates that Anxiety is the dependent variable.

```{r}
xyplot(Anx ~ Opt, data = Data3350 , type = c("p","r"),
       main = "Optimism vs. Anxiety",
       xlab = "Optimism",
       ylab = "Anxiety")
```

In the code block above. the **type** parameter specifies the type of plotting we're asking RStudio to do, and we're asking for both points ("p") and the regression line ("r"). A vector is created (**c** is R's **concatenate** function) with two components, one for points and one for the regression line. If we ask only for type "p", we get a standard scatter plot with no line of best fit superimposed.

The pattern in the scatter plot provides evidence of a linear pattern between the variables **Opt** and **Anx**. As the **Optimism** score increases, generally the **Anxiety** scores decrease.

### 2. Creating a Model

For our example, let's create a linear model for **Anxiety vs. Optimism**.

```{r}
lm(Anx ~ Opt, data = Data3350)
```

We only see the output for the coefficients of the line of best fit: the $y$-intercept and the slope. Yet, lots of information from the model has been generated by RStudio and is lurking in the background. Let's create an object (or variable) called **mod** and ask RStudio to store the results there. Then we can access additional model information whenever we like. 

```{r}
mod = lm(Anx ~ Opt, data = Data3350)
```

Let's request a model summary.

```{r}
summary(mod)
```

We will provide more details about the output later. For now, just realize that the function **lm** creates an object of class "linear model." That object has many different outputs including details about the line of best fit, the residuals (as can be seen in the five number summary) and much, much more. That's why we saved it as **mod**, short for "model." We plan to call it later for diagnostic and analytic tasks.

### 3. Verifying Normality Assumption

Let's consider what we've learned. We checked the linearity assumption by inspecting a scatter plot which showed a cloud of paired data points with a downward sloping trend (negative correlation). We next created a linear model and saved it as the RStudio object **mod**. Now, we wish to evaluate the normality assumption which says that the residuals should be normally distributed. In previous sections, we used histograms to investigate the shape of the data.

RStudio can plot a histogram of the residuals, too, but more typically we'll use the *qq-plot* instead. Here's the histogram.

```{r}
histogram (~ resid (mod))
```

We can analyze the residuals histogram like any other: there is evidence the data were drawn from a bell-shaped distribution with skew to the right. The **qq-plot** does have advantages. Let's create a qq-plot and then discuss what we're seeing.

```{r}
qqmath( ~ resid(mod))
```

The *qq* in qq-plot refers to quantiles, and quantile is just another word for percentile. A qq-plot compares the percentiles of one distribution to the percentiles of another. If the distributions have the same shape, the qq-plot will be a straight line. Our function **qqmath** is a **normal** qq-plot which compares the residuals in **mod** to the bell curve. <span style="color: red;">If the residuals meet the model assumptions, we should see a straight line.</span>

<div style="float:right; margin: 8px; border:2px black solid; padding: 10px 15px 10px">
#### QQ-Plots
For details about what we're</br>
hoping *not* to find in a qq-plot,</br> see the <a href = https://rpubs.com/robbsinn/s11>QQ Plots module S11</a>.
</div>

With real world data, we rarely see a perfectly straight line. What we hope to avoid is seeing something S-shaped or C-shaped where the last few points on the chart at either end jump far out of line. 

## <span style="color: red;">Quick Example: $\text{SE}\sim\text{Opt}$</span>

Using the Data3350 data frame, we can use a qq-plot to analyze the fit of the linear model for self-esteem (SE) with optimism (Opt).

```{r}
mod1 = lm(SE ~ Opt, data = Data3350)
qqmath( ~ resid(mod1), type = c("p","r"))
```

The sharp drop-off at the bottom-left indicates that there is a more extreme value in the residuals than we would expect if the residuals were normally
distributed. The top two points are also further from the 45-degree line than we would prefer.

<div style="float:right; margin: 8px; border:2px black solid; padding: 10px 15px 10px">
### Analyzing QQ-Plots
The University of Virginia has a nice article about <a href = https://data.library.virginia.edu/understanding-q-q-plots/>understanding qq-plots</a> which goes into more detail. Just beware they use slightly different R commands to achieve what we're doing. (Remember, we always load the Mosaic package and use it's simplified commands. The UVA article doesn't use Mosaic.) I have linked to an image from the UVA article that shows an <a href = https://data.library.virginia.edu/files/heavy_tails.jpeg>example of a qqplot showing a pattern we definitely don't want to see</a>. You should definitely click the link to understand the problematic shape we're hoping to avoid.
</div>

Back to our original qq-plot: everything seems fine. While the tails of the qq-plot wiggle a bit, none of the plotted comparisons jump off the line very far. Our data appear to meet the normality assumptions.

### 4. Analysis Statements for Linear Regression

There are three analysis statements for regression models, two that describe the strength of the relationship, and one that describes how the relationship works. Let's look at the model summary again:

```{r}
summary(mod)
```

In the second row from the bottom of the output, consider "Multiple R-squared: 0.2968." We only have one predictor in our model, so the correlation coefficient $r$ is the square root of the R-squared value:
$$r = \sqrt{0.2968}\approx .5448$$

#### 4.a) Analyzing the Correlation Coefficient $r$
The correlation coefficient $r$ indicates the strength and direction of the bivariate association. The chart below relates to quantitative research involving human subjects in social sciences such as psychology, economics, and educational psychology. Regression methods are widely used across many disparate branches of research. No single chart like the one below could ever be general enough for every situation. A chemistry professor might have students check the correlation of measured outputs from a Boyle's Law lab and decide that any $r<0.98$ is too weak, that there must have been a measurement error.

$$\begin{array}{ll}
\textbf{Correlation}&&\textbf{Strength}\\ \hline
\hspace{1cm}|r|<.25&&\text{Little or None}\\
.25\leq|r|<.50&&\text{Weak}\\
.50\leq|r|<.75&&\text{Moderate}\\
.75\leq|r|&&\text{Strong}
\end{array}$$

The analysis statement for the correlation must identify the strength and direction of the relationship. For this example, we find a moderate, negative correlation between Optimism and Anxiety.

#### 4.b) Analyzing the Coefficient of Determination $R^2$
The output "Multiple R-squared: 0.2968" indicates the coefficient of determination. Because $R^2=0.2968$, we know that 29.7% of the changes in Anxiety scores (the dependent variable) is accounted for by how the Optimism scores (the independent variable) are changing. 

Why do we write $R^2$ when it's the square of the correlation coefficient which is always written in lower case? It's important to understand the multivariate context. The coefficient of determination $R^2$ is primarily used in multiple regression models to indicate the total variance accounted for by the model, not just one predictor. Correlations, on the other hand, only apply to bivariate scenarios. This concept will once more make better sense in the multiple predictor setting of the follow-on analysis.

For this example, we find that 30% of the variance in Anxiety is accounted for by Optimism. Note that $R^2$ refers to the total variance accounted for by the model, so we will see this statistic again in the next section when we use multiple predictors.

## Example 2: U.S. Median Wage Growth

Consider the scatter plot below showing median U.S. wages since 1980. All wage values are represented in current USD (e.g. corrected for inflation by the BLS). All data for this module were downloaded in September, 2002, from the <a href = https://www.bls.gov/>Bureau of Labor Statistics</a> using either <a href = https://www.bls.gov/webapps/legacy/cpswktab2.htm>Table 2</a> or <a href = https://www.bls.gov/webapps/legacy/cpswktab5.htm>Table 5</a>. The data charts produced by the BLS are not data frames, so the wage data was coerced into single-column format with column headers and other formatting added so that it would work well in R. Period 0 is the 1st Quarter of 1980, the first quarter of Ronald Reagan's first term.

![U.S. Median Wage Growth](images\MedWages.PNG)

We can create scatter plots in Microsoft Excel by selecting the two columns of data and clicking on "Charts: Scatter" from the Insert menu. By right-clicking on a data point, we can "Add Trendline" and select options to include equation and $R^2$ on the chart.

The equation displayed is the line of best fit $$ y = 4.09x + 260.57$$ which can be used to estimate the median wage for any time period since 1981 Q1. Historically, the median weekly wage has increased by $\$4.09$ each quarter (or \$16.36 each year). The coefficient of determination $R^2=.993$ shows that $99.3\%$ of the variance in median wages is accounted for by the time period.

The pattern is clearly linear with an odd uptick in the final 1-2 periods which represent the first two quarters of 2020. What's going on? Why would median income surge massively during the COVID lockdown?

# <span style="color: blue;">IV. Outliers Affect Correlation and Predictions Based on Regression Models.</span>

The two outliers at the far right correspond to the first two quarters of 2020 when the pandemic started. Think about which U.S. workers lost their jobs due to the COVID lockdowns. Most of them were members of the working poor who make much less than the median income: folks like waitresses, non-essential retail workers, and movie theater attendants. The total number of workers decreased. 

Think about taking a data set of sample size 20. The median wage would be the average of the 10th and 11th individuals when the sample is ordered from least to greatest wage. Suppose the five individuals with the lowest wages were removed, the entire bottom quarter of the data set. The median of the new 15-person data set would be the 8th person when ordered from least to greatest, but in reality this is the the 13th person from the original data set. The median shifts upward even though 25% of this sample lost their jobs.

To display the chart below, we must first import the file *Med.xlsx* into RStudio (File menu: "Import Dataset"). The number of workers decreases sharply at the onset of recessions, and the job loss was particularly nasty in second quarter of 2020. Note the large discontinuity at the beginning of the recession of 2008-2009. The work force contracted by almost 5 million wage earners in the 1st quarter of 2009. The last data point on the right is the second quarter of 2020 when more than 11 million U.S. wage earners lost their jobs due to the COVID shutdown. In the previous quarter, more than 2 million lost their jobs.

```{r}
xyplot(Earners ~ Period, data = Med, 
       xlab = "Quarters since 1981 Q1", 
       ylab = "Number of Wage Earners (in thousands)")
```

Apparently, for the COVID lockdown, the vast majority of those 13 million wage earners were at the lower end of the economic spectrum. Had they been evenly distributed throughout the economic spectrum, the median would not have jumped upward. These data show economic devastation was immense especially for those earning below median wages. To correct for these obvious outliers, we will compare only the pre-COVID quarters. If **Med.xlsx** has been imported into RStudio, the following code will subset it to only include quarters through the end of 2019. 

```{r}
newMed = subset( Med , Period <= 155 )
```

Let's recreate the model in RStudio without the COVID outliers. We're not ignoring them, we're instead understanding that the outliers create a very rosy picture when something horrible is happening. In R, we use the **xyplot** function to create scatter plots. The "type" command asks R to show both the *points* ("p") and *regression line* ("r").

```{r}
xyplot( Median ~ Period, data = newMed , 
        type = c("p","r") ,
        xlab = "Quarters since 1981 Q1" , 
        ylab = "U.S. Median Wage")
```

The R commands $\fbox{lm}$ and $\fbox{cor}$ generate a linear model and the correlation respectively.

```{r}
lm( Median ~ Period, data = newMed)
cor(Median ~ Period, data = Med)
cor(Median ~ Period, data = newMed)
```

The correlation increased by 0.0011 and the rate of wage growth changed from $\$4.09$ to $\$4.04$. A nickel may not seem like much. However, consider the relative error of the slope of the regression line equates to more than $1\%$ per quarter propagated over 150-plus quarters.

### Example 2: Analysis Statements

Let's return to the modified data set, build and save the model, and conduct our analysis without the outliers present.

```{r}
mod = lm( Median ~ Period, data = newMed)
summary(mod)
```

The line of best fit is $$\text{Med} = 4.05\ *\ \text{Period} + 259.16 $$
We find that $R^2 = .995$ which means that $r = \sqrt{R^2}=0.997$ (Beginning analysts should always use "Multiple R-squared" value, not adjusted.) Thus, the correlation between median wage and quarter since 1980 is very strong. $R^2$ is the coefficient of determination and indicates that the Period accounts for $99.5\%$ of the variance in median wages. Lastly, we can see that the slope of the line of best is $\$4.04$ which means that U.S. wage growth over the past 40 years has averaged about $4.04$ per week per quarter (in current U.S. dollars), or about $\$16.32$ per week per year.

## Example 3: Median Wages during Trump Administration

Trump campaigned in 2016 on tax cuts, tariffs on foreign imports, exiting global trade treaties (TPP) and renegotiating NAFTA. Trump claimed his approach would help the wage earners in the country by reversing the globalization that shipped middle class jobs overseas. To suggest that experts disagreed would be a marked understatement. The U.S. political climate for three decades has been increasingly globalist. Trump's economic nationalism was <a href = https://www.washingtonpost.com/opinions/a-president-trump-could-destroy-the-world-economy/2016/10/05/f70019c0-84df-11e6-92c2-14b64f3d453f_story.html>pilloried by pundits</a>, Republican party stalwarts and his political opponents alike.

Let's investigate what transpired as detailed breifly in the YouTube video lecture, but this time work with the much more advanced tools in RStudio. 

### Using *subset* function in R to create data frame for Trump years only

We first need to subset our *Med* data set for just the years 2017 through 2109, or 12 quarters. These will be Periods 144 to 155.

```{r}
TrumpMed = subset( Med , Period >=  144 & Period <= 155)
```

### Scatter Plot: Trump Era Median Wage

```{r}
xyplot(Median ~ Period, data = TrumpMed , 
        type = c("p","r") ,
        xlab = "All Quarters 2017 to 2019" ,
        ylab = "Trump Era Median Wage") 
```

### Regression model for Trump era median wage

We use the **lm** function in R to create the linear model, and we store the results in the variable *modT* (Trump model) so we can access it later. The summary statistics match the YouTube lecture video which were created using the TI graphing calculator.

```{r}
modT = lm(Median ~ Period, data = TrumpMed)
summary(modT)
```

### QQ-Plot for Trump era median wage model

There is certainly some deviation in the qq-plot, but the overall pattern is roughly along the 45-degree line.

```{r}
qqmath( ~ resid(modT), type = c("p","r"))
```

## Example 3b: Median Wages during Obama Administration

We can subset the data frame for the whole Obama era or just the last three years.

```{r}
ObamaMedTotal = subset( Med , Period >= 112 & Period <= 143 )
ObamaMedL3 = subset( Med , Period >= 132 & Period <= 143 )
```

### Scatter plots for wage growth during Obama era

```{r}
xyplot(Median ~ Period, data = ObamaMedTotal , 
        type = c("p","r") ,
        xlab = "All Quarters 2009 to 2016" ,
        ylab = "Obama Era Median Wage") 
```

A marked increase can be seen in the final few quarters above. Let's display the last 3 years only.

```{r}
xyplot(Median ~ Period, data = ObamaMedL3 , 
        type = c("p","r") ,
        xlab = "All Quarters 2014 to 2016" ,
        ylab = "Obama Era Median Wage") 
```

### Model summaries for Obama era

```{r}
modB = lm(Median ~ Period, data = ObamaMedTotal)
modBL3 = lm(Median ~ Period, data = ObamaMedL3)
summary(modB)
summary(modBL3)
```
Note that the median wage growth for the entire 8 years of the Obama administration is lower than the historic trend of $\$4.05$, but in the final three years was $\$5.08$ as discussed in the YouTube lecture video. 

# <span style="color: blue;">V. Testing Sigificance of Slope: Randomized Confidence Intervals</span>

The hypothesis we wish to test is if the slope of the model from the Trump era is significantly different (higher) than the historic trend.

$$H_0 : \beta = 4.04\\H_a : \beta > 4.04$$

One reason we save the models is that it makes grabbing them later quite easy. Let's generate the 95% confidence intervals for the slope of the regression line in each model.

### Confidence interval: Trump era

```{r}
confint.lm(modT)
```
Because the 95% confidence interval $(4.84,7.52)$ does not include $\$4.04$, we reject the null and have evidence for a significantly higher Trump era median wage growth at the $\alpha = 0.025$ level. With confidence intervals, we don't have the luxury of a one-tailed hypothesis evaluation. When testing a two-tailed hypothesis with a confidence interval, then
$$\alpha = 1 - \text{confidence level}$$
For one-tailed tests, we have to match the area in the tails, so

$$\alpha = \frac{1 - \text{confidence level}}{2}$$

To produce the exact equivalent of testing at the $\alpha = 0.05$ level, we would use the following 90% confidence interval estimate.

```{r}
confint.lm(modT, level = 0.90)
```

Now we have exactly 5% of the area under the curve to the left which matches the rejection region for a one-tailed $t$-test.


### Confidence intervals: Obama era

We can test the hypothesis that the Obama era median wage growth was different that historic value at the 0.05 level.
$$H_0 : \beta = 4.04\\H_a : \beta \neq 4.04$$

```{r}
confint.lm(modB)
```

The historic trend for median wage growth of $4.04 is not included in the interval which suggests (based on the endpoints of the confidence interval above) that we reject the null hypothesis that the Obama era median wage growth is \$4.05. When all 32 quarters of the Obama era are included, evidence suggests that the overall growth of the U.S. median wage was significantly lower than the historic trend (at the 0.025 level of significance).

### Confidence interval: Last 3 years of Obama era

Note that a model restricted to the final three years of the Obama era dramatically shifts the analysis in favor of Obama. While the effect is certainly a cherry-picking of the best Obama years, we were analyzing the argument that the Trump economy rode on the coattails of the Obama economy in 2016. The argument is plausible on its face, so let's take a look at the data.

We're testing the null hypothesis that the last three years of the Obama era growth in median wages is similar to the Trump era growth of $6.90.

$$H_0 : \beta = 6.19\\H_a : \beta \neq 6.19$$

We need the 95% confidence interval for the last three years of Obama.

```{r}
confint.lm(modBL3, level = 0.95)
```

The upper endpoint of the confidence is close to but not greater than $6.19. Thus, we reject the null. We have evidence the median wage growth was significantly different in the pre-COVID Trump era than during the last three years of Obama. Remember that removing the median wage outliers from the COVID recession actually makes Trump's growth rate lower.

## QQ-Plots for Obama era median wage growth models

As with the Trump era, both Obama era qq-plots line up approximately with the 45-degree line, but they do have some odd variations.

```{r}
qqmath( ~ resid(modB), type = c("p","r"))
```

```{r}
qqmath( ~ resid(modBL3), type = c("p","r"))
```

# <span style="color: blue;">VI. Robustness of estimates of regression statistics</span>

Due to the <a href = https://towardsdatascience.com/ols-linear-regression-gauss-markov-blue-and-understanding-the-math-453d7cc630a5>Gauss-Markov theorem</a>, we know that the Ordinary Least Squares model is the *best linear unbiased estimator* (BlUE) for the model provided the relationship between the variables is **actually linear**. Better, we know that the regression statistics are generally robust even if the normality assumption is violated, so long as evidence of linearity is present.

We do have a problem, however, because the linear regression $t$-tests and $t$-intervals are not robust. Their intervals and $p$-values can be skewed by a lack of normality. When the qq-plot shows the residuals may not be normally distributed, we should use randomization techniques to generate p-values and confidence intervals instead traditional statistical procedures.

<div style="float:right; margin: 8px; border:2px black solid; padding: 10px 15px 10px">
### <span style="color: red;">Mosaic's Randomization Options</span>
1. **Shuffle.** Permutes the values in the sample data.
2. **Sample.** Draws a sub-sample from the sample data</br>*without replacement*.
3. **Resample.** Draws a sub-sample from the sample data</br>*with replacement*.</br>

For examples of permutation tests and bootstrapping, see</br>
the <a href = https://rpubs.com/robbsinn/s11>Randomization Tutorial</a>.
</div>

# <span style="color: blue;">VII. Randomization approach for Trump vs. Obama

We have two randomization options:

1. Permutation Method
2. Bootstrapping Method

The **permutation** method simply shuffles the $y$-values, pairing them with random $x$-values. The random association of $y$-values will generate, on average, a correlation and slope of zero. This allows us to test the null hypothesis $$H_0 : \rho = 0$$ but not null hypotheses like $$H_0 : \rho = 5.08$$

As usual, the Rossman-Chance applet is both a great visualization and easily implements the randomization.

![Rossman Chance Regression Shuffle App](images\Reg2.PNG)

<div style="float:right; margin: 8px; border:2px black solid; padding: 10px 15px 10px">
### Randomized Confidence Intervals
See the <a href = https://rpubs.com/robbsinn/s6>Statistical Estimation</a> (module 6) for details</br>
on using bootstrapping to create randomized confidence</br>
intervals.
</div>

However, we knew that the correlation of median wage during the Trump era was significantly different than zero. Given the historic trend of $4.05, the correlation would be remarkable only if it were zero.

### Bootstrapping

The bootstrapping method takes samples with replacement from the sample data to generate an estimate of the sampling distribution. Let's create one of the resamples.

```{r}
coef(lm(Median ~ Period, data=resample(TrumpMed, size = 10)))
```

The resample function has produced 10 data points drawn with replacement from the TrumpMed data frame. Note that the same data point may be drawn more than once. The linear model created by using those resampled data points has a slope of 6.64.

For bootstrapping, we are going to allow the **resample** function to choose its own **size** parameter. If we resample a data set with 155 observations, the **resample** functions will make 155 draws with replacment each time it runs.

```{r}
bootstrap = do(500) * coef(lm(Median ~ Period, data=resample(TrumpMed)))
densityplot(~Period, data=bootstrap)
```

There are a couple of methods to generate a confidence interval based on the bootstrap distribution. The best method - most often appropriate in most common situations - requires only that the bootstrap distribution be symmetric. The density plot above suggests the symmetry assumption is reasonable.

To create the confidence interval, we just find the relevant percentiles from within the bootstrap distribution. In our bootstrap distribution we have two values, "Intercept" and "Period." The "Period" column contains the slope estimates.

```{r}
qdata(~Period, p=c(0.025, 0.975), data=bootstrap)
```

Thus, the 95% confidence interval using randomization is $$(5.03, 7.54)$$
which is a bit different than the $t$-interval of $$(4.84, 7.53)$$ calculated above.


You can rerun the bootstrapping resampling several times to see how the confidence interval changes. It remains somewhat close to the $t$-interval values, but does vary a bit. In any case, we still appear to reject the null hypothesis that the Trump era was no different than the Obama era, at least with regard to median wage growth.

# Caveats and Takeaways

In political discussions, folks often choose teams to support and get frustrated when their team seems diminished. A word or two of caution.

1.  *Agenda-driven geeks can lie with statistics.* Politics is rife with deception. Media reports are increasingly agenda-driven. Question the motivation and factual integrity of every voice. Rely on good science.
2.  *Multiple ways of assessing the economy are viable.* If you felt the economic discussion was slanted right, remember there are lots of measures of economic health. Perhaps other measures tilt left. Wages are not the only way to earn money (investing, commisions, real estate, etc.). Median wages also don't measure the lowest end of the economic spectrum. What about the 10th percentile or 25th percentile? Studying an issue like this is multifaceted and may require lots of research. Keep an open mind and dig through the most scientific results you can find.

## Key Takeaways

1. *Savvy geeks detect others' attempted lies with statistics.* One major benefit of this class should be learning to apply your analytic skills to real world issues where it's incredibly important not to naively fall for deceptions like cherry-picking cases that support one's position. The counter-intuitive COVID outliers in the median wage data are a perfect example that would have sharply increased the median wage growth during the Trump era. People disagree about politics, but I hope we can agree that learning the art and science of applied statistics makes us more knowledgeable citizens and better potential employees and entrepreneurs.
2. *The median wage discussion contained great data sets.* The major reason the economic data was chosen was for pedagogy. Several of the qq-plots looked quite wobbly which motivated a discussion about violations of the normality assumption. The media report statistics that were generated by analyses exactly like the ones shown above. We don't get to check for crazy COVID outliers. We don't get to scan the qq-plots to assess model fit. We don't get to verify any of the assumptions or, in fact, check for bias on the part of the scientist who did the analysis. The median wage data sets allowed for an important discussion, and hopefully an interesting one as well.
3. *Regression model statistics are robust with respect violations of the normality assumption* for residuals, but linear regression $t$-tests and $t$-intervals are not. We always use scatter plots to verify that a linear relationship is a reasonable. With qq-plots, we are trying to decide how reliable the resulting model will be. If the qq-plot detects issues, use a randomization method for tests of significant correlation or confidence interval estimates.
4. *Bootstrapping.* We presented two randomization methods to tests of significant correlation. The permutation method is visually demonstrated by the Rossman-Chance applet. The bootstrap method allowed us to construct a confidence interval estimate of the slope parameter, yet the only the assumption was that bootstrap distribution was symmetric. When the qq-plot indicates potential violations of the normality assumption, use bootstrapping instead of $t$-procedures.

# <span style="color: blue;">VIII. Exercises</span>

1. Use the Data3350 data frame to build and evaluate a linear model for narcissism (dependent variabele) vs. thrill-seeking (independent variable) using the **Thrill** and **Narc** variables. Be sure to check the linearity and normality assumptions and analyze all regression statistics. Construct a confidence interval for the slope of the regression line using an appropriate method.

2. Use the Data3350 data frame to build and evaluate a linear model for neuroticism (dependent variabele) vs. optimism (independent variable) using the **Neuro** and **Opt** variables. Be sure to check the linearity and normality assumptions and analyze all regression statistics. Construct a confidence interval for the slope of the regression line using an appropriate method.

3. Using the built-in Mosaic data set **Dimes**, test the significance of the correlation between **mass** and **year**. Would it be appropriate to build a linear model in this case? Why or why not?

4. Use the **Perc** data frame to create linear models for the 25th percentile wage earners from 2000 Q1 through  2019 Q4. Be sure to look carefully at all diagnostic plots. Create models for the Trump era and the last 3 years of the Obmama era and conduct hypothesis tests that the Trump era growth was significantly greater than the historic trend as well as greater than the Obama era.

5. Use the **Perc** data frame to create linear models for the 75th and 90th percentile wage earners from 2000 Q1 through  2019 Q4. Be sure to look carefully at all diagnostic plots. Create models for the Trump era and the last 3 years of the Obmama era and conduct hypothesis tests that the Trump era growth was significantly greater than the historic trend as well as greater than the Obama era.

6. The **PercB** data is similar to **Perc** in that it includes the 10th, 25th, 50th, 75th and 90th percentile wages, but **PerB** includes Black wage earners only. Use the **PercB** data frame to create a linear model for the median wage growth (50th percentile) from 2000 Q1 through 2019 Q4. Be sure to look carefully at all diagnostic plots. Create models for the Trump era and the last 3 years of the Obmama era and conduct hypothesis tests that the Trump era growth was significantly greater than the historic trend as well as greater than the Obama era.
