Homework 2

Simple Linear Regression

Claire Battaglia https://rpubs.com/clairebattaglia (DACSS 603 Introduction to Quantitative Analysis)
March 10, 2022

Question 1

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

  1. Identify the predictor and the response.
  2. Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
  3. Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

Answer

  1. The predictor, or explanatory, variable is gross national product per person (ppgdp) and the response variable is fertility.
  2. No. The relationship between the two variables appears to be exponential instead of linear so a simple linear regression model is not a good fit.
  3. Yes. A simple linear regression model is plausible when we log-transform the data.

Solution

Let’s start by inspecting the data. I’ll load the data set and create a new dataframe with just the variables I’m interested in. I’ll then take a quick look to make sure that there are no missing values.

Show code
# load data
data("UN11")

# create new dataframe
un11 <- UN11 %>%
  select(c(fertility, ppgdp))

# get summary
summary(un11)
   fertility         ppgdp         
 Min.   :1.134   Min.   :   114.8  
 1st Qu.:1.754   1st Qu.:  1283.0  
 Median :2.262   Median :  4684.5  
 Mean   :2.761   Mean   : 13012.0  
 3rd Qu.:3.545   3rd Qu.: 15520.5  
 Max.   :6.925   Max.   :105095.4  

Now I’ll create a scatterplot.

Show code
# create scatterplot
ggplot(un11, aes(x = ppgdp, y = fertility)) +
  geom_point() +
  labs(title = "Fertility v. Gross National Product Per Person", x = "Gross National Product Per Person", y = "Fertility") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

We can see that a linear model is not a good fit. This makes intuitive sense, as a linear model would predict a negative birth rate that continues to grow in absolute value beyond a certain value for ppgdp, which isn’t logical. It instead seems that fertility varies pretty wildly among countries with a ppgdp less than 125,000 or so but that for countries with a ppgdp above that threshold, fertility varies much less and eventually seems to settle around 2.

Because the relationship isn’t linear, a log transformation might be useful.

Show code
ggplot(un11, aes(x = log(ppgdp), y = log(fertility))) +
  geom_point() +
  labs(title = "Log Transformation of Fertility v. Gross National Product Per Person", x = "Log of Gross National Product Per Person", y = "Log of Fertility") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

A linear model of the log transformation appears to be a much better fit.

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

  1. How, if at all, does the slope of the prediction equation change?
  2. How, if at all, does the correlation change?

Answer

  1. The slope does change. Converting the units of the response variable at the exchange rate from 2016 (1 GBP = 1.33 USD, 1 USD = .75 GBP), \[new\;slope=\frac{old\;slope}{1.33}\] It’s worth noting that if we were to convert the units of the explanatory variable instead of the response variable, we would calculate the new slope by performing the inverse mathematical operation of the unit change.
  2. The correlation does not change.

Solution

While the slope (\(b\)) of a prediction equation and the correlation (\(r\)) between the two variables are linked conceptually and mathematically, they are distinct from one another and convey different information.

The slope is a measure of the direction of the association between the two variables. For every unit increase in \(x\), \(y\) will either increase or decrease (if the two variables are actually associated with one another). The sign of the slope will tell us which and the value itself will tell us by how much.

The slope cannot tell us the strength of that association, however. This is because the slope is dependent on the specific units of the variables. If the data are converted to a different unit of measurement, the slope will change, while logically we know that the strength of the association between the two variables hasn’t changed.

As a standardized form of the slope, the correlation is not dependent on the specific units of the variables and is thus able to tell us the strength of the association between the two variables.

Let’s look at a quick example (all data are entirely made up and completely meaningless).

Show code
x <- 1:20 # set values for x
a <- 1 # set y intercept
b <- 2 # set slope
yUSD = a + b*x # prediction equation with y in USD

# create dataframe
dataQ2 <- data.frame(x, yUSD)

# convert y from USD to GBP
dataQ2 <- dataQ2 %>%
  mutate(yGBP = yUSD/1.33)

# view
dataQ2
    x yUSD      yGBP
1   1    3  2.255639
2   2    5  3.759398
3   3    7  5.263158
4   4    9  6.766917
5   5   11  8.270677
6   6   13  9.774436
7   7   15 11.278195
8   8   17 12.781955
9   9   19 14.285714
10 10   21 15.789474
11 11   23 17.293233
12 12   25 18.796992
13 13   27 20.300752
14 14   29 21.804511
15 15   31 23.308271
16 16   33 24.812030
17 17   35 26.315789
18 18   37 27.819549
19 19   39 29.323308
20 20   41 30.827068

I now have 20 values for \(x\) in USD, 20 corresponding \(y\) values in USD, and a second set of \(y\) values that are the original \(y\) values converted into GBP at the given exchange rate. I can now regress each set of \(y\) values onto \(x\) and compare the slopes and correlation coefficients.

Show code
# regress yUSD onto x
fitUSD <- lm(yUSD ~ x, data = dataQ2)

# regress yGBP onto x
fitGBP <- lm(yGBP ~ x, data = dataQ2)

# show both regression outputs
stargazer(fitUSD, fitGBP,
          type = "text",
          header = FALSE,
          single.row = TRUE,
          no.space = TRUE,
          column.sep.width = "3pt",
          font.size = "small")

=================================================================================================================================
                                                                      Dependent variable:                                        
                              ---------------------------------------------------------------------------------------------------
                                                    yUSD                                              yGBP                       
                                                     (1)                                               (2)                       
---------------------------------------------------------------------------------------------------------------------------------
x                                             2.000*** (0.000)                                  1.504*** (0.000)                 
Constant                                      1.000*** (0.000)                                  0.752*** (0.000)                 
---------------------------------------------------------------------------------------------------------------------------------
Observations                                         20                                                20                        
R2                                                  1.000                                             1.000                      
Adjusted R2                                         1.000                                             1.000                      
Residual Std. Error (df = 18)                       0.000                                             0.000                      
F Statistic (df = 1; 18)      79,494,261,481,456,398,357,598,065,131,520.000*** 71,025,695,591,639,092,023,323,881,635,840.000***
=================================================================================================================================
Note:                                                                                                 *p<0.1; **p<0.05; ***p<0.01

When \(y\sim\sf{x}\) and both variables are in USD, \(b=2\) and \(r=1\)

This means that when \(x\) increases by a single unit, \(y\) increases by 2 units.

When I convert \(y\) from USD into GBP and once again run the regression \(y\sim\sf{x}\), \(b=1.504\) and \(r=1\)

This means that when \(x\) increases by a single unit, \(y\) increases by 1.504 units.

\[1.504=\frac{2}{1.33}\]

Because I created the data myself, I know that the relationship between \(x\) and \(y\) is the same when \(y\) is in USD as when \(y\) is in GBP, yet the slopes are different. This demonstrates that the slope is dependent on the specific unit of measurements of the variables while the correlation coefficient is not. The two are used in concert to understand the relationship between two variables.

Question 3

Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

Answer

Year does not appear to be correlated with any variable.

APMAM, APSAB, and APSLAKE appear to be moderately positively correlated with one another.

OPBPC, OPRC, and OPSLAKE appear to be strongly positively correlated with one another.

There appears to be a moderate positive correlation between BSAAM and APMAM, APSAB, and APSLAKE, and a strong positive correlation between BSAAM and OPBPC, OPRC, and OPSLAKE.

Solution

As always, I’ll start by inspecting the data.

Show code
# load data
data("water")

# create object
water <- water

# preview
head(water)
  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

The predictor variables are the precipitation measurements taken at each of the six sites: APMAM, APSAB, APSLAKE, OPBPC, OPRC, OPSLAKE, measured in inches.

The response variable is the stream runoff at the site BSAAM, measured in acre-feet.

Show code
# create scatterplot matrix
pairs(water)

Yikes. Let’s first orient ourselves so that we can make sense of this.

The \(x\)-axis of any given plot is the variable named in that column. In the above matrix, each plot in the first column has the \(x\)-axis year.

The \(y\)-axis of any given plot is the variable named in that row. In the above matrix, each plot in the first row has the \(y\)-axis year.

By looking at the first column and row, then, we can see the variable year plotted against every other variable, on the \(x\) and \(y\)-axis, respectively.

With that in mind, let’s assess.

Year does not appear to be correlated with any variable, which is to be expected. While collecting these data over time could reveal larger trends (e.g. climate change), the year itself isn’t predictive of runoff.

APMAM, APSAB, and APSLAKE appear to be moderately positively correlated with one another.

OPBPC, OPRC, and OPSLAKE appear to be strongly positively correlated with one another.

BSAAM appears to be moderately positively correlated with APMAM, APSAB, and APSLAKE.

BSAAM appears to be strongly positively correlated with OPBPC, OPRC, and OPSLAKE.

Another way to visualize the data is with the ggcorr function. Instead of showing us a scatterplot matrix, it calculates the correlation between each set of variables and then displays the strength of the association as color-coded categories. This is an easy to confirm what we gleaned from the scatterplot matrix.

Show code
ggcorr(water)

We can see that no variables are negatively correlated with one another (blue), that year is either not correlated or very weakly positively correlated with every variable (white to pale pink), that APMAM, APSAB, and APSLAKE are moderately positively correlated with one another (light red to red), that OPBPC, OPRC, and OPSLAKE are strongly positively correlated with one another (red), and, finally, that BSAAM is strongly positively correlated with OPBPC, OPRC, and OPSLAKE (red) and slightly less so with APMAM, APSAB, and APSLAKE (light red).

Question 4

Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

Solution

As always, I’ll start by inspecting the data.

Show code
# load data
data("Rateprof")

# create object
rateProf <- Rateprof

# preview
head(rateProf)
  gender numYears numRaters numCourses pepper discipline
1   male        7        11          5     no        Hum
2   male        6        11          5     no        Hum
3   male       10        43          2     no        Hum
4   male       11        24          5     no        Hum
5   male       11        19          7     no        Hum
6   male       10        15          9     no        Hum
               dept  quality helpfulness  clarity easiness
1           English 4.636364    4.636364 4.636364 4.818182
2 Religious Studies 4.318182    4.545455 4.090909 4.363636
3               Art 4.790698    4.720930 4.860465 4.604651
4           English 4.250000    4.458333 4.041667 2.791667
5           Spanish 4.684211    4.684211 4.684211 4.473684
6           Spanish 4.233333    4.266667 4.200000 4.533333
  raterInterest sdQuality sdHelpfulness sdClarity sdEasiness
1      3.545455 0.5518564     0.6741999 0.5045250  0.4045199
2      4.000000 0.9020179     0.9341987 0.9438798  0.5045250
3      3.432432 0.4529343     0.6663898 0.4129681  0.5407021
4      3.181818 0.9325048     0.9315329 0.9990938  0.5882300
5      4.214286 0.6500112     0.8200699 0.5823927  0.6117753
6      3.916667 0.8632717     1.0327956 0.7745967  0.6399405
  sdRaterInterest
1       1.1281521
2       1.0744356
3       1.2369438
4       1.3322506
5       0.9749613
6       0.6685579

Now I’ll create a basic scatterplot matrix.

Show code
# create scatterplot matrix
pairs(~ quality + helpfulness + clarity + easiness + raterInterest, data = rateProf)

Discussion

Quality appears to have:

Helpfulness appears to have:

Clarity appears to have:

Easiness appears to have:

RaterInterest appears to have a weak positive correlation with every other variable.

Beyond a simple statement of the association between each of these variables, however, it’s impossible to say what substantive, or contextual, significance these data have. In an attempt to understand the data more clearly, I created an account with Rate My Professors and took a look around. Some things I noticed are:

Ultimately, I don’t think these data can tell us much about a given professor. The real potential of these data is in understanding how people rate their professors, which is not the same thing as how the professors actually are.

It could be interesting to subset the data in different ways: by the gender of the professor (do we rate professors of different gender differently?), the gender of the respondent (do students of different gender rate professors differently?), the discipline (do we rate humanities professors differently than STEM professors?), the age/year of the respondent (do undergraduate freshmen tend to rate differently than graduate students?), etc. As the questions are currently designed, however, they don’t consistently capture valid information about the professor or the underlying factors that influence how people might rate their professors.

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating

  1. y = political ideology and x = religiosity,
  2. y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

  1. Use graphical ways to portray the individual variables and their relationship.
  2. Interpret descriptive statistics for summarizing the individual variables and their relationship.
  3. Summarize and interpret results of inferential analyses.

Solution

As always, I’ll start by inspecting the data.

Show code
# load data
data("student.survey")

# create dataframe
studS <- student.survey %>%
  select(c("pi", "re", "hi", "tv"))

# get summary
summary(studS)
                     pi                re           hi       
 very liberal         : 8   never       :15   Min.   :2.000  
 liberal              :24   occasionally:29   1st Qu.:3.000  
 slightly liberal     : 6   most weeks  : 7   Median :3.350  
 moderate             :10   every week  : 9   Mean   :3.308  
 slightly conservative: 6                     3rd Qu.:3.625  
 conservative         : 4                     Max.   :4.000  
 very conservative    : 2                                    
       tv        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 6.000  
 Mean   : 7.267  
 3rd Qu.:10.000  
 Max.   :37.000  
                 

While I am using the summary function here primarily to determine which types of variables we’re working with and to make sure there are no missing values, it actually also returns some of the descriptive statistics I’m interested in. I’ll look at those more closely in each section.

Political Ideology and Religiosity

Political ideology and religiosity are both categorical variables, meaning that respondents can select their response option from a specified set of values that are categorical in nature. Because there is an underlying order to each range of response options, both are ordinal.

This means that I’ll primarily be looking at the frequency and relative frequency of each response option. Given that the distance between response options is not quantifiable and almost certainly unequal, measures like the mean and standard deviation are not appropriate here.

I’ll start by looking at political ideology. A simple bar chart is a great way to visualize frequency of each response.

Show code
# create bar chart
ggplot(studS, aes(x = pi)) +
  geom_bar(fill = "#830042") +
  labs(title = "Political Ideology of STA 6126 Students", x = "Political ideology", y = NULL) +
  coord_flip() +
  theme_minimal()

We can clearly see that the mode is “liberal,” as well as the relative frequency of all of the response options (although not the exact values).

We could also view the frequency distribution in a simple table. In the table below, the first value in each cell is the number of students who responded with that option and the second value is the proportion. Looking at the first cell, we can see that 8 students, or 13.3%, identify as “very liberal.”

Show code
# create freq table
CrossTable(studS$pi,  max.width = 3)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  60 

 
                      |          very liberal |               liberal |      slightly liberal | 
                      |-----------------------|-----------------------|-----------------------|
                      |                     8 |                    24 |                     6 | 
                      |                 0.133 |                 0.400 |                 0.100 | 
                      |-----------------------|-----------------------|-----------------------|

                      |              moderate | slightly conservative |          conservative | 
                      |-----------------------|-----------------------|-----------------------|
                      |                    10 |                     6 |                     4 | 
                      |                 0.167 |                 0.100 |                 0.067 | 
                      |-----------------------|-----------------------|-----------------------|

                      |     very conservative | 
                      |-----------------------|
                      |                     2 | 
                      |                 0.033 | 
                      |-----------------------|



 

If we were willing to sacrifice some precision to simplify the data we could collapse the categories into “conservative,” “moderate,” and “liberal.”

Show code
# collapse categories
piCollapse <- studS %>%
  mutate(pi = fct_collapse(pi,
                         liberal = c("very liberal", "liberal", "slightly liberal"),
                         conservative = c("very conservative", "conservative", "slightly conservative"),
                         moderate = c("moderate")))

# create bar plot
ggplot(piCollapse, aes(x = pi)) +
  geom_bar(fill = "#830042") +
  labs(title = "Political Ideology of STA 6126 Students", x = "Political ideology", y = NULL) +
  theme_minimal() +
  coord_flip()

Another way to approach these data is by looking at the students’ distance from the ideological center. That is, how are the students distributed around the center of the ideological spectrum?

Show code
# collapse categories
piDistance <- studS %>%
  mutate(pi = fct_collapse(pi,
                           "furthest left or right of center" = c("very liberal", "very conservative"),
                           "a bit further left or right of center" = c("liberal", "conservative"),
                           "slightly left or right of center" = c("slightly liberal", "slightly conservative"),
                           "center (neither left nor right)" = c("moderate")))

# create bar plot
ggplot(piDistance, aes(x = pi)) +
  geom_bar(fill = "#830042") +
  labs(title = "Distance from Ideological Center", x = "Distance from center", y = NULL) +
  theme_minimal() +
  coord_flip()

It appears that the greatest number of students consider themselves to be more than just slightly left or right of center but not very far left or right of center.

I’m sure this is an entire area of study in itself so I’d want to explore the literature before standing by this interpretation but for now I’ll say we could potentially look at the data this way.

Next I’ll look at religiosity. Again, a simple bar chart is a great place to start.

Show code
# create bar chart
ggplot(studS, aes(x = re)) +
  geom_bar(fill = "#830042") +
  labs(title = "Religiosity of STA 6126 Students", x = "How often do you attend religious services?", y = NULL) +
  coord_flip() +
  theme_minimal()

Once again this allows us to easily identify the mode and relative frequency of each response option. “Occasionally” is clearly the most common response; “most weeks” the least common.

We can again look at the frequency distribution in a simple table.

Show code
# create freq table
CrossTable(studS$re,  max.width = 4)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  60 

 
             |        never | occasionally |   most weeks |   every week | 
             |--------------|--------------|--------------|--------------|
             |           15 |           29 |            7 |            9 | 
             |        0.250 |        0.483 |        0.117 |        0.150 | 
             |--------------|--------------|--------------|--------------|



 

In order to understand the relationship between political ideology and religiosity, we can construct a contingency table and/or a scatterplot.

Show code
# create table pi and re
tablePiRe <- table(studS$pi, studS$re)

# view
tablePiRe
                       
                        never occasionally most weeks every week
  very liberal              3            5          0          0
  liberal                   8           14          1          1
  slightly liberal          2            1          1          2
  moderate                  1            8          1          0
  slightly conservative     1            1          2          2
  conservative              0            0          2          2
  very conservative         0            0          0          2
Show code
# create scatterplot
ggplot(studS, aes(x = re, y = pi)) +
  geom_point() +
  geom_jitter() +
  labs(title = "Political Ideology v. Religiosity", x = "Religiosity", y = "Political ideology") +
  theme_minimal()

The two do seem to be associated with one another. Less religious students are more likely to be more liberal and more religious students are more likely to be more conservative.

Because both variables are ordinal, we can also calculate gamma (\(\gamma\)) to understand the association between them. Gamma is the standardized difference between pairs that are concordant (both measures are either high or low) and discordant (one measure is high and the other is low).

Show code
# calculate gamma
GoodmanKruskalGamma(tablePiRe,
                    conf.level = 0.95)
    gamma    lwr.ci    upr.ci 
0.5747711 0.3646211 0.7849211 

This output gives us the 95% confidence interval for gamma. Here, \(\gamma=.575\). The positive sign indicates a positive correlation and the confidence interval tells us gamma will most likely lie between .365 and .785—that is, there is a moderate to fairly strong correlation between the two variables.

Because of the way the categories are numbered (1 = very liberal, 7 = very conservative), gamma has been calculated such that we articulate the association:

Religiosity is positively associated with conservatism. As religiosity increases, conservatism increases.

We could also say that religiosity is negatively associated with liberalism but that is not what’s being reflected in our gamma calculation.

There are a few ways to approach understanding association between ordinal variables. I’ve chosen to look at gamma because the sample size seems too small to justify using \(\chi^2\). Another option is to treat the data numerically and then run a simple linear regression. Given the potentially extreme variability in distance between the categories (for religiosity in particular), I’m not sure we could justify doing that here.

That being said, here’s what it would look like.

Show code
# convert categories to numeric values
pireNum <- studS %>%
  select(c("pi", "re")) %>%
  sapply(unclass)

# convert to dataframe
pireNum <- data.frame(pireNum)

# regress ideology onto religiosity
fitPiRe <- lm(pi ~ re, data = pireNum)

# get summary
summary(fitPiRe)

Call:
lm(formula = pi ~ re, data = pireNum)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re            0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

Here again we do see a positive correlation, which confirms our above conclusion. The \(P\)-value meets the criteria for statistical significance at the .01 level, indicating that we can be confident that the slope (\(b\)) is not zero.

Put another way, we can reject the null hypothesis that the two variables are independent of one another.

High School GPA and TV Watching

Both high school GPA and hours of TV watched per week are quantitative variables. This means that in addition to looking at the frequency and relative frequency of each value, we’ll also be able to look at the mean and standard deviation of each data set.

I’ll first look at high school GPA. Because this is a small data set, a stem-and-leaf plot might be a useful way to take a quick look.

Show code
# create stem and leaf plot
stem(studS$hi, scale = 3)

  The decimal point is 1 digit(s) to the left of the |

  20 | 0
  21 | 0
  22 | 0
  23 | 0
  24 | 
  25 | 
  26 | 
  27 | 0
  28 | 0
  29 | 
  30 | 00000000000000
  31 | 0
  32 | 000
  33 | 000000
  34 | 00000
  35 | 0000000
  36 | 000
  37 | 0000
  38 | 000000
  39 | 
  40 | 00000

We can see that the lowest grade earned is a 2.0 (1 student) and that the highest grade earned is a 4.0 (5 students), making the range 2.

3.0 appears to be the most frequently earned grade (the mode) and the largest proportion of grades are clustered between 3.0 and 3.8 so I expect both the mean and median to fall somewhere between those values.

Beyond the mode, range, and some sense of the distribution of the grades, there are a couple of other descriptive statistics we can look at.

Show code
# get summary stats
select(studS, hi) %>%
  describe.by()
   vars  n mean   sd median trimmed  mad min max range  skew kurtosis
X1    1 60 3.31 0.46   3.35    3.35 0.52   2   4     2 -0.76     0.57
     se
X1 0.06

The median is 3.35, the mean is 3.31, and the standard deviation is .46. We can see that the data are slightly skewed because the median and mean are different values. The mean is pulled down by those few very low GPAs.

Now I’ll look at the average number of hours of TV watched per week. We can use a simple histogram to take a look at the frequency distribution.

Show code
# create hist
ggplot(studS, aes(x = tv)) +
  geom_histogram(fill = "#830042") +
  labs(title = "TV-Watching Habits of STA 6126 Students", x = "Average hours watched per week", y = NULL) +
  theme_minimal()

This gives us a sense of the frequency distribution and we can again get some summary statistics.

Show code
# get summary stats
select(studS, tv) %>%
  describe.by()
   vars  n mean   sd median trimmed  mad min max range skew kurtosis
X1    1 60 7.27 6.72      6    6.24 5.93   0  37    37 2.14     6.06
     se
X1 0.87

We can see from the histogram that most of the students watch less than 15 hours per week but that a few watch more than 25 hours per week. The median is 6 and the mean is 7.27, reflecting the skew created by those few very high values. The range (37) is surprisingly large and it’s hard to even imagine watching 37 hours of TV per week. It’s conceivable that there are two types of TV-watching: active watching (you sit down to watch something) and passive watching (it’s on in the background while you do something else). The large range makes me wonder if some respondents were reporting only active watching while perhaps others were reporting both active and passive watching.

Before looking at these two variables together, it’s worth noting that we’re looking at each student’s GPA in high school and TV-watching habits in graduate school. It’s worth noting because I think it’s tempting to conceptualize TV-watching habits as the explanatory variable and GPA as the outcome variable (something like “watching more/less TV is associated with a higher/lower GPA”) and subsequently explore causal inference but given the temporal order in this case (GPA precedes TV-watching) and long time between the two (a minimum of 4 years), it seems highly unlikely that any association between these two variables could have any substantive meaning.

That being said, a scatterplot is a great way to visualize association.

Show code
# create scatterplot
ggplot(studS, aes(x = tv, y = hi)) +
  geom_point() +
  geom_jitter() +
  labs(title = "High School GPA v. Hours of TV Watched per Week in Graduate School", x = "Hours of TV watched per week", y = "High School GPA") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

There appears to be a weak negative correlation between high school GPA and hours of TV watched per week.

Show code
# cor test
cor.test(studS$tv, studS$hi)

    Pearson's product-moment correlation

data:  studS$tv and studS$hi
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.48826914 -0.01457694
sample estimates:
       cor 
-0.2675115 

A correlation coefficient (\(r\)) of -.268 confirms that there is a weak negative correlation. Those students with higher GPAs in high school tend to watch less TV in graduate school.

Show code
# fit model
fitHiTv <- lm(hi ~ tv, data = studS)

# get summary
summary(fitHiTv)

Call:
lm(formula = hi ~ tv, data = studS)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

Our regression analysis indicates that for every unit increase in GPA, there is a .018 decrease in hours of TV watched in graduate school.

Again, though, given the above discussion of these two variables, I don’t believe there is any substantive or contextual meaning to the observed association.

Question 6

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful.

Solution

Regression towards the mean is a statistical phenomenon in which extreme values (those furthest away from the mean) tend to be less extreme (closer to the mean) when measured again.

It occurs when two variables are correlated but there is also some element of chance that explains the outcome variable. Observing a non-random sample makes it more likely that we will observe this phenomenon.

In the context of this quasi-experiment, each student’s test scores vary around some true level of knowledge, which we don’t know. The difference between the student’s true level of knowledge and the student’s exam score is the error.

This error can occur in either direction and be caused by many things. That is, a student could score higher or lower than her true level of knowledge because of any number of conditions—conditions that occur to some degree by chance. Because those conditions occur to some degree by chance, it’s improbable that the same students will experience the same conditions during both the first observation (the midterm exam) and the second observation (the final exam). If those students do not experience those same conditions, we would expect to see the students who scored higher than their true level of knowledge score a little lower and the students who scored lower than their true level of knowledge score a little higher. This will occur regardless of treatment.

It’s important to note that regression towards the mean does not imply that every student who scored extremely high or low did so purely by chance and that their scores will be different when observed a second time. We’re looking at the mean score of the ten lowest-scorers. If even just a few of those students scored lower than their true level of knowledge due to some chance condition—perhaps one student was sick on exam day, one missed the bus and had to run to school, and a third’s parents had just announced their divorce—that wasn’t present at the time of the second observation (the final exam), those students would score higher, which would increase the mean of the entire group of lowest-scorers. Thus we cannot say whether there was any change in the students’ true level of knowledge.

It would be helpful to also look at the mean score of the ten highest-scorers on the midterm. Regression towards the mean tells us that we would likely see the mean score of those ten students decrease from the midterm to the final.

From a research design perspective, this quasi-experiment is problematic in other ways that prevent us from drawing any causal inference about the effectiveness of the treatment.

  1. The sample is small.
  2. There is no control group. There is a group of students who did not receive the treatment but because the treatment and no-treatment groups were not randomly assigned, we know that they differed from one another in at least one way (their midterm exam score). Whether they differed in other ways was not explored so at this time we have no way of knowing whether they differed in systematic ways that could also explain their scores.
  3. It’s likely that one observation (the midterm exam) is insufficient for determining a student’s true level of knowledge. In order to overcome the regression challenge, we would need more observations of the same students over time. This would help us understand how much of each student’s score can be attributed to some true level of knowledge and how much is due to some chance condition.

We can finally conclude that the phenomenon of regression towards the mean could easily explain the increase in the mean score of the ten lowest-scorers. In addition, there are numerous other challenges to the validity of this quasi-experiment. As such, we cannot claim that the treatment was effective. Indeed, we can’t even claim that there was any actual change in the students’ level of knowledge.