DACSS 603 Homework 2

Regression Background, Simple Linear Regression, & Multiple Regression

Kristina Becvar
2022-03-08

Question 1

United Nations

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

Answer 1

First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?UN11” to get details on this data.

The defining measure of the variables I am examining are:

ppgdp, representing “per capita gross domestic product in US dollars”, and fertility, representing “number of children per woman”.

Next, I create the scatterplot and add a theoretical regression line using the smooth() function and applying the ‘lm’ method in ggplot2. Using “smooth()” in this way allows me to see patterns in the graph representing a potential linear regression line.

Show code
#First, I loaded the data, used the head() function to preview the dataset, and used the "?UN11" function to get additional details on the data represented in the dataset. 

data("UN11")

#Then, I created a dataframe with the relevant variables and plotted them using ggplot2

df1 <- UN11 %>% 
  select(c(ppgdp, fertility))

gg1a <- ggplot(df1, aes(x=ppgdp, y=fertility)) +
  geom_point() +
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE,color="goldenrod") +
   labs(title= "Fertility and GDP",
        x= "GNP (Per Person in US Dollars)", 
        y = "Fertility (Births Per Woman)") +
   theme_classic()

gg1a

Finally, I add the log() functions to analyze the linear model and again use “smooth()” to look at patterns in the graph representing a regression line on a linear model.

Show code
gg1b <- ggplot(df1, aes(x=log(ppgdp), y=log(fertility))) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
   labs(title= "Log: Fertility and GDP",
        x= "GNP (Per Person in US Dollars)",
        y = "Fertility (Births Per Woman)") +
   theme_classic()

gg1b

Additionally, the summary() call of the linear model “lm()” function allows me to confirm that the p-value is statistically significant to a high level of confidence (< 0.001)

Show code
fit1 <- lm(fertility ~ ppgdp, data = df1)
summary(fit1)

Call:
lm(formula = fertility ~ ppgdp, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
ppgdp       -3.201e-05  4.655e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

Answer 2

Question 3

Water Runoff in the Sierras

Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM.

Answer 3

First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?water” to get details on this “California water” data. I can now create a scatterplot matrix for the data using the plot() function. This allows me to make an initial look into the existence of linear relationships in the data.

Show code
#First I load the data

data("water")

#Next, I create the dataframe object from the data

df3 <- water

#Then, I make an initial scatterplot matrix

pairs(df3)

Using the plot() function allows me to identify visually the positive linear correlations between these variables. Those appear to me at first glance to be, leading with the strongest visual representation:

Still clear but weaker positive linear correlations:

There are no apparent positive (or negative!) linear correlations between the ‘Year’ variable and any other variable.

This is helpful in visualizing relationships, but I have better opportunities to visualize this scatterplot matrix in R. I also would like to look at the statistical correlation evaluations of these relationships.

One opportunity is a package called “GGally” that builds upon ggplot2 to look at more than just the general scatterplot matrix. The primary difference between the GGally “ggpairs()” function and the pairs function of base R is that the diagonal consists of the densities of the variables and the upper panels consist of the Pearson correlation coefficients between the variables using the argument “pearson”.

Show code
ggpairs(df3, method = c("everything", "pearson"))

Using this representation, I can see both the visual correlations and the statistical correlations of the pairs. This allows me to compare my predictions to the correlation scores. Highest significant correlations are:

The results confirm that although my visual evaluation using the initial scatterplot matrix was not necessarily 100% accurate, it served as a reliable estimate of positive linear statistical correlation.

Finally, I want to use a third option for visualizing these correlations that is even more simple and graphically pleasing, the “ggcorr()” function in GGally. This method cleanly demonstrates the strong correlations we have evaluated in a simple way.

Show code
ggcorr(df3, method = c("everything", "pearson")) 

Question 4

Professor Ratings

In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

Answer 4

First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?Rateprof” to get details on this dataset. I create a dataframe of the variables of interest from the Rateprof dataset.

Show code
#First I load the data

data("Rateprof")

#Then create a data frame object of the relevant variables

df4 <- Rateprof %>% 
  select(c(quality, helpfulness, clarity, easiness, raterInterest))

#I preview the data to understand what type of data is represented

head(df4)
   quality helpfulness  clarity easiness raterInterest
1 4.636364    4.636364 4.636364 4.818182      3.545455
2 4.318182    4.545455 4.090909 4.363636      4.000000
3 4.790698    4.720930 4.860465 4.604651      3.432432
4 4.250000    4.458333 4.041667 2.791667      3.181818
5 4.684211    4.684211 4.684211 4.473684      4.214286
6 4.233333    4.266667 4.200000 4.533333      3.916667

Using the pairs() function, I am now able to create a scatterplot matrix that matches the one in the book (ALR, Problem 1.6).

Show code
pairs(df4)

The relationships illustrated by the correlations represented in this scatterplot matrix indicate a strong, positive linear relationship between quality and helpfulness. There is also a strong, positive linear relationship between quality and clarity, but for one outlier. These tell me that as the rankings given to a professor on quality, helpfulness, and clarity each rise, they can expect the other variables in that group to rise as well. HOwever, I cannot guess or speculate as to the cause of these correlations.

In comparison, there is a moderate positive linear relationship between positive ratings of easiness with each variable except for raterInterest. The variable raterInterest has a horizontal correlation with all of the other 4 variables, indicating either no correlation or a very weak correlation between them.

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

Answer 5

After installing the “smss” package and loading the relevant library, I again inspect the data and package info.

Show code
#First I load the data

data("student.survey")

#Then I create a data frame with the data set and look at the content

df5 <- student.survey

head(student.survey)
  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal
            re    ab    aa    ld
1   most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE    NA
3   most weeks FALSE FALSE    NA
4 occasionally FALSE FALSE FALSE
5        never FALSE FALSE FALSE
6 occasionally FALSE FALSE    NA
Show code
#To plot y = political ideology and x = religiosity, the relative variable names are "pi" and "re".

plot1 <- plot(pi ~ re, data = student.survey)

Show code
#To plot y = high school GPA and x = hours of TV watching, the relative variable names are "hi" and "tv".

plot2 <- plot(hi ~ tv, data = student.survey)

These initial representations really don’t tell me much about the relationships between the variables.

Now I need to do some data cleaning, and convert the categorical variables for political ideology and religiosity into numeric variables. I will also rename the columns for the sake of clarity.

For political ideology (“pi”), “as.integer” represents the categorical variables with values of 1 to 7, starting with very liberal (1) and increasing in assigned value to very conservative (7). For religiousity, “as.integer” represents the categorical variables with values of 1 to 4, starting with never (attending religious services) (1) to every week (4). For religiosity (“re”) this was how often you attend religious services, “never” = 1, “occasionally” = 2, “most weeks” = 3, “every week” = 4

Show code
df5 <- student.survey %>% 
  select(c(hi, tv, pi, re))

df5$pi <- as.integer(as.factor(df5$pi))
df5$re <- as.integer(as.factor(df5$re))

Then I create a sub-frame for the particular pair of variables I want to compare. First, religiosity and political ideology; and then create a scatter plot of the correlation between religiosity and political ideology.

Show code
df5a <- df5 %>%
  rename(Political.Ideology = pi,
         Religiosity = re) %>% 
  select(Political.Ideology, Religiosity)
Show code
gg5a <- ggplot(df5a, aes(x=Religiosity, y=Political.Ideology)) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
   labs(title= "Religiosity & Political Ideology",
        x= "Religiosity", 
        y = "Political Ideology") +
   theme_classic()

gg5a

Again, I create a sub-frame for the specific pair of variables I want to compare. Now I look at high school GPA and hours of TV watched per week and create a scatter plot of the correlation between high school GPA and hours of TV watched per week.

Show code
df5b <- student.survey %>%
  rename(High.School.GPA = hi,
         Hours.TV = tv) %>% 
  select(High.School.GPA, Hours.TV)
Show code
gg5b <- ggplot(df5b, aes(x=Hours.TV, y=High.School.GPA)) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
   labs(title= "High School GPA & Average Hours of TV Watched Per Week",
        x= "Average Number of Hours of TV Watched per Week", 
        y = "High School GPA") +
   theme_classic()

gg5b

Show code
summary(df5)
       hi              tv               pi              re       
 Min.   :2.000   Min.   : 0.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.: 3.000   1st Qu.:2.000   1st Qu.:1.750  
 Median :3.350   Median : 6.000   Median :2.000   Median :2.000  
 Mean   :3.308   Mean   : 7.267   Mean   :3.033   Mean   :2.167  
 3rd Qu.:3.625   3rd Qu.:10.000   3rd Qu.:4.000   3rd Qu.:3.000  
 Max.   :4.000   Max.   :37.000   Max.   :7.000   Max.   :4.000  

This basic correlation matrix gives an overview of the correlations for religiosity and political ideology, rounded to 2 decimals. This indicates a moderate positive relationship between the variables.

Show code
round(cor(df5a),
  digits = 2 
)
                   Political.Ideology Religiosity
Political.Ideology               1.00        0.58
Religiosity                      0.58        1.00

Using the “lm()” formula, I can confirm that the p-value for this relationship is ~1.22.

Show code
fit5a <- lm(pi ~ re, data = df5)

summary(fit5a)

Call:
lm(formula = pi ~ re, data = df5)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re            0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

Again, using a correlation matrix gives an overview of the correlations for high school GPA and hours of TV watched, rounded to 2 decimals. This indicates a weak negative correlation between the variables.

Show code
round(cor(df5b),
  digits = 2 
)
                High.School.GPA Hours.TV
High.School.GPA            1.00    -0.27
Hours.TV                  -0.27     1.00

Using the “lm()” formula, I can confirm that the p-value for this relationship is ~0.04

Show code
fit5b <- lm(tv ~ hi, data = df5)

summary(fit5b)

Call:
lm(formula = tv ~ hi, data = df5)

Residuals:
   Min     1Q Median     3Q    Max 
-8.600 -3.790 -1.167  2.408 27.746 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   20.200      6.175   3.271   0.0018 **
hi            -3.909      1.849  -2.114   0.0388 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.528 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

The basis analysis of these pairs of data from the “student.survey” data set tells me that if the null hypothesis are:

then I can use the inferential analyses to conclude that I have sufficient evidence at the 95% confidence level to reject both H1 and H2 null hypothesis. However, if I choose to use a 99% confidence level, I would be able to to reject H1, but not to reject H2.

Question 6

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)

Answer

No, we do not have enough information to say that this result is the direct effect of the special tutoring program. The increase in scores among the tutored subset may be due to other factors, such as the students having a better day on the day of the second test, among other potential facts.

Moreover, there is the statistical phenomenon at play of regression toward the mean which can help explain how extreme values correct toward the mean in repeated samples. There is a potential for the results to be statistically significant, but without additional information, the explanation that randomization and regression toward the mean remains primary.

This example could potentially be strengthened by more information, such as multiple test results over the semester rather than just two tests or a density plot of the entire class scores. It could also be strengthened by the presence of a control group.