Regression Background, Simple Linear Regression, & Multiple Regression
United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.
1.1.1. Identify the predictor and the response.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?UN11” to get details on this data.
The defining measure of the variables I am examining are:
ppgdp, representing “per capita gross domestic product in US dollars”, and fertility, representing “number of children per woman”.
Next, I create the scatterplot and add a theoretical regression line using the smooth() function and applying the ‘lm’ method in ggplot2. Using “smooth()” in this way allows me to see patterns in the graph representing a potential linear regression line.
#First, I loaded the data, used the head() function to preview the dataset, and used the "?UN11" function to get additional details on the data represented in the dataset.
data("UN11")
#Then, I created a dataframe with the relevant variables and plotted them using ggplot2
df1 <- UN11 %>%
select(c(ppgdp, fertility))
gg1a <- ggplot(df1, aes(x=ppgdp, y=fertility)) +
geom_point() +
geom_smooth(method=lm,se=FALSE,fullrange=TRUE,color="goldenrod") +
labs(title= "Fertility and GDP",
x= "GNP (Per Person in US Dollars)",
y = "Fertility (Births Per Woman)") +
theme_classic()
gg1a
Finally, I add the log() functions to analyze the linear model and again use “smooth()” to look at patterns in the graph representing a regression line on a linear model.
gg1b <- ggplot(df1, aes(x=log(ppgdp), y=log(fertility))) +
geom_point() +
geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
labs(title= "Log: Fertility and GDP",
x= "GNP (Per Person in US Dollars)",
y = "Fertility (Births Per Woman)") +
theme_classic()
gg1b
Additionally, the summary() call of the linear model “lm()” function allows me to confirm that the p-value is statistically significant to a high level of confidence (< 0.001)
Call:
lm(formula = fertility ~ ppgdp, data = df1)
Residuals:
Min 1Q Median 3Q Max
-1.9006 -0.8801 -0.3547 0.6749 3.7585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.178e+00 1.048e-01 30.331 < 2e-16 ***
ppgdp -3.201e-05 4.655e-06 -6.877 7.9e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared: 0.1936, Adjusted R-squared: 0.1895
F-statistic: 47.29 on 1 and 197 DF, p-value: 7.903e-11
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM.
First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?water” to get details on this “California water” data. I can now create a scatterplot matrix for the data using the plot() function. This allows me to make an initial look into the existence of linear relationships in the data.
Using the plot() function allows me to identify visually the positive linear correlations between these variables. Those appear to me at first glance to be, leading with the strongest visual representation:
Still clear but weaker positive linear correlations:
There are no apparent positive (or negative!) linear correlations between the ‘Year’ variable and any other variable.
This is helpful in visualizing relationships, but I have better opportunities to visualize this scatterplot matrix in R. I also would like to look at the statistical correlation evaluations of these relationships.
One opportunity is a package called “GGally” that builds upon ggplot2 to look at more than just the general scatterplot matrix. The primary difference between the GGally “ggpairs()” function and the pairs function of base R is that the diagonal consists of the densities of the variables and the upper panels consist of the Pearson correlation coefficients between the variables using the argument “pearson”.
Using this representation, I can see both the visual correlations and the statistical correlations of the pairs. This allows me to compare my predictions to the correlation scores. Highest significant correlations are:
The results confirm that although my visual evaluation using the initial scatterplot matrix was not necessarily 100% accurate, it served as a reliable estimate of positive linear statistical correlation.
Finally, I want to use a third option for visualizing these correlations that is even more simple and graphically pleasing, the “ggcorr()” function in GGally. This method cleanly demonstrates the strong correlations we have evaluated in a simple way.
In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?Rateprof” to get details on this dataset. I create a dataframe of the variables of interest from the Rateprof dataset.
quality helpfulness clarity easiness raterInterest
1 4.636364 4.636364 4.636364 4.818182 3.545455
2 4.318182 4.545455 4.090909 4.363636 4.000000
3 4.790698 4.720930 4.860465 4.604651 3.432432
4 4.250000 4.458333 4.041667 2.791667 3.181818
5 4.684211 4.684211 4.684211 4.473684 4.214286
6 4.233333 4.266667 4.200000 4.533333 3.916667
Using the pairs() function, I am now able to create a scatterplot matrix that matches the one in the book (ALR, Problem 1.6).
pairs(df4)
The relationships illustrated by the correlations represented in this scatterplot matrix indicate a strong, positive linear relationship between quality and helpfulness. There is also a strong, positive linear relationship between quality and clarity, but for one outlier. These tell me that as the rankings given to a professor on quality, helpfulness, and clarity each rise, they can expect the other variables in that group to rise as well. HOwever, I cannot guess or speculate as to the cause of these correlations.
In comparison, there is a moderate positive linear relationship between positive ratings of easiness with each variable except for raterInterest. The variable raterInterest has a horizontal correlation with all of the other 4 variables, indicating either no correlation or a very weak correlation between them.
For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)
After installing the “smss” package and loading the relevant library, I again inspect the data and package info.
subj ge ag hi co dh dr tv sp ne ah ve pa pi
1 1 m 32 2.2 3.5 0 5.0 3 5 0 0 FALSE r conservative
2 2 f 23 2.1 3.5 1200 0.3 15 7 5 6 FALSE d liberal
3 3 f 27 3.3 3.0 1300 1.5 0 4 3 0 FALSE d liberal
4 4 f 35 3.5 3.2 1500 8.0 5 5 6 3 FALSE i moderate
5 5 m 23 3.1 3.5 1600 10.0 6 6 3 0 FALSE i very liberal
6 6 m 39 3.5 3.5 350 3.0 4 5 7 0 FALSE d liberal
re ab aa ld
1 most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE NA
3 most weeks FALSE FALSE NA
4 occasionally FALSE FALSE FALSE
5 never FALSE FALSE FALSE
6 occasionally FALSE FALSE NA
#To plot y = political ideology and x = religiosity, the relative variable names are "pi" and "re".
plot1 <- plot(pi ~ re, data = student.survey)
#To plot y = high school GPA and x = hours of TV watching, the relative variable names are "hi" and "tv".
plot2 <- plot(hi ~ tv, data = student.survey)
These initial representations really don’t tell me much about the relationships between the variables.
Now I need to do some data cleaning, and convert the categorical variables for political ideology and religiosity into numeric variables. I will also rename the columns for the sake of clarity.
For political ideology (“pi”), “as.integer” represents the categorical variables with values of 1 to 7, starting with very liberal (1) and increasing in assigned value to very conservative (7). For religiousity, “as.integer” represents the categorical variables with values of 1 to 4, starting with never (attending religious services) (1) to every week (4). For religiosity (“re”) this was how often you attend religious services, “never” = 1, “occasionally” = 2, “most weeks” = 3, “every week” = 4
df5 <- student.survey %>%
select(c(hi, tv, pi, re))
df5$pi <- as.integer(as.factor(df5$pi))
df5$re <- as.integer(as.factor(df5$re))
Then I create a sub-frame for the particular pair of variables I want to compare. First, religiosity and political ideology; and then create a scatter plot of the correlation between religiosity and political ideology.
gg5a <- ggplot(df5a, aes(x=Religiosity, y=Political.Ideology)) +
geom_point() +
geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
labs(title= "Religiosity & Political Ideology",
x= "Religiosity",
y = "Political Ideology") +
theme_classic()
gg5a
Again, I create a sub-frame for the specific pair of variables I want to compare. Now I look at high school GPA and hours of TV watched per week and create a scatter plot of the correlation between high school GPA and hours of TV watched per week.
gg5b <- ggplot(df5b, aes(x=Hours.TV, y=High.School.GPA)) +
geom_point() +
geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
labs(title= "High School GPA & Average Hours of TV Watched Per Week",
x= "Average Number of Hours of TV Watched per Week",
y = "High School GPA") +
theme_classic()
gg5b
summary(df5)
hi tv pi re
Min. :2.000 Min. : 0.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.:1.750
Median :3.350 Median : 6.000 Median :2.000 Median :2.000
Mean :3.308 Mean : 7.267 Mean :3.033 Mean :2.167
3rd Qu.:3.625 3rd Qu.:10.000 3rd Qu.:4.000 3rd Qu.:3.000
Max. :4.000 Max. :37.000 Max. :7.000 Max. :4.000
This basic correlation matrix gives an overview of the correlations for religiosity and political ideology, rounded to 2 decimals. This indicates a moderate positive relationship between the variables.
Political.Ideology Religiosity
Political.Ideology 1.00 0.58
Religiosity 0.58 1.00
Using the “lm()” formula, I can confirm that the p-value for this relationship is ~1.22.
Call:
lm(formula = pi ~ re, data = df5)
Residuals:
Min 1Q Median 3Q Max
-2.81243 -0.87160 0.09882 1.12840 3.09882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9308 0.4252 2.189 0.0327 *
re 0.9704 0.1792 5.416 1.22e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared: 0.3359, Adjusted R-squared: 0.3244
F-statistic: 29.34 on 1 and 58 DF, p-value: 1.221e-06
Again, using a correlation matrix gives an overview of the correlations for high school GPA and hours of TV watched, rounded to 2 decimals. This indicates a weak negative correlation between the variables.
High.School.GPA Hours.TV
High.School.GPA 1.00 -0.27
Hours.TV -0.27 1.00
Using the “lm()” formula, I can confirm that the p-value for this relationship is ~0.04
Call:
lm(formula = tv ~ hi, data = df5)
Residuals:
Min 1Q Median 3Q Max
-8.600 -3.790 -1.167 2.408 27.746
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.200 6.175 3.271 0.0018 **
hi -3.909 1.849 -2.114 0.0388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.528 on 58 degrees of freedom
Multiple R-squared: 0.07156, Adjusted R-squared: 0.05555
F-statistic: 4.471 on 1 and 58 DF, p-value: 0.03879
The basis analysis of these pairs of data from the “student.survey” data set tells me that if the null hypothesis are:
then I can use the inferential analyses to conclude that I have sufficient evidence at the 95% confidence level to reject both H1 and H2 null hypothesis. However, if I choose to use a 99% confidence level, I would be able to to reject H1, but not to reject H2.
For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)
No, we do not have enough information to say that this result is the direct effect of the special tutoring program. The increase in scores among the tutored subset may be due to other factors, such as the students having a better day on the day of the second test, among other potential facts.
Moreover, there is the statistical phenomenon at play of regression toward the mean which can help explain how extreme values correct toward the mean in repeated samples. There is a potential for the results to be statistically significant, but without additional information, the explanation that randomization and regression toward the mean remains primary.
This example could potentially be strengthened by more information, such as multiple test results over the semester rather than just two tests or a density plot of the entire class scores. It could also be strengthened by the presence of a control group.