DACSS 603 Homework 2: Linear Regression
(Problem 1.1 in ALR)
United Nations (Data file: UN11
) The data in the file UN11
contains several variables, including ppgdp
, the gross national product per person in U.S. dollars, and fertility
, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility
on ppgdp
.
1.1.1. Identify the predictor and the response.
The predictor is ppgdp
and the response is fertility.
So, in the scatterplot we will see that fertility is on the y-axis while ppgdp is on the x-axis.
1.1.2. Draw the scatterplot of fertility
on the vertical axis versus ppgdp
on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
ggplot(data = UN11, aes(x = ppgdp, y = fertility)) +
geom_point(color = "#4FB06D") +
labs(y = "Fertility (birth rate per 1000 females)", x = "PPGDP ($)") +
theme(panel.background = element_rect(fill = "#F6F9FC"),
axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))
We see that fertility decreases as ppgdp increases to a certain point, and then fertility appears to flatten out (around 1-2 births per female). Based on this graph it seems like a straight-line mean function is not the best summary of this graph.
1.1.3 Draw the scatterplot of log(fertility
) versus log(ppgdp
) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
ggplot(data = UN11, aes(x = log(ppgdp), y = log(fertility))) +
geom_point(color = "#4FB06D") +
labs(y = "Fertility (birth rate per 1000 females)", x = "PPGDP ($)") +
theme(panel.background = element_rect(fill = "#F6F9FC"),
axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))
A simple linear regression does seem like a more plausible model for this graph. The mean function appears to be linear, and it is believable that there could be consistent varience across the points (although this is not completely certain). As we would expect, we see that there are a few outliers with either very high or very low fertility for their ppgdp.
(Problem 9.47 in SMSS)
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
(a) How, if at all, does the slope of the prediction equation change?
Assuming that the data itself is the same and the study was not re-done with based on income in Britain (just conversion), the new slope will be equal to the old slope divided by 1.33 (each British pound represents a smaller interval).
(b) How, if at all, does the correlation change?
The correlation will not change (i.e. the relationship is the same regardless of the unit of income used to represent it).
(Problem 1.5 in ALR)
Water runoff in the Sierras (Data file: water
) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM
, APSAB
, APSLAKE
, OPBPC
, OPRC
, and OPSLAKE
) and stream runoff volume at a site near Bishop, California, labeled BSAAM
. Draw the scatterplot matrix for these data and summarize the information available from these plots.
# use pairs() function
pairs(~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + BSAAM,
data = water,
col = "#0068B1",
pch = 20)
Summary of relationship between variables based on this matrix:
Year
does not appear to be related to the the other variables (runoff or water levels).OPBPC
, OPRC
, and OPSLAKE
sites seem to be correlated with each other. The plots that include two of these variables seem to show a dependence that is stronger than any dependence they have with the APSLAKE, APSAB, and APMAM sites.APSLAKE
, APSAB
, and APMAM
sites also seem to be correlated with each other.BSAAM
(the stream runoff) is more correlated with the OPBPC
, OPRC
, and OPSLAKE
group. Perhaps this stream runoff drains to the OPBPC
, OPRC
, and OPSLAKE
sites more than the others.APSLAKE
, APSAB
, and APMAM
sites that we are not accounting for here since they are all closely related to each other but not as much to BSAAM
.(Problem 1.6 in ALR - slightly modified)
Professor ratings (Data file: Rateprof
) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
# find the names of the variables
# names(Rateprof)
#use pairs() to graph matrix
pairs(~ quality + clarity + helpfulness + easiness + raterInterest,
data = Rateprof,
xlim = c(1,5),
ylim = c(1,5),
xaxs = "i",
yaxs = "i",
col = "#FF5C35",
pch = 20)
Summary of relationship between variables based on this matrix:
quality
, clarity
, and helpfulness
variables. This indicates that professors who rate highly in one of these categories may also rate highly in the other two.easiness
and raterInterest
and the rest of the variables.(Problem 9.34 in SMSS)
For the student.survey
data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey
in the R console, after loading the package, to see what each variable means.)
# downloading the data
data("student.survey")
# taking a look at each variable.
# ?student.survey
#select the variables we need.
student.survey <- student.survey %>%
select(c(pi, re, hi, tv))
str(student.survey)
'data.frame': 60 obs. of 4 variables:
$ pi: Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
$ re: Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
$ hi: num 2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
$ tv: num 3 15 0 5 6 4 5 5 7 1 ...
head(student.survey)
pi re hi tv
1 conservative most weeks 2.2 3
2 liberal occasionally 2.1 15
3 liberal most weeks 3.3 0
4 moderate occasionally 3.5 5
5 very liberal never 3.1 6
6 liberal occasionally 3.5 4
# graph to portray individual variables and their relationship
plot(pi ~ re, data = student.survey)
Using the datatable()
and descriptive graph above, we see that for political ideology we see that the scale consists of very conservative, conservative, slightly conservative, moderate, slightly liberal, liberal, very liberal. The religiosity variable is how often the respondent attends religious services, and the options are: never, occasionally, most weeks, and every week.
# graph to portray individual variables and their relationship
plot(hi ~ tv, data = student.survey)
High School GPA is on a 4 point scale (from 1 to 4) and TV is hours of watching TV.
Next, in order to summarize these variables using descriptive statistics, I am going to assign numerical values to the political ideology and religiosity values as follows:
very conservative = 7, conservative = 6, slightly conservative = 5, moderate = 4, slightly liberal = 3, liberal = 2, very liberal = 1
never = 1, occasionally = 2, most weeks = 3, and every week = 4
# re-assign values as integers and recreate table so I can double check
student.survey <- student.survey %>%
add_column(pi_num = (as.integer(as.factor(student.survey$pi))), .after = 1)
student.survey <- student.survey %>%
add_column(re_num = (as.integer(as.factor(student.survey$re))), .after = 3)
head(student.survey)
pi pi_num re re_num hi tv
1 conservative 6 most weeks 3 2.2 3
2 liberal 2 occasionally 2 2.1 15
3 liberal 2 most weeks 3 3.3 0
4 moderate 4 occasionally 2 3.5 5
5 very liberal 1 never 1 3.1 6
6 liberal 2 occasionally 2 3.5 4
#select only numerical values
student.survey <- student.survey %>%
select(c(pi_num, re_num, hi, tv))
#summarize data
summary(student.survey)
pi_num re_num hi tv
Min. :1.000 Min. :1.000 Min. :2.000 Min. : 0.000
1st Qu.:2.000 1st Qu.:1.750 1st Qu.:3.000 1st Qu.: 3.000
Median :2.000 Median :2.000 Median :3.350 Median : 6.000
Mean :3.033 Mean :2.167 Mean :3.308 Mean : 7.267
3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.625 3rd Qu.:10.000
Max. :7.000 Max. :4.000 Max. :4.000 Max. :37.000
# use stat.desc() to summarize data
stat.desc(student.survey)
pi_num re_num hi tv
nbr.val 60.0000000 60.0000000 60.00000000 60.0000000
nbr.null 0.0000000 0.0000000 0.00000000 5.0000000
nbr.na 0.0000000 0.0000000 0.00000000 0.0000000
min 1.0000000 1.0000000 2.00000000 0.0000000
max 7.0000000 4.0000000 4.00000000 37.0000000
range 6.0000000 3.0000000 2.00000000 37.0000000
sum 182.0000000 130.0000000 198.50000000 436.0000000
median 2.0000000 2.0000000 3.35000000 6.0000000
mean 3.0333333 2.1666667 3.30833333 7.2666667
SE.mean 0.2112201 0.1261482 0.05934157 0.8672043
CI.mean.0.95 0.4226505 0.2524220 0.11874221 1.7352718
var 2.6768362 0.9548023 0.21128531 45.1225989
std.dev 1.6361040 0.9771398 0.45965782 6.7173357
coef.var 0.5393749 0.4509876 0.13893939 0.9244040
Now we have a lot of information:
stat.desc()
means we also know the standard deviation is 1.64 and the 95% confidence interval of the mean is 3.03 ± .24.# plot the relationship between religiosity and political ideology - scatterplot
ggplot(data = student.survey, aes(x = re_num, y = pi_num)) +
geom_point(color = "#5C62D6") +
labs(x = "Religiosity (frequency of attending services)", y = "Political Ideology") +
theme(panel.background = element_rect(fill = "#F6F9FC"),
axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))
Based on this it looks like there is weak evidence of a correlation between religiosity (as measured by attendence) and political ideology from this graph. Slightly conservative, moderate, slightly liberal and liberal respondents vary from never, occasionally, most weeks, and every week service attendance. However, we do see that on either end (very liberal and conservative/very conservative) there is some relationship between ideology and religiosity.
# plot the relationship between HS GPA and hours of TV watched - scatterplot
ggplot(data = student.survey, aes(y = hi, x = tv)) +
geom_point(color = "#ED2D40") +
labs(x = "TV watched per week (Avg Hrs.)", y = "High School GPA") +
theme(panel.background = element_rect(fill = "#F6F9FC"),
axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))
This is a repeat of the summary graph above (plot()
) but we do glean some interesting information. In general, most students are watching less than 15 hours of TV on average per week. Most respondents to this survey had a GPA of 3.0 or greater. It seems like there is not a strong relationship between the average number of hours of TV watched per week and the student’s GPA.
Finally, I used cor.test()
to test the correlation between each of these pairs. HO is that the correlation coefficient = 0, while HA is that the correlation coefficient =/= 0.
# correlation test between political ideology and religiosity using cor.test()
cor.test(student.survey$pi_num, student.survey$re_num)
Pearson's product-moment correlation
data: student.survey$pi_num and student.survey$re_num
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3818345 0.7265650
sample estimates:
cor
0.5795661
For political ideology and religiosity (how often you attend religious services) there is a p-value of less than the 0.05 significance level. Since the p-value is smaller than 0.05, we can reject the null hypothesis (that there is no relationship). So, we can assume that these two do have a significant relationship to each other. The estimated correlation/correlation coefficient is 0.560 so the correlation shows a modest positive correlation.
# correlation test between GPA and TV using cor.test()
cor.test(student.survey$hi, student.survey$tv)
Pearson's product-moment correlation
data: student.survey$hi and student.survey$tv
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.48826914 -0.01457694
sample estimates:
cor
-0.2675115
For the relationship between high school GPA and the average hours of TV watched per week there is a p-value of 0.039 which is less than the 0.05 significance level. Therefore, we can assume that these two do have a significant relationship to each other. Since the p-value is .039 we can reject the null hypothesis. However, the relationship is not as significant as the one we looked at above. Additionally, the estimated correlation is -0.268, so there is a small negative correlation.
(Problem 9.50 in SMSS)
For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)
We don’t have enough evidence (enough data) to imply that the tutoring program was responsible for the increased scores. Regression towards the mean is the idea that if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean and extreme events are likely to be followed by more typical ones. In the example in the question, the group that was selected to receive tutoring was the “extreme” group (performing the lowest on the midterm exam). On the next exam (the final), the average for the class was 70 and the 10 students in the extreme group’s average regressed toward the mean. Therefore, we do not know if the increase was due to the tutoring.