HW2_DACSS603

DACSS 603 Homework 2: Linear Regression

Eliza Geeslin
3/4/2022

Question 1

(Problem 1.1 in ALR)

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

1.1.1. Identify the predictor and the response.

The predictor is ppgdp and the response is fertility. So, in the scatterplot we will see that fertility is on the y-axis while ppgdp is on the x-axis.

1.1.2. Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

ggplot(data = UN11, aes(x = ppgdp, y = fertility)) +
  geom_point(color = "#4FB06D") +
  labs(y = "Fertility (birth rate per 1000 females)", x = "PPGDP ($)") +
  theme(panel.background = element_rect(fill = "#F6F9FC"), 
        axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
        axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
        axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
        axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))

We see that fertility decreases as ppgdp increases to a certain point, and then fertility appears to flatten out (around 1-2 births per female). Based on this graph it seems like a straight-line mean function is not the best summary of this graph.

1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

ggplot(data = UN11, aes(x = log(ppgdp), y = log(fertility))) +
  geom_point(color = "#4FB06D") +
  labs(y = "Fertility (birth rate per 1000 females)", x = "PPGDP ($)") +
  theme(panel.background = element_rect(fill = "#F6F9FC"), 
        axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
        axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
        axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
        axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))

A simple linear regression does seem like a more plausible model for this graph. The mean function appears to be linear, and it is believable that there could be consistent varience across the points (although this is not completely certain). As we would expect, we see that there are a few outliers with either very high or very low fertility for their ppgdp.

Question 2

(Problem 9.47 in SMSS)

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

(a) How, if at all, does the slope of the prediction equation change?

Assuming that the data itself is the same and the study was not re-done with based on income in Britain (just conversion), the new slope will be equal to the old slope divided by 1.33 (each British pound represents a smaller interval).

(b) How, if at all, does the correlation change?

The correlation will not change (i.e. the relationship is the same regardless of the unit of income used to represent it).

Question 3

(Problem 1.5 in ALR)

Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

# use pairs() function

pairs(~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + BSAAM, 
      data = water,
      col = "#0068B1",
      pch = 20)

Summary of relationship between variables based on this matrix:

Question 4

(Problem 1.6 in ALR - slightly modified)

Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

# find the names of the variables
# names(Rateprof)

#use pairs() to graph matrix
pairs(~ quality + clarity + helpfulness + easiness + raterInterest, 
      data = Rateprof,
      xlim = c(1,5),
      ylim = c(1,5),
      xaxs = "i", 
      yaxs = "i",
      col = "#FF5C35",
      pch = 20)

Summary of relationship between variables based on this matrix:

Question 5

(Problem 9.34 in SMSS)

For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

  1. Use graphical ways to portray the individual variables and their relationship.
  2. Interpret descriptive statistics for summarizing the individual variables and their relationship.
  3. Summarize and interpret results of inferential analyses.
# downloading the data
data("student.survey")

# taking a look at each variable.
# ?student.survey

#select the variables we need. 
student.survey <- student.survey %>%
  select(c(pi, re, hi, tv))

str(student.survey)
'data.frame':   60 obs. of  4 variables:
 $ pi: Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
 $ re: Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
 $ hi: num  2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
 $ tv: num  3 15 0 5 6 4 5 5 7 1 ...
head(student.survey)
            pi           re  hi tv
1 conservative   most weeks 2.2  3
2      liberal occasionally 2.1 15
3      liberal   most weeks 3.3  0
4     moderate occasionally 3.5  5
5 very liberal        never 3.1  6
6      liberal occasionally 3.5  4
# graph to portray individual variables and their relationship

plot(pi ~ re, data = student.survey)

Using the datatable() and descriptive graph above, we see that for political ideology we see that the scale consists of very conservative, conservative, slightly conservative, moderate, slightly liberal, liberal, very liberal. The religiosity variable is how often the respondent attends religious services, and the options are: never, occasionally, most weeks, and every week.

# graph to portray individual variables and their relationship

plot(hi ~ tv, data = student.survey)

High School GPA is on a 4 point scale (from 1 to 4) and TV is hours of watching TV.

Next, in order to summarize these variables using descriptive statistics, I am going to assign numerical values to the political ideology and religiosity values as follows:

very conservative = 7, conservative = 6, slightly conservative = 5, moderate = 4, slightly liberal = 3, liberal = 2, very liberal = 1

never = 1, occasionally = 2, most weeks = 3, and every week = 4

# re-assign values as integers and recreate table so I can double check
student.survey <- student.survey %>%
  add_column(pi_num = (as.integer(as.factor(student.survey$pi))), .after = 1)

student.survey <- student.survey %>% 
  add_column(re_num = (as.integer(as.factor(student.survey$re))), .after = 3)
             
head(student.survey)
            pi pi_num           re re_num  hi tv
1 conservative      6   most weeks      3 2.2  3
2      liberal      2 occasionally      2 2.1 15
3      liberal      2   most weeks      3 3.3  0
4     moderate      4 occasionally      2 3.5  5
5 very liberal      1        never      1 3.1  6
6      liberal      2 occasionally      2 3.5  4
#select only numerical values

student.survey <- student.survey %>%
  select(c(pi_num, re_num, hi, tv))

#summarize data
summary(student.survey)
     pi_num          re_num            hi              tv        
 Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   : 0.000  
 1st Qu.:2.000   1st Qu.:1.750   1st Qu.:3.000   1st Qu.: 3.000  
 Median :2.000   Median :2.000   Median :3.350   Median : 6.000  
 Mean   :3.033   Mean   :2.167   Mean   :3.308   Mean   : 7.267  
 3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.625   3rd Qu.:10.000  
 Max.   :7.000   Max.   :4.000   Max.   :4.000   Max.   :37.000  
# use stat.desc() to summarize data
stat.desc(student.survey)
                  pi_num      re_num           hi          tv
nbr.val       60.0000000  60.0000000  60.00000000  60.0000000
nbr.null       0.0000000   0.0000000   0.00000000   5.0000000
nbr.na         0.0000000   0.0000000   0.00000000   0.0000000
min            1.0000000   1.0000000   2.00000000   0.0000000
max            7.0000000   4.0000000   4.00000000  37.0000000
range          6.0000000   3.0000000   2.00000000  37.0000000
sum          182.0000000 130.0000000 198.50000000 436.0000000
median         2.0000000   2.0000000   3.35000000   6.0000000
mean           3.0333333   2.1666667   3.30833333   7.2666667
SE.mean        0.2112201   0.1261482   0.05934157   0.8672043
CI.mean.0.95   0.4226505   0.2524220   0.11874221   1.7352718
var            2.6768362   0.9548023   0.21128531  45.1225989
std.dev        1.6361040   0.9771398   0.45965782   6.7173357
coef.var       0.5393749   0.4509876   0.13893939   0.9244040

Now we have a lot of information:

  1. political ideology - the medium ideology was a 2 which is liberal while the mean ideology was a ~3 which is slightly liberal. Using the stat.desc() means we also know the standard deviation is 1.64 and the 95% confidence interval of the mean is 3.03 ± .24.
  2. religiosity - the median is 2 while the mean is ~2 which is occasionally attending religious services. The standard deviation is .98 and the 95% confidence interval of the mean is 2 ± .25.
  3. high school GPA - the median GPA was a 3.35 while the mean was 3.31; the maximum GPA was a 4.0 and the minimum was a 2.0.
  4. hours watching TV - the median hours watched was 6 hours and the mean was 7.27 hours. The minimum hours watched was 0 while the maximum was 37 hours.
# plot the relationship between religiosity and political ideology - scatterplot
ggplot(data = student.survey, aes(x = re_num, y = pi_num)) +
  geom_point(color = "#5C62D6") +
  labs(x = "Religiosity (frequency of attending services)", y = "Political Ideology") +
  theme(panel.background = element_rect(fill = "#F6F9FC"), 
        axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
        axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
        axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
        axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))

Based on this it looks like there is weak evidence of a correlation between religiosity (as measured by attendence) and political ideology from this graph. Slightly conservative, moderate, slightly liberal and liberal respondents vary from never, occasionally, most weeks, and every week service attendance. However, we do see that on either end (very liberal and conservative/very conservative) there is some relationship between ideology and religiosity.

# plot the relationship between HS GPA and hours of TV watched - scatterplot
ggplot(data = student.survey, aes(y = hi, x = tv)) +
  geom_point(color = "#ED2D40") +
  labs(x = "TV watched per week (Avg Hrs.)", y = "High School GPA") +
  theme(panel.background = element_rect(fill = "#F6F9FC"), 
        axis.text.x = element_text(family = "Avenir", color = "#192733", size=8),
        axis.text.y = element_text(family = "Avenir", color = "#192733", size=8),
        axis.title.y = element_text(family = "Avenir", color = "#192733", size=11),
        axis.title.x = element_text(family = "Avenir", color = "#192733", size=11))

This is a repeat of the summary graph above (plot()) but we do glean some interesting information. In general, most students are watching less than 15 hours of TV on average per week. Most respondents to this survey had a GPA of 3.0 or greater. It seems like there is not a strong relationship between the average number of hours of TV watched per week and the student’s GPA.

Finally, I used cor.test() to test the correlation between each of these pairs. HO is that the correlation coefficient = 0, while HA is that the correlation coefficient =/= 0.

# correlation test between political ideology and religiosity using cor.test()

cor.test(student.survey$pi_num, student.survey$re_num)

    Pearson's product-moment correlation

data:  student.survey$pi_num and student.survey$re_num
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3818345 0.7265650
sample estimates:
      cor 
0.5795661 

For political ideology and religiosity (how often you attend religious services) there is a p-value of less than the 0.05 significance level. Since the p-value is smaller than 0.05, we can reject the null hypothesis (that there is no relationship). So, we can assume that these two do have a significant relationship to each other. The estimated correlation/correlation coefficient is 0.560 so the correlation shows a modest positive correlation.

# correlation test between GPA and TV using cor.test()

cor.test(student.survey$hi, student.survey$tv)

    Pearson's product-moment correlation

data:  student.survey$hi and student.survey$tv
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.48826914 -0.01457694
sample estimates:
       cor 
-0.2675115 

For the relationship between high school GPA and the average hours of TV watched per week there is a p-value of 0.039 which is less than the 0.05 significance level. Therefore, we can assume that these two do have a significant relationship to each other. Since the p-value is .039 we can reject the null hypothesis. However, the relationship is not as significant as the one we looked at above. Additionally, the estimated correlation is -0.268, so there is a small negative correlation.

Question 6

(Problem 9.50 in SMSS)

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)

We don’t have enough evidence (enough data) to imply that the tutoring program was responsible for the increased scores. Regression towards the mean is the idea that if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean and extreme events are likely to be followed by more typical ones. In the example in the question, the group that was selected to receive tutoring was the “extreme” group (performing the lowest on the midterm exam). On the next exam (the final), the average for the class was 70 and the 10 students in the extreme group’s average regressed toward the mean. Therefore, we do not know if the increase was due to the tutoring.