DACSS 603 Homework 2

Homework 2 for DACSS 603

Molly Hackbarth
03-06-2022

Question 1 (Problem 1.1 in ALR)

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

1.1.1. Identify the predictor and the response.

1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

The first thing I did is load in the data. I will do that by using the alr4 package. Once that is loaded in then I will use select(c()) to select multiple columns at once. In the question I looked at ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009.

The predictor variable is the ppgdp and the response variable is fertility.

data("UN11")

un11 <- UN11 %>% 
  select(c(ppgdp, fertility))

p1 <- ggplot(un11, aes(x=ppgdp, y=fertility)) +
  geom_point() +
   labs(title= "Fertility Compared to the Gross National Product in US Dollars",
        x= "GNP (per person in US Dollars)", 
        y = "Fertility (per 1000 females)") +
   theme_minimal()

p1

1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

In this graph we can see that with a larger GNP the level of birth rates decreases per 1000 females. A straight line mean function does not seem to nor be plausible for a summary of this graph because many of the points are around the 2 mark in the Fertility category. The graph does not increase or decrease in a straight line when going from the top of a y-axis to the furthest on the x-axis.

1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

For question 1.1.3 we will use the log() function on both Fertility and ppgdp, which uses the natural logarithm of the variables.

p2 <- ggplot(un11, aes(x=log(ppgdp), y=log(fertility))) +
  geom_point() +
   labs(title= "Natural Log of Fertility Compared to Natural Log of Gross National Product in US Dollars",
        x= "GNP (per person in US Dollars)",
        y = "Fertility (per 1000 females)") +
   theme_minimal()

p2

Using the log(ppgdp) for the x-axis and log(Fertility) for the y-axis a simple linear regression model seems plausible as there is a clear downward trend in Fertility when the GNP becomes higher per person in USD. This is highlighted by how a straight downward line would be able to go through most of the scatter plot dots with few outliers.

In order to test this I redid the graph with a line that used the linear model method using geom_smooth().

p3 <- ggplot(un11, aes(x=log(ppgdp), y=log(fertility))) +
  geom_point() +
  geom_smooth(method = 'lm') +
   labs(title= "Fertility Compared to the Gross National Product in US Dollars",
        x= "GNP (per person in US Dollars)", 
        y = "Fertility (per 1000 females)") +
   theme_minimal()
p3

We can also look at the summary using the summary() function and the lm() function which stands for linear model. lm(formula = fertility ~ ppgdp, data = data). In the formula we are trying to predict fertility by ppgdp.

fit <- lm(fertility ~ ppgdp, data = un11)
summary(fit)

Call:
lm(formula = fertility ~ ppgdp, data = un11)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
ppgdp       -3.201e-05  4.655e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11

Here we can see we have a p value of 7.903e-11, which has three stars and means the value is between 0 and .001, which is statically significant.

Question 2 (Problem 9.47 in SMSS)

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

(a) How, if at all, does the slope of the prediction equation change?

(b) How, if at all, does the correlation change?

In order to calculate a new slope we must take the old slope and divide that by the the new amount. So the formula would be new slope = old slope divided by 1.33

The correlation does not change. This is because the change of slope is not related to correlation, as the correlation only will tell you how close the data fits on the line.

Question 3 (Problem 1.5 in ALR)

Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

The first thing I did was pull in the data.

data(water)

waters <- water %>% 
  select(c(APMAM, APSAB, APSLAKE, OPBPC, OPRC, OPSLAKE))

In order to do a scatterplot matrix we will use the pairs() function that will allow us to produce a matrix of scatterplots. If we look at the the correlation of just the six sites, not including BSAAM or Years, in the Sierra Nevada mountains we can see that:

pairs(waters)

Lets take an example in the scatterplot matrix second row, first column. The x-axis would be precipitation at APMAM and the y-axis would be precipitation at APSAB. Each dot is the amount of precipitation in one of forty-three years. Here we can see from the visualization that there is a linear relation between precipitation at APMAM and APSAB.

We can see a linear relationship between precipitation at APMAM, APSAB, and APSLAKE throughout the years. We can also see a linear relationship between precipitation at OPBPC, OPRC, OPSLAKE throughout the years.

Now lets look at another example in the scatterplot matrix second row, fifth column. The x-axis would be precipitation at OPRC and the y-axis would be precipitation at APSAB. It appears a linear regression line would be horizontal, which means there is either no correlation or a weak correlation. This situation is also the case between precipitation at all the AP locations and all the OP locations.

pairs(water)

Lets take an example in the scatterplot matrix eighth row, seventh column. The x-axis would be precipitation at OPSLAKE and the y-axis would be stream runoff volume at BSAAM. Each dot represents one of forty-three years where each of these measurements were taken. Here we can see from the visualization that there is a positive linear relation between precipitation at OPSLAKE and stream runoff volume for BSAAM. There is appears to be a positive linear relationship between precipitation at OPBPC, OPRC and the stream runoff volume at BSAAM.

Now lets look at another example in the scatterplot matrix eighth row, second column. The x-axis would be precipitation at APMAM and the y-axis would be stream runoff volume at BSAAM. It appears there is either no or a weak correlation between the precipitation at APMAM and stream runoff volume at BSAAM. This situation is also the case between precipitation at APSAB and APSLAKE and stream runoff volume at BSAAM.

Now lets look at an example in the scatterplot matrix fifth row, first column. The x-axis would be the year and the y-axis would be the precipitation at OPBPC. It appears that the regression line would be approximately horizontal. This would signify either no or a very weak correlation between the year and precipitation at OPBPC. This situation is also the case for all of the other five locations. For BSAAM (eighth row, first column) it also appears that the regression line would be approximately horizontal. This would signify either no or a very weak correlation between the year and stream runoff volume for BSAAM.

Question 4 (Problem 1.6 in ALR - slightly modified)

Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

The first thing I did was pull in the data. Then I selected the columns that were asked for (quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest). After this I used the pairs() function in order to create a scatterplot matrix.

rf <- Rateprof %>% 
  select(c(quality, helpfulness, clarity, easiness, raterInterest))

pairs(rf)

This scatterplot matrix matches the one in the book.

Lets take an example in the scatterplot matrix second row, first column. The x-axis would be quality rating out of five stars and the y-axis would be the helpfulness rating out of five stars. Each dot represents one of the ratings left by a student for a professor. Here we can see from the visualization that there is a positive linear relation between quality rating out of five stars and helpfulness rating out of five stars. There is appears to be a positive linear relationship between quality rating, helpfulness rating, and clarity rating out of five stars.

Now let’s take an example in the scatterplot matrix fifth row, fourth column. The x-axis would be easiness rating out of five stars and the y-axis would be raterInterest rating out of five stars. It appears that the regression line would be approximately horizontal. This would signify either no or a very weak correlation between easiness rating and raterInterest rating. This situation is also the case for raterInterest rating and clarity rating, raterInterest and helpfulness rating, and raterInterest and quality rating.

Now let’s take an example in the scatterplot matrix second row, fourth column. The x-axis would be easiness rating out of five stars and the y-axis would be helpfulness rating out of five stars. It appears that the regression line would be an approximately positive slope. However there is quite a bit of variation from the regression line because many of the individual rating points are quite a ways away from the line. This would signify a weak positive correlation between easiness rating and helpfulness rating. This situation is also the case for easiness rating and clarity rating, easiness rating and helpfulness rating, and easiness and quality rating. As mentioned above for easiness rating and raterInterest rating there is either no or a very weak correlation rating.

Question 5 (Problem 9.34 in SMSS)

For the student.survey data file in the smss package, conduct regression analyses relating

(i) y = political ideology and x = religiosity,

(ii) y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

(a) Use graphical ways to portray the individual variables and their relationship.

(b) Interpret descriptive statistics for summarizing the individual variables and their relationship.

(c) Summarize and interpret results of inferential analyses.

(a) Use graphical ways to portray the individual variables and their relationship.

First I pulled the data. Then I plotted them using the plot() function.

data("student.survey")

# plot pi and re
pirp <- plot(pi ~ re, data = student.survey)
# plot hi and tv
hitvp <- plot(hi ~ tv, data = student.survey)

(b) Interpret descriptive statistics for summarizing the individual variables and their relationship.

In order to summarize the variables I also need to turn pi and re into numerical values instead of characters.

For PI this was what is your political ideology, “very liberal” = 1, “liberal” = 2, “slightly liberal” = 3, “moderate” = 4, “slightly conservative” = 5, “conservative” = 6, “very conservative” = 7

For RE this was how often you attend religious services, “never” = 1, “occasionally” = 2, “most weeks” = 3, “every week” = 4

data("student.survey")


ss <- student.survey %>% 
  select(c(hi, tv, pi, re))

ss$pi <- as.integer(as.factor(ss$pi))
ss$re <- as.integer(as.factor(ss$re))


stat.desc(ss)
                       hi          tv          pi          re
nbr.val       60.00000000  60.0000000  60.0000000  60.0000000
nbr.null       0.00000000   5.0000000   0.0000000   0.0000000
nbr.na         0.00000000   0.0000000   0.0000000   0.0000000
min            2.00000000   0.0000000   1.0000000   1.0000000
max            4.00000000  37.0000000   7.0000000   4.0000000
range          2.00000000  37.0000000   6.0000000   3.0000000
sum          198.50000000 436.0000000 182.0000000 130.0000000
median         3.35000000   6.0000000   2.0000000   2.0000000
mean           3.30833333   7.2666667   3.0333333   2.1666667
SE.mean        0.05934157   0.8672043   0.2112201   0.1261482
CI.mean.0.95   0.11874221   1.7352718   0.4226505   0.2524220
var            0.21128531  45.1225989   2.6768362   0.9548023
std.dev        0.45965782   6.7173357   1.6361040   0.9771398
coef.var       0.13893939   0.9244040   0.5393749   0.4509876

For the high school GPA (4 point scale) we see the minimum GPA was 2 (C) while the maximum was 4 (A). We can also see the median GPA was 3.35 while the mean was a 3.31. The standard deviation is .46 and the 95% confidence interval of the mean value is 3.31 ± .12 (3.19, 3.43).

For the average hours of TV watched per week we see the minimum hours watched was 0 while the maximum hours watched was 37. We can also see that the median hours watched was 6, with the mean hours watched being 7.27. The standard deviation is 6.72 and the 95% confidence interval of the mean is 7.27 ± 1.74.

For political ideology the medium ideology was a 2 which is liberal while the mean idology was a 3.03 which is slightly liberal. The standard deviation is 1.64 and the 95% confidence interval of the mean is 3.03 ± .24.

For religiosity (how often you attend religious services) the median is 2 while the mean is 2 which are both occasionally attending religious services. The standard deviation is .98 and the 95% confidence interval of the mean is 2 ± .25.

ssht <- ss %>% 
  select(c("hi", "tv"))

p3 <- ggplot(ss, aes(x=hi, y=tv)) +
  geom_point() +
   labs(title= "High School GPA compared to Average Hours of TV Watched Per Week",
        x= "High School GPA (4 Point Scale)", 
        y = "Average Number of Hours of TV Watched per Week") +
   theme_minimal()

p3

Here we see for high school GPA and hours of TV watched that in general most students watched less than 15 hours of TV on average per week. We can also see that of the students that took this survey most students scored a 3.0 or better. From this graph there is very little correlation between the average number of hours of TV watched and a student’s GPA.

p4 <- ggplot(student.survey, aes(x=re, y=pi)) +
  geom_point() +
   labs(title= "Religiosity compared to Political ideology",
        x= "Religiosity (How often you attend religious services)", 
        y = "Political ideology") +
   theme_minimal()

p4

Here we see for political ideology and religiosity there is very little evidence of a relationship between religiosity and political ideology from this graph. We see that slightly conservative, slightly liberal and liberal respondents will vary their religiosity from never, occasionally, most weeks, and every week. We also see that moderate political ideology will generally follow a religiosity of never to most weeks.

However we can make a weak correlation that the very liberal versus the very conservative will be opposite in attending religious services. Very liberal political ideology will never or occasionally attend religious services compared to very conservative who will go every week.

(c) Summarize and interpret results of inferential analyses.

To do inferential analysis I used a cor.test(), which is Pearson’s correlation test that tests the association between paired samples using the t-distribution, for both high school GPA and the hours of TV and political ideology and religiosity. The null hypothesis is the correlation coefficient is 0 while the alternative hypothesis is the correlation is not equal to 0.

For the high school GPA (4 point scale) and the average hours of TV watched per week there is a p-value of .039 which is less than the .05 significance level. Thus we can assume that these two do have a significant relationship to each other. The estimated correlation, also known as a correlation coefficient which is a number between -1 and 1 that tells you the relationship strength between the two variables, is -0.268. This correlation coefficient shows a small negative correlation.

Since the p-value is .039 we can reject the null hypothesis. Note if the significance level was .01 instead of .05 we would fail to reject the null hypothesis that there is no correlation between high school GPA and average hours of TV watched per week.

cor.test(ss$hi, ss$tv)

    Pearson's product-moment correlation

data:  ss$hi and ss$tv
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.48826914 -0.01457694
sample estimates:
       cor 
-0.2675115 

For political ideology and religiosity (how often you attend religious services) there is a p-value of 1.221e-06, which is less than the .05 significance level. Thus we can assume that these two do have a significant relationship to each other. The estimated correlation, also known as a correlation coefficient which is a number between -1 and 1 that tells you the relationship strength between the two variables, is 0.560. This correlation shows a modest positive correlation.

Since the p-value is 1.221e-06 we can reject the null hypothesis.

cor.test(ss$pi, ss$re)

    Pearson's product-moment correlation

data:  ss$pi and ss$re
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3818345 0.7265650
sample estimates:
      cor 
0.5795661 

Question 6 (Problem 9.50 in SMSS)

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)

The reason this does not have sufficient evidence to imply the tutoring program was successful is because we do not have enough data points to accurately measure. Regression towards the mean is the idea that if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean. As a result extreme events are likely to be followed by more typical ones. In this example the “extreme” group (10 students who performed the poorest on the midterm exam) follow the path of “regressing” towards the mean (the final exam average was 60 and the 10 students increased to 60, 10 more points than their midterm exam putting them closer to the average). Because of this regression towards the mean we do not know if tutoring had any effect.

In Israeli pilot training example in the video the pilots trained with negative feedback and improved. However we do not know the negative feedback caused the improvement. They may have improved with positive feedback or no feedback at all.

Additionally there is always a chance of randomness for the final exam. For example students who were tutored may have been feeling happier that day or had been told exactly what to study, compared to higher scoring students who may have been feeling off or studied the wrong things.