Simple Linear Regression

library(distill)
library(dplyr)
library(tidyverse)
library(knitr)
library(alr4)
library(ggplot2)
library(smss)

Question 1

(Problem 1.1 in ALR)

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

data("UN11") # load the UN11 data
UN11 <- UN11 %>%
  select(c(fertility,ppgdp)) # select the two variables to use 
dim(UN11)

[1] 199   2

kable(head(UN11), format = "markdown", digits = 10, caption = "**Dependence of Fertility on ppgdp**")

Table 1: **Dependence of Fertility on ppgdp**
	fertility	ppgdp
Afghanistan	5.968	499.0
Albania	1.525	3677.2
Algeria	2.142	4473.0
Angola	5.135	4321.9
Anguilla	2.000	13750.1
Argentina	2.172	9162.1

1.1.1. Identify the predictor and the response.

Predictor is ppgdp, response is fertility.

1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

See Figure 1.1 below. From this graph, a straight line mean function DOES NOT SEEM PLAUSIBLE for a summary of this graph. It looks like there is a negative correlation between ppgdp and fertility but we will need the lm function to visualize it.

ggplot(UN11, aes(x = ppgdp, y = fertility)) +
    geom_point(color=2) + 
    labs(x="ppgdp-Gross National Product Per Person in U.S. dollars", y="fertility-birth rate per 1000 females", title = "FIGURE 1.1 UN ppgdp vs fertility data in 2009")

1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

See Figure 1.2 below. Yes, a simple linear regression model looks more plausible on this graph. Using the log function produced a graph with a more linear relationship between ppgdp and fertility.

ggplot(UN11, aes(x = log(ppgdp), y = log(fertility))) +
    geom_point(color=2) + 
    geom_smooth(method = "lm") +
    labs(x="ppgdp-Gross National Product Per Person in U.S. dollars", y="fertility-birth rate per 1000 females", title = "FIGURE 1.2 UN ppgdp and fertility data in 2009")

Question 2

(Problem 9.47 in SMSS)

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

(a) How, if at all, does the slope of the prediction equation change?

The slope will change when responses are converted to British pounds. The new slope of the prediction equation when explanatory variable is in British pounds will be LESS than original slope (in US dollars).

(b) How, if at all, does the correlation change?

No, a change in units on the explanatory variables from US Dollar to British pounds will not result in a correlation change. Correlation DOES NOT depend on the variable’s units.

Question 3

(Problem 1.5 in ALR)

Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

data("water") # load the water data
dim(water)

[1] 43  8

kable(head(water), format = "markdown", digits = 10, caption = "**Water runoff in the Sierras**")

Table 2: **Water runoff in the Sierras**
Year	APMAM	APSAB	APSLAKE	OPBPC	OPRC	OPSLAKE	BSAAM
1948	9.13	3.58	3.91	4.10	7.43	6.47	54235
1949	5.28	4.82	5.20	7.55	11.11	10.26	67567
1950	4.20	3.77	3.67	9.52	12.20	11.35	66161
1951	4.60	4.46	3.93	11.14	15.15	11.13	68094
1952	7.15	4.99	4.88	16.34	20.05	22.81	107080
1953	9.70	5.65	4.91	8.88	8.15	7.41	67594

pairs(water,col = 2, main = "Water Runoff in Sierras Scatterplot Matrix")

water1 <-lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(water1)


Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
    OPSLAKE, data = water)

Residuals:
   Min     1Q Median     3Q    Max 
-12690  -4936  -1424   4173  18542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15944.67    4099.80   3.889 0.000416 ***
APMAM         -12.77     708.89  -0.018 0.985725    
APSAB        -664.41    1522.89  -0.436 0.665237    
APSLAKE      2270.68    1341.29   1.693 0.099112 .  
OPBPC          69.70     461.69   0.151 0.880839    
OPRC         1916.45     641.36   2.988 0.005031 ** 
OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9123 
F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

ANALYSIS:

On residuals, although there is a big disparity between the min and max (-12690 and 18542), which can be possible outliers, I think the data is relatively balanced because the 1Q and 3Q (-4936 and 4173) are close in values.

On coefficients, sites OPRC and OPSLAKE has statistically significant values of Pr(>|t|) < 0.05, indicating that these two locations’ precipitation measurements are significant to BASAAM stream runoff volume.

Multiple R-squared 0.9248 and adjusted R-squared 0.9123 are relatively close, suggesting model in not over-fitted. An adjusted R-squared of 0.9123 indicates a good fit for the model.

Lastly, with a p-value: < 2.2e-16, we can conclude that this model is statistically significant. This model can be used to predict runoff so engineers, planners, and policy makers could do their jobs more efficiently.

Question 4

(Problem 1.6 in ALR - slightly modified)

Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20).

data(Rateprof)
Rateprof <- Rateprof %>%
  select(c(quality,helpfulness,clarity,easiness,raterInterest)) # select the five variables to use
dim(Rateprof)

[1] 366   5

kable(head(Rateprof), format = "markdown", digits = 10, col.names = c('Quality','Helpfulness','Clarity', 'Easiness', 'Rater Interest'), caption = "**Professor Ratings**")

Table 3: **Professor Ratings**
Quality	Helpfulness	Clarity	Easiness	Rater Interest
4.636364	4.636364	4.636364	4.818182	3.545455
4.318182	4.545455	4.090909	4.363636	4.000000
4.790698	4.720930	4.860465	4.604651	3.432432
4.250000	4.458333	4.041667	2.791667	3.181818
4.684211	4.684211	4.684211	4.473684	4.214286
4.233333	4.266667	4.200000	4.533333	3.916667

pairs(Rateprof,col = 2, main = "Professor Ratings Scatterplot Matrix")

Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

Based on this scatterplot matrix, we can observe a very strong linear positive correlation between: quality and helpfulness, quality and clarity, helpfulness and clarity.

There is moderate positive linear correlation between easiness and quality, clarity or helpfulness.

There is moderate positive linear correlation between raterinterest and clarity, helpfulness or quality

Easiness and raterinterest have a weak positive linear association.

RP <-cor(Rateprof, use = "all.obs",method = c("pearson", "kendall", "spearman"))
kable ((RP), format = "markdown", digits = 10, col.names = c('Quality','Helpfulness','Clarity', 'Easiness', 'Rater Interest'), caption = "**Correlation Matrix**")

Table 4: **Correlation Matrix**
	Quality	Helpfulness	Clarity	Easiness	Rater Interest
quality	1.0000000	0.9810314	0.9759608	0.5651154	0.4706688
helpfulness	0.9810314	1.0000000	0.9208070	0.5635184	0.4630321
clarity	0.9759608	0.9208070	1.0000000	0.5358884	0.4611408
easiness	0.5651154	0.5635184	0.5358884	1.0000000	0.2052237
raterInterest	0.4706688	0.4630321	0.4611408	0.2052237	1.0000000

Question 5

(Problem 9.34 in SMSS)

For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

data("student.survey") # load the student.survey data
student.survey <- student.survey %>%
  select(c(re,pi,hi,tv)) # select the four variables to use 
dim(student.survey)

[1] 60  4

kable(head(student.survey), format = "markdown", digits = 2, col.names = c('Religiosity','Political Ideology','HS GPA', 'Hrs of watching TV'), caption = "**Student Survey Data**")

Table 5: **Student Survey Data**
Religiosity	Political Ideology	HS GPA	Hrs of watching TV
most weeks	conservative	2.2	3
occasionally	liberal	2.1	15
most weeks	liberal	3.3	0
occasionally	moderate	3.5	5
never	very liberal	3.1	6
occasionally	liberal	3.5	4

A. Graph of Individual Variables and their Relationship

(i) y = political ideology and x = religiosity

ggplot(student.survey, aes(x=re,ymin = 0, ymax = 30, fill=pi)) +
  geom_bar() +
  labs(x="Religiosity", y="Political Ideology", 
  title = "FIGURE 2. Political Ideology and Religiosity") +
  facet_wrap(vars(re,pi),strip.position = "left") +
  theme(axis.text.x = element_text(size = 8, angle = 90))

student.survey %>%
    count(pi,re, sort = TRUE) %>%
  kable(head(10), format = "markdown", digits = 10, col.names = c('Political Ideology','Religiosity','Number of Students'), caption = "**Political Ideology and Religiosity Matrix Count**")

Table 6: **Political Ideology and Religiosity Matrix Count**
	Political Ideology	Religiosity	Number of Students
1	liberal	occasionally	14
2	liberal	never	8
3	moderate	occasionally	8
4	very liberal	occasionally	5
5	very liberal	never	3
6	slightly liberal	never	2
7	slightly liberal	every week	2
8	slightly conservative	most weeks	2
9	slightly conservative	every week	2
10	conservative	most weeks	2
11	conservative	every week	2
12	very conservative	every week	2
13	liberal	most weeks	1
14	liberal	every week	1
15	slightly liberal	occasionally	1
16	slightly liberal	most weeks	1
17	moderate	never	1
18	moderate	most weeks	1
19	slightly conservative	never	1
20	slightly conservative	occasionally	1

(ii) y = high school GPA and x = hours of TV watching.

ggplot(student.survey, aes(x = tv, y = hi)) +
    geom_point(color=2) + 
    geom_smooth(method = "lm") +
    labs(x="Hrs of TV", y="HS GPA", title = "FIGURE 3. HS GPA and Hrs of TV")

B. Interpret Descriptive Statistics of Variables and their Relationship.

summary(student.survey)

            re                         pi           hi       
 never       :15   very liberal         : 8   Min.   :2.000  
 occasionally:29   liberal              :24   1st Qu.:3.000  
 most weeks  : 7   slightly liberal     : 6   Median :3.350  
 every week  : 9   moderate             :10   Mean   :3.308  
                   slightly conservative: 6   3rd Qu.:3.625  
                   conservative         : 4   Max.   :4.000  
                   very conservative    : 2                  
       tv        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 6.000  
 Mean   : 7.267  
 3rd Qu.:10.000  
 Max.   :37.000

INTERPRETATION:

Religiosity: Mode is Occasionally (count: 29) Political Ideology:Mode is liberal (count:24). Based on this, a visualization of re vs pi (see figure 2) showed that this population had mostly occasional religiosity and liberal ideology as the mode.

HS GPA: Mean GPA was 3.3 and very close to median of 3.35. The min GPA was 2.0 while max of 4.0. Hrs of Watching TV: On average, HS students watch TV for 7.267 hrs but most students watch for 6 hours. The min was 0 hrs while the maximum was 37 hrs. Based on Figure 3, there is an association between HS GPA and Hrs of TV watched.

C. Summarize and Interpret results of Inferential Analyses

(i) LOGISTIC REGRESSION: y = political ideology and x = religiosity

SS.glm.fit <- glm(re ~ pi, data = student.survey, family = binomial)
summary(SS.glm.fit)


Call:
glm(formula = re ~ pi, family = binomial, data = student.survey)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1460  -0.3500   0.5314   0.9005   0.9695  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    5.8337   489.4505   0.012    0.990
pi.L          16.2199  1753.3896   0.009    0.993
pi.Q           8.1491  1526.1299   0.005    0.996
pi.C          -0.2996  1398.7211   0.000    1.000
pi^4          -4.6817  1304.7376  -0.004    0.997
pi^5          -5.0032   915.6782  -0.005    0.996
pi^6          -3.3188   401.1467  -0.008    0.993

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 67.480  on 59  degrees of freedom
Residual deviance: 60.684  on 53  degrees of freedom
AIC: 74.684

Number of Fisher Scoring iterations: 16

INTERPRETATION

Based on all the p-values, we fail to reject the null hypothesis and conclude that there is no association between religiosity and political ideology.

(ii) LINEAR REGRESSION: y = high school GPA and x = hours of TV watching

student.surveyii <-lm(hi ~ tv, data = student.survey)
summary(student.surveyii)


Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

student.survey1 <- student.survey %>%
  select(c(hi,tv))
cor(student.survey1)

           hi         tv
hi  1.0000000 -0.2675115
tv -0.2675115  1.0000000

INTERPRETATION

Based on the p-value 0.0388, we reject the null hypothesis. Therefore, we can conclude that there is statistically significant association between HS GPA and hours of watching TV. However, with very low R-squared (0.07156), this model is not a good fit to explain variations in the data.

Based on the correlation values, we can conclude a weak negative association between HS GPA and hrs of watching TV.

Question 6

(Problem 9.50 in SMSS)

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful.

EXPLANATION:

Choose 10 lowest scoring students (mean=50) and their scores will regress upward AFTER a tutoring program, NOT BECAUSE OF. These students who did poorly were unlucky. It is highly probable that these students would have increased their final exam scores (mean =60), with or without the tutoring program. The effect of the tutoring program on the test scores is INCONCLUSIVE. The same rationale can be used regarding highest scoring students during midterms. Chances are the same high performing students who scored higher during midterms will then score closer to the mean (mean=70) during their final exams. Both the lowest scoring students and the highest scoring students had their scores REGRESS TOWARDS THE MEAN of 70 in the final exam.

To determine if the tutoring program is effective in improving test scores, the design of the study should choose the sample student population RANDOMLY, and conduct the test in a CONTROLLED environment. Otherwise, we must consider regression to the mean as the rationale for the increase in test scores of the lowest performing group.