DACSS 603 Homework 2
(Problem 1.1 in ALR)
United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.
data("UN11") # load the UN11 data
UN11 <- UN11 %>%
select(c(fertility,ppgdp)) # select the two variables to use
dim(UN11)
[1] 199 2
kable(head(UN11), format = "markdown", digits = 10, caption = "**Dependence of Fertility on ppgdp**")
fertility | ppgdp | |
---|---|---|
Afghanistan | 5.968 | 499.0 |
Albania | 1.525 | 3677.2 |
Algeria | 2.142 | 4473.0 |
Angola | 5.135 | 4321.9 |
Anguilla | 2.000 | 13750.1 |
Argentina | 2.172 | 9162.1 |
1.1.1. Identify the predictor and the response.
Predictor is ppgdp, response is fertility.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
See Figure 1.1 below. From this graph, a straight line mean function DOES NOT SEEM PLAUSIBLE for a summary of this graph. It looks like there is a negative correlation between ppgdp and fertility but we will need the lm function to visualize it.
ggplot(UN11, aes(x = ppgdp, y = fertility)) +
geom_point(color=2) +
labs(x="ppgdp-Gross National Product Per Person in U.S. dollars", y="fertility-birth rate per 1000 females", title = "FIGURE 1.1 UN ppgdp vs fertility data in 2009")
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
See Figure 1.2 below. Yes, a simple linear regression model looks more plausible on this graph. Using the log function produced a graph with a more linear relationship between ppgdp and fertility.
ggplot(UN11, aes(x = log(ppgdp), y = log(fertility))) +
geom_point(color=2) +
geom_smooth(method = "lm") +
labs(x="ppgdp-Gross National Product Per Person in U.S. dollars", y="fertility-birth rate per 1000 females", title = "FIGURE 1.2 UN ppgdp and fertility data in 2009")
(Problem 9.47 in SMSS)
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
(a) How, if at all, does the slope of the prediction equation change?
The slope will change when responses are converted to British pounds. The new slope of the prediction equation when explanatory variable is in British pounds will be LESS than original slope (in US dollars).
(b) How, if at all, does the correlation change?
No, a change in units on the explanatory variables from US Dollar to British pounds will not result in a correlation change. Correlation DOES NOT depend on the variable’s units.
(Problem 1.5 in ALR)
Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.
[1] 43 8
Year | APMAM | APSAB | APSLAKE | OPBPC | OPRC | OPSLAKE | BSAAM |
---|---|---|---|---|---|---|---|
1948 | 9.13 | 3.58 | 3.91 | 4.10 | 7.43 | 6.47 | 54235 |
1949 | 5.28 | 4.82 | 5.20 | 7.55 | 11.11 | 10.26 | 67567 |
1950 | 4.20 | 3.77 | 3.67 | 9.52 | 12.20 | 11.35 | 66161 |
1951 | 4.60 | 4.46 | 3.93 | 11.14 | 15.15 | 11.13 | 68094 |
1952 | 7.15 | 4.99 | 4.88 | 16.34 | 20.05 | 22.81 | 107080 |
1953 | 9.70 | 5.65 | 4.91 | 8.88 | 8.15 | 7.41 | 67594 |
pairs(water,col = 2, main = "Water Runoff in Sierras Scatterplot Matrix")
Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
OPSLAKE, data = water)
Residuals:
Min 1Q Median 3Q Max
-12690 -4936 -1424 4173 18542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15944.67 4099.80 3.889 0.000416 ***
APMAM -12.77 708.89 -0.018 0.985725
APSAB -664.41 1522.89 -0.436 0.665237
APSLAKE 2270.68 1341.29 1.693 0.099112 .
OPBPC 69.70 461.69 0.151 0.880839
OPRC 1916.45 641.36 2.988 0.005031 **
OPSLAKE 2211.58 752.69 2.938 0.005729 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
ANALYSIS:
On residuals, although there is a big disparity between the min and max (-12690 and 18542), which can be possible outliers, I think the data is relatively balanced because the 1Q and 3Q (-4936 and 4173) are close in values.
On coefficients, sites OPRC and OPSLAKE has statistically significant values of Pr(>|t|) < 0.05, indicating that these two locations’ precipitation measurements are significant to BASAAM stream runoff volume.
Multiple R-squared 0.9248 and adjusted R-squared 0.9123 are relatively close, suggesting model in not over-fitted. An adjusted R-squared of 0.9123 indicates a good fit for the model.
Lastly, with a p-value: < 2.2e-16, we can conclude that this model is statistically significant. This model can be used to predict runoff so engineers, planners, and policy makers could do their jobs more efficiently.
(Problem 1.6 in ALR - slightly modified)
Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20).
data(Rateprof)
Rateprof <- Rateprof %>%
select(c(quality,helpfulness,clarity,easiness,raterInterest)) # select the five variables to use
dim(Rateprof)
[1] 366 5
kable(head(Rateprof), format = "markdown", digits = 10, col.names = c('Quality','Helpfulness','Clarity', 'Easiness', 'Rater Interest'), caption = "**Professor Ratings**")
Quality | Helpfulness | Clarity | Easiness | Rater Interest |
---|---|---|---|---|
4.636364 | 4.636364 | 4.636364 | 4.818182 | 3.545455 |
4.318182 | 4.545455 | 4.090909 | 4.363636 | 4.000000 |
4.790698 | 4.720930 | 4.860465 | 4.604651 | 3.432432 |
4.250000 | 4.458333 | 4.041667 | 2.791667 | 3.181818 |
4.684211 | 4.684211 | 4.684211 | 4.473684 | 4.214286 |
4.233333 | 4.266667 | 4.200000 | 4.533333 | 3.916667 |
pairs(Rateprof,col = 2, main = "Professor Ratings Scatterplot Matrix")
Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
Based on this scatterplot matrix, we can observe a very strong linear positive correlation between: quality and helpfulness, quality and clarity, helpfulness and clarity.
There is moderate positive linear correlation between easiness and quality, clarity or helpfulness.
There is moderate positive linear correlation between raterinterest and clarity, helpfulness or quality
Easiness and raterinterest have a weak positive linear association.
RP <-cor(Rateprof, use = "all.obs",method = c("pearson", "kendall", "spearman"))
kable ((RP), format = "markdown", digits = 10, col.names = c('Quality','Helpfulness','Clarity', 'Easiness', 'Rater Interest'), caption = "**Correlation Matrix**")
Quality | Helpfulness | Clarity | Easiness | Rater Interest | |
---|---|---|---|---|---|
quality | 1.0000000 | 0.9810314 | 0.9759608 | 0.5651154 | 0.4706688 |
helpfulness | 0.9810314 | 1.0000000 | 0.9208070 | 0.5635184 | 0.4630321 |
clarity | 0.9759608 | 0.9208070 | 1.0000000 | 0.5358884 | 0.4611408 |
easiness | 0.5651154 | 0.5635184 | 0.5358884 | 1.0000000 | 0.2052237 |
raterInterest | 0.4706688 | 0.4630321 | 0.4611408 | 0.2052237 | 1.0000000 |
(Problem 9.34 in SMSS)
For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)
data("student.survey") # load the student.survey data
student.survey <- student.survey %>%
select(c(re,pi,hi,tv)) # select the four variables to use
dim(student.survey)
[1] 60 4
kable(head(student.survey), format = "markdown", digits = 2, col.names = c('Religiosity','Political Ideology','HS GPA', 'Hrs of watching TV'), caption = "**Student Survey Data**")
Religiosity | Political Ideology | HS GPA | Hrs of watching TV |
---|---|---|---|
most weeks | conservative | 2.2 | 3 |
occasionally | liberal | 2.1 | 15 |
most weeks | liberal | 3.3 | 0 |
occasionally | moderate | 3.5 | 5 |
never | very liberal | 3.1 | 6 |
occasionally | liberal | 3.5 | 4 |
ggplot(student.survey, aes(x=re,ymin = 0, ymax = 30, fill=pi)) +
geom_bar() +
labs(x="Religiosity", y="Political Ideology",
title = "FIGURE 2. Political Ideology and Religiosity") +
facet_wrap(vars(re,pi),strip.position = "left") +
theme(axis.text.x = element_text(size = 8, angle = 90))
student.survey %>%
count(pi,re, sort = TRUE) %>%
kable(head(10), format = "markdown", digits = 10, col.names = c('Political Ideology','Religiosity','Number of Students'), caption = "**Political Ideology and Religiosity Matrix Count**")
Political Ideology | Religiosity | Number of Students | |
---|---|---|---|
1 | liberal | occasionally | 14 |
2 | liberal | never | 8 |
3 | moderate | occasionally | 8 |
4 | very liberal | occasionally | 5 |
5 | very liberal | never | 3 |
6 | slightly liberal | never | 2 |
7 | slightly liberal | every week | 2 |
8 | slightly conservative | most weeks | 2 |
9 | slightly conservative | every week | 2 |
10 | conservative | most weeks | 2 |
11 | conservative | every week | 2 |
12 | very conservative | every week | 2 |
13 | liberal | most weeks | 1 |
14 | liberal | every week | 1 |
15 | slightly liberal | occasionally | 1 |
16 | slightly liberal | most weeks | 1 |
17 | moderate | never | 1 |
18 | moderate | most weeks | 1 |
19 | slightly conservative | never | 1 |
20 | slightly conservative | occasionally | 1 |
ggplot(student.survey, aes(x = tv, y = hi)) +
geom_point(color=2) +
geom_smooth(method = "lm") +
labs(x="Hrs of TV", y="HS GPA", title = "FIGURE 3. HS GPA and Hrs of TV")
summary(student.survey)
re pi hi
never :15 very liberal : 8 Min. :2.000
occasionally:29 liberal :24 1st Qu.:3.000
most weeks : 7 slightly liberal : 6 Median :3.350
every week : 9 moderate :10 Mean :3.308
slightly conservative: 6 3rd Qu.:3.625
conservative : 4 Max. :4.000
very conservative : 2
tv
Min. : 0.000
1st Qu.: 3.000
Median : 6.000
Mean : 7.267
3rd Qu.:10.000
Max. :37.000
INTERPRETATION:
Religiosity: Mode is Occasionally (count: 29) Political Ideology:Mode is liberal (count:24). Based on this, a visualization of re vs pi (see figure 2) showed that this population had mostly occasional religiosity and liberal ideology as the mode.
HS GPA: Mean GPA was 3.3 and very close to median of 3.35. The min GPA was 2.0 while max of 4.0. Hrs of Watching TV: On average, HS students watch TV for 7.267 hrs but most students watch for 6 hours. The min was 0 hrs while the maximum was 37 hrs. Based on Figure 3, there is an association between HS GPA and Hrs of TV watched.
Call:
glm(formula = re ~ pi, family = binomial, data = student.survey)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1460 -0.3500 0.5314 0.9005 0.9695
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.8337 489.4505 0.012 0.990
pi.L 16.2199 1753.3896 0.009 0.993
pi.Q 8.1491 1526.1299 0.005 0.996
pi.C -0.2996 1398.7211 0.000 1.000
pi^4 -4.6817 1304.7376 -0.004 0.997
pi^5 -5.0032 915.6782 -0.005 0.996
pi^6 -3.3188 401.1467 -0.008 0.993
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 67.480 on 59 degrees of freedom
Residual deviance: 60.684 on 53 degrees of freedom
AIC: 74.684
Number of Fisher Scoring iterations: 16
INTERPRETATION
Based on all the p-values, we fail to reject the null hypothesis and conclude that there is no association between religiosity and political ideology.
Call:
lm(formula = hi ~ tv, data = student.survey)
Residuals:
Min 1Q Median 3Q Max
-1.2583 -0.2456 0.0417 0.3368 0.7051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.441353 0.085345 40.323 <2e-16 ***
tv -0.018305 0.008658 -2.114 0.0388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared: 0.07156, Adjusted R-squared: 0.05555
F-statistic: 4.471 on 1 and 58 DF, p-value: 0.03879
hi tv
hi 1.0000000 -0.2675115
tv -0.2675115 1.0000000
INTERPRETATION
Based on the p-value 0.0388, we reject the null hypothesis. Therefore, we can conclude that there is statistically significant association between HS GPA and hours of watching TV. However, with very low R-squared (0.07156), this model is not a good fit to explain variations in the data.
Based on the correlation values, we can conclude a weak negative association between HS GPA and hrs of watching TV.
(Problem 9.50 in SMSS)
For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful.
EXPLANATION:
Choose 10 lowest scoring students (mean=50) and their scores will regress upward AFTER a tutoring program, NOT BECAUSE OF. These students who did poorly were unlucky. It is highly probable that these students would have increased their final exam scores (mean =60), with or without the tutoring program. The effect of the tutoring program on the test scores is INCONCLUSIVE. The same rationale can be used regarding highest scoring students during midterms. Chances are the same high performing students who scored higher during midterms will then score closer to the mean (mean=70) during their final exams. Both the lowest scoring students and the highest scoring students had their scores REGRESS TOWARDS THE MEAN of 70 in the final exam.
To determine if the tutoring program is effective in improving test scores, the design of the study should choose the sample student population RANDOMLY, and conduct the test in a CONTROLLED environment. Otherwise, we must consider regression to the mean as the rationale for the increase in test scores of the lowest performing group.