Homework # 2 questions and answers for DACSS 603: Introduction to Quantitative Analysis
(Problem 1.1 in ALR) United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.
1.1.1. Identify the predictor and the response.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
# Load dataset
data("UN11")
# Select variables of focus
UN11 <- UN11 %>%
select(c(ppgdp, fertility))
# Preview data
head(UN11)
ppgdp fertility
Afghanistan 499.0 5.968
Albania 3677.2 1.525
Algeria 4473.0 2.142
Angola 4321.9 5.135
Anguilla 13750.1 2.000
Argentina 9162.1 2.172
1.1.1
The predictor variable is ppgdp (gross national product per person, in US dollars) and the response variable is fertility (birth rate per 1000 females).
1.1.2
# Create scatterplot
# fertility on vertical axis, ppgdp on horizontal axis
plot(x = UN11$ppgdp, y = UN11$fertility, xlab = 'ppgdp', ylab = 'fertility', main = 'Scatterplot for Question 1.1.2')
The graph shows an intense negative relationship between a country’s gross national product per person and fertility rate at first (up to about $10000 ppgdp), then there appears to be little change in fertility in relationship to ppgdp moving beyond this point. A straight-line mean function does not seem to be an appropriate measure for summary of this graph.
1.1.3
# Create scatterplot
# log(fertility) on vertical axis, log(ppgdp) on horizontal axis
plot(x = log(UN11$ppgdp), y = log(UN11$fertility), xlab = 'log(ppgdp)', ylab = 'log(fertility)', main = 'Scatterplot for Question 1.1.3')
The simple linear regression seems plausible for summary of this graph. The relationship between the variables (when a log-scale is applied) appears to be negative and rather consistent throughout the graph (as opposed to the graph in 1.1.2, which has a dramatic drop at first then plateaus for the majority of the plot).
(Problem 9.47 in SMSS) Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
How, if at all, does the slope of the prediction equation change?
How, if at all, does the correlation change?
a.
The slope of the prediction equation would change. It would be the initial version’s slope divided by 1.33 to account for the change in unit to pounds.
b.
The correlation does not change, because it standardizes the slope (thus is not impacted by unit of measure).
(Problem 1.5 in ALR) Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.
Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
# create scatterplot matrix
pairs(water, main = 'Scatterplot Matrix for Question 3')
Looking at this scatterplot matrix, it appears that precipitation levels for the ‘A’ named lakes seem to have a positive (relatively linear) correlation (although unsure how strong) with each other and the ‘O’ named lakes seem to have one as well with each other. The year variable does not appear to have a relationship to any of the variables. Also, it seems that the stream run-off variable has a relationship to the ‘O’ named lakes but no real notable relationship to the ‘A’ named lakes.
(Problem 1.6 in ALR - slightly modified) Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
# load dataset, select variables, preview dataset
data(Rateprof)
Rateprof <- Rateprof %>%
select(c(quality, clarity, helpfulness, easiness, raterInterest))
head(Rateprof)
quality clarity helpfulness easiness raterInterest
1 4.636364 4.636364 4.636364 4.818182 3.545455
2 4.318182 4.090909 4.545455 4.363636 4.000000
3 4.790698 4.860465 4.720930 4.604651 3.432432
4 4.250000 4.041667 4.458333 2.791667 3.181818
5 4.684211 4.684211 4.684211 4.473684 4.214286
6 4.233333 4.200000 4.266667 4.533333 3.916667
# create scatterplot matrix
pairs(Rateprof, main = 'Plot for Question 4')
Referring to the scatterplot matrix of the average professor ratings for the topics of quality, clarity, helpfulness, easiness, and rater interest, the variables quality, clarity, and helpfulness appear to each have strong positive correlations with each other. The variable easiness appears to have a much weaker positive correlation with helpfulness, clarity, and quality. Rater interest does not appear to have much of a correlation to any of the other variables. There are a few notable outliers in the matrix, for example the data point rating higher for clarity and lower for quality than the trend of other points on the clarity/quality plot. The variables with stronger correlations to each other may suggest that there is a relationship between certain qualities in the professors from the selected university (like professors that tend to be perceived as more helpful by students also tend to have higher clarity) or this could mean that students associate these certain qualities together (thus rating similarly for helpfulness and clarity).
(Problem 9.34 in SMSS) For the student.survey data file in the smss package, conduct regression analyses relating:
(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)
Use graphical ways to portray the individual variables and their relationship.
Interpret descriptive statistics for summarizing the individual variables and their relationship.
Summarize and interpret results of inferential analyses.
subj ge ag hi co dh dr tv sp ne ah ve pa pi
1 1 m 32 2.2 3.5 0 5.0 3 5 0 0 FALSE r conservative
2 2 f 23 2.1 3.5 1200 0.3 15 7 5 6 FALSE d liberal
3 3 f 27 3.3 3.0 1300 1.5 0 4 3 0 FALSE d liberal
4 4 f 35 3.5 3.2 1500 8.0 5 5 6 3 FALSE i moderate
5 5 m 23 3.1 3.5 1600 10.0 6 6 3 0 FALSE i very liberal
6 6 m 39 3.5 3.5 350 3.0 4 5 7 0 FALSE d liberal
re ab aa ld
1 most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE NA
3 most weeks FALSE FALSE NA
4 occasionally FALSE FALSE FALSE
5 never FALSE FALSE FALSE
6 occasionally FALSE FALSE NA
a.
# graph: x=religiosity, y=political ideology
repa <- student.survey %>%
select(re, pi)
ggplot(data = repa) +
geom_bar(mapping = aes(x = re, fill = pi)) +
labs(title = "Relationship between Religiosity and Political Ideology", x = "Religiosity (how often you attend services)", y = "Political Ideology (pi)")
# graph: x=hours of watching tv, y=high school gpa
tvhi <- student.survey %>%
select(tv, hi)
ggplot(data = tvhi) +
geom_point(mapping = aes(x = tv, y = hi)) +
labs(title = "Relationship between Hours Watching TV and GPA", x = "Average Hours of TV watched per Week", y = "High School GPA")
b.
# descriptive statistics: religiosity and political ideology
summary(repa)
re pi
never :15 very liberal : 8
occasionally:29 liberal :24
most weeks : 7 slightly liberal : 6
every week : 9 moderate :10
slightly conservative: 6
conservative : 4
very conservative : 2
Both the religiosity and political ideology variables are skewed right, with significantly higher counts for “never” and “occasional” service attendance and “liberal” and “moderate” political ideologies in respondents.
# descriptive statistics: hours of tv watched and GPA
summary(tvhi)
tv hi
Min. : 0.000 Min. :2.000
1st Qu.: 3.000 1st Qu.:3.000
Median : 6.000 Median :3.350
Mean : 7.267 Mean :3.308
3rd Qu.:10.000 3rd Qu.:3.625
Max. :37.000 Max. :4.000
The variable average hours of tv watched has a wide range, and the large distance between the 3rd quantile and maximum suggest that there is at least one outlier (which there are multiple when viewing the scatterplot in part a.). Additionally, the median being less than the mean suggests a right skew in the data. The summary for the high school gpa suggests a relatively normal distribution, as the mean and median are similar and lie relatively in the center of the range. However, the gpa is skewed left (although the mode lies directly in the center of the range, there is a higher count of individuals with a gpa above 3.0, as reflected by the mean being above the mode).
c.
# inferential analysis: religiosity and political ideology
lmrepa <- lm(data = student.survey, formula = as.numeric(pi) ~ as.numeric(re))
summary(lmrepa)
Call:
lm(formula = as.numeric(pi) ~ as.numeric(re), data = student.survey)
Residuals:
Min 1Q Median 3Q Max
-2.81243 -0.87160 0.09882 1.12840 3.09882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9308 0.4252 2.189 0.0327 *
as.numeric(re) 0.9704 0.1792 5.416 1.22e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared: 0.3359, Adjusted R-squared: 0.3244
F-statistic: 29.34 on 1 and 58 DF, p-value: 1.221e-06
# looking at Pearson's correlation
cor.test(as.numeric(student.survey$re), as.numeric(student.survey$pi))
Pearson's product-moment correlation
data: as.numeric(student.survey$re) and as.numeric(student.survey$pi)
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3818345 0.7265650
sample estimates:
cor
0.5795661
Due to the political ideology and religiosity variables being categorical, we need to use the as.numeric argument in the linear model to convert the variables into numerical data. At a significance level of .01, there is a statistically significant association between religiosity and political ideology (as p-value < .01). The correlation is moderate and positive, suggesting that as weekly church attendance increases, political ideology becomes more conservative leaning.
# inferential analysis: hours of tv and high school gpa
lmtvhi <- lm(data = student.survey, formula = hi ~ tv)
summary(lmtvhi)
Call:
lm(formula = hi ~ tv, data = student.survey)
Residuals:
Min 1Q Median 3Q Max
-1.2583 -0.2456 0.0417 0.3368 0.7051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.441353 0.085345 40.323 <2e-16 ***
tv -0.018305 0.008658 -2.114 0.0388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared: 0.07156, Adjusted R-squared: 0.05555
F-statistic: 4.471 on 1 and 58 DF, p-value: 0.03879
# looking at Pearson's correlation
cor.test(student.survey$tv, student.survey$hi)
Pearson's product-moment correlation
data: student.survey$tv and student.survey$hi
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.48826914 -0.01457694
sample estimates:
cor
-0.2675115
With a slope of -.018, there is a negative association between hours of tv watched per week and high school GPA, meaning that as hours of tv viewing increase, a student’s GPA tends to decrease. There is a statistically significant relationship between hours of tv viewed per week and GPA at a significance level of .05. However, the R-squared value is close to 0, which suggests that the regression model does not provide a strong prediction for the observed variables. This is not suprising after looking at the scatterplot with hours of tv watched a GPA, since there does not appear to be a linear trend in the data.
(Problem 9.50 in SMSS) For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)
Applying the concept of regression toward the mean to this example, the low midterm scores in the sample could be explained as being an extreme in the sample by chance. Thus, in the next sample (in this case the final exam), we can expect that those 10 students’ scores will be closer to the mean this time (which remained to be 70, an average larger than the tutored students’ midterm score). Thus, we cannot conclude that the tutoring program was the cause of increase in the 10 students’ test scores.