Using the UN11 data from the alr4 package to answer the following questions about the relationship between ppgdp and fertility.
The explanatory, or predictor, variable would be the ppgdp as we suspect that may be the cause of differing fertility levels between the countries. The response variable is fertility
fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?At the current scale of the graph (below), it does not appear that a straight line would make sense for the graph. Many data points are clustered vertically near y-axis, though you could potentially squint and see a straight line if the plot was stretched wider in that area, with the density of data points as the clusters of observations bend towards parallel with x-axis towards the end. I think that’s what we’re doing next!
log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.After using the logs of the variables, a straight line definitely makes sense for modeling this relationship.The graph is in the Question 3 portion of the R code below.
## Question 2
#Scatter plot of fertility per person
data("UN11")
plot(y = UN11$fertility, x = UN11$ppgdp)
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
-A) How, if at all, does the slope of the prediction equation change?
The slope would decrease from its value for US dollars given the exchange rate. Changing the units for the explanatory variable also changes the measurement unit output of our y variable, since each pound is 1.33 of our initial dollar scale, the slope would need to go down slightly to cover that exchange rate. Dividing the old currency by new value gives us 0.7518797, which we could multiply the old slope by to get the new value.
-B)How, if at all, does the correlation change? The correlation would not change at all, as it does not depend on the units if measurement and describes the strength of the relationship. Since it is a standardized version of that relationship, this change shouldn’t impact the value.
Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently… Draw the scatterplot matrix for these data and summarize the information available from these plots.
I made several versions of the scatter plot matrix below, one of which to play around with the psych package and become more familiar with its uses.
The first thing I noticed from the scatter plots is that the year variable didn’t seem to be super correlated or important to the overall question we were trying to answer with this data. I made another matrix of plots dropping that variable to focus on the measurement points and runoff. Because the runoff was on a much different scale, I used the log() function on the whole table of observations when I made the plots using the psych package. I also noticed that one year must have had flooding with lots of rain or something as there is a dot in the top right-ish portion of just about every graph combination
What I’m seeing is that the precipitation observations at the “A” locations seem to be quite correlated with each other. The “O” observation locations also look to be correlated with each other, but not the “A” locations. The stream runoff BSAAM looks to only be strongly correlated with the “A” location observations, but not “O.” It seems like there should be some way to use this information to make predictions that would be helpful. Perhaps grouping “O” observations helps, or some other tactics to transform it? If I have time I’m going to come back and try that out.
#Look at the data, get a feel for what's going on
#drop the years variable for the matrix making, don't seem to need that for the analysis after looking at pairs
water2 <- water %>%
select(c(2:8))
glimpse(water2)
Rows: 43
Columns: 7
$ APMAM <dbl> 9.13, 5.28, 4.20, 4.60, 7.15, 9.70, 5.02, 6.70, 10.5…
$ APSAB <dbl> 3.58, 4.82, 3.77, 4.46, 4.99, 5.65, 1.45, 7.44, 5.85…
$ APSLAKE <dbl> 3.91, 5.20, 3.67, 3.93, 4.88, 4.91, 1.77, 6.51, 3.38…
$ OPBPC <dbl> 4.10, 7.55, 9.52, 11.14, 16.34, 8.88, 13.57, 9.28, 2…
$ OPRC <dbl> 7.43, 11.11, 12.20, 15.15, 20.05, 8.15, 12.45, 9.65,…
$ OPSLAKE <dbl> 6.47, 10.26, 11.35, 11.13, 22.81, 7.41, 13.32, 9.80,…
$ BSAAM <int> 54235, 67567, 66161, 68094, 107080, 67594, 65356, 67…
#scatter plot matrix base R
pairs(water2)
#pairs(water) This just made it messy and too small to look at the other plots better.
#Scatter plot matrix via pairs.panels from psych package, with Pearson correlations and lm fits for each combo set
pairs.panels(log(water2), lm = TRUE)
#Let's do something silly just to see a couple of model options if i have time to finish and come back to this
#try log of data for BSAAM scale problem?
#water_mod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water2)
#summary(water_mod)
Use R to reproduce the scatter plot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
I just want to say first that RateMyProfessors.com is the perfect encapsulation of 1999 and early-2000’s “internet is magic” Utopian thinking. Wisdom of crowds will solve all problems, no consideration for things like racism and sexism in ratings accounted for, all the usual hallmarks are there. There’s no way the use of this tool in actual practice has been a net positive for students, instructors, or society.
Anyway, I think this scatter plot matrix actually gets at some of those weaknesses as a couple of the most correlated ratings features look especially prone (clarity and helpfulness to be specific) to various biases having an impact more so than class content. The things that look to be most correlated are quality, helpfulness, and clarity. Interest and easiness of the course do not look to be as correlated to the other categories or each other, though easiness is slightly stronger.
Rows: 366
Columns: 17
$ gender <fct> male, male, male, male, male, male, male, ma…
$ numYears <int> 7, 6, 10, 11, 11, 10, 7, 11, 11, 7, 11, 6, 4…
$ numRaters <int> 11, 11, 43, 24, 19, 15, 17, 16, 12, 18, 11, …
$ numCourses <int> 5, 5, 2, 5, 7, 9, 3, 3, 4, 4, 4, 4, 5, 2, 8,…
$ pepper <fct> no, no, no, no, no, no, no, no, no, no, no, …
$ discipline <fct> Hum, Hum, Hum, Hum, Hum, Hum, Hum, Hum, Hum,…
$ dept <fct> English, Religious Studies, Art, English, Sp…
$ quality <dbl> 4.636364, 4.318182, 4.790698, 4.250000, 4.68…
$ helpfulness <dbl> 4.636364, 4.545455, 4.720930, 4.458333, 4.68…
$ clarity <dbl> 4.636364, 4.090909, 4.860465, 4.041667, 4.68…
$ easiness <dbl> 4.818182, 4.363636, 4.604651, 2.791667, 4.47…
$ raterInterest <dbl> 3.545455, 4.000000, 3.432432, 3.181818, 4.21…
$ sdQuality <dbl> 0.5518564, 0.9020179, 0.4529343, 0.9325048, …
$ sdHelpfulness <dbl> 0.6741999, 0.9341987, 0.6663898, 0.9315329, …
$ sdClarity <dbl> 0.5045250, 0.9438798, 0.4129681, 0.9990938, …
$ sdEasiness <dbl> 0.4045199, 0.5045250, 0.5407021, 0.5882300, …
$ sdRaterInterest <dbl> 1.1281521, 1.0744356, 1.2369438, 1.3322506, …
#describeData(Rateprof)
##Graph Data
prof_plot_data <- Rateprof %>%
select(c(quality, helpfulness, clarity, easiness, raterInterest))
pairs(prof_plot_data)
pairs.panels(prof_plot_data)
For the student.survey data file in the smss package, conduct regression analyses relating: -(i) y = political ideology and x = religiosity -(ii) y = high school GPA and x = hours of TV watching.
-(a) Use graphical ways to portray the individual variables and their relationship.
For relationship (i), between self-reported ideology along a seven-level scale and the four-level scale of regliosity as measured through self-reported attendance frequency of religious services I did a trio of graphs switching up which types of variable the ideology question data went into the plot with so I could use violins, box plots, and then the stacked scatter plot like ALR’s fish age vs size example. I’m familiar with the assumptions that go in to treating the ideology scale as continuous with dummy variables, even if if can get a little less literal its interpretation on the equation side, but less so for the 4 levels relgiosity question. I’m just assuming I can treat it similarly– that respondents take the categories seriously, understand them, and that the differences between them do signal actual changes – for modeling, but I will just change it up for the graphing. Usually this would go in to a difference in means, or I would want some other question types to verify the scale against before just flipping both dummies continuous.
There looks to be an increase in likelihood that someone identifies as more conservative the more times they say they attend religious ceremonies. However, looking at the box and violin plots of the data, the spread of ideologies is much more narrow for those saying never–and probably the most confident thing from this data and model is that never attenders are likely more liberal than the other categories – than for the other options, and the most frequent attendance category has the most spread range even with the leaning conservative mean. I would imagine if this data were disaggregated into further groupings and we looked at difference in means we would find the source of some of that messiness, even if we’d need a larger sample to be sure. (I’m cheating because I sort of kinda think I know what’s going on in that most frequent category that stops the association from being even stronger from some of my personal work, work. I think.)
For relationship (ii) I started with the scatter plot as these data were more conventionally continuous numeric variables to look at. Right away it seems like there is an extreme observation in terms of hours of TV watched (37 hours? What! In grad school?) significantly detached from the rest of the data. I will look at subsetting that point out of the model to see what changes in the formula and summary statistics.
-(b) Interpret descriptive statistics for summarizing the individual variables and their relationship.
I Just combined this answer with the section for C, below.
(c) Summarize and interpret results of inferential analyses.
For model (i), political ideology and religiosity definitely are associated with each other. The model term is significant, but the r2 is still at a relatively low level so it’s associated but there are other variables we should explore. We can’t determine causality because we don’t know the timing of when or if someone changed their ideology or religious attendance. Generally we could say that if we had a guess about someone’s, or a group of someones’ if we kept only religiosity and other data ordinal, ideology(ies) informed by other data or context and then we found out their religiosity level, we’d probably want to move our guess in a more conservative direction roughly around a point up on the scale for each level of the religiosity scale of their response. But, given the model describes a smaller percent of the variation overall, I wouldn’t exactly want to bet money on that or make decisions about which students would get added to a door knocking list for a campaign based on that alone. Again, the main information I would take from this is what we can see in the never religious attenders and their leftward lean, versus certainty on a progression of more and more conservative if you were able to dial up someone’s religiosity response on a continuous scale somehow.
For model (ii) The p values is significant at the .05 level so you could technically say they’re associated, if you wanted to be one of those people. However the estimated effect of each hour of self-reported TV watching decrease in high school GPA is very small, and the r2 value is also very small. Looking at the graph of the model and confidence interval, you can see how much of the data is throwing large errors. Even when tossing out the outlier on TV viewing hours these numbers shift only a minimal amount. There is also the issue of timing which stops us from making any causal inferences. Students are reporting their current TV viewing habits but we are looking at their high school GPA’s. We could perhaps infer things about time management or current habits might line up with previous ones in high school, and that is what is perhaps having this very small impact on GPA. However, we’re still collecting data on an X variable that occurred well after the Y variable that we are measuring. Based on the sum of the information available, I don’t think we can say too much of anything meaningful about the relationship between these two variables.
Rows: 60
Columns: 18
$ subj <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ ge <fct> m, f, f, f, m, m, m, f, m, m, m, f, m, m, f, f, f, m, m…
$ ag <int> 32, 23, 27, 35, 23, 39, 24, 31, 34, 28, 23, 27, 36, 28,…
$ hi <dbl> 2.2, 2.1, 3.3, 3.5, 3.1, 3.5, 3.6, 3.0, 3.0, 4.0, 2.3, …
$ co <dbl> 3.5, 3.5, 3.0, 3.2, 3.5, 3.5, 3.7, 3.0, 3.0, 3.1, 2.6, …
$ dh <int> 0, 1200, 1300, 1500, 1600, 350, 0, 5000, 5000, 900, 253…
$ dr <dbl> 5.0, 0.3, 1.5, 8.0, 10.0, 3.0, 0.2, 1.5, 2.0, 2.0, 1.5,…
$ tv <dbl> 3, 15, 0, 5, 6, 4, 5, 5, 7, 1, 10, 14, 6, 3, 4, 7, 6, 5…
$ sp <int> 5, 7, 4, 5, 6, 5, 12, 3, 5, 1, 15, 3, 15, 10, 3, 6, 7, …
$ ne <int> 0, 5, 3, 6, 3, 7, 4, 3, 3, 2, 1, 7, 12, 1, 1, 1, 3, 6, …
$ ah <int> 0, 6, 0, 3, 0, 0, 2, 1, 0, 1, 1, 0, 5, 2, 0, 0, 10, 10,…
$ ve <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ pa <fct> r, d, d, i, i, d, i, i, i, i, r, d, d, i, d, i, i, d, i…
$ pi <ord> conservative, liberal, liberal, moderate, very liberal,…
$ re <ord> most weeks, occasionally, most weeks, occasionally, nev…
$ ab <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ aa <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ld <lgl> FALSE, NA, NA, FALSE, FALSE, NA, FALSE, FALSE, NA, FALS…
#assign to new object, need to mutate some variables back to dummy numeric
student_survey <- student.survey %>%
select(re, pi, tv, hi) %>%
mutate(pi_n = dplyr::recode(pi,
"very liberal" = 1,
"liberal" = 2,
"slightly liberal" = 3,
"moderate" = 4,
"slightly conservative" = 5,
"conservative" = 6,
"very conservative" = 7
)) %>%
mutate(re_n = dplyr::recode(re,
"never" = 0,
"occasionally" = 1,
"most weeks" = 2,
"every week" = 3
))
glimpse(student_survey)
Rows: 60
Columns: 6
$ re <ord> most weeks, occasionally, most weeks, occasionally, nev…
$ pi <ord> conservative, liberal, liberal, moderate, very liberal,…
$ tv <dbl> 3, 15, 0, 5, 6, 4, 5, 5, 7, 1, 10, 14, 6, 3, 4, 7, 6, 5…
$ hi <dbl> 2.2, 2.1, 3.3, 3.5, 3.1, 3.5, 3.6, 3.0, 3.0, 4.0, 2.3, …
$ pi_n <dbl> 6, 2, 2, 4, 1, 2, 2, 2, 1, 3, 5, 2, 1, 4, 1, 2, 3, 2, 2…
$ re_n <dbl> 2, 1, 2, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 3, 3, 0, 0…
#See summary statistics for variables from psych package
#describe(student_survey)
## For recoding:
#PI = (1 = very liberal, 2 = liberal, 3 = slightly liberal, 4 = moderate, 5 = slightly conservative, 6 = conservative, 7 = very conservative)
#RE = Religous service attendance (religiosity) (0 = never, 1 = occasionally, 2 = most weeks, 3 = every week)
##Create the two models
#(i)
pi_re_model <- lm(pi_n ~ re_n, data = student_survey)
#(ii)
hi_tv_model <- lm(hi ~ tv, data = student_survey)
hi_tv_model_2 <- update(hi_tv_model, subset = -37)
## A) Graph the variables
# (i) Let's try looking at a violin plot for kicks first, will use re before dummy translation to use ggplot
violins_for_students <- ggplot(student_survey, aes(x=re, y=pi_n)) +
geom_violin()
violins_for_students
#and a box plot in the same way, why not, let's see if the assumption of higher reported church attendance and ideology become more conservative holds
plot(student_survey$re, student_survey$pi_n)
#alright now scatter with the vertical groupings, should mirror box, violins
plot(student_survey$re_n, student_survey$pi_n)
#graph the model!
ggplot(data = student_survey, aes(x = re_n, y = pi_n)) +
geom_point() +
geom_smooth(method = 'lm')
# (ii)
#start with a scatter plot here, scaling TV down towards the GPA reporting scale
plot(x = (student_survey$tv), y = student_survey$hi)
hist(student_survey$tv)
hist(student_survey$hi)
#graph the model!
ggplot(data = student_survey, aes(x = tv, y = hi)) +
geom_point() +
geom_smooth(method = 'lm')
## B) Summary Statistics
#(i)
summary(pi_re_model)
Call:
lm(formula = pi_n ~ re_n, data = student_survey)
Residuals:
Min 1Q Median 3Q Max
-2.81243 -0.87160 0.09882 1.12840 3.09882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9012 0.2717 6.997 2.97e-09 ***
re_n 0.9704 0.1792 5.416 1.22e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared: 0.3359, Adjusted R-squared: 0.3244
F-statistic: 29.34 on 1 and 58 DF, p-value: 1.221e-06
confint(pi_re_model)
2.5 % 97.5 %
(Intercept) 1.3572764 2.445090
re_n 0.6117725 1.329056
#(ii)
summary(hi_tv_model)
Call:
lm(formula = hi ~ tv, data = student_survey)
Residuals:
Min 1Q Median 3Q Max
-1.2583 -0.2456 0.0417 0.3368 0.7051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.441353 0.085345 40.323 <2e-16 ***
tv -0.018305 0.008658 -2.114 0.0388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared: 0.07156, Adjusted R-squared: 0.05555
F-statistic: 4.471 on 1 and 58 DF, p-value: 0.03879
confint(hi_tv_model)
2.5 % 97.5 %
(Intercept) 3.2705166 3.6121890592
tv -0.0356356 -0.0009752657
summary(hi_tv_model_2)
Call:
lm(formula = hi ~ tv, data = student_survey, subset = -37)
Residuals:
Min 1Q Median 3Q Max
-1.24754 -0.24190 0.05246 0.33359 0.71471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.436267 0.085070 40.393 <2e-16 ***
tv -0.018873 0.008632 -2.186 0.0329 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4447 on 57 degrees of freedom
Multiple R-squared: 0.07737, Adjusted R-squared: 0.06119
F-statistic: 4.78 on 1 and 57 DF, p-value: 0.03291
For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful
Since we don’t have a control group selected from the students we intended to tutor or matched that criteria, we can’t be sure that the improvement is due to the tutoring program or random chance drawing their scores closer to the overall average score. Luck, sleep before the different days of the tests, and other things can move the mean around for reasons outside the program that we haven’t accounted for.
The overall class mean stayed the same even as the tutored students mean increased, so we can see this happened in the other direction as well. Some of the students would have needed to do more poorly due to random factors, or other factors we are not structuring our data gathering to see, for this to happen.