Second homework for DACSS 603.
United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.
1.1.1. Identify the predictor and the response. - The response is ppgdp (on ppgdp) and the predictor is fertility (dependence of).
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
No. If anything, the points seem to be distributed in a negative exponential fashion.
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
The natural logarithmic transformation has enabled a simple linear regression model to be plausible.
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
some_pred <- c(7, 9, 11, 12, 13, 14)
some_inc_us <- c(45000, 58000, 65400, 80000, 94600, 103580)
some_inc_uk <- some_inc_us * (1 / 1.33)
hw2 <- data.frame(some_pred, some_inc_us, some_inc_uk)
hw2_table <- knitr::kable(hw2)
hw2a <- summary(lm(some_pred ~ some_inc_us, data = hw2))
hw2b <- summary(lm(some_pred ~ some_inc_uk, data = hw2))
First, I made a hypothetical dataset that has some response variable (some_pred), along with hypothetical salaries (some_inc_us), as well as converting the salaries to pounds (some_inc_uk)
| some_pred | some_inc_us | some_inc_uk |
|---|---|---|
| 7 | 45000 | 33834.59 |
| 9 | 58000 | 43609.02 |
| 11 | 65400 | 49172.93 |
| 12 | 80000 | 60150.38 |
| 13 | 94600 | 71127.82 |
| 14 | 103580 | 77879.70 |
The slope (coefficient) for using dollars as a predictor 1.1334084^{-4}
is different from
using pounds as a predictor 1.5074332^{-4}
R Squared value, using dollar salaries to predict some_pred : 0.9465294
R Squared value, using UK pound salaries to predict some_pred : 0.9465294
Water runoff in the Sierras (ALR Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.
It seems that all the AP___ predictors are heavily linearly related with other AP___ predictors. The same can be said with OP___ predictors being heavily linearly related with other OP___ predictors.
Otherwise, the linear relationship from an AP___ predictor to an OP___ predictor is not as prevalent.
Professor ratings (ALR Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
library(alr4)
library(ggplot2)
library(dplyr)
hw4 <- Rateprof
hw4_plot_var <- select(hw4, quality, clarity, helpfulness, easiness, raterInterest)
hw4_pairs <- pairs(hw4_plot_var)
The quality, clarity, and helpfulness ratings seem to be highly, positively correlated with each other and demonstrate clear linear relationships with each other.
The easiness ratings also seem to be positively correlated with helpfulness, clarity, and quality. However the linear relationship is not as as strong compared to the relationships that quality, clarity, and helpfulness share with each other.
Rater interest
For the student.survey SMSS data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)
library(smss)
library(ggplot2)
## To convert the responses to numerical values, based on scaling
data(student.survey)
hw5 <- student.survey
re_mod <- unclass(hw5$re)
pi_mod <- unclass(hw5$pi)
re_scale <- c(1:4)
pi_scale <- c(1:7)
re_levels <- levels(hw5$re)
pi_levels <- levels(hw5$pi)
# Outputs the scales and what it is associated with
re_levels_df <- data.frame(re_scale, re_levels)
pi_levels_df <- data.frame(pi_scale, pi_levels)
# Model fitting
i5 <- lm(pi_mod ~ re_mod, hw5)
i5_summ <- summary(i5)
i5_plot <- ggplot(hw5, aes(re_mod, pi_mod)) + geom_point()
i5_resumm <- summary(re_mod)
i5_pisumm <- summary(pi_mod)
a5 <- data.frame(as.factor(i5_pisumm), as.factor(i5_resumm))
names(a5) <- c('PI', 'RE')
# Correlation Test
i5_cortest <- cor.test(pi_mod, re_mod)
# Information fot TV Watching vs High School GPA
ii5 <- lm(hi ~ tv, hw5)
ii5_plot <- ggplot(hw5, aes(tv, hi)) + geom_point()
aa5 <- data.frame(as.factor(summary(hw5$tv)), as.factor(summary(hw5$hi)) )
names(aa5) <- c('TV', 'HI')
ii5_cortest <- cor.test(hw5$hi, hw5$tv)
Outline of the response choices for Religious Services and Political Ideology
| re_scale | re_levels |
|---|---|
| 1 | never |
| 2 | occasionally |
| 3 | most weeks |
| 4 | every week |
| pi_scale | pi_levels |
|---|---|
| 1 | very liberal |
| 2 | liberal |
| 3 | slightly liberal |
| 4 | moderate |
| 5 | slightly conservative |
| 6 | conservative |
| 7 | very conservative |
a. Ideology vs Religious Services Responses Plot
b. Summary Statistics for the Religious Services and Political Ideology responses
| PI | RE | |
|---|---|---|
| Min. | 1 | 1 |
| 1st Qu. | 2 | 1.75 |
| Median | 2 | 2 |
| Mean | 3.03333333333333 | 2.16666666666667 |
| 3rd Qu. | 4 | 3 |
| Max. | 7 | 4 |
The median for the responses are that students go to religious services occasionally, and that the political ideology is liberal. The mean however, suggests that students still occasionally go to religious services occasionally, but that their political ideology is slightly liveral.
c. Inferential Analyses
Correlation Test
Test Statistic : 5.4162562
P Value : 1.22113^{-6}
As far as the correlation test goes, it can be deemed that the correlation between the Political Ideology and Religious Services responses is statistically significant.
High School GPA vs TV Watching Analysis
a. Plot outlining the relationship between GPA and the amount of TV watched
b. Summary Statistics for High School GPA and Amount of TV Hours Watched
| TV | HI | |
|---|---|---|
| Min. | 0 | 2 |
| 1st Qu. | 3 | 3 |
| Median | 6 | 3.35 |
| Mean | 7.26666666666667 | 3.30833333333333 |
| 3rd Qu. | 10 | 3.625 |
| Max. | 37 | 4 |
The lowest amount of TV watched is 0 hours, while the most amount of TV watched is 37 hours! On average, the amount of TV watched amongst this group is 7.3 hours, with the median being 6 hours.
The lowest GPA reported by the respondents is 2, with the highest being a 4. On average, the respondents have a 3.31 GPA with a median of a 3.35 GPA.
c. Inferential Analyses
Correlation Test
Test Statistic : -2.1143654
P Value : 0.0387935
The correlation test shows that the correlation between hours of TV watched and GPA is statistically significant. Furthermore, the amount of TV watched has a negative relationship with GPA, reinforcing the cliche idea that student’s grades will suffer as more time spent watching on TV.
For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)
I constructed a fake dataset, with one column displaying test scores that had the 10 people scoring 50, and another column scoring 60 on the next test, and not adjusting other test scores to average out to 70. After conducting a two sided T test with this scenario with a test statistic of -1 and a p-value of 0.319 , this test in itself suggests that there is not enough evidence to prove that the improvement was statistically significant.
Furthermore, regression to the mean suggests that the most extreme of occurrences is not going to happen again and go back to the mean. In this testing case, the students that scored 50 will most likely score better the 2nd or 3rd time around.