Question 1:

(Problem 1.1 in ALR) United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

1.1.1. Identify the predictor and the response.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

Answer to Question 1:

# Load dataset
data("UN11") 

# Select variables of focus
UN11 <- UN11 %>%
  select(c(ppgdp, fertility))  

# Preview data
head(UN11)

              ppgdp fertility
Afghanistan   499.0     5.968
Albania      3677.2     1.525
Algeria      4473.0     2.142
Angola       4321.9     5.135
Anguilla    13750.1     2.000
Argentina    9162.1     2.172

1.1.1

The predictor variable is ppgdp (gross national product per person, in US dollars) and the response variable is fertility (birth rate per 1000 females).

1.1.2

# Create scatterplot
# fertility on vertical axis, ppgdp on horizontal axis
plot(x = UN11$ppgdp, y = UN11$fertility, xlab = 'ppgdp', ylab = 'fertility', main = 'Scatterplot for Question 1.1.2')

The graph shows an intense negative relationship between a country’s gross national product per person and fertility rate at first (up to about $10000 ppgdp), then there appears to be little change in fertility in relationship to ppgdp moving beyond this point. A straight-line mean function does not seem to be an appropriate measure for summary of this graph.

1.1.3

# Create scatterplot
# log(fertility) on vertical axis, log(ppgdp) on horizontal axis
plot(x = log(UN11$ppgdp), y = log(UN11$fertility), xlab = 'log(ppgdp)', ylab = 'log(fertility)', main = 'Scatterplot for Question 1.1.3')

The simple linear regression seems plausible for summary of this graph. The relationship between the variables (when a log-scale is applied) appears to be negative and rather consistent throughout the graph (as opposed to the graph in 1.1.2, which has a dramatic drop at first then plateaus for the majority of the plot).

Question 2:

(Problem 9.47 in SMSS) Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

How, if at all, does the slope of the prediction equation change?
How, if at all, does the correlation change?

Answer to Question 2:

The slope of the prediction equation would change. It would be the initial version’s slope divided by 1.33 to account for the change in unit to pounds.

The correlation does not change, because it standardizes the slope (thus is not impacted by unit of measure).

Question 3:

(Problem 1.5 in ALR) Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

Answer to Question 3:

# load and preview dataset 
data(water)
head(water)

  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

# create scatterplot matrix
pairs(water, main = 'Scatterplot Matrix for Question 3')

Looking at this scatterplot matrix, it appears that precipitation levels for the ‘A’ named lakes seem to have a positive (relatively linear) correlation (although unsure how strong) with each other and the ‘O’ named lakes seem to have one as well with each other. The year variable does not appear to have a relationship to any of the variables. Also, it seems that the stream run-off variable has a relationship to the ‘O’ named lakes but no real notable relationship to the ‘A’ named lakes.

Question 4:

(Problem 1.6 in ALR - slightly modified) Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

Answer to Question 4:

# load dataset, select variables, preview dataset
data(Rateprof)

Rateprof <- Rateprof %>%
  select(c(quality, clarity, helpfulness, easiness, raterInterest))  

head(Rateprof)

   quality  clarity helpfulness easiness raterInterest
1 4.636364 4.636364    4.636364 4.818182      3.545455
2 4.318182 4.090909    4.545455 4.363636      4.000000
3 4.790698 4.860465    4.720930 4.604651      3.432432
4 4.250000 4.041667    4.458333 2.791667      3.181818
5 4.684211 4.684211    4.684211 4.473684      4.214286
6 4.233333 4.200000    4.266667 4.533333      3.916667

# create scatterplot matrix
pairs(Rateprof, main = 'Plot for Question 4')

Referring to the scatterplot matrix of the average professor ratings for the topics of quality, clarity, helpfulness, easiness, and rater interest, the variables quality, clarity, and helpfulness appear to each have strong positive correlations with each other. The variable easiness appears to have a much weaker positive correlation with helpfulness, clarity, and quality. Rater interest does not appear to have much of a correlation to any of the other variables. There are a few notable outliers in the matrix, for example the data point rating higher for clarity and lower for quality than the trend of other points on the clarity/quality plot. The variables with stronger correlations to each other may suggest that there is a relationship between certain qualities in the professors from the selected university (like professors that tend to be perceived as more helpful by students also tend to have higher clarity) or this could mean that students associate these certain qualities together (thus rating similarly for helpfulness and clarity).

Question 5:

(Problem 9.34 in SMSS) For the student.survey data file in the smss package, conduct regression analyses relating:

1. y = political ideology and x = religiosity,
1. y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

Use graphical ways to portray the individual variables and their relationship.
Interpret descriptive statistics for summarizing the individual variables and their relationship.
Summarize and interpret results of inferential analyses.

Answer to Question 5:

# load and preview data
data(student.survey)

head(student.survey)

  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal
            re    ab    aa    ld
1   most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE    NA
3   most weeks FALSE FALSE    NA
4 occasionally FALSE FALSE FALSE
5        never FALSE FALSE FALSE
6 occasionally FALSE FALSE    NA

# graph: x=religiosity, y=political ideology
repa <- student.survey %>%
  select(re, pi)

ggplot(data = repa) +
  geom_bar(mapping = aes(x = re, fill = pi)) +
  labs(title = "Relationship between Religiosity and Political Ideology", x = "Religiosity (how often you attend services)", y = "Political Ideology (pi)")

# graph: x=hours of watching tv, y=high school gpa
tvhi <- student.survey %>%
  select(tv, hi)

ggplot(data = tvhi) +
  geom_point(mapping = aes(x = tv, y = hi)) +
  labs(title = "Relationship between Hours Watching TV and GPA", x = "Average Hours of TV watched per Week", y = "High School GPA")

# descriptive statistics: religiosity and political ideology
summary(repa)

            re                         pi    
 never       :15   very liberal         : 8  
 occasionally:29   liberal              :24  
 most weeks  : 7   slightly liberal     : 6  
 every week  : 9   moderate             :10  
                   slightly conservative: 6  
                   conservative         : 4  
                   very conservative    : 2

Both the religiosity and political ideology variables are skewed right, with significantly higher counts for “never” and “occasional” service attendance and “liberal” and “moderate” political ideologies in respondents.

# descriptive statistics: hours of tv watched and GPA
summary(tvhi)

       tv               hi       
 Min.   : 0.000   Min.   :2.000  
 1st Qu.: 3.000   1st Qu.:3.000  
 Median : 6.000   Median :3.350  
 Mean   : 7.267   Mean   :3.308  
 3rd Qu.:10.000   3rd Qu.:3.625  
 Max.   :37.000   Max.   :4.000

The variable average hours of tv watched has a wide range, and the large distance between the 3rd quantile and maximum suggest that there is at least one outlier (which there are multiple when viewing the scatterplot in part a.). Additionally, the median being less than the mean suggests a right skew in the data. The summary for the high school gpa suggests a relatively normal distribution, as the mean and median are similar and lie relatively in the center of the range. However, the gpa is skewed left (although the mode lies directly in the center of the range, there is a higher count of individuals with a gpa above 3.0, as reflected by the mean being above the mode).

# inferential analysis: religiosity and political ideology
lmrepa <- lm(data = student.survey, formula = as.numeric(pi) ~ as.numeric(re))
summary(lmrepa)


Call:
lm(formula = as.numeric(pi) ~ as.numeric(re), data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.9308     0.4252   2.189   0.0327 *  
as.numeric(re)   0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

# looking at Pearson's correlation
cor.test(as.numeric(student.survey$re), as.numeric(student.survey$pi))


    Pearson's product-moment correlation

data:  as.numeric(student.survey$re) and as.numeric(student.survey$pi)
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3818345 0.7265650
sample estimates:
      cor 
0.5795661

Due to the political ideology and religiosity variables being categorical, we need to use the as.numeric argument in the linear model to convert the variables into numerical data. At a significance level of .01, there is a statistically significant association between religiosity and political ideology (as p-value < .01). The correlation is moderate and positive, suggesting that as weekly church attendance increases, political ideology becomes more conservative leaning.

# inferential analysis: hours of tv and high school gpa
lmtvhi <- lm(data = student.survey, formula = hi ~ tv)
summary(lmtvhi)


Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

# looking at Pearson's correlation
cor.test(student.survey$tv, student.survey$hi)


    Pearson's product-moment correlation

data:  student.survey$tv and student.survey$hi
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.48826914 -0.01457694
sample estimates:
       cor 
-0.2675115

With a slope of -.018, there is a negative association between hours of tv watched per week and high school GPA, meaning that as hours of tv viewing increase, a student’s GPA tends to decrease. There is a statistically significant relationship between hours of tv viewed per week and GPA at a significance level of .05. However, the R-squared value is close to 0, which suggests that the regression model does not provide a strong prediction for the observed variables. This is not suprising after looking at the scatterplot with hours of tv watched a GPA, since there does not appear to be a linear trend in the data.

Question 6:

(Problem 9.50 in SMSS) For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)

Answer to Question 6:

Applying the concept of regression toward the mean to this example, the low midterm scores in the sample could be explained as being an extreme in the sample by chance. Thus, in the next sample (in this case the final exam), we can expect that those 10 students’ scores will be closer to the mean this time (which remained to be 70, an average larger than the tutored students’ midterm score). Thus, we cannot conclude that the tutoring program was the cause of increase in the 10 students’ test scores.

DACSS 603: Homework 2