DACSS 603 Homework 2

Question 1

United Nations

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

1.1.1. Identify the predictor and the response.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

Answer 1

First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?UN11” to get details on this data.

The defining measure of the variables I am examining are:

ppgdp, representing “per capita gross domestic product in US dollars”, and fertility, representing “number of children per woman”.

1.1.1: The predictor variable is the ppgdp and the response variable is fertility.

Next, I create the scatterplot and add a theoretical regression line using the smooth() function and applying the ‘lm’ method in ggplot2. Using “smooth()” in this way allows me to see patterns in the graph representing a potential linear regression line.

Show code

#First, I loaded the data, used the head() function to preview the dataset, and used the "?UN11" function to get additional details on the data represented in the dataset. 

data("UN11")

#Then, I created a dataframe with the relevant variables and plotted them using ggplot2

df1 <- UN11 %>% 
  select(c(ppgdp, fertility))

gg1a <- ggplot(df1, aes(x=ppgdp, y=fertility)) +
  geom_point() +
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE,color="goldenrod") +
   labs(title= "Fertility and GDP",
        x= "GNP (Per Person in US Dollars)", 
        y = "Fertility (Births Per Woman)") +
   theme_classic()

gg1a

1.1.2: The scatterplot indicates that this graph does not support a straight-line mean function.

Finally, I add the log() functions to analyze the linear model and again use “smooth()” to look at patterns in the graph representing a regression line on a linear model.

Show code

gg1b <- ggplot(df1, aes(x=log(ppgdp), y=log(fertility))) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
   labs(title= "Log: Fertility and GDP",
        x= "GNP (Per Person in US Dollars)",
        y = "Fertility (Births Per Woman)") +
   theme_classic()

gg1b

1.1.3: It seems the simple linear regression model is plausible. The consistent downward line would be the best fit for most of the data points.

Additionally, the summary() call of the linear model “lm()” function allows me to confirm that the p-value is statistically significant to a high level of confidence (< 0.001)

Show code

fit1 <- lm(fertility ~ ppgdp, data = df1)
summary(fit1)


Call:
lm(formula = fertility ~ ppgdp, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
ppgdp       -3.201e-05  4.655e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

1. How, if at all, does the slope of the prediction equation change?
1. How, if at all, does the correlation change?

Answer 2

1. The slope will change - if slope = “s”, the new slope becomes (s/1.33). This is because we are looking at the response variable. When converting an explanatory variable, there is an inverse relationship to the slope and we would multiply the old slope by 1.33.
1. The correlation will not change. This is because correlation is not affected by a change in units.

Question 3

Water Runoff in the Sierras

Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM.

Draw the scatterplot matrix for these data and summarize the information available from these plots.

Answer 3

First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?water” to get details on this “California water” data. I can now create a scatterplot matrix for the data using the plot() function. This allows me to make an initial look into the existence of linear relationships in the data.

Show code

#First I load the data

data("water")

#Next, I create the dataframe object from the data

df3 <- water

#Then, I make an initial scatterplot matrix

pairs(df3)

Using the plot() function allows me to identify visually the positive linear correlations between these variables. Those appear to me at first glance to be, leading with the strongest visual representation:

OPSLAKE and OPRC
OPSLAKE and OPBPC
BSAAM and OPSLAKE
BSAAM and OPRC
BSAAM and OPBPC

Still clear but weaker positive linear correlations:

OPRC and OPBPC
APMAM and APSAB
APMAM and APSLAKE
APSAB and APSLAKE

There are no apparent positive (or negative!) linear correlations between the ‘Year’ variable and any other variable.

This is helpful in visualizing relationships, but I have better opportunities to visualize this scatterplot matrix in R. I also would like to look at the statistical correlation evaluations of these relationships.

One opportunity is a package called “GGally” that builds upon ggplot2 to look at more than just the general scatterplot matrix. The primary difference between the GGally “ggpairs()” function and the pairs function of base R is that the diagonal consists of the densities of the variables and the upper panels consist of the Pearson correlation coefficients between the variables using the argument “pearson”.

Show code

ggpairs(df3, method = c("everything", "pearson"))

Using this representation, I can see both the visual correlations and the statistical correlations of the pairs. This allows me to compare my predictions to the correlation scores. Highest significant correlations are:

OPSLAKE and OPBPC
BSAAM and OPSLAKE
BSAAM and OPRC
OPSLAKE and OPRC
APSLAKE and APSAB
BSAAM and OPBPC
OPRC and OPBPC
APSAB and APMAM
APSLAKE and APMAM

The results confirm that although my visual evaluation using the initial scatterplot matrix was not necessarily 100% accurate, it served as a reliable estimate of positive linear statistical correlation.

Finally, I want to use a third option for visualizing these correlations that is even more simple and graphically pleasing, the “ggcorr()” function in GGally. This method cleanly demonstrates the strong correlations we have evaluated in a simple way.

Show code

ggcorr(df3, method = c("everything", "pearson"))

Question 4

Professor Ratings

In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

Answer 4

First, I load in the appropriate data. Then I look at the head() info to preview the dataset, and use the function “?Rateprof” to get details on this dataset. I create a dataframe of the variables of interest from the Rateprof dataset.

Show code

#First I load the data

data("Rateprof")

#Then create a data frame object of the relevant variables

df4 <- Rateprof %>% 
  select(c(quality, helpfulness, clarity, easiness, raterInterest))

#I preview the data to understand what type of data is represented

head(df4)

   quality helpfulness  clarity easiness raterInterest
1 4.636364    4.636364 4.636364 4.818182      3.545455
2 4.318182    4.545455 4.090909 4.363636      4.000000
3 4.790698    4.720930 4.860465 4.604651      3.432432
4 4.250000    4.458333 4.041667 2.791667      3.181818
5 4.684211    4.684211 4.684211 4.473684      4.214286
6 4.233333    4.266667 4.200000 4.533333      3.916667

Using the pairs() function, I am now able to create a scatterplot matrix that matches the one in the book (ALR, Problem 1.6).

Show code

pairs(df4)

The relationships illustrated by the correlations represented in this scatterplot matrix indicate a strong, positive linear relationship between quality and helpfulness. There is also a strong, positive linear relationship between quality and clarity, but for one outlier. These tell me that as the rankings given to a professor on quality, helpfulness, and clarity each rise, they can expect the other variables in that group to rise as well. HOwever, I cannot guess or speculate as to the cause of these correlations.

In comparison, there is a moderate positive linear relationship between positive ratings of easiness with each variable except for raterInterest. The variable raterInterest has a horizontal correlation with all of the other 4 variables, indicating either no correlation or a very weak correlation between them.

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

1. Use graphical ways to portray the individual variables and their relationship.
1. Interpret descriptive statistics for summarizing the individual variables and their relationship.
1. Summarize and interpret results of inferential analyses.

Answer 5

After installing the “smss” package and loading the relevant library, I again inspect the data and package info.

Show code

#First I load the data

data("student.survey")

#Then I create a data frame with the data set and look at the content

df5 <- student.survey

head(student.survey)

  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal
            re    ab    aa    ld
1   most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE    NA
3   most weeks FALSE FALSE    NA
4 occasionally FALSE FALSE FALSE
5        never FALSE FALSE FALSE
6 occasionally FALSE FALSE    NA

1. First, I use the generalized “plot()” command to take a preliminary look at the graphical data.

Show code

#To plot y = political ideology and x = religiosity, the relative variable names are "pi" and "re".

plot1 <- plot(pi ~ re, data = student.survey)

Show code

#To plot y = high school GPA and x = hours of TV watching, the relative variable names are "hi" and "tv".

plot2 <- plot(hi ~ tv, data = student.survey)

These initial representations really don’t tell me much about the relationships between the variables.

Now I need to do some data cleaning, and convert the categorical variables for political ideology and religiosity into numeric variables. I will also rename the columns for the sake of clarity.

For political ideology (“pi”), “as.integer” represents the categorical variables with values of 1 to 7, starting with very liberal (1) and increasing in assigned value to very conservative (7). For religiousity, “as.integer” represents the categorical variables with values of 1 to 4, starting with never (attending religious services) (1) to every week (4). For religiosity (“re”) this was how often you attend religious services, “never” = 1, “occasionally” = 2, “most weeks” = 3, “every week” = 4

Show code

df5 <- student.survey %>% 
  select(c(hi, tv, pi, re))

df5$pi <- as.integer(as.factor(df5$pi))
df5$re <- as.integer(as.factor(df5$re))

Then I create a sub-frame for the particular pair of variables I want to compare. First, religiosity and political ideology; and then create a scatter plot of the correlation between religiosity and political ideology.

Show code

df5a <- df5 %>%
  rename(Political.Ideology = pi,
         Religiosity = re) %>% 
  select(Political.Ideology, Religiosity)

Show code

gg5a <- ggplot(df5a, aes(x=Religiosity, y=Political.Ideology)) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
   labs(title= "Religiosity & Political Ideology",
        x= "Religiosity", 
        y = "Political Ideology") +
   theme_classic()

gg5a

Again, I create a sub-frame for the specific pair of variables I want to compare. Now I look at high school GPA and hours of TV watched per week and create a scatter plot of the correlation between high school GPA and hours of TV watched per week.

Show code

df5b <- student.survey %>%
  rename(High.School.GPA = hi,
         Hours.TV = tv) %>% 
  select(High.School.GPA, Hours.TV)

Show code

gg5b <- ggplot(df5b, aes(x=Hours.TV, y=High.School.GPA)) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="goldenrod") +
   labs(title= "High School GPA & Average Hours of TV Watched Per Week",
        x= "Average Number of Hours of TV Watched per Week", 
        y = "High School GPA") +
   theme_classic()

gg5b

1. I start to interpret descriptive statistics for summarizing the individual variables and their relationship using the “summary()” function.

Show code

summary(df5)

       hi              tv               pi              re       
 Min.   :2.000   Min.   : 0.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.: 3.000   1st Qu.:2.000   1st Qu.:1.750  
 Median :3.350   Median : 6.000   Median :2.000   Median :2.000  
 Mean   :3.308   Mean   : 7.267   Mean   :3.033   Mean   :2.167  
 3rd Qu.:3.625   3rd Qu.:10.000   3rd Qu.:4.000   3rd Qu.:3.000  
 Max.   :4.000   Max.   :37.000   Max.   :7.000   Max.   :4.000

Checking the Pearson Correlation Coefficient and Linear Model Fit of Each Relationship

This basic correlation matrix gives an overview of the correlations for religiosity and political ideology, rounded to 2 decimals. This indicates a moderate positive relationship between the variables.

Show code

round(cor(df5a),
  digits = 2 
)

                   Political.Ideology Religiosity
Political.Ideology               1.00        0.58
Religiosity                      0.58        1.00

Using the “lm()” formula, I can confirm that the p-value for this relationship is ~1.22.

Show code

fit5a <- lm(pi ~ re, data = df5)

summary(fit5a)


Call:
lm(formula = pi ~ re, data = df5)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re            0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

Again, using a correlation matrix gives an overview of the correlations for high school GPA and hours of TV watched, rounded to 2 decimals. This indicates a weak negative correlation between the variables.

Show code

round(cor(df5b),
  digits = 2 
)

                High.School.GPA Hours.TV
High.School.GPA            1.00    -0.27
Hours.TV                  -0.27     1.00

Using the “lm()” formula, I can confirm that the p-value for this relationship is ~0.04

Show code

fit5b <- lm(tv ~ hi, data = df5)

summary(fit5b)


Call:
lm(formula = tv ~ hi, data = df5)

Residuals:
   Min     1Q Median     3Q    Max 
-8.600 -3.790 -1.167  2.408 27.746 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   20.200      6.175   3.271   0.0018 **
hi            -3.909      1.849  -2.114   0.0388 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.528 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

1. Summarize and interpret results of inferential analyses.

The basis analysis of these pairs of data from the “student.survey” data set tells me that if the null hypothesis are:

H1: There is no correlation between religiosity and political ideology; and
H2: There is no correlation between grade point average and hours of television watched,

then I can use the inferential analyses to conclude that I have sufficient evidence at the 95% confidence level to reject both H1 and H2 null hypothesis. However, if I choose to use a 99% confidence level, I would be able to to reject H1, but not to reject H2.

Question 6

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)

Answer

No, we do not have enough information to say that this result is the direct effect of the special tutoring program. The increase in scores among the tutored subset may be due to other factors, such as the students having a better day on the day of the second test, among other potential facts.

Moreover, there is the statistical phenomenon at play of regression toward the mean which can help explain how extreme values correct toward the mean in repeated samples. There is a potential for the results to be statistically significant, but without additional information, the explanation that randomization and regression toward the mean remains primary.

This example could potentially be strengthened by more information, such as multiple test results over the semester rather than just two tests or a density plot of the entire class scores. It could also be strengthened by the presence of a control group.