Math 247: Explorations 10.4-10.5

Exploration 10.4: Predicting Brain Density from Number of Facebook Friends

In Section 10.3, we looked at how least squares linear regression can be used to describe a linear relationship between two quantitative variables.

In that section, we mentioned that the sign on the slope of the regression equation and the sign on the correlation coefficient are always the same.

For example, when there is a positive association in the data, both the correlation coefficient and the slope of the regression equation are positive.

In Section 10.2, we saw how to use the sample correlation coefficient in a simulation-based test about a null hypothesis of no association. In section 10.4, we will see how we can do the same type of inference but now with the population slope as the parameter of interest.

STEP 1: Ask a research question.

How often did you check your Facebook account today? How many Facebook friends do you have? Does everyone have the same number? Might the number of Facebook friends that a person has be associated in some way with the person’s brain structure? Kanai, Bahrami, Roylance, and Rees (2011, “Online social network size is reflected in human brain structure,” Proceedings of the Royal Statistical Society B: Biological Sciences) examined the relationship between number of (self-reported) Facebook friends and “grey matter density” in different regions of the brain involved with social interaction, memory, and emotional responses. In particular, Kanai and colleagues wanted to explore whether brain density tends to increase as a person’s number of Facebook friends increases.

Write the null and alternative hypotheses for this study in words (use the term association).

STEP 2: Design a study and collect data.

Kanai and colleagues performed MRI scans on student volunteers at University College, London. The results from each brain area were quite similar. You will examine results for the left middle temporal gyrus (which has been linked in other studies to facial recognition) for a follow-up study of 40 students. These results are stored in dataset Facebook within library(ISIwithR). Note that number of friends is given in units of 100 friends, so 0.30 = 30 friends, 1.09 = 109 friends, and so on. The brain density measurement is in “arbitrary units.”

data(Facebook)
head(Facebook,n=2)

Identify the observational units and sample size in this study.

Describe the explanatory variable and the response variable (as set up in the research conjecture) and classify each as quantitative or categorical.

Is this an observational study or a randomized experiment? Justify your answer.

STEP 3: Explore the data.

Let’s use R to create a scatterplot where brain density is the response variable (vertical axis) and number of Facebook friends is the explanatory variable (horizontal axis). (Keep in mind that the number of Facebook friends is reported in terms of hundreds of friends.)

xyplot(density ~ friends,
       data = Facebook,
       main="Scatterplot of the data",
       xlab="number of facebook friends (in 100s)")

Describe the direction, form, and strength of association between the variables as revealed in the scatterplot. Are there any unusual observations?

Let’s use R to determine the least squares regression line for predicting brain density based on number of Facebook friends.

model<-lm(density ~ friends, data = Facebook) # fitted equation of regression line
coef(model)

## (Intercept)     friends 
##  -0.7404404   0.2008952

The estimated (fitted) equation of the regression line is:

\[\widehat{brain \ density}=-0.74 + 0.2*\ number \ of \ FB \ friends\]

Interpret the slope in context: The slope of the regression line predicting brain density based on number of friends is __________, meaning that for every additional _________ increase in number of friends, the predicted brain density increases by _________ units.

Note that number of friends is given in units of 100 friends, so 0.30 = 30 friends, 1.09 = 109 friends, and so on. The brain density measurement is in “arbitrary units.”

Let’s investigate what the slope means in the context of number of friends and brain density. Use the output from the R chunk below to predict the brain density for a person with 300 Facebook friends.

# assigning function of the linear model the name `prediction_equation` 
prediction_equation <- makeFun(lm(density ~ friends, data = Facebook))
prediction_equation(friends=3) # predicted density for a person with 300 FB friends

##          1 
## -0.1377549

STEP 4: Draw inferences.

In Section 10.2, we performed a simulation-based test for a correlation coefficient. Doing the simulation-based test for the slope of the regression line is extremely similar. The only difference is that instead of the correlation coefficient, we use the slope as our statistic.

You should have found a positive association between number of Facebook friends and brain density in the sample. The question, however, is if there were no association between number of Facebook friends and brain density in the population, how likely is it that we would get a slope as large (as far above zero) as we did in a sample of 40 students. Let’s apply the 3S strategy.

7a. Statistic: What is the value of the slope in the sample?

7b. Simulate: To simulate, you can use the same general approach we used for correlation in Section 10.2. Explain how you would conduct the simulation by hand. Assume you have 40 slips of paper with the sample numbers of Facebook friends on them and 40 slips of paper with the brain densities written on them.

7c. Strength of evidence: Explain how you will calculate the strength of evidence in support of the conjecture that people with more Facebook friends tend to have higher brain densities.

Now let’s complete a test of significance.

Let’s model the null hypothesis by breaking the association between the data pairs. Let’s do a 1000 shuffles and describe the behavior of the 1,000 regression lines across the different shuffles (see FAQ 10.4.1 for some discussion about this).

set.seed(546)
Fb_slope.null <- do(1000) * coef(lm(shuffle(density) ~ friends, data = Facebook))
dotPlot(~friends, 
        data = Fb_slope.null,
        pch=20,
        cex=1.2,
        n = 40, 
        main="simulated null distribution of the sample slopes",
        xlab="sample slope b1",
        groups = (friends >= 0.2009))

Let’s use R to find the simulation-based p-value using the sample/estimated slope \(b=0.2009\).

p.value<-prop(~(friends >= 0.2009), 
              data = Fb_slope.null)
cat("simulation-based p-value using distribution of sample slopes",
    p.value)

## simulation-based p-value using distribution of sample slopes 0.01

Explain what your p-value measures in the context of the study. Can we conclude that there is strong evidence of a genuine positive association between numbers of Facebook friends and brain density in the population?

Let’s now use the correlation coefficient and find the p-value corresponding to the observed correlation coefficient. (we did it in section 10.2). First, let’s calculate the correlation coefficient between density and friends, \(r\).

cor(density ~ friends,data = Facebook)

## [1] 0.3655156

Sample correlation coefficient,i.e. statistic is \(r=0.365\).

Now let’s simulate a world in which population correlation coefficient (parameter of interest) \(\rho=0\) and calculate the p-value based on our statistic, \(r=0.365\).

set.seed(126)
Fb_cor.null <- do(1000) * cor(shuffle(density) ~ friends,
                              data = Facebook)
dotPlot(~cor, 
        data = Fb_cor.null,
        pch=20,
        cex=1.2,
        n = 40, 
        main="simulated null distribution of the sample correlation coefficients",
        xlab="sample correlation r", 
        groups = (cor >= 0.365))

p.value<-prop(~(cor >= 0.365), data = Fb_cor.null)
cat("simulation-based p-value using distribution of sample correlation coefficients",
    p.value)

## simulation-based p-value using distribution of sample correlation coefficients 0.009

How does this p-value compare with the p-value corresponding to the slope?

Key Idea: For a given data set, the test for slope is equivalent to the test for correlation coefficient.

STEP 5: Formulate conclusions.

Remember that the sample used here was not randomly selected, but rather a group of 40 volunteer university students.

Describe a population in which you would be comfortable drawing inferences. Explain your reasoning.

Can we conclude that acquiring more Facebook friends would lead to an increase in a person’s brain density? Explain your answer.

What about the opposite direction: Can we conclude that having a larger brain density causes a person to acquire more Facebook friends? Explain.

STEP 6: Look back and ahead.

Explain at least one thing you would do differently if you were these researchers and were doing the study again and why you’d do it.

Exploration 10.5: Predicting Brain Density from Number of Facebook Friends (continued)

In Exploration 10.4 we explored the potential positive relationship between the number of Facebook friends someone has and their brain density based on data from 40 volunteer university students.

The null hypothesis is that there is no association between brain density and number of Facebook friends in the population. In Exploration 10.4, we rejected this null hypothesis because the observed slope (and correlation coefficient) were too far in the tail of the (respective) null distribution to be attributed to random chance. We will further explore this analysis in this exploration using a theory-based approach.

Theory-Based Approach

When using the theory-based approach for regression you can write your hypotheses in terms of population parameters.

The relevant population parameters are:

the population slope (indicated by the Greek letter, \(\beta\)) and
the population correlation (indicated by the Greek letter \(\rho\)).

See FAQ 10.5.1 for more discussion about parameters.

By now, you’ve seen many times that when the null distribution of statistics takes a familiar, mound-shaped curve, we can often use theory-based methods to predict the null distribution of related standardized statistics—as long as certain validity conditions are true. This is no different for regression (and correlation!).

The theory-based approach computes a standardized statistic (t-statistic) using one of the following equations:

\[t=\frac{r}{\sqrt{\frac{1-r^2}{n-2}}}=\frac{b-0}{SE(b)}\]

More information is provided in Appendix A, but notice that the t-statistic can be computed based on either the correlation coefficient or the slope, and it yields the same value—further underscoring that tests for the correlation coefficient and for the slope are essentially identical.

Theory-Based Approach using correlation coefficient

\[H_0: \rho=0\]

\[H_A: \rho>0\]

Let’s use the R to find the correlation coefficient for the 40 students.

cat("sample correlation coefficient r = ",cor(density ~ friends, data = Facebook))

## sample correlation coefficient r =  0.3655156

Use the correlation coefficient to find the t-statistic using the following equation:

\[t=\frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\]

Include a one-sentence interpretation of this standardized statistic.

r<- # your value here
n<- # your value here
t=r/(sqrt((1-r^2)/(n-2)))
cat("standardized statistic t=",t)

An “automated” way to obtain the t-statistic for the correlation coefficient in R is as follows:

stat(cor.test(density ~ friends, data = Facebook))

##        t 
## 2.420689

Let’s use the t-statistic you calculated to approximate the p-value for this test. How does it compare to the simulation-based p-value you found in #3?

#option 1
cat("theory-based right-sided p-value using correlation is", pval(cor.test(density ~ friends, data = Facebook, alternative="greater")))

## theory-based right-sided p-value using correlation is 0.01018894

#option 2
pval(cor.test(density ~ friends, data = Facebook))/2 # divided by 2 to obtain a one-sided p-value

##    p.value 
## 0.01018894

Theory-Based Approach using slope of regression line

Let’s translate the research question into null and alternative hypotheses using the population slope, \(\beta\) as a parameter of interest.

\(H_0\): there is no linear relationship between the number of facebook friends and brain density,

\[H_0: \beta = 0\]

\(H_a\): there is a positive linear relationship between the number of facebook friends and brain density, i.e. as number of friends increases the brain density increases as well

\[H_a: \beta > 0\]

The R code below calculates the least squares regression coefficients. In addition the output contains the test statistics for estimated intercept and slopes under t-value as well as corresponding p-value under Pr(>|t|). Notice that the R-squared is also included in the output.

summary(lm(density ~ friends, data = Facebook))

## 
## Call:
## lm(formula = density ~ friends, data = Facebook)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5050 -0.8057 -0.0401  0.6734  1.6753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.74044    0.33948  -2.181   0.0354 *
## friends      0.20090    0.08299   2.421   0.0204 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9421 on 38 degrees of freedom
## Multiple R-squared:  0.1336, Adjusted R-squared:  0.1108 
## F-statistic:  5.86 on 1 and 38 DF,  p-value: 0.02038

\(R^2=0.1336\). interpret what this value means in the context of this problem.

We can use R shortcuts to extract the t-statistics. Report the t-statistic for the slope (coefficient of friends), i.e. \(t=\frac{b}{SE(b)}\) and interpret what it measures. Be specific.

summary(lm(density ~ friends, data = Facebook))$coefficients[,3]#t-statistics

## (Intercept)     friends 
##   -2.181121    2.420689

We can use R shortcuts to extract the p-values. Keep in mind that we are performing a one-sided test, and the output provides the two-sided p-value.

#p-values
summary(lm(density ~ friends, data = Facebook))$coefficients[,4]

## (Intercept)     friends 
##  0.03543210  0.02037788

Theory-based right-sided p-value using slope is is 0.0203/*2=0.01

How does this p-value compare to:

p-value obtained with the simulation-based test for a correlation coefficient?

p-value obtained with the simulation-based test for the slope of the regression line?

p-value obtained with the theory-based test for a correlation coefficient?

What does this comparison tell you (what conclusion can we make about these methods)?

Validity Conditions for Linear Regression

There are four validity conditions for regression which are needed in order to use the theory-based approach to yield a p-value.

Validity conditions for a theory-based test for a regression slope, LINE

Linearity: The general pattern of the points in the scatterplot should follow a Linear trend; the pattern should not show curved or other nonlinear patterns.
Independence of observations.
Normality: The distribution of the response variable at each value of \(x\) is Normal. In other words, there should be approximately the same distribution of points above the regression line as below the regression line (symmetry about the regression line).
Equal Variance: The variability of the points around the regression line should be similar regardless of the value of the explanatory variable; the variability (spread) of the points around the regression line should not differ as you slide along the x-axis: Equal variance/standard deviation.

Checking validity conditions

If we were to subtract the mean from each observation and pool these values together, then we should have one big distribution with mean 0, standard deviation equal to \(\sigma\), and a Normal shape. However we don’t know the population regression coefficients, but we have their least squares estimates. If we subtract the fitted values from each response value, these differences are simply the residuals from the least squares regression. So to check the technical conditions we will use residual plots.

The Linearity condition is considered met if a plot of the residuals vs. the explanatory variable (fitted values in R) does not show any patterns such as curvature.
Another condition is that the observations are Independent. We will usually appeal to the data being collected randomly or randomization being used. The main thing is to make sure there is not something like time dependence in the data.
The Normality condition is considered met if a histogram and Normal Probability Plot of the residuals look normal.
The Equal variance condition is considered met if a plot of the residuals vs. the explanatory variable shows the similarly variability in residuals across the values of x.

Linearity condition is assessed with the scatterplot, i.e. “does linear relationship between x and y seem reasonable”?

xyplot(density ~ friends,
       data = Facebook,
       main="Scatterplot of the data",
       xlab="number of facebook friends (in 100s)")

Based on the scatterplot above, is Linearity condition met in this case? Namely, is the general pattern of the scatterplot linear?

Independence condition is assessed based on the information about how the data was collected. If we have a random sample the independence condition is considered to be satisfied.

Is it reasonable to assume Independence of the observations in this case?

Normality of the residuals condition is assessed with the help of a Normal QQ plot or a histogram of residuals. A normal quantile plot (QQ plot) is a scatterplot of the ordered residuals against values we’d expect to see from a “perfect” normal sample of the same size. These can check for normally distributed errors.

Ideal pattern: most dots are close to the dotted line, with perhaps a few deviations.

plot(lm(density ~ friends, data = Facebook),
     which=2,
     main="Normal QQ plot for asseessing normality condition")

histogram(lm(density ~ friends, data = Facebook)$residuals,
          width = 0.45,
          main = "Histogram of the residuals",
          xlab="residuals")

Based on the QQ plot and the histogram of residuals shown above, is Normality condition met?

Equal variance condition is considered met if a plot of the residuals vs. the explanatory variable shows the similarly variability in residuals across the values of x. In other words, is the distribution of points on a scatterplot above the regression line the same as below the line?

xyplot(density ~ friends,
       data = Facebook,
       main="Scatterplot of the data with regression line",
       xlab="number of facebook friends (in 100s)",
       type = c("p", "r"))# adding a regression line

Equal variance of the residuals (homogeneity condition) can be checked with the help of the scatterplot of the residuals against the fitted line produced in R. Positive residuals represent points above the line, negative residuals represent points below the line.

Ideal pattern: random variation above and below zero (grey dotted line), with no pattern. What we don’t want to see is a “funnel” pattern in which it is clear that the variances are not constant.

plot(lm(density ~ friends, data = Facebook),
     which=1,
     main="Residuals vs Ffitted plot for assessing equality of variances")

Does equal variance condition seem to be met? Namely, is the variability of the points around the line similar regardless of the value of the explanatory variable?

Let’s use R to obtain the \(95\%\) Confidence interval for slope.

confint(lm(density ~ friends, data = Facebook))

##                   2.5 %      97.5 %
## (Intercept) -1.42767560 -0.05320521
## friends      0.03288886  0.36890148

Record and interpret this interval. (Hint: Make sure that your interpretation refers to how to interpret the slope coefficient in the population and remember that the units of the explanatory variable are 100s of Facebook friends.)

We are 95% confident that a one _____ increase in number of Facebook friends is associated with an average _________ of ______ to ______ units in brain density in the population of people represented by this sample.