Lecture Notes

My notes on the lectures/videos for selected units. These are just my notes, and although I have checked things to my satisfaction, I cannot guarantee total accuracy.

As of today (9/1/14), I have more or less complete notes on units 4 through 7, and partial notes on unit 1. My goal is to try to fill in units 2 through 3, and the rest of unit 1, in the next few weeks.

Unit 1: Introduction to Data

Video 1-1-1: Data Basics

Observations, variables, and data matrices

The data shown here is similar but not identical to that in the video.

google <- read.csv("google_data/google-government-removal-requests.csv")
google <- google[complete.cases(google),]
colnames(google) <- c("period", "country", "terr.code", "all_num_req", "all_compl", "all_tbr", "cr_req", "cr_compl", "cr_tbr", "other_req", "other_compl", "other_tbr")
gfields <- c("country", "cr_req", "cr_compl", "other_req", "other_compl")
head(google[,gfields])
##           country cr_req cr_compl other_req other_compl
## 81        Germany     72      100        46          91
## 82         Mexico      1        0         1           0
## 83    Netherlands      1      100         3         100
## 86         Brazil    146       97       117          50
## 94 United Kingdom     18      100        20          80
## 96      Argentina     21      100         1         100

The table above is a data matrix. Each row is an observation or case. Each column represents a variable (also sometimes known as a feature).

Types of variables

There are two types of variables: numerical and categorical. Numerical (quantitative) variables take on numerical values. It is sensible to add, subtract or take averages of these values. Categorical (qualitative) variables take on a limited number of distinct categories. Categories can be identified with numbers. For example, a gender might be coded as 0 for males and 1 for females. But it does not make sense to perform arithmetic operations on these categories.

Numerical values can be continuous or discrete. Continuous numerical variables are usually measured; an example is height. They can take on any numerical value. We tend to round a measure like height when we record it, but this does not change the fact that it is a continuous variable. Discrete numerical variables are generally counted, and can take on only whole non-negative numbers. It takes some thought about a numerical variable to be sure whether it is continuous or discrete; height is continuous, for example, but values may be rounded, so it looks discrete.

Categorical variables can be ordinal, which means there is some kind of natural order to the categories; for example, “very low”, “low”, “middle”, “high” and “very high”. If there is no natural order, then the variable is just a “regular” categorical variable.

Reviewing the Google transparency data shown above:

  • country is a categorical variable
  • cr_req is the number of content removal requests made to Google. It is discrete numerical.
  • cr_comply is the percentage of requests with which Google complied. It is a continuous numerical variable, because it can take on any value between 0 and 100, even though the values may be shown rounded to whole numbers.
  • ud_req (not shown in my data): the number of user data requests, again discrete numerical.
  • ud_comply (not shown here): like cr_comply, numerical continuous
  • hemisphere (not shown here): categorical
  • hdi (human development index): categorical (ordinal)

Relationships between variables

The following shows other_req plotted against other_compl (comparable, I think, to ud_req and ud_comply in video).

plot(google$other_req, google$other_compl)

plot of chunk unnamed-chunk-2

There is an outlier, the United States, which makes many more requests for content removal than any other country.

When two variables show some connection with one another, they are called associated or dependent variables. The assocation can be described as positive (when one variable goes up, so does the other) or negative (when one variable goes up, the other goes down).

The plot above does not do as good of a job as the one in the video at showing a a connection between the variables, but a slight positive association can be detected.

If two variables are not associated, they are independent.

Video 1-1-2: Observational Studies and Experiments

Observational study

In an observational study, researchers collect data in a way that does not directly interfere with how they data arise, they merely “observe.” In an observational study, we can only establish an association, or correlation, between two variables (if one exists). We cannot identify causation.

If the data in an observational study is from the past, it is a retrospective study. if data is observed at the time it arises, it is a prospective study.

Experiment

In an experiment, researchers randomly assign subjects to various treatments, and can establish causal connections between explanatory and response variables.

An observational study might look at two groups of people, those who work out regularly and those who don’t, and measure average energy levels. An experiment might assign subjects to two groups, those will be directed to work out regularly, and those who will be directed not to work out regularly. It should be noted that an experiment cannot always, necessarily make causal connections. In this example, there may be other confounding variables that have not been controlled, such as whether a subject previously worked out regularly.

Another example in the video discusses an observational study of the breakfast habits of girls. A news headline concludes that “breakfast cereal keeps girls slim,” thus attempting to draw a causal connection from an observational study, which is invalid. The study showed that girls who ate one of several breakfast cereals tended not to be overweight, but there could be many confounding variables. Was it just cereal? Were these girls more likely to eat breakfast because of other factors, which could be the real causes?

Confounding variable

Extraneous variables that affect both the explanatory and response variables, and that make it seem like there is a relationship between them, are called confounding variables.

Correlation does not imply causation.

Video 1-1-3: Sampling and Sources of Bias

Census vs. sample

A census considers the data for an entire population. This can be very expensive and difficult. Also, populations rarely stand still. By the time you collect data from an entire population, the population itself has changed.

Exploratory analysis is like sampling soup while it cooks to determine whether it needs more or less salt. If you can generalize that the entire soup pot is too salty, you have made an inference. But you won’t drink all of the soup to do this. Instead you want a representative sample, which might be a taste after stirring the pot thoroughly.

Sources of bias

  • Convenience sample: the cases in the study are not randomly selected from the population, but instead are easily accessed by the researcher. Example: Asking family members about a food preference.
  • Non-response: If only a non-random fraction of of the randomly-selected people in a study refuse to respond, the sample is no longer representative of the population. Example: immigrants (legal or not) are fearful about responding to any government survey.
  • Voluntary response: occurs when the sample consists only of people who volunteer to respond, because they have motivations to do so, which can bias the results.

An example in the video is the presidential election of 1936. A magazine, The Literary Digest, polled 10 million Americans, a huge number. About 2.4 million responded, and the results showed that FDR’s challenger would win by a huge margin. Instead, FDR won by a huge margin. The mistake was that the people polled were the magazine’s own readers, as well as car and telephone owners. At the time, this group had an average income far above the national average. So in spite of the huge number of responses, the results were heavily biased.

You can drink a gallon of the soup from a large soup pot, but if it is all from the top of the pot, you still may not know how your soup tastes!

Also: who makes soup with green beans without cutting off the ends of the green beans first?!?

Sampling methods

  • Simple random sample: select cases entirely at random from the population at large (avoid The Literary Digest’s mistake of choosing a non-random sample)
  • Stratified sample: Group the population into strata, in which all of the subjects/cases within a stratum are like each other, and the strata differ from each other. Then choose a random sample from within each stratum. Example: polling a sample of members of each NBA team (each team may have considerably different average salaries). Another example: divide population into men and women.
  • Cluster sample: Group the population into clusters, in which the clusters are all similar to each other, and their is diversity within each cluster. Take a random sample from a randomly-selected number of the clusters. Example: polling a sample of people from ten of twenty remote villages. This can help to reduce sampling costs without impairing effectiveness.

Video 1-1-4: Experimental Design

Principles of experimental design

  • Control: One group is “controlled” in that it receives no treatment, so this can be compared to the treatments
  • Randomize: Subjects for each group are assigned randomly
  • Replicate: Sufficient cases, or replications of the study, ensure accuracy
  • Block: If there are known external variables, such as age, gender, or health conditions, then create blocks for each within each treatment group, and then assign randomly to those blocks in each group

Blocking vs. explanatory variables

  • Explanatory variables (also known as factors) are conditions we can impose on the experimental units. E.g., various treatments, “drink 12 oz. water every morning” or “use energy gel”
  • Blocking variables are characteristics that the experimental units come with, that we would like to control for. E.g., gender. Not many subjects would participate if you said, “You will be converted to a man, and you will be converted to a woman.”
  • Blocking is like stratifying, except used in experimental settings when randomly assigning, rather than when sampling

Experimental design terminology

  • a placebo is a fake treatment often used in a control group in an experiment
  • blinding is when the subjects in an experiment do not know whether they are in a treatment or a control group
  • double blinding is when the researchers who interact with study subjects also do not know who is in which group

Video 1-1-5: Random Sample Assignment

Random sampling

This refers to randomly selecting subjects from within a population for a study. If the sample is selected randomly, it is likely to be representative of the population. Therefore, the study’s results are generalizable to the population.

Random assignment

Random assignment occurs only in an experiment. It refers to the assignment of subjects to the treatment and control groups. If we are careful to assign subjects randomly to each group, we may be able to draw causal conclusions.

Random assignment No random assignment .
Random sampling causal and generalizable not causal, but generalizable Generalizability
No random sampling causal, but not generlizable neither causal nor generlizable No generalizability
Causation Assocation
  • causal and generlizable: the ideal experiment, but often difficult or even impossible to do
  • causal, but not generalizable: true of many experiments; useful, but cannot be generalized to entire population
  • not causal but generalizable: this is an observational study
  • neither causal nor generalizable: a bad observational study; otherwise known as a Fox News infographic

Video 1-2-1: Visualizing Numerical Data

Scatterplots for paired data

library(xlsx)
## Warning: package 'xlsx' was built under R version 3.1.3
## Loading required package: rJava
## Loading required package: xlsxjars
## Warning: package 'xlsxjars' was built under R version 3.1.2
life_exp <- read.xlsx2("gapminder/life-expectancy.xlsx",sheetIndex=1,header=T)
gdp <- read.xlsx2("gapminder/gdp-per-capita.xlsx",sheetIndex=1,header=T)

#colnames(life_exp)
#colnames(income)

I gave up on trying to get the data for this and do a plot; something was very, very slow about loading the data. The video shows a plot showing an assocation between income per person (explanatory variable, on x axis) and life expectancy (response variable, on y axis). By convention, explanatory variable is on the x axis and response on the y axis.

It should be kept in mind that labelling two variables as explanatory and response does not mean there is necessarily a causal connection between the two. Since the data in this case (from gapminder.com) is from an observational study, we cannot draw any causal conclusions.

The plot shows that there is a relationship that looks like a curve, rising sharply upward from x=0, and flattening out as it moves right along the x axis.

Evaluating the relationship

When examining relationship between two numerical variables, look at:

  • direction: as you move right on the x axis, do the y values increase or decrease
  • shape: is there a linear (straight-line) or non-linear relationship
  • strength: is it strong (points line up fairly closely to a single line that could be drawn) or weak (points are widely scattered)
  • outliers: are there individual observations or groups of observations that are less correlated than the rest of the points

The plot in the video shows three outliers with high income: Luxembourg, Macao (in China), and Qatar. Another outlier is Nepal, which has low income per person, but very high life expectancy.

One naive way to deal with outliers in data analysis is to immediately exclude them. But this can be a bad idea, because they may say something important.

Other visualizations for describing distributions of numerical variables

Histogram

data = c(rep(45,8), rep(50,12),rep(55,11),rep(60,18),rep(65,27),rep(70,52),rep(75,50),rep(80,28))
hist(data)

plot of chunk unnamed-chunk-4

(The above is a rough approximation of histogram in video.)

In a histogram, data are “binned” into variables and the heights of the bars represent the number of cases in each variable.

  • provides a view of data density.
  • especially useful for describing shape of distribution

The life expectancy histogram tends to be right-skewed. Most countries have average life expectancies in the range of 65-80. But a few have much lower values. The income histogram tends to be left-skewed. Many countries have average income of a few thousand dollars; the wealthy nations are few.

Skewness in histogram

  • left-skewed: long tail to the left, peak to the right
  • symmetric: no skew, tails of about equal size on both left and right, peak in the middle
  • right-skewed, long tail to the right, peak to the left

Modality in histogram

  • unimodal: a single prominent peak in the bars
  • bimodal: two prominent peaks
  • multiomdal: more than two prominent peaks
  • uniform: no prominent peaks

The most common distribution worked with in this course is the normal distribution, which is unimodal and symmetric. NOTE: As we will see later, the fact that a distribution is unimodal and symmetric does not mean it is necessarily a normal distribution. There is in fact a specific formula that describes the curve of a normal distribution, but it is outside the scope of this course.

Histogram and bin width

If the bins are too wide (a small number of bins), then the shape of the histogram can be lost. If the bins are too narrow (very large number of bins), then it can look like there are lots of peaks, though they won’t be prominent. There is no “right” number of bins, but the histogram should give a good overall view of the distribution of the data.

Dot plot

data = c(48,50,rep(52,5),rep(54,3),56,58,rep(60,7),rep(62,9),rep(64,2),rep(66,12),rep(68,14),rep(70,11),rep(72,23),rep(74,27),rep(76,16),rep(78,9),rep(80,3),82)
stripchart(data,method="stack",pch=21,col="blue")

plot of chunk unnamed-chunk-5

(The above is a rough approximation of the plot in the video.)

  • Useful when individual values are of interest
  • Can get very busy as the sample size increases

box plot

boxplot(data,horizontal=T)

plot of chunk unnamed-chunk-6

(Plot above is rough approximation of video.)

  • useful for highlighting outliers, median (dark middle line), and the interquartile range IQR (the box)
  • outliers are shown as circles
  • the “whiskers” show 1st and 3rd quartiles, but only up to a range of 1.5 * IQR (outside of that, points are shown as outliers)

A boxplot gives an idea of skew. In a horizontal boxplot, a long left whisker indicates left skew, for example. If the median is in the middle of the box, and the whisker are roughly equal length, there is little skew.

intensity map

This is a map in which regions are shaded different colors to represent different values (generally, darker means a higher number)

Visualizations review

  • scatterplot
  • histogram
  • dot plot
  • box plot
  • intensity map

Video 1-2-2: Measures of Center

Previously we examined shape of a distribution in terms of:

  • skew
  • left
  • right
  • symmetric
  • modality
  • unimodal
  • bimodal
  • multimodal
  • uniform

Center

The center of a distribution is another key feature of the distribution. It is commonly considered to be one of:

  • mean: arithmetic average
  • median: midpoint of the distribution (50th percentile)
  • mode: the most frequent observation

Sample statistics are point estimates of the population parameters. (So the mean of the entire population is a parameter; the mean of a sample is a point estiamte.) These estimates may not be perfect, but if the sample is good (representative of the population), they are usually good guesses.

The sample mean is usually denoted as \(\bar{x}\), and the population mean is denoted as \(\mu\).

Example

Student scores: 76,69,88,93,95,54,87,88,27

The mean is \(\frac{76+69+88+93+95+54+87+88+27}{9}=75.11\).

The mode is 88 because that value occurs most often.

The median is defined as the midpoint of the data. If the scores are arranged in ascending order, this is 87, the fifth value of nine.

If the data set contains an even number of values, there is no single midpoint, so the two values closest to the midpoint are added together and divided by two.

Skew, mean and median

If a unimodal distribution is left-skewed, then the mean will be less than the median. The median in a unimodal distribution will always be near the peak. But in a left-skewed distrbution, the average will be pulled to the left by the outliers on the left. Similarly, if a unimodal distribution is right-skewed, the average will be pulled to the right, so it will be greater than the median.

If a unimodal distribution has little skew, then the mean and median will be approximately equal.

The same basic effect is seen in a boxplot. It shows the median as a heavy line. If the left whisker (in a horizontal boxplot) is longer than the right whisker, it is safe to assume that the mean is to the left of (less than) the median.

To put this intuitively, if you have a sample of ten people, and you take their heights, and one of the ten is an NBA player who is over seven feet tall, then the median of the sample might be five and a half feet, but the average will be increased significantly by the one very tall person.

To recap:

  • left-skewed: mean < median
  • right-skewed: mean > median
  • symmetric: mean \(\approx\) median

Video 1-2-3: Measures of Spread

xlim = c(-4,4)
ylim=c(0,0.5)
data <- seq(-4,4,length=100)
plot(data, dnorm(data,mean=0,sd=1),type="l",lty=1,xlim=xlim,ylim=ylim)
par(new=T)
plot(data, dnorm(data,mean=0,sd=2), type="l",lty=2,col="blue",xlim=xlim,ylim=ylim)

plot of chunk unnamed-chunk-7

In the skinny solid curve, the data are clustered closer to the center, while in the dashed curve, the data are spread more widely.

Measures of spread:

  • range: The difference between maximum and minimum values; not very reliable because it is based on only two data points, and they are the most extreme points. Sort of like picking the coldest and hottest days of the year to represent the year’s weather.
  • variance
  • standard deviation
  • interquartile range

Variance

Roughly the average squared deviation from the mean. The sample variance is denoted as \(s^2\) and the population variance as \(\sigma^2\).

The formula for sample variance:

\[s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\]

A deviation from the mean is the distance of any one value in a data set from the mean for that set. So we square each deviation, sum those squares, and then divide the sum by \(n-1\), where \(n\) is the number of values. (The reason for dividing by \(n-1\) is discussed later.)

life_exp2 <- life_exp[,c("X2011")]
life_exp2 <- as.numeric(levels(life_exp2))[life_exp2]
life_exp2 = life_exp2[complete.cases(life_exp2)]
# some sample records
head(life_exp2)
## [1] 60.08 76.98 70.75 51.09 75.61 75.95
# number of data points
length(life_exp2)
## [1] 202
# mean of data set
mean(life_exp2)
## [1] 70.24
# variance of data set
var(life_exp2)
## [1] 85.51

The data set has 202 values. The mean of these values is 70.24. The variance is 85.51 years squared. (These differ a bit from the video, may have been from a different year. I use R’s built-in functions to get the mean and variance.)

The unit “years squared” is not very helpful, so we can take the square root of the variance (or use R’s sd function) to get the standard deviation.

sd(life_exp2)
## [1] 9.247

The standard deviation is 9.247 years.

Why do we square the differences?

  • get rid of negatives so that negatives and positives don’t cancel each other out when added together
  • increase larger deviations more than smaller ones so that they are weighed more heavily

My personal comment: I don’t claim to have a deep understanding of this, but I believe that there is a rigorous mathematical underpinning behind the standard deviation. It wasn’t chosen arbitrarily just to keep negatives from cancelling out positives, and it’s not an arbitrary method of making larger deviations weigh more. But the mathematical theory is beyond the scope of the course. Take a glance at the formula for a normal distribution (to which the idea of a standard deviation is closely linked) to get just a glimpse of the complexity.

Standard deviance

Roughly the average deviation around the mean, and has the same units as the data.

For a sample, this is denoted as \(s\). For a population, it is denoted \(\sigma\).

\[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]

Variability vs. diversity

A set of five cars, in which each is a different color, has high diversity. But consider these two sets, each of which gives a color and mileage for each car in it:

Set 1: blue (10), red (20), yellow (30), green (40), purple (50)

Set 2: blue (10), blue (10), blue (10), purple (50), purple (50)

The second set has lower diversity–it has only two different kinds of cars–but higher variability. The standard deviation of the first set is 15.81, but the standard deviation of the second set is 21.91.

sd(c(10,20,30,40,50))
## [1] 15.81
sd(c(10,10,10,50,50))
## [1] 21.91

Interquartile range

The range of the middle 50% of the data, distance between the first quartile (25th percentile) and the third quartile (75th percentile).

\[IQR = Q3 - Q1\]

data = c(48,50,rep(52,5),rep(54,3),56,58,rep(60,7),rep(62,9),rep(64,2),rep(66,12),rep(68,14),rep(70,11),rep(72,23),rep(74,27),rep(76,16),rep(78,9),rep(80,3),82)
boxplot(data,horizontal=T)

plot of chunk unnamed-chunk-11

The first quartile is roughly 65, and the third quartile is roughly 77. The IQR is therefore \(77-65=12\). This value, 12, isn’t useful by itself, but is useful in the context of the overall distribution. The IQR is usually more useful than the range, because it does not rely on extreme observations. Instead, it provides a measure of the variability for the bulk of the data around the center.

Video 1-2-4: Robust Statistics

We define robust statistics as measures on which extreme observations have little effect.

Example: data set is {1,2,3,4,5,6}. The mean and the median are both 3.5. But if we change one of the values so that the set is {1,2,3,4,5,1000}, the median remains unchanged at 3.5, but the mean balloons to 169. The median is considered a robust statistic.

robust non-robust
center median mean
spread IQR SD, range

Robust stastistics are most valuable with skewed distributions, or those with extreme observations. Non-robust statistics are most useful with symmetric distributions.

Video 1-2-5: Transforming Data

Transformations

  • A transformation is a rescaling of the data using a function
  • When data are very strongly skewed, we sometimes transform them so they are easier to model

Log transformation

The natural log transformation is often applied when much of the data cluster near zero (relative to the larger values in the set) and all observations are positive

gdp2 <- gdp[,c("X2011")]
gdp2 <- as.numeric(levels(gdp2))[gdp2]
gdp2 = gdp2[complete.cases(gdp2)]
par(mfrow=c(1,2))
hist(gdp2)
hist(log(gdp2))

plot of chunk unnamed-chunk-12

(Example above actually shows GDP per capita rather than income, but suffices to demonstrate the transformation. The histogram on the right has had natural log applied to the data.)

Transformations can also be applied to one or both variables in a scatterplot to make the relationship between the variables more linear, and hence easier to model with simple methods

life_exp2 <- life_exp[,c("X2011")]
life_exp2 <- as.numeric(levels(life_exp2))[life_exp2]

gdp2 <- gdp[,c("X2011")]
gdp2 <- as.numeric(levels(gdp2))[gdp2]
data = data.frame(life_exp=life_exp2, gdp=gdp2,stringsAsFactors=F)
data = data[complete.cases(data)]

par(mfrow=c(1,2))
plot(data$life_exp, data$gdp)
plot(data$life_exp, log(data$gdp))

The above code should show side-by-side scatterplots (income as explanatory variable, life expectancy as response variable) in which the one on the left, with the original data, shows a curved relationship between the data, and the one on the right, with natural log applied to response variable, shows a much more linear relationship. Right now the data doesn’t fit together (differing numbers of rows in the two sets), I may fix this later so the plots work.

Other transformations

  • sqaure root: may be applied in addition to a natural log
  • inverse: each data point \(x\) is convereted to \(1/x\)

The video says that later in the course, we’ll get into “a little bit more detail” on choosing a transformation. I don’t recall anything like this; I think transformation are merely touched on here, but it’s not something we use later. Important, but outside the scope of the course. “A transformation of the data can be useful, even though the complicate the interpretation of it.”

Goals of transformations

  • to see the data structure differently
  • to reduce the skew to assist in modeling
  • to straighten a nonlinear relationship in a scatterplot

Video 1-2-6: Exploring Categorical Variables

Frequency table and bar plot

Some data on a survey on how difficult Americans find i to save money:

Counts Frequencies
Very 231 46%
Somewhat 196 39%
Not very 58 12%
Not at all 14 3%
Not sure 1 ~0%
save_data = data.frame(counts=c(231,196,58,14,1),cat=c("Very", "Somewhat", "Not very", "Not at all", "Not sure"))
save_data = data.frame(counts=c(rep("Very", 231),rep("Somewhat",196),rep("Not very",58),rep("Not at all",14),"Not sure"))
barplot(table(save_data))

plot of chunk unnamed-chunk-14

Video 1-3: Introduction to Inference

Unit 4

Video 4-1-1: Hypothesis testing for paired data

Example of students taking test that measures both reading and writing skills. The two scores cannot be considered independent, because any one student’s ability in reading and writing are probably linked to some degree.

When two observations have a special correspondence (they are not independent), they are said to be paired.

To analyze paired data, it is often useful to look at the difference in outcomes of each paired observation. Example, diff = read - write. It important that we always subtract in the same order for each case (either read-write, or write-read).

  • The parameter of interest is the average difference between reading and writing scores, \(\mu_{diff}\), among the population (all high school students).
  • The point estimate is the average difference between reading and right scores, \(\bar{x}_{diff}\), among the sampled high school students.

You can plot a histogram of the observed differences, but it won’t indicate whether the differences are statistically significant; i.e., it won’t allow you to reject a null hypothesis that states that there is no difference.

Hypotheses for paired means

\(H_0: \mu_{diff} = 0\)
\(H_A: \mu_{diff} \ne 0\)

Now we have just a single variable, the differences between the paired data, and we proceed as before. All of the conditions are the same as before.

  • Independence: Random sampling, < 10% population
  • Sample size/skew: \(n \ge 30\), larger if population very skewed

Test statistic is

\[Z = \frac{\bar{x}_{diff}-\mu_{diff}}{SE_{\bar{x}_{diff}}}\]

\(\bar{x}_{diff} \text{~} N(mean=0,SE=0.628)\)

x_bar_diff = -0.545
s_diff = 8.887 # population std. dev.
n_diff = 200
se_diff = s_diff / sqrt(n_diff) # 0.628
nullval = 0
Z = abs(x_bar_diff - nullval) /se_diff
Z
## [1] 0.8673
p_value = (1 - pnorm(Z)) * 2 # twosided test
p_value
## [1] 0.3858

With such a high p-value, we do not reject the null hypothesis.

Video 4-1-2: Confidence intervals for paired data

Like the hypothesis test, the confidence interval for two paired values in a sample is just like a single sample, except the sample statistic is the difference between the two values. Confidence interval is still point estimate \(\pm\) margin of error, or \(\bar{x}_{diff}\pm z^*SE_{\bar{x}_{diff}}\).

x_bar_diff = -0.545
s_diff = 8.887
n_diff = 200
se_diff = s_diff / sqrt(n_diff)
se_diff
## [1] 0.6284
conf_level = 0.95
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
z_star
## [1] 1.96
ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci
## [1] -1.7767  0.6867

We are 95% confident that the average reading score is between 1.78 points lower to 0.69 points higher than the average writing score.

Video 4-1-3: Comparing independent means

Example is work hours and educational attainment from GSS (2010).

Educational attainment is ordinal: less than high school, high school, junior college, bachelor’s, graduate. Hours worked is numerical.

Best way to visualize a categorical versus numerical variable is side-by-side box plots.

Example then shifts to just “college” or “no college”. The parameter of interest is average difference in number of hours worked among all Americans with a college degree, and those without a degree. \(\mu_{coll} - \mu_{no coll}\)

The point estimate would be average difference in hours worked per week for sampled Americans, \(\bar{x}_{coll} - \bar{x}_{no coll}\).

Confidence interval

As always, the formula is point estimate \(\pm\) standard error. In this case, it is \((\bar{x}_1 - \bar{x}_2) \pm z^* SE_{\bar{x}_1 - \bar{x}_2}\).

The standard error of the difference of two independent means is:

\[SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\]

Conditions

  1. Independence:
  • Within groups, sampled observations must be independent
    • random sampling/assignment
    • if sampling without replacement, <10% pop.
  • between groups: two groups must be independent of each other (non-paired)
  1. Sample size/skew: each sample must be at least 30, larger if population is very skewed
conf_level = 0.95
alpha = 1 - conf_level
x_bar_1 = 41.8
n_1 = 505
s_1 = 15.14
x_bar_2 = 39.4
n_2 = 667
s_2 = 15.12

x_bar_diff = x_bar_1 - x_bar_2

se_diff = sqrt((s_1^2/n_1) + (s_2^2/n_2))
se_diff
## [1] 0.8926
z_star = qnorm(conf_level + (alpha/2))
z_star
## [1] 1.96
ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci
## [1] 0.6506 4.1494

Hypothesis test

(Carrying over example and calculations from previous.)

\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \ne 0\)

\((\bar{x}_{coll} - \bar{x}_{no coll}) ~ N(mean = 0, SE = 0.89)\)

nullval = 0
Z = abs(x_bar_diff-nullval) / se_diff
Z
## [1] 2.689
p_value = (1 - pnorm(Z)) * 2 # twosided test
p_value
## [1] 0.007168

Reject the null hypothesis, there is a difference in hours worked. We already knew this from the 95% confidence interval, in which 0 (no difference) did not appear.

Interpretation: Given that there is no difference between the number of hours worked between those with and without a college degree, there is a 0.7% chance that random samples of 505 college and 667 non-college graduates will show an average difference of hours worked of 2.4 hours or more.

A note about the standard deviation

This is not explicitly covered in the course, but was discussed with the instructor in a forum. Think of the two samples as coming from distinct populations. Therefore, it is desirable to have standard deviations for both populations to calculate the standard error. If they are not available, fall back to the standard deviations for the samples. But don’t think of the samples as coming from a single population; there is no such thing as a single population standard deviation for both.

What if you have two samples of different sizes from a single population? In that case, it would make no sense to conduct a hypothesis test of their difference. The point of comparing two unpaired samples is to determine whether there is a difference in their respective population parameters. If they come from the same population, we already know the answer to that is no.

Video 4-2: Bootstrapping

Example: Twenty 1+ bedroom apartments were randomly selected on raleigh.craigslist.org. The median is a better measure of the distribution because the rents are right-skewed.

There is no Central Limit Theorem for medians, so we cannot use methods learned so far to construct confidence interval. An alternative approach to constructing confidence intervals is bootstrapping. The technique involves taking many samples, with replacement, from the original sample, and then building a distribution based on those samples.

Example: There is a sample of 20 apartments. So we take many samples, each of size 20, using replacement, from this sample, and we calculate a sample statistic (mean, median, proportion, etc.) for each of those samples.

I don’t know how to construct a dotplot with 500 elements and get then all to show as in the video, so I show a histogram instead.

set.seed(3)
rents <- c(775,625,733,929,895,749,1020,1349,599,1143,1209,1495,879,975,1076,1282,665,705,799,500)
num_samples = 500
rent_medians <- rep(0, num_samples)
for(i in 1:num_samples) {
  samp <- sample(rents, length(rents), replace=TRUE)
  rent_medians[i] <- median(samp)
}
hist(rent_medians,breaks=10)

plot of chunk unnamed-chunk-19

Confidence interval: percentile method

We can use 95% as a confidence level, and thus determine the samples that comprise 95% of all samples that are “in the middle”. Or conversely, find 2.5% of the samples on the left, and 2.5% on the right.

The key point here is that you have a dotplot, and so if you need to determine which points fall outside of 95%, multiply the total number of samples by 0.95, and that is the number of cases within the 95% confidence interval. Subtract this number from the total number of samples, and you have the number of cases outside of the 95% confidence interval. These will be half on one side, and half on the other.

Example with 90% confidence level: 500 samples * 0.9 = 450. 500 - 450 = 50, so there would be 50/2=25 samples on either side of the confidence interval.

conf_level = 0.9
num_in_interval = conf_level * num_samples
num_in_tail = (num_samples - num_in_interval) / 2
medians_sorted = rent_medians[order(rent_medians)]
left_tail_val = medians_sorted[num_in_tail]
right_tail_val = medians_sorted[num_samples - num_in_tail + 1]
c(left_tail_val, right_tail_val)
## [1]  749 1020

The numbers above do not exactly match the video, they are from my own bootstrap simulation.

Standard error method

First determine the mean of the bootstrap distribution, \(\bar{x}_{boot}\), and caclulate the 90% confidence interval as that mean, plus/minus 1.96 times the standard error, which is the bootstrap standard error, \(SE_{boot}\). This is the standard deviation of the observations in the bootstrap distribution.

se_boot = sd(rent_medians) # std. error is just the standard deviation because we use the entire bootstrap population
rent_mean = mean(rent_medians)
conf_level = 0.9
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(rent_mean - z_star * se_boot, rent_mean + z_star * se_boot)
ci
## [1]  742.4 1014.1

Note that the numbers above are NOT the numbers in the video, they are based on my own bootstrap. They are however pretty close to the video’s numbers.

You can see that the numbers from the two estimates (percentile method and standard error method) do not match each other exactly, but they are close.

Bootstrapping limitations

  • Not as rigid as conditions for CLT based methods
  • But if bootstrap distribution is very skewed or very sparse, the bootstrap interval might be unreliable
  • A representative sample is required for generalizability. If the sample is biased, the estimates resulting from this sample will also be biased.

Bootstrap vs. sampling distribution

  • Sampling distribution created using sampling (with replacement) from population. Note that this is rare, we almost never have access to entire population
  • Bootstrap distribution created using sampling (with replacement) from the sample.
  • Both are distributions of sample statistics

Video 4-3-1: t distribution

The t distribution is useful when sample size is less than 30.

What purpose does a large sample serve?

As long as observations are independent and the population distribution is not extremely skewed, a large sample would ensure that…

  • the sampling distribution of the mean is nearly normal
  • the estimate of the standard error is reliable: \(\frac{s}{\sqrt{n}}\)

Review: normality of sampling distributions

  • CLT: sampling distributions are nearly normal as long as the population distribution is normal, for any sample size
  • Helpful special case, but difficult to verify normality in small data sets

For example, if sample size is 10, the sample might show considerable skew, but the sample is too small to be sure. If sample size is 1000, the histogram gives us much more confidence about what the population distribution looks like.

So it is important to be careful with normality condition for small samples; don’t just examine the sample, but also think about where the data come from. Would I expect this distribution to be symmetric, and am I confident that outliers are rare?

t distribution

When n is small and \(\sigma\) is unknown, use the t distribution to address the uncertainty of the standard error estimate. The t distribution is unimodal and symmetric, but its tails are somewhat thicker. The exact shape of the distribution depends on the degrees of freedom, and as those degrees are larger, the distribution looks more and more like the normal distribution. The plot below shows the normal distribution as a solid line.

normdata <- seq(-4, 4, length=100)
hnormdata <- dnorm(normdata)
plot(normdata, hnormdata, type="l", lty=1, xlab="x value", ylab="Density", main="Normal Distribution",xlim=c(-4,4),ylim=c(0,0.45))
abline(h=0)
tdata <- dt(normdata, df=2)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
tdata <- dt(normdata, df=6)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
tdata <- dt(normdata, df=20)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")

plot of chunk unnamed-chunk-22

The thicker tails mean that observations are more likely to fall beyond 2 standard deviations from the mean. This is helpful for mitigating the effect of a less reliable estimate for the standard error of the sampling distribution. In a nutshell, the confidence interval will include fewer observations with a t distribution than a standard distribution.

  • The t distribution is always centered at 0, like the normal
  • It has only one parameter, degrees of freedom, which determines the thickness of the tails

Remember, the normal distribution has two parameters, mean and standard deviation.

t statistic

  • For inference on a mean where
  • \(\sigma\) is unknown
  • \(n < 30\)
  • calculated the same way: \(T = \frac{obs - null}{SE}\)
  • p-value: same definition as before

We can see below that, when degrees of freedom is 50, the t distribution is very similar to the normal distribution. When degrees of freedom is 10, there is a much larger difference.

# P(|Z| > 2)
(1 - pnorm(2)) * 2
## [1] 0.0455
# P(|t[df=50]| > 2)
(1 - pt(2, df=50)) * 2
## [1] 0.05095
# P(|t[df=10]| > 2)
(1 - pt(2, df=10)) * 2
## [1] 0.07339

So in the above results, if the significance level is 5%, we can reject the null hypothesis only for the scenario using the normal distribution. The t distribution makes it less likely to reject the null hypothesis, and the smaller the degrees of freedom, the harder it gets.

Origins of the t distribution

It’s actually the Student’s t distribution. This was a pseudonym for William Gosset at Guiness brewing company. He worked with small samples/batches, and thus developed the t distribution. He assumed a pseudonym, Student, for his publications to avoid exposing trade secrets.

Video 4-3-2: Inference for a small sample mean

The example here is a study on playing video game while eating, which might lead to forgetting what you ate, and this can contribute to increased snacking.

biscuit intake \(\bar{x}\) \(s\) \(n\)
solitaire 52.1g 45.1g 22
no distraction 27.1g 26.4g 22
conf_level = 0.95
alpha = 1 - conf_level
x_bar_solitaire = 52.1 # biscuit intake
s_solitaire = 45.1
n_solitaire = 22
se_solitaire = s_solitaire / sqrt(n_solitaire)

x_bar_no_dist = 27.1 # biscuit intake
s_no_dist = 26.4
n_no_dist = 22
se_no_dist = s_no_dist / sqrt(n_no_dist)

df = n_solitaire - 1
t_star = qt(conf_level + (alpha/2), df=df)
t_star
## [1] 2.08
ci_solitaire = c(x_bar_solitaire - t_star * se_solitaire, x_bar_solitaire + t_star * se_solitaire)
ci_solitaire
## [1] 32.1 72.1
ci_no_dist = c(x_bar_no_dist - t_star * se_no_dist, x_bar_no_dist + t_star * se_no_dist)
ci_no_dist
## [1] 15.39 38.81

Hypothesis test

Suppose the suggested serving size of these biscuits is 30g. Do these data provide convincing evidence that the amount of snacks consumed by distracted eaters post-lunch is different than the suggested serving size?

\(H_0: \mu = 30\)
\(H_A: \mu \ne 30\)

nullval_solitaire = 30
T = (x_bar_solitaire - nullval_solitaire) / se_solitaire
T
## [1] 2.298
p_value = (1 - pt(T, df=n_solitaire)) * 2 # twosided test
p_value
## [1] 0.03141

So reject the null hypothesis; there is a difference in snacking levels between distracted eaters and suggested serving size.

Video 4-3-3: Inference for comparing two small sample means

Nutshell: This is the same as inference for comparing means from two large samples, except you use the t distribution

Same study as last video, but this time we consider means both for those who played solitaire and those who had no distractions.

Confidence interval: point estimate \(\pm\) margin of error or \((\bar{x}_1 - \bar{x}_2) \pm t^*_{df}SE_{(\bar{x}_1 - \bar{x}_2)}\).

Hypothesis test: \(T_{df} = \frac{obs - null}{SE} = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE_{\bar{x}_1 - \bar{x}_2}}\), where \(SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\).

Degrees of freedom for t statistic for inference on difference of two means:

\[df = min(n_1 - 1, n_2 - 1)\]

(Note that the above is a conservative estimate of the actual number, which is complex to derive.)

For hypothesis test:

\(H_0: \mu_{wd} - \mu_{wod} = 0\)
\(H_A: \mu_{wd} - \mu_{wod} \ne 0\)

Where “wd” means “with distraction” and “wod” means “without distraction”.

x_bar_diff = x_bar_solitaire - x_bar_no_dist
nullval = 0
se_diff = sqrt(s_solitaire^2/n_solitaire + s_no_dist^2/n_no_dist)
se_diff
## [1] 11.14
df = min(n_solitaire - 1, n_no_dist - 1)
Z = abs(x_bar_diff - nullval)/se_diff
Z
## [1] 2.244
p_value = (1 - pt(Z, df=df)) * 2
p_value
## [1] 0.03575

Reject the null hypothesis, there is a difference between the two groups.

Confidence interval for above:

conf_level = 0.95
alpha = 1 - conf_level
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(x_bar_diff - t_star * se_diff, x_bar_diff + t_star * se_diff)
ci
## [1]  1.83 48.17

Video 4-4-1: Comparing more than two means

Survey that asked respondents to choose correct definitions of 10 words, and asked them to identify their social class.

# load data
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/gss.Rdata"))
print(levels(gss$class))
## [1] "LOWER"   "MIDDLE"  "UPPER"   "WORKING"
hist(gss$wordsum)

plot of chunk unnamed-chunk-28

plot(gss$class)

plot of chunk unnamed-chunk-28

Exploratory analysis

boxplot(gss$wordsum ~ gss$class)

plot of chunk unnamed-chunk-29

gss.means = as.vector(by(gss$wordsum, gss$class, mean))
gss.counts = as.vector(by(gss$wordsum, gss$class, length))
gss.sds = as.vector(by(gss$wordsum, gss$class, sd))
gss2 <- data.frame(class=levels(gss$class), n=gss.counts, mean=gss.means, sd=gss.sds)
gss2
##     class   n  mean    sd
## 1   LOWER  41 5.073 2.240
## 2  MIDDLE 331 6.761 1.886
## 3   UPPER  16 6.188 2.344
## 4 WORKING 407 5.749 1.865
n mean sd
lower 41 5.07 2.24
working 407 5.75 1.87
middle 331 6.76 1.89
upper 16 6.19 2.34
total 795 6.14 1.98

We need to compare more than two statistics, so methods learned so far won’t do the job.

  • To compare means of 2 groups we use a Z or T statistic (depending on sample size)
  • To compare means of 3+ groups we use a new test called analysis of variance (ANOVA) and a new statistic called F

\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu+k\).
\(H_A:\) The mean outcome is different for at least two of the categories

where \(\mu_i\) mean of the outcome for observations in category \(i\), and \(k\) is the number of groups.

ANOVA does not tell us which of the means are different from each other, or whether it is 2, or 3, or more means that differ, only that there are at least two means that differ.

The test statistic is F, which is:

\[F = \frac{\text{variability between groups}}{\text{variability within groups}}\]

  • Large test statistics lead to small p-values
  • If p-value is small enough \(H_0\) is rejected and we conclude the data provide evidence of a different in the population means
fdata = rf(5000, df1=10, df2=10)
hist(fdata, breaks=100,main="F dist. (df=10)")

plot of chunk unnamed-chunk-30

  • In order to be able to reject \(H_0\), we need a small p-value, which requires a large F statistic
  • In order to obtain a large F statistic, variability between sample means needs to be greater than variabiity within sample means

Video 4-4-2: ANOVA

Variability partitioning

The data is again the GSS data on class and vocabulary score.

\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu+k\).
\(H_A:\) The mean outcome is different for at least one pair of means

Total variability in vocabulary scores:

  1. Variability attributed to class. The variability between the groups.
  2. Variability attributed to other factors. The variability within the groups.

ANOVA output

Df Sum Sq Mean Sq F value PR(>F)
Group class 3 236.56 78.855 21.735 <0.0001
Error Residuals 791 2869.80 3.628
Total 794 3106.36

(Note that R’s ANOVA output does not show the Total line!)

Sum of squares total (SST)

  • This is 3106.36 in the table. It measures the total variability in the response variable (vocabulary scores)
  • Calculated very similarly to variance, except not scaled by the sample size

\[SST = \sum\limits^n_{i=1}(y_i-\bar{y})^2\]

Where \(y_i\) is the value of the response variable for each observation and \(\bar{y}\) is the grand mean of the response variable.

To compute SST:

sst <- function(y) {
  y.mean = mean(y)
  return(sum(sapply(y, function(y.i){(y.i - y.mean)^2})))
}
sst(gss$wordsum)
## [1] 3106

Sum of squares groups (SSG)

  • This is 236.56 in the table above. It measures the variability between the groups.
  • It is explained variability: the deviation of the group mean from the overall mean, weighted by sample size

\[SSG = \sum\limits^k_{j=1} n_j(\bar{y}_j - \bar{y})^2\]

where \(k\) is the number of groups, \(n_j\) is the number of observations in group \(j\), \(\bar{y}_j\) is the mean of the response variable for group \(j\), and \(\bar{y}\) is the grand mean of the response variable.

To compute SSG:

# this can no doubt be optimized--get rid of loop
ssg <- function(data, y, j) {
  total = 0
  mean.grand = mean(y)           # mean for all values
  for(grp in unique(j)) {
    grp.data = y[j==grp]        # vector of the y column of data
    n.grp = length(grp.data)
    mean.group = mean(grp.data) # mean for current group
    total = total + n.grp * (mean.group - mean.grand)^2
  }
  return(total)
}
ssg(gss, gss$wordsum, gss$class)
## [1] 236.6

Sum of squares error (SSE)

  • This is 2869.8 in the table above. Measures variability within groups
  • Unexplained variability: unexplained by the group variable, due to other reasons

\[SSE = SST - SSG\]

Now we need a way to get from these measures of total variability to average variability (“Mean Sq” column in table above). Scaling by a measure that incorporates sample size and number of groups, which means degrees of freedom.

Degrees of freedom

Three degrees of freedom associated with ANOVA:

  • total: \(df_T = n - 1\), e.g., 795-1=794
  • group: \(df_G = k -1\), e.g., 4-1=3
  • error: \(df_E = df_T - df_G\), e.g., 794-3=791

Mean squares

Average variability between and within groups, calculated as the total variability (sum of squares) scaled by the associated degree of freedom

  • group: \(MSG = SSG/df_G\)
  • error: \(MSE = SSE/df_E\)

Here is the table again:

Df Sum Sq Mean Sq F value PR(>F)
Group class 3 236.56 78.855 21.735 <0.0001
Error Residuals 791 2869.80 3.628
Total 794 3106.36

So MSG is 236.56/3=78.855, and MSE is 2869.8/791=3.628.

F statistic

Ratio of the between group and within group variability:

\[F = \frac{MSG}{MSE}\]

Here, this is 78.855/3.628=21.735.

p-value

Now at last we can calculate the p-value, based on the F statistic.

  • p-value is the probability of at least as large a ration between the “between” and “within” group variabilities if in fact the means of all groups are equal
  • area under F curve, with degrees of freedom \(df_G\) and \(df_E\)
F = 21.735
df1 = 4-1
df2 = 795 - df1 - 1
p_value = pf(21.735, df1=df1, df2=df2, lower.tail=FALSE)
p_value
## [1] 1.56e-13

Reject the null hypothesis; there is a difference between at least one pair of groups.

Here is the same thing, using R’s anova function. It summarizes a single model (or it can compare two or more models).

gss.lm <- lm(gss$wordsum ~ gss$class)
gss.anova = anova(gss.lm)
gss.anova
## Analysis of Variance Table
## 
## Response: gss$wordsum
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## gss$class   3    237    78.9    21.7 1.6e-13 ***
## Residuals 791   2870     3.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gss.anova)
##        Df          Sum Sq        Mean Sq         F value         Pr(>F) 
##  Min.   :  3   Min.   : 237   Min.   : 3.63   Min.   :21.7   Min.   :0  
##  1st Qu.:200   1st Qu.: 895   1st Qu.:22.43   1st Qu.:21.7   1st Qu.:0  
##  Median :397   Median :1553   Median :41.24   Median :21.7   Median :0  
##  Mean   :397   Mean   :1553   Mean   :41.24   Mean   :21.7   Mean   :0  
##  3rd Qu.:594   3rd Qu.:2212   3rd Qu.:60.05   3rd Qu.:21.7   3rd Qu.:0  
##  Max.   :791   Max.   :2870   Max.   :78.85   Max.   :21.7   Max.   :0  
##                                               NA's   :1      NA's   :1
gss.anova
## Analysis of Variance Table
## 
## Response: gss$wordsum
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## gss$class   3    237    78.9    21.7 1.6e-13 ***
## Residuals 791   2870     3.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is the output from the course’s inference function.

source("http://bit.ly/dasi_inference")
# This will do ANOVA analysis because the explanatory variable is categorical and has more than 2 levels
# alternative is "greater" because an F test is always one-sided
inference(y=gss$wordsum, x=gss$class, type="ht", est="mean", method="theoretical", alternative="greater", eda_plot=F, inf_plot=F)
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_LOWER = 41, mean_LOWER = 5.073, sd_LOWER = 2.24
## n_MIDDLE = 331, mean_MIDDLE = 6.761, sd_MIDDLE = 1.886
## n_UPPER = 16, mean_UPPER = 6.188, sd_UPPER = 2.344
## n_WORKING = 407, mean_WORKING = 5.749, sd_WORKING = 1.865
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
## Warning: option "show.signif.stars" is invalid: assuming TRUE
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## x           3    237    78.9    21.7 1.6e-13 ***
## Residuals 791   2870     3.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Pairwise tests: t tests with pooled SD 
##          LOWER MIDDLE  UPPER
## MIDDLE  0.0000     NA     NA
## UPPER   0.0475 0.2396     NA
## WORKING 0.0306 0.0000 0.3671

Now, here is the output of R’s aov function. It fits a single model.

gss.aov = aov(gss.lm)
summary(gss.aov)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## gss$class     3    237    78.9    21.7 1.6e-13 ***
## Residuals   791   2870     3.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gss.aov
## Call:
##    aov(formula = gss.lm)
## 
## Terms:
##                 gss$class Residuals
## Sum of Squares      236.6    2869.8
## Deg. of Freedom         3       791
## 
## Residual standard error: 1.905
## Estimated effects may be unbalanced

Video 4-4-3: Conditions for ANOVA

independence

Sample observations must be independent of each other

  • random sample/assignment
  • each \(n_j\) less than 10% of population
  • carefully consider whether the groups may be independent (pairing)
  • always important, sometimes difficult to check

approximately normal

  • distribution of response variable within each group should be approximately normal
  • especially important when sample sizes are small
  • this means do a normal prob. plot for each group

constant variance

  • variability should be constant across groups: homoscedastic groups
  • especially important when samples sizes differ between groups
  • look at side-by-side boxplots, sample sizes and standard deviations

Video 4-4-4: Multiple comparisons

So far we’ve determined that at least two groups differ from each other, but not which. Now we conduct multiple tests to determine which groups differ.
### Which means differ

Conduct two-sample t tests for each possible pair of groups. But because there are many tests, this means an increase in the rate of Type 1 errors.

Modified significance level

  • testing many pairs of groups is called multiple comparisons
  • the Bonferroni correction suggests that a more stringent significance level is appropriate for such tests
  • adjust \(\alpha\) by number of comparisons

\[\alpha^* = \alpha/K\]

where \(\alpha\) is starting significance level, and \(K = \frac{k(k-1)}{2}\)

Example

alpha = 0.05
k = 4
K = (k*(k-1))/2
alpha.mod = alpha/K
alpha.mod
## [1] 0.008333

Pairwise comparisons

  • For multiple comparisons after ANOVA, since assumption of equal variability across groups must have been satisified, re-think standard error and degrees of freedom
  • use a consistent measure of standard error and consistent degrees of freedom for all tests
  • compare p-values for each test to modified significance level

Standard error for pairwise comparison:

\[SE = \sqrt{\frac{MSE}{n_1} + \frac{MSE}{n_2}}\]

Recall that MSE was 3.628 in the table above.

Degrees of freedom for pairwise comparisons:

\[df = df_E\]

Recal that \(df_E\) was 791 in the table above.

Example comparing lower and middle class:

df_E = 791
k = 4
K = (4*(4-1))/2
mean.lower=5.07
mean.middle=6.76
mean.diff = mean.middle - mean.lower
n.lower=41
n.middle=331
MSE = 3.628
SE.pairwise = sqrt(MSE/n.lower + MSE/n.middle)
nullval = 0
T = abs(mean.diff - nullval)/SE.pairwise
T
## [1] 5.359
p_value = pt(T, df=df_E, lower.tail=FALSE) * 2 # twosided test
p_value
## [1] 1.098e-07
p_value < alpha.mod # TRUE means reject null hypothesis
## [1] TRUE

Similar comparisons could now be made between the other groups. I want to code a function that would take in a data frame and do all of this automatically.

Unit 5: Inference for Categorical Variables

Video 5-1-1: Sampling variability and CLT for proportions

Focus shifts from numeric to categorical variables.

This video has a lousy example on proportions of smokers by country, worldwide. I don’t like it because there is obvious cultural differences that can result in very large differences in proportions between countries, so any estimated proportion based on the sample proportions of the various countries is invalid, as far as I’m concerned.

But the idea is that you take the mean of \(\hat{p}\), which apparently means add all of the \(\hat{p}\) values for the various countries together, and then divide by the number of countries (population size differences unaccounted for!) and this should approximate the population \(p\) parameter.

CLT for proportions

The distribution of sample proportions is nearly normal, centered at the population proportion, and with a standard error inversely proportional to the sample size.

\[\hat{p} \text{~} N (\text{mean} = p, SE=\sqrt{\frac{p(1-p)}{n}})\]

Conditions

  1. Independence
  • simple random sample/assignment
  • if sampling without replacement, < 10% population
  1. Sample size/skew–success/failure condition
  • np and n(1-p) \(\ge 10\)

If \(p\) is unknown for success/failure, use \(\hat{p}\).

Example: 90% of all plant species are angiosperms. If you randomly sample 200 plants, what is probability that at least 95% will be angiosperms?

p = 0.9
nullval = 0.95
n = 200
SE = sqrt((p * (1-p))/n)
SE
## [1] 0.02121
Z = abs(p - nullval)/SE
Z
## [1] 2.357
p_value = 1 - pnorm(Z)
p_value
## [1] 0.009211

This can also be done with binomial distribution, although answer will be slightly different.

sum(dbinom(190:200, size=200, prob=0.9))
## [1] 0.008071

NOTE: I was really confused by this for a while. Later on, we’re told that, in a hypothesis test based on a proportion, we use the null value to calculate the standard error, but that doesn’t happen above. Why not? Because we have the population parameter, p, rather than a statistic, \(\hat{p}\). I think!

What if sample size condition is not met?

  • center of sampling distribution will still be around population parameter
  • spread of sampling distribution can still be approximated using the same formula for the standard error
  • however the shape of the distribution will depend on whether true population parameter is closer to 0 or 1. If closer to 0, right-skewed, if closer to 1, left-skewed.

Video 5-1-2: Confidence interval for a proportion

In a survey, 99 of 670 respondents showed bad intuition about experiment design, and 571 showed good intuition.

The parameter of interest is \(p\). The point estimate is \(\hat{p}\), the sample proportion. This is \(571/670\approx0.85\).

Confidence interval is:

\[\hat{p} \pm z^*SE_{\hat{p}}\]

The standard error for proportion:

\[SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

Note that, for a confidence interval, we use \(\hat{p}\) to calculate standard error.

n = 670
success = 571
conf_level = 0.95
alpha = 1 - conf_level
p_hat = success / n
SE = sqrt((p_hat*(1-p_hat))/n)
Z_star = qnorm(conf_level + alpha/2)

ci = c(p_hat - Z_star * SE, p_hat + Z_star * SE)
ci
## [1] 0.8254 0.8791

Sample size for a margin of error

In the example above, the margin of error was about 2.7%. To determine sample size needed for a confidence interval at a given margin of error, e.g., 1%:

\[n = (z^*/ME)^2 \times (p(1-p))\]

p_hat = 0.85
ME = 0.01
conf_level = 0.95
alpha = 1 - conf_level
Z_star = qnorm(conf_level + alpha/2)
n = (Z_star/ME)^2 * (p_hat * (1 - p_hat))
ceiling(n)
## [1] 4898

The answer given here is 4898 on the dot. Video shows 4894.4. This due to rounding errors (the Z value for 95% is actually a bit lower than 1.96). Don’t know how to deal with this on the exam. Use 1.96 hard-coded I guess.

Video 5-1-3: Hypothesis test for a proportion

\[SE = \sqrt{\frac{p(1-p)}{n}}\]

\[Z = \frac{\hat{p} - p}{SE}\]

Note that “p” here is the null value, because the null hypothesis assumes this is the population parameter.

A poll found that 60% of 1983 randomly sampled American adults believe in evolution. Does this provide convincing evidence that a majority believe?

\(H_0: p = 0.5\)
\(H_A: p > 0.5\)

Conditions

  1. independence: random sample, < 10% population
  2. sample size/skew: \(1983 * 0.5 \ge 10\), this is for both success and fail (note that we use the value from null hypothesis, not observed proportion from sample)
conf_level = 0.95
alpha = 1 - conf_level
nullval = 0.5
p_hat = 0.6
n = 1983
SE = sqrt((nullval * (1-nullval))/n)
Z_score = abs(p_hat - nullval) / SE
Z_score
## [1] 8.906
p_value = 1 - pnorm(Z_score)
p_value
## [1] 0

Reject the null hypothesis.

Video 5-2-1: Estimating the difference between two proportions

“Calculating a confidence interval between the two population proportions that are unknown using data from our samples.” Two categorical variables, each of which can be “success” or “failure.”

Gallup poll on handgun ban

succ. n \(\hat{p}\)
US 257 1025 0.25
Coursera 59 83 0.71

\(H_0: \hat{p}_{Coursera} - \hat{p}_{US} = 0\)
\(H_A: \hat{p}_{Coursera} - \hat{p}_{US} \ne 0\)

Conditions:

  1. Independence
  • within groups: sampled observations must be independent within each group
    • simple random sample/assignment
    • < 10% population
  • between groups:
    • groups must be independent of each other (non-paired)
  1. Sample size/skew: success-fail condition–we have at least 10 successes and failures for both US pop. and Courserians

For a confidence interval, we use the two observed proportions.

conf_level = 0.95
alpha = 1 - conf_level
success_coursera = 59
n_coursera = 83
success_us = 257
n_us = 1025
p_hat_coursera = success_coursera / n_coursera
p_hat_us = success_us / n_us
p_hat_coursera
## [1] 0.7108
p_hat_us
## [1] 0.2507
point_est = p_hat_coursera - p_hat_us
point_est
## [1] 0.4601
z_star = qnorm(conf_level + alpha/2)
z_star
## [1] 1.96
SE = sqrt((p_hat_coursera * (1 - p_hat_coursera))/n_coursera + (p_hat_us * (1 - p_hat_us))/n_us)
SE
## [1] 0.05157
ci = c(point_est - z_star * SE, point_est + z_star * SE)
ci
## [1] 0.3590 0.5612

Video 5-2-2: Hypothesis test for comparing two proportions

In the hypothesis test, the null hypothesis assumes that the two sample proportions are equal. This means that we use a pooled proportion to check the success-failure condition, and to calculate standard error.

Male Female
Yes 34 61
No 52 61
Not sure 4 0
Total 90 122
\(\hat{p}\) 0.38 0.5

\(H_0: p_{male} - p_{female} = 0\)
\(H_A: p_{male} - p_{female} \ne 0\)

p_success_m = 34
p_success_f = 61
n_m = 90
n_f = 122
p_hat_m = p_success_m / n_m
p_hat_f = p_success_f / n_f
p_null = 0
p_pool = (p_success_m + p_success_f)/(n_m + n_f)
SE = sqrt( (p_pool * (1-p_pool))/n_m + (p_pool * (1-p_pool))/n_f )
SE
## [1] 0.0691
point_est = p_hat_m - p_hat_f
Z = abs(point_est - p_null) / SE
Z
## [1] 1.769
p_value = 2 * (1 - pnorm(Z))
p_value
## [1] 0.07694

Do not reject null hypothesis.

Video 5-3-1: Small sample proportion

Use simulation when the success-fail condition is not met for a proportion.

source("http://bit.ly/dasi_inference")
paul = factor(c(rep("yes", 8), rep("no", 0)), levels=c("yes", "no"))
paul
## [1] yes yes yes yes yes yes yes yes
## Levels: yes no
inference(paul, est="proportion", type="ht", method="simulation", success="yes", null=0.5, alternative="greater")
## Single proportion -- success: yes 
## Summary statistics:
## p_hat = 1 ;  n = 8 
## H0: p = 0.5 
## HA: p > 0.5

plot of chunk unnamed-chunk-46

## p-value =  0.0038

Video 5-3-2: Example

“Know the back of your hand”

back = factor(c(rep("correct", 11), rep("incorrect",1)))
inference(back, est="proportion", type="ht", method="simulation", success="correct", null=0.5, alternative="greater",nsim=100)
## Single proportion -- success: correct 
## Summary statistics:
## p_hat = 0.9167 ;  n = 12 
## H0: p = 0.5 
## HA: p > 0.5

plot of chunk unnamed-chunk-47

## p-value =  0

Video 5-3-3: Comparing two small sample proportions

Extension of example from 3-2.

back palm total
Correct 11 7 18
Incorrect 1 5 6
Total 12 12 24
\(\hat{p}\) 0.9167 0.5833 0.75

\(H_0: p_{back} - p_{palm} = 0\)
\(H_A: p_{back} - p_{palm} \ne 0\)

Conditions:

  1. Independence
  • within groups: can assume guess of one subject is independent from another.
  • between groups: no, sample people guessing for front and back of hand. Therefore invalid. But proceed for illustrative purposes.
  1. Sample size/skew: the pooled proportion is 0.75 (as shown in table). \(12 \times 0.75 = 9\) and \(12 \times 0.25 = 3\). Sample size condition not met, use simulation.

Simulation scheme

  1. 24 index cards, where each represents a subject
  2. Mark 18 as correct and 6 as wrong (see total column)
  3. Shuffle cards, split into two groups of 12, for back and palm
  4. Calculate the difference in proportions for “correct” and “incorrect” in both “decks” and record result
  5. Repeat many times
hand = factor(c(rep("correct", 7), rep("wrong", 5), c(rep("correct", 11), rep("wrong", 1))))

grp = c(rep("palm",12),rep("back",12))

inference(hand, grp, est = "proportion", type = "ht", null = 0, alternative = "twosided", order = c("back","palm"), success = "correct", method = "simulation", seed = 842, eda_plot = FALSE)
## Response variable: categorical, Explanatory variable: categorical
## Difference between two proportions -- success: correct
## Summary statistics:
##          x
## y         back palm Sum
##   correct   11    7  18
##   wrong      1    5   6
##   Sum       12   12  24
## Observed difference between proportions (back-palm) = 0.3333
## H0: p_back - p_palm = 0 
## HA: p_back - p_palm != 0

plot of chunk unnamed-chunk-48

## p-value =  0.1566

There is not enough evidence to reject the null hypothesis. We do not conclude that there is a difference between the ability to recognize back versus palm.

Video 5-4-1: Chi-square GOF test

GOF is “goodness of fit”. Use it when you have one categorical variable with two or more levels. It tells you whether the proportions of the levels match the proportions in the population. So you need to know the proportions in the population for those levels.

Example: Racial distribution in jury selection

\(H_0:\) People selected for jury duty are a simple random sample from the population of potential jurors. The observed counts of jurors from various race/ethnicities follow the same ethnicity pattern distribution in the population.

\(H_A:\) People selected for jury duty are not a simple random sample from the population of potential jurors. The observed counts of jurors from various ethnicities do not follow the same distribution in the population.

The table shows availability for jury duty by race, based on census data.

ethnicity white black nat. amer. asian & PI other total
% in pop. 80.29 12.06 0.79 2.92 3.94 100%
expected # 2007 302 20 73 98 2500

Distribution of 2500 people selected for jury duty.

ethnicity white black nat. amer. asian & PI other
observed # 1920 347 19 84 130
%

Conditions for chi-square test

  1. Independence. 1) random sample/assignment, 2) if sampling without replacement, sample < 10% population, 3) each case contributes to only one cell in table
  2. Sample size/skew. Each particular scenario (cell in table) must have at least 5 expected cases.

Chi-square statistic

Known as \(X^2\):

\[X^2 = \sum\limits_{i=2}^k \frac{(O-E)^2}{E}\]

Where O is the observed count, E is the expected count, and k is the number of cells. In this example, k is 5 (white, black, nat. amer., asian & PI, other). We square \(X\) to:

  • positive standardized differences
  • highly unusual differences between observed and expected counts appear as even more unusual

The chi-square distribution has one parameter, degrees of freedom. For chi-square GOD test, \(df = k-1\) where k is the number of cells. So our example has df=4.

The following shows chi-square distributions for 2 degrees of freedom (green), 4 (red) and 9 (blue).

x <- rchisq(100, 5)
hist(x, prob=TRUE, ylim=c(0,0.6), xlim=c(-0.1, 20))
curve( dchisq(x, df=2), col='green', add=TRUE)
curve( dchisq(x, df=4), col='red', add=TRUE )
curve( dchisq(x, df=9), col='blue', add=TRUE)

plot of chunk smallplot \(H_0: p_{observed}-p_{expected} = 0\)
\(H_A: p_{observed}-p_{expected} \ne 0\)

k = 5
df = k - 1
expected = c(2007, 302, 20, 73, 98)
observed = c(1920, 347, 19, 84, 130)
x_squared = sum((observed-expected)^2/expected)
x_squared
## [1] 22.63
p_value = pchisq(x_squared, df, lower.tail=FALSE)
p_value
## [1] 0.0001499

Reject the null hypothesis.

Video 5-4-2: The chi-square independence test

This applies when there are two categorical variables, at least one of which has more than two levels. It tells you whether there is a relationship between these two variables.

Example: Obesity and marital status

\(H_0:\) Weight and relationship status are independent. Obesity rates do not vary by relationship status. \(H_A:\) Weight and relationship status are dependent. Obesity rates do vary by relationship status.

dating cohabiting married total
obese 81 103 147 331
not obese 359 326 277 962
total 440 429 424 1293

The first variable is obesity, which has only two levels, obese or not obese. The second variable is status, which has three levels, dating, cohabiting and married. The question: Does there appear to be a relationship between weight and marital status?

\(H_0:\) Weight and relationship status are independent.
\(H_A:\) Weight and relationship status are not independent.

This hypothesis test is an independence test, because we’re evaluating the relationship between two categorical variables.

Chi-square statistic determined same way as before:

\[X^2 = \sum\limits_{i-1}^k \frac{(O-E)^2}{E}\]

Degrees of freedom: \(df = (R-1)\times(C-1)\), where R is the number of rows and C is the number of columns. Here, there are two rows and three columns, so df is (2-1)(3-1)=2.

Conditions are same as GOF test.

obese = c(81,103,147)
not_obese = c(359,326,277)
obesity_rate = sum(obese)/(sum(obese) + sum(not_obese))
obesity_rate
## [1] 0.256

Now we can add expected numbers of obese people to each level.

dating cohabiting married total
obese 81 103 147 331
not obese 359 326 277 962
total 440 429 424 1293
exp. obese 113 110 108
exp. not obese 327 319 316

Expected counts in two-way tables:

\[\text{expected count} = \frac{(\text{row total})\times(\text{column total})}{\text{tabletotal}}\]

So for example, the row total for obese is 331, the column total for dating is 440, and the table total is 1293. So the expected count for obese in a relationship is (331*440)/1293, which when rounded is 113.

expected_obese = c(113,110,108)
observed_obese = c(81,103,147)
expected_not = c(327,319,316)
observed_not = c(359,326,277)
r = 2
c = 3
x_squared = sum((observed_obese-expected_obese)^2/expected_obese) + sum((observed_not - expected_not)^2/expected_not)
x_squared
## [1] 31.69
df = (r-1)*(c-1)
p_value = pchisq(x_squared, df, lower.tail=FALSE)
p_value
## [1] 1.315e-07

p-value is very small, so reject the null hypothesis.

Unit 6: Introduction to Linear Regression

Video 6-1-1: Correlation

In regression, there is an explanatory variable (on the x axis) and a response variable (on the y axis). The relationship must be linear in order for linear regression to be useful. The relationship can be positive or negative, weak or strong.

One measure of the relationship between two variables is correlation. It describes the strength of the linear relationship between the variables. It is denoted as R.

Properties of the correlation coefficient

  • Magnitude (absolute value) measures strength of linear relationship
  • Sign of the correlation coefficient indicates direction of assocation
  • The correlation coefficient is always between -1 (perfect negative correlation) and 1 (perfect positive correlation).
  • R=0 indicates no linear relationship at all.
  • Correlation coefficient is unitless, and is not affected by changes in center or scale of either variable (such as unit conversions)
  • Correlation of X with Y is the same as Y with X
  • The correlation coefficient is sensitive to outliers.

With perfect positive assocation, slope of regression line is positive. With perfect negative assocation, slope is negative. When there is none, slope is horizontal. Note that correlation coefficient is not exactly the same as the slope, but its sign and the sign of the regression line slope will always be the same.

Video 6-2-1: Residuals

Residuals are leftovers from the model fit. Data = Fit + Residuals. A residual is the difference between the observed and the predicted \(y\) for a given \(x\). Specifically, this is observed minus \(y\).

\[\text{residual} = e_i = y_i - \hat{y}_i\]

(I have seen other texts that refer to error as the difference between a response variable and the “true” regression line, the one based on a population, and the difference between a response variable and the estimated regression line as a residual. In that case, the error is represented by \(e\) and the residual by \(\epsilon\).)

If a residual is negative, then the actual data point is below the regression line. The model overestimates at that particular \(x\).

If the residual is positive, then the data point is above the regression line. The model underestimates at that particular \(x\).

Video 6-2-2: Least squares line

For the slope of the regression line, we minimize the sum of the squared residuals. Squaring the values avoids the problem of positive and negative values cancelling each other out, and it also makes residual values even more costly as they become larger.

(Unfortunately, this does not explain why squared values are valid. It explains two problems with using the original values, or absolute values. But there could be many methods of making residual values more costly as they become larger; for example, \(\exp(y)\), or cubing the value. I’m sure there must be a mathematical basis for squaring the values.)

The least squares line is \(\hat{y} = \beta_0 + \beta_1x\), where \(x\) is the explanatory variable, \(\hat{y}\) is the predicted response variable, \(\beta_1\) is the slope of the line, and \(\beta_0\) is the y-intercept of the line.

Notation:

parameter point estimate
intercept \(\beta_0\) \(b_0\)
slope \(\beta_1\) \(b_1\)

(I have a textbook on regression analysis that refers to the estimates as \(\hat{\beta_0}\) and \(\hat{\beta_1}\) rather than \(b_0\) and \(b_1\).)

Calculating the R value

(Video does not show this)

\[R = \frac{1}{n-1} \sum\limits_{i=1}^{n} \frac{(y_i-\bar{y})}{s_y} \times \frac{(x_i-\bar{x})}{s_x}\]

# make up some data for x and y
x <- c(10,12,14,10,13,16,20)
y <- c(9,10,12,9,14,14,16)
x.mean = mean(x)
y.mean = mean(y)
x.sd = sd(x)
y.sd = sd(y)
n <- length(x)
correl = (1/(n-1)) * sum(((y - y.mean)/y.sd) * ((x - x.mean)/x.sd))
correl
## [1] 0.915

Or use the cor function in R:

cor(x, y)
## [1] 0.915

Calculating the slope of the regression line

The slope of the regression line, \(b_1\), is:

\[b_1 = \frac{s_y}{s_x}R\]

# x, y, x.sd and y.sd defined or calculated above
R = cor(x, y)
b_1 = (y.sd/x.sd) * R
b_1
## [1] 0.7132

Calculating the y-intercept of the regression line

The regression line must always pass through \(\bar{x}, \bar{y}\). We can combine this fact, with the slope, to calculate the y-intercept at the point where \(x=0\):

\[b_0 - \bar{y} = b_1(x - \bar{x}) = b_1(0 - \bar{x}) = -b_1\bar{x}\] \[b_0 = -b_1\bar{x} + \bar{y}\]

# x.mean and y.mean calculated above
b_0 = -b_1 * x.mean + y.mean
b_0
## [1] 2.321

Output from R for calculating regression line will look something like:

Estimate Std. Error t value Pr(>abs(t))
(Intercept) 64.78 6.80 9.52 0.00
hsgrad -0.62 0.08 -7.86 0.00

The first line is for the y-intercept, second is for the slope. “hsgrad” here is the explanatory variable. So the slope is -0.62 and the y-intercept is 64.78.

In R, using made-up data defined above:

data.lm = lm(y ~ x)
summary(data.lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      1      2      3      4      5      6      7 
## -0.453 -0.879 -0.306 -0.453  2.408  0.268 -0.585 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    2.321      1.964    1.18   0.2904   
## x              0.713      0.141    5.07   0.0039 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.22 on 5 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.805 
## F-statistic: 25.7 on 1 and 5 DF,  p-value: 0.00386

Video 6-2-3: Prediction and extrapolation

Predict. Don’t extrapolate.

Prediction means using the linear model to predict the value for the response variable for a given value of the explanatory variable. You can safely calculate a value of the response variable \(y\) for any value of the explanatory variable \(x\) that is between the minimum and maximum observed values for \(x\). Going outside this range is extrapolation. Don’t.

Video 6-2-4: Conditions for linear regression

Linearity

There should be a straight-line relationship between the explanatory and response variables. Eyeball this using a scatterplot or residuals plot.

TODO: How to plot a residuals plot.

Nearly normal residuals

  • Residuals should be nearly normal, centered at 0
  • May not be satisfied if there are unusual observations that don’t follow the trend of the rest of the data
  • Check using a histogram or normal probability plot of residuals

TODO: How to do either on a set of data’s residuals

Constant variability

The variability of the residuals should not increase or shrink as you move along x. It should not form a “cone.” Constant variability is called homoscedasticity.

http://bitly.com/slr_diag

Video 6-2-5: R squared

R-squared is just what it sounds like: square the R value. This value indicates the percentage of the variation in the response variable that is explained by the model. It is always between 0 and 1.

Video 6-2-6: Regression and explanatory categorical variables

Now we consider a case where the response variable is numerical but the explanatory variable is categorical.

Example: Poverty vs. region. The explanatory variable is region, where 0=east and 1=west.

\(\hat{poverty} = 11.17 + 0.38\text{(region:west)}\)

For eastern states, we plug in 0 for x. This means that the predicted poverty rate for all eastern states is 11.17.

For western states, the predicted poverty rate is 11.17 + 0.38 = 11.55.

When the explanatory variable is categorical, we always code one, and only one, level of that variable to be zero. This is the “reference level”.

Example with more than two levels:

Estimate Std. Err. t value Pr(>abs(t))
(Intercept) 9.50 0.87 10.94 0.00
region4:midwest 0.03 1.15 0.02 0.98
region4:west 1.79 1.13 1.59 0.12
region4:south 4.16 1.07 3.87 0.00

The top row is the reference level (northeast, name not shown).

\[\hat{\text{% in poverty}} = 9.50 + 0.03\text{(region4:midwest)} + 1.79\text{(region4:west)} + 4.16\text{(region4:south)}\]

So here, the predicted poverty rate for midwest states would be 9.53. It is 11.29 for western states, and 13.66 for southern states.

Video 6-3-1: Outliers in regression

Leverage: a point that “falls away horizontally” from the cloud. This means it is some distance away horizontally from the cloud, although it might appear more or less where the regression line predicts.

Influence: The amount of change an outlier can create in the regression line. So a point can have high leverage but low influence. Influential points are usually high-leverage points.

Influential points do not necessarily reduce \(R^2\). It is possible that the data set without the outlier has no linear relationship, in which case both \(R\) and \(R^2\) will be very low. A highly-influential outlier might actually cause \(R^2\) to bump up slightly in this case. Example is a cloud of points with very low \(R\), and one highly influential outlier. This causes R suddenly to become quite high, and so \(R^2\) is also much higher.

I consider this question to be bogus. The instructor concedes “we would never want to fit a linear model” to such data. It’s not fit for linear regression. So we should never get to the point of checking \(R^2\) in this case. This is sort of like asking, “Is it always a good idea to put smoke detectors in a house?” and the answer is, “No, not if the house has been abandoned for 25 years, and there would be no negative impact if it burned.” Correct and irrelevant in my opinion.

Video 6-4-1: Inference for linear regression

Hypothesis testing for significance of predictor

Example is a study on IQ among twins, raised together or apart.

Estimate Std. Error t value Pr(>abs(t))
(Intercept) 9.2076 9.2999 0.99 0.3316
bioIQ 0.9014 0.0963 9.36 0.00

The linear model is \(\hat{\text{fosterIQ}} = 9.2076 + 0.9014\text{(bioIQ)}\). \(R^2\) is 0.78, althoug NOTE that you cannot see this from the above table (the summary of the output from R’s lm function will include the \(R\) value, and \(R^2\)).

\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \ne 0\)

What the above means is that, if the slope of the line is zero, there is no linear relationship between the variables. (I’ve done some thinking about this and remain bemused. If \(R\) is zero, then there is no linear relationship between the variables. But if \(R\) is 0.000001, then there is a relationship, no matter how weak, and the slope of the line could be either very small, like 0.000001, or it could be enormous, a billion.)

Linear regression always uses a t statistic for inference:

\[T = \frac{\text{point estimate}-\text{null value}}{SE_{b1}}\]

b_1 = 0.9014
point_est = b_1
nullval = 0
n = 27
df = n - 2 # always n-2 for simple linear regression, because we have two variables, intercept and slope
SE = 0.0963 # from table above, SE for b_1
T_score = abs(point_est - nullval)/SE
T_score
## [1] 9.36
p_value = 1 - pt(T_score, df=df)
p_value
## [1] 5.982e-10

With such a high t-score, and more to the point, such a low p-value, we can have high confidence that the null hypothesis should be rejected, i.e., \(\beta_1\) is not zero.

(A question occurs to me: what about the intercept? The p-value for it is much larger. I guess the conclusion here is that the estimated intercept may not be precisely correct. But that’s a different kettle of fish. The slope is special in the sense that, if it is precisely zero, there is no correlation at all between the variables. Otherwise, there is at least some correlation. But if there is a small difference between our estimated intercept and the true intercept, it does not decide whether there is some, or no, correlation. The video does make the point later that “inference on the intercept is rarely done.”)

Confidence interval for slope

Confidence interval is point estimate \(\pm\) margin of error:

\[b_1 \pm t^*_{df} \cdot SE_{b1}\]

b_1 = 0.9014
point_est = b_1
n = 27
df = n - 2
conf_level = 0.95
alpha = 1 - conf_level
SE = 0.0963
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(point_est - t_star * SE, point_est + t_star * SE)
ci
## [1] 0.7031 1.0997

We are 95% confident that for each additional point on the biological twins’ IQs, the foster twins’ IQs are expected on average to be higher by 0.7 to 1.1 points.

Recap: Inference for regression

  • Null value for hypothesis test is often 0, since we want to check for any relationship between the variables
  • Regression output gives \(b_1, SE_{b1}\) and two-tailed p-value for the t-test for the slope where the null value is 0.
  • Inference on the intercept is rarely done.

Caution

  • Always be aware of the data: is it a random sample, a non-random sample, a population?
  • Statistical inference and the p-value are meaningless if you have population data.
  • If the sample is non-random, the results will be unreliable
  • Ultimate goal is to have observations that are independent.

Video 6-4-2: Variability partitioning

So far, inference has used a t-test to evaluate the strength of the evidence for a hypothesis test for the slope of the relationship between \(x\) and \(y\).

Alternatively, we can consider the variability in \(y\) explained by \(x\), compared to the unexplained variability.

Partitioning the variability in \(y\) to explained and unexplained variability requires ANOVA. Here is an example of ANOVA output:

Df SumSq MeanSq F val Pr(>F)
bioIQ 1 5231.13 5231.13 87.56 0.0000
Residuals 25 1493.53 59.74
Total 26 6724.66

Review of the ANOVA table

\(SS_{Tot}\) (sum of squares total) is 6724.66 above. The unexplained variability is \(SS_{Res}\) (sum of squares of residuals) or 1493.53. So the explained variability is \(SS_{Reg} = SS_{Tot}-SS_{Res} = 5231.13\).

The total degrees of freedom is \(df_{Tot} = 27 - 1 = 26\). The regression degrees of freedom is \(df_{Reg} = 1\) because there is only one predictor (in multiple linear regression this will be higher). The residuals degrees of freedom is \(df_{Res} = df_{Tot} - df_{Reg} = 26 - 1 = 25\).

The mean squares for the regression is \(MS_{Reg} = \frac{SS_{Reg}}{df_{Reg}} = \frac{5231.13}{1} = 5231.13\). The mean squares of the residuals is \(MS_{Res} = \frac{SS_{Res}}{df_{Res}} = \frac{1493.53}{25} = 59.74\).

Finally, the F statistic, which is the ratio of explained to unexplained variability, is \(F_{(1,25)} = \frac{MS_{Reg}}{MS_{Res}} = \frac{5231.13}{59.74} = 87.56\).

Hypothesis test

\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \ne 1\)

The p-value in the table is very small, so we reject the null hypothesis. The data provide convincing evidence that the slope is significantly different from 0.

Revisiting \(R^2\)

Two ways to calculate

  • Using square of the correlation coefficient
  • From the ANOVA table, as shown previously

In the example, \(R^2 = 0.882^2 \approx 0.78\). Using the ANOVA table, it is \(\frac{SS_{Reg}}{SS_{Tot}} = \frac{5231.13}{6724.66} \approx 0.78\).

So in the ANOVA table, it is Sum Sq, first row, divided by Sum Sq, total row.

Unit 7: Multiple Linear Regression

Video 7-Intro

This unit will use, for one example, data with predictive variables for birthweight.

The response variable will always be numerical, but the explanatory variables can be numerical or categorial.

data

Topics:

  • multiple predictors
  • inference for MLR
  • model selection (which variables to include)
  • model diagnostics

Video 7-1-1: Multiple predictors

First example is relationship between weight of books, and volume and cover type as the explanatory variables.

library(DAAG) # contains some data to be used as example
## Loading required package: lattice
## 
## Attaching package: 'DAAG'
## 
## The following object is masked from 'package:openintro':
## 
##     possum
data(allbacks)
# fit model
book_mlr = lm(weight ~ volume + cover, data = allbacks)
summary(book_mlr)
## 
## Call:
## lm(formula = weight ~ volume + cover, data = allbacks)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -110.1  -32.3  -16.1   28.9  210.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.9628    59.1927    3.34  0.00584 ** 
## volume         0.7180     0.0615   11.67  6.6e-08 ***
## coverpb     -184.0473    40.4942   -4.55  0.00067 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.927,  Adjusted R-squared:  0.915 
## F-statistic: 76.7 on 2 and 12 DF,  p-value: 1.45e-07

Here, the intercept is 197.96, and the explanatory variables are volume and cover. We see cover:pb, for paperback. Since this is listed, we know it is the “non-reference” value, and the one not shown must be the “reference” value, which means its value would be 0. So in this case, a paperback starts out -184 grams compared to a hardback. If there are three cover types, then two would be shown.

The \(R^2\) value is 0.9275, so 92.75% of the variability in the response variable is explained by volume and cover type.

Based on the output above, the model is:

\(\hat{\text{weight}} = 197.96 + 0.72\text{volume} - 184.05\text{cover:pb}\)

Because hardcover is the reference value, for a hardcover book, we plug in a 0 for cover, and the weight would be:

\(\hat{weight} = 197.96 + 0.72\text{volume} - 184.05 \times 0 = 197.96 + 0.72\text{volume}\)

For a paperback:

\(\hat{weight} = 197.96 + 0.72\text{volume} - 184.05 \times 1 = 13.91 + 0.72\text{volume}\)

pb = subset(allbacks, allbacks$cover=="pb")
hb = subset(allbacks, allbacks$cover=="hb")
pb.lm = lm(pb$weight ~ pb$volume)
hb.lm = lm(hb$weight ~ hb$volume)
plot(hb$volume, hb$weight,pch="*", ylab="weight", xlab="volume (cm3)", xlim=c(0,1500), ylim=c(0,1000))
par(new=T)
plot(pb$volume, pb$weight,pch=17, xlab="", ylab="", xlim=c(0,1500), ylim=c(0,1000),bg=4)
abline(pb.lm)
abline(hb.lm)
legend("topleft", inset=0.05, title="Cover Type", c("pb", "hb"))

plot of chunk unnamed-chunk-60

par(new=F)

Interpreting the regression parameters

The slope estimate for volume is 0.72, so that, for each additional cubic centimeter in volume, all else being held constant, the weight is predicted to increase by 0.72 grams.

The slop for the cover variable is -184.05, so that, all else being held constant, when a book has a cover type of paperback, there is an expected decrease of 184.05 grams in weight.

The intercept is 197.96, which would mean that a book with no volume would weigh 197.96 grams. Since this is clearly nonsense, the intercept has no relevance by itself.

Prediction

We can use this linear regression model to predict book weight. For example, a paperback with a volume of 600 \(cm^3\) would be \(197.96 + 0.72(600) - 184.05(1) = 445.91\) grams.

Interaction variables

The model assumes hardcover and paperback bookshave the same slope for the relationship between their volume and weight. This may be true in some cases, but in others it may not. In that case we would include an interaction variable, which is beyond the scope of this course.

Video 7-1-2: Adjusted \(R^2\)

states = read.csv("http://bit.ly/dasi_states")
pov_slr = lm(poverty ~ female_house, data=states)
summary(pov_slr)
## 
## Call:
## lm(formula = poverty ~ female_house, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.754 -1.825 -0.037  1.556  6.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.309      1.897    1.74    0.087 .  
## female_house    0.691      0.160    4.32  7.5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.66 on 49 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.261 
## F-statistic: 18.7 on 1 and 49 DF,  p-value: 7.53e-05
cor(states$female_house, states$poverty)
## [1] 0.5254
plot(female_house ~ poverty, data=states)
abline(pov_slr)

plot of chunk unnamed-chunk-61

\(R\) is 0.525, \(R^2\) is 0.276. The p-value shows that female householder is a significant predictor of living in poverty.

Now the output from ANOVA.

anova.pov_slr = anova(pov_slr)
anova.pov_slr
## Analysis of Variance Table
## 
## Response: poverty
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## female_house  1    133   132.6    18.7 7.5e-05 ***
## Residuals    49    348     7.1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explained variability is \(R^2 = \frac{\text{explained variability}}{\text{explained variability}} = \frac{132.57}{480.25} = 0.276\). This matches the values from the lm output.

Now add another variable to move to multiple linear regression.

pov_mlr = lm(poverty ~ female_house + white, data=states)
summary(pov_mlr)
## 
## Call:
## lm(formula = poverty ~ female_house + white, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.525 -1.853 -0.038  1.377  6.269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.5789     5.7849   -0.45  0.65774    
## female_house   0.8869     0.2419    3.67  0.00061 ***
## white          0.0442     0.0410    1.08  0.28676    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.66 on 48 degrees of freedom
## Multiple R-squared:  0.293,  Adjusted R-squared:  0.264 
## F-statistic: 9.95 on 2 and 48 DF,  p-value: 0.000242
anova(pov_mlr)
## Analysis of Variance Table
## 
## Response: poverty
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## female_house  1    133   132.6   18.74 7.6e-05 ***
## white         1      8     8.2    1.16    0.29    
## Residuals    48    339     7.1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Output similar to previous, but with new variable, white, as a predictor.

The total variability has not changed, because this is the inherent variability that is in the response variable, regardless of the number of explanatory variables. But what has changed is how the variability is partitioned. Some is now attributed to female householder, and some to race=white.

To calculate \(R^2\) here, we use both explanatory variables: \(R^2 = \frac{132.57 + 8.21}{480.25} = 0.293\). This has increased slightly from 28% with the single linear regression. The number changes because the variability is caused by two variables instead of one.

Adjusted \(R^2\):

\[R^2_{adj} = 1 - \left(\frac{SSE}{SST} \times \frac{n-1}{n-k-1}\right)\]

where \(k\) is the number of predictors. This introduces a penalty to \(R^2\) based on the number of predictors in the model, and the magnitude of this penalty is based on the proportion of \(k\) to \(n\). The larger \(n\) is, the more \(k\) can increase without significantly penalizing.

Remember n is 51 (50 states plus DC), though the table above shows 50 (n-1). So adjusted \(R^2\) is:

\(1 - (\frac{339.47}{480.25} \times \frac{51-1}{51-1-1}) = 0.26\)

\(R^2\) adjusted \(R^2\)
Model 1 (poverty vs. female_house) 0.28 0.28
Model 2 (poverty vs. female_house + white) 0.29 0.26

So we see that \(R^2\) increased, but adjusted \(R^2\) actually decreased.

  • When any variable is added to the model, \(R^2\) increases
  • But if the added variable doesn’t really provide any new information, or is completely unrelated, adjusted \(R^2\) does not increase.

Properties of adjusted \(R^2\)

  • k is never negative, so adjusted \(R^2 < R^2\)
  • adjusted \(R^2\) applies a penalty for the number of predictors in the model
  • we choose models with higher adjusted \(R^2\) over others

Video 7-1-3: Colinearity and Parsimony

Colinearity

  • Two predictor variables are said to be colinear when they are correlated with each other
  • Remember: predictors are also called independent variables, so they should be independent of each other
  • Inclusion of of colinear predictors (also called multcolinearity) complicates model estimation

In the example data, there is a very strong negative relationship between the variable white and the variable female_householder, so they are not independent of each other. So if female_householder is one of our variables, we should not add white. Think of the two together as a sort of echo chamber.

Parsimony

This leads to the subject of parsimony.

  • Avoid adding predictors associated with each other because often they bring nothing new to the table
  • Prefer the simplest best model, the parsimonious model (Occam’s Razor: among competing hypotheses, the one with fewest assumptions is best)
  • Addition of colinear variables can result in biased estimates of the regression parameters
  • Whie it’s impossible to avoid colinearity from arising in observational data, experiments are usually designed to control for correlated predictors

Video 7-2-1: Inference for MLR

Data for cognitive study:

cognitive = read.csv("http://bit.ly/dasi_cognitive")
cog_full = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age, data=cognitive)
summary(cog_full)
## 
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age, 
##     data = cognitive)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.05 -12.92   1.99  11.56  49.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.5924     9.2191    2.13    0.034 *  
## mom_hsyes     5.0948     2.3145    2.20    0.028 *  
## mom_iq        0.5615     0.0606    9.26   <2e-16 ***
## mom_workyes   2.5372     2.3507    1.08    0.281    
## mom_age       0.2180     0.3307    0.66    0.510    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 429 degrees of freedom
## Multiple R-squared:  0.217,  Adjusted R-squared:  0.21 
## F-statistic: 29.7 on 4 and 429 DF,  p-value: <2e-16

Inference for the model as a whole

\(H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0\)
\(H_A:\) At least one \(\beta_i\) is different than 0

The test statistics is an F-statistic, 29.74 on 4 (num. predictors) and 429 (n-k-1) degrees of freedom (there were 434 cases). We also have the p-value, so we don’t need to do any calculations. Since p-value is less then 0.05, we reject the null hypothesis.

  • The F test yielding a significant result does not mean the model fits the data well. It means at least one of the \(\beta\) values is non-zero.
  • If the F test does not yield a significant result (i.e., p-value > 0.05), it doesn’t mean individual variables are not good predictors of y, only that the combination of them doesn’t yield a good model.

Hypothesis testing for slopes

Example: Is whether or not the mother went to high school a significant predictor of the cognitive test scores of children, given all other variables in the model?

\(H_0: \beta_1=0\), when all other variables are included in the model
\(H_A: \beta_1\ne0\), when all other variables are included in the model

where \(\beta_1\) is estimate for mom_hs:yes. The p-value for this variable is 0.028, so we can reject the null hypothesis. The variable is a good predictor of \(y\).

We can walk through how this p-value was obtained.

\[T = \frac{\text{point_estimate}- \text{null value}}{SE}\]

Our point estimate is \(b_1\) and \(SE\) is \(SE_{b1}\). So T is \(\frac{b_1 - 0}{SE_{b1}} = \frac{5.095}{2.315}=2.201\) where \(df = n - k - 1\).

A note about \(df = n - k - 1\). Recall that in single linear regression, \(df = n - 2\). This is because \(k=1\), since there is only one predictor. So this is the same concept.

If T is 2.201, then the p-value is 0.028.

n = 434
k = 4
b_1 = 5.095
se_b1 = 2.315
nullval = 0
T = (b_1 - nullval)/se_b1
df = n - k - 1
p_value = pt(T, df=df, lower.tail=FALSE) * 2
p_value
## [1] 0.02828

Confidence interval for slopes

As always, it is point estimate \(\pm\) margin of error. Here, the point estimate is the slope estimate, so for mom_work:yes, \(b_3 \pm t^*_{df} \cdot SE_{b_3}\).

conf_level = 0.95
alpha = 1 - conf_level
b_3 = 2.54
se_b3=2.35
#df comes from previous snippet
t_star_df = abs(qt(1 - (alpha/2), df=df))
conf_level - (alpha/2)
## [1] 0.925
ci = c(b_3 - t_star_df * se_b3, b_3 + t_star_df * se_b3)
ci
## [1] -2.079  7.159

We are 95% confident that the cognitive IQ of children whose mothers worked score 2.09 points lower to 7.17 points higher than those whose mothers did not work.

Video 7-3-1: Model selection

Stepwise model selection

One method is backwards elimination: start with full model (all predictors), drop one at a time until parsimonious model is reached

Another is forward selection: start with empty model and add one predictor at a time until parsimonious model is reached

There are many criteria for mode selection, such as: * p-value * adjusted \(R^2\) * AIC (blah-blah information criterion) * BIC (bayesian information criterion) * DIC (deviance information criterion) * Bayes factor * Mallow’s \(C_p\)

All but first two beyond scope of course.

Backwards elimination using adjusted \(R^2\)

  • Start with full model
  • Drop one variable at a time and record adjusted \(R^2\) of each smaller model
  • Pick the model with highest increase in adjusted \(R^2\)
  • Repeat until none of the models yield an increase in adjusted \(R^2\)

We saw that adjusted \(R^2\) for full model is 0.2098. Now remove mom_hs:

kid_score ~ mom_iq + mom_work + mom_age

This yields adjusted \(R^2\) of 0.2027. Removing it did not improve the model, it made it worse. So put it back in, and try removing another. End result:

step variables included removed adjusted \(R^2\)
full kid_score ~ mom_hs + mom_iq + mom_work + mom_age 0.2098
step 1 kid_score ~ mom_iq + mom_work + mom_age [-mom_hs] 0.2027
kid_score ~ mom_hs + mom_work + mom_age [-mom_iq] 0.0541
kid_score ~ mom_hs + mom_iq + mom_age [-mom_work] 0.2095
kid_score ~ mom_hs + mom_iq + mom_work [-mom_age] 0.2109
step 2 kid_score ~ mom_iq + mom_work [-mom_hs] 0.2024
kid_score ~ mom_hs + mom_work [-mom_iq] 0.0546
kid_score ~ mom_hs + mom_iq [-mom_work] 0.2105

In step 1, we found that removing mom_age improved the model. The question in step 2 is, with mom_age removed, can we remove any other predictors? The answer in this case is no, removing them one at a time never improves the model, so the final model is kid_score ~ mom_hs + mom_iq + mom_work. But if we had been able to improve the model by removing another predictor in step 2, then we would go on to step 3, and so on.

Backward elimination with p-value

  • Start with full model
  • Drop variable with highest p-value and refit a smaller model
  • Repeat until all variables left in the model are significant

So the idea here is to look at the full model, and find the highest p-value. Is it above the significance level? If so, remove it, and then refit the model. Is the new highest p-value also above the significance level? Then repeat. Keep doing this until you no longer have a p-value above the significance level, or you only have one predictor left.

Note that if a predictor is a categorical variable, you have to look at all of the levels for that variable. If any of them have a p-value below the significance level, then you keep that variable, which means keeping all of the levels.

Adjusted \(R^2\) vs. p-value

In the example with cognitive IQ data, we end with a full model of kid_score ~ mom_hsyes + mom_iq. This is different from the adjusted \(R^2\) method. This is because p-values are based on a confidence level, which is somewhat arbitrary.

We use the p-value approach if we are interested in which predictors are statistically significant. We use adjusted \(R^2\) if we want more reliable predictions.

The p-value method is used more commonly, since it requires fitting fewer models (in the more commonly used backwards-selection process)

Forward selection using adjusted \(R^2\)

  • Start with single predictor regressions of response vs. each explanatory variable
  • Pick the model with the highest adjusted \(R^2\)
  • Add the remaining variables one at a time to the existing model, and the model with the highest adjusted \(R^2\)

So if you have 10 predictors, you start by fitting 10 single-predictor models, choose the best one, then fit 9 more models, each of which has the first predictor plus one of the remaining predictors. Keep repeating until you’ve fitted all possible predictors. Of course, if you’ve fitted \(n\) predictors, and fitting each one of the remaining predictors yields a worse model, you stop with those \(n\) predictors.

Forward selection using p-values

  • Start with single predictor regressions of reponse vs. each explanatory variable
  • Pick the variable with the lowest significant p-value
  • Add the remaining variables one at a time to the existing model and pick the variable with the lowest significant p-value
  • Repeat until none of the remaining predictors have a significant p-value

I found this a little confusing so here is my example. Say you have five predictors. Create a single-predictor model for each. Pick the one with the lowest p-value. If this is not significant, then stop here–you can’t create a significant model. Otherwise, now create four models, each of which has this first predictor and one of the others. Now pick the model that has the lowest p-value. Is it significant? If so, continue, otherwise stop with the single-predictor model. Now create three models, each with the first two predictors and one of the remaining three. Keep repeating this until either all of the predictors are added, and the p-value is still significant, or else you have to stop at a previous step because adding another predictor did not result in a significant p-value.

Expert opinion

Sometimes variables are included, or eliminated, based on expert opinion. Or, if you are studying a certain variable, you will leave it in the model regardless.

Final model

Our final model, using adjusted \(R^2\):

cog_final = lm(kid_score ~ mom_hs + mom_iq + mom_work, data=cognitive)
summary(cog_final)
## 
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq + mom_work, data = cognitive)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.76 -13.02   2.09  11.55  48.65 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.1794     6.0432    4.00  7.4e-05 ***
## mom_hsyes     5.3823     2.2716    2.37    0.018 *  
## mom_iq        0.5628     0.0606    9.29  < 2e-16 ***
## mom_workyes   2.5664     2.3487    1.09    0.275    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 430 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.211 
## F-statistic: 39.6 on 3 and 430 DF,  p-value: <2e-16

Note that mom_work does not have a significant p-value, but that is because we used the adjusted \(R^2\) method. It gives the model higher predictive power, even though it is not statistically significant. I think this means that it gives it just a teensy bit of higher predictive power, but using adjusted \(R^2\), more is always better, no matter how tiny the addition.

Video 7-4-1: Diagnostics for MLR

Conditions for valid model:

  • linear relationships between \(x\) and \(y\)
  • nearly normal residuals
  • constant variability of residuals
  • independence of residuals

Linear relationships between (numerical) x and y

If an explanatory variable is categorical, it doesn’t make sense to ask for a linear relationship between it and a numerical variable.

  • Each (numerical) explanatory variable linearly related to the response variable
  • Check using residuals plots (\(e\) vs. \(x\))
  • look for random scatter around zero
  • use instead of scatterplot because it allows for considering other variables that are not in the model, and not just the bivariate relationship between a given \(x\) and \(y\)
cog_final = lm(kid_score ~ mom_hs + mom_iq + mom_work, data=cognitive)
# note: mom_iq is only numerical variable
# residuals are saved in residuals field in model
plot(cog_final$residuals ~ cognitive$mom_iq)

plot of chunk unnamed-chunk-68 The plot shows values are scattered randomly around zero, condition is satisified.

Nearly normal residuals with mean 0

  • Some residuals will be positive, some negative
  • On residuals plot, look for random scatter of residuals around 0
  • This translates to a nearly normal distribution of residuals centered at 0
  • Check using histogram or normal probability plot
par(mfrow=c(1,2))
hist(cog_final$residuals)
qqnorm(cog_final$residuals)
qqline(cog_final$residuals)

plot of chunk unnamed-chunk-69

Histogram shows slight skew but acceptable. Condition met.

Constant variability of residuals

  • Residuals should be equally variable for low and high values of predicted response variable
  • Check using residuals plots of residuals vs. predicted (\(e\) vs. \(\hat{y}\)) because it allows for consdering the entire model (all expl. vars.) at once
  • Residuals randomly scattered in band with constant width…no fan
  • Also good to view abs. values of residuals vs. predicted to identify unusual observations easily
par(mfrow=c(1,2))
plot(cog_final$residuals ~ cog_final$fitted)
plot(abs(cog_final$residuals) ~ cog_final$fitted)

plot of chunk unnamed-chunk-70

Independent residuals

  • Independent residuals means independent observations
  • If time series structure is suspected, check using residuals vs. order of data collection
  • If not, think about how data are sampled

To do a quick check for time series:

plot(cog_final$residuals)

plot of chunk unnamed-chunk-71

If there were time series, we would see data go up and down, probably.

Logistic Regression

Some of my own notes.

The formula

The book states but does not explain that:

\[\log(\frac{p_i}{1-p_i}) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}{1+e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}\]

We know that if \(\log(p) = x\) then \(p = e^x\). But let’s start with:

\[\log(\frac{p}{1-p}) = x\]

Therefore:

\[\frac{p}{1-p} = e^x\]

We can manipulate the left side:

\[\frac{p}{1-p} = -1 \cdot \frac{p}{p-1} = -1 \cdot \frac{1+p-1}{p-1} = -1 \cdot \left(\frac{p-1}{p-1} + \frac{1}{p-1} \right)\]

\[= -1 \cdot \left(1 + \frac{1}{p-1} \right) = -1 + \frac{1}{1-p}\]

So:

\[ \frac{p}{1-p} = -1 + \frac{1}{1-p}\]

And therefore:

\[-1 + \frac{1}{1-p} = e^x\]

Now simplify:

\[ \frac{1}{1-p} = 1 + e^x\]

\[1-p = \frac{1}{1 + e^x}\]

\[p = 1 - \frac{1}{1 + e^x}\]

\[p = \frac{1 + e^x}{1 + e^x} - \frac{1}{1 + e^x} = \frac{1 + e^x -1}{1 + e^x} = \frac{e^x}{1 + e^x}\]

Therefore:

\[\log(\frac{p_i}{1-p_i}) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}{1 + e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}\]