My notes on the lectures/videos for selected units. These are just my notes, and although I have checked things to my satisfaction, I cannot guarantee total accuracy.
As of today (9/1/14), I have more or less complete notes on units 4 through 7, and partial notes on unit 1. My goal is to try to fill in units 2 through 3, and the rest of unit 1, in the next few weeks.
The data shown here is similar but not identical to that in the video.
google <- read.csv("google_data/google-government-removal-requests.csv")
google <- google[complete.cases(google),]
colnames(google) <- c("period", "country", "terr.code", "all_num_req", "all_compl", "all_tbr", "cr_req", "cr_compl", "cr_tbr", "other_req", "other_compl", "other_tbr")
gfields <- c("country", "cr_req", "cr_compl", "other_req", "other_compl")
head(google[,gfields])
## country cr_req cr_compl other_req other_compl
## 81 Germany 72 100 46 91
## 82 Mexico 1 0 1 0
## 83 Netherlands 1 100 3 100
## 86 Brazil 146 97 117 50
## 94 United Kingdom 18 100 20 80
## 96 Argentina 21 100 1 100
The table above is a data matrix. Each row is an observation or case. Each column represents a variable (also sometimes known as a feature).
There are two types of variables: numerical and categorical. Numerical (quantitative) variables take on numerical values. It is sensible to add, subtract or take averages of these values. Categorical (qualitative) variables take on a limited number of distinct categories. Categories can be identified with numbers. For example, a gender might be coded as 0 for males and 1 for females. But it does not make sense to perform arithmetic operations on these categories.
Numerical values can be continuous or discrete. Continuous numerical variables are usually measured; an example is height. They can take on any numerical value. We tend to round a measure like height when we record it, but this does not change the fact that it is a continuous variable. Discrete numerical variables are generally counted, and can take on only whole non-negative numbers. It takes some thought about a numerical variable to be sure whether it is continuous or discrete; height is continuous, for example, but values may be rounded, so it looks discrete.
Categorical variables can be ordinal, which means there is some kind of natural order to the categories; for example, “very low”, “low”, “middle”, “high” and “very high”. If there is no natural order, then the variable is just a “regular” categorical variable.
Reviewing the Google transparency data shown above:
The following shows other_req plotted against other_compl (comparable, I think, to ud_req and ud_comply in video).
plot(google$other_req, google$other_compl)
There is an outlier, the United States, which makes many more requests for content removal than any other country.
When two variables show some connection with one another, they are called associated or dependent variables. The assocation can be described as positive (when one variable goes up, so does the other) or negative (when one variable goes up, the other goes down).
The plot above does not do as good of a job as the one in the video at showing a a connection between the variables, but a slight positive association can be detected.
If two variables are not associated, they are independent.
In an observational study, researchers collect data in a way that does not directly interfere with how they data arise, they merely “observe.” In an observational study, we can only establish an association, or correlation, between two variables (if one exists). We cannot identify causation.
If the data in an observational study is from the past, it is a retrospective study. if data is observed at the time it arises, it is a prospective study.
In an experiment, researchers randomly assign subjects to various treatments, and can establish causal connections between explanatory and response variables.
An observational study might look at two groups of people, those who work out regularly and those who don’t, and measure average energy levels. An experiment might assign subjects to two groups, those will be directed to work out regularly, and those who will be directed not to work out regularly. It should be noted that an experiment cannot always, necessarily make causal connections. In this example, there may be other confounding variables that have not been controlled, such as whether a subject previously worked out regularly.
Another example in the video discusses an observational study of the breakfast habits of girls. A news headline concludes that “breakfast cereal keeps girls slim,” thus attempting to draw a causal connection from an observational study, which is invalid. The study showed that girls who ate one of several breakfast cereals tended not to be overweight, but there could be many confounding variables. Was it just cereal? Were these girls more likely to eat breakfast because of other factors, which could be the real causes?
Extraneous variables that affect both the explanatory and response variables, and that make it seem like there is a relationship between them, are called confounding variables.
Correlation does not imply causation.
A census considers the data for an entire population. This can be very expensive and difficult. Also, populations rarely stand still. By the time you collect data from an entire population, the population itself has changed.
Exploratory analysis is like sampling soup while it cooks to determine whether it needs more or less salt. If you can generalize that the entire soup pot is too salty, you have made an inference. But you won’t drink all of the soup to do this. Instead you want a representative sample, which might be a taste after stirring the pot thoroughly.
An example in the video is the presidential election of 1936. A magazine, The Literary Digest, polled 10 million Americans, a huge number. About 2.4 million responded, and the results showed that FDR’s challenger would win by a huge margin. Instead, FDR won by a huge margin. The mistake was that the people polled were the magazine’s own readers, as well as car and telephone owners. At the time, this group had an average income far above the national average. So in spite of the huge number of responses, the results were heavily biased.
You can drink a gallon of the soup from a large soup pot, but if it is all from the top of the pot, you still may not know how your soup tastes!
Also: who makes soup with green beans without cutting off the ends of the green beans first?!?
This refers to randomly selecting subjects from within a population for a study. If the sample is selected randomly, it is likely to be representative of the population. Therefore, the study’s results are generalizable to the population.
Random assignment occurs only in an experiment. It refers to the assignment of subjects to the treatment and control groups. If we are careful to assign subjects randomly to each group, we may be able to draw causal conclusions.
| Random assignment | No random assignment | . | |
|---|---|---|---|
| Random sampling | causal and generalizable | not causal, but generalizable | Generalizability |
| No random sampling | causal, but not generlizable | neither causal nor generlizable | No generalizability |
| Causation | Assocation |
library(xlsx)
## Warning: package 'xlsx' was built under R version 3.1.3
## Loading required package: rJava
## Loading required package: xlsxjars
## Warning: package 'xlsxjars' was built under R version 3.1.2
life_exp <- read.xlsx2("gapminder/life-expectancy.xlsx",sheetIndex=1,header=T)
gdp <- read.xlsx2("gapminder/gdp-per-capita.xlsx",sheetIndex=1,header=T)
#colnames(life_exp)
#colnames(income)
I gave up on trying to get the data for this and do a plot; something was very, very slow about loading the data. The video shows a plot showing an assocation between income per person (explanatory variable, on x axis) and life expectancy (response variable, on y axis). By convention, explanatory variable is on the x axis and response on the y axis.
It should be kept in mind that labelling two variables as explanatory and response does not mean there is necessarily a causal connection between the two. Since the data in this case (from gapminder.com) is from an observational study, we cannot draw any causal conclusions.
The plot shows that there is a relationship that looks like a curve, rising sharply upward from x=0, and flattening out as it moves right along the x axis.
When examining relationship between two numerical variables, look at:
The plot in the video shows three outliers with high income: Luxembourg, Macao (in China), and Qatar. Another outlier is Nepal, which has low income per person, but very high life expectancy.
One naive way to deal with outliers in data analysis is to immediately exclude them. But this can be a bad idea, because they may say something important.
data = c(rep(45,8), rep(50,12),rep(55,11),rep(60,18),rep(65,27),rep(70,52),rep(75,50),rep(80,28))
hist(data)
(The above is a rough approximation of histogram in video.)
In a histogram, data are “binned” into variables and the heights of the bars represent the number of cases in each variable.
The life expectancy histogram tends to be right-skewed. Most countries have average life expectancies in the range of 65-80. But a few have much lower values. The income histogram tends to be left-skewed. Many countries have average income of a few thousand dollars; the wealthy nations are few.
The most common distribution worked with in this course is the normal distribution, which is unimodal and symmetric. NOTE: As we will see later, the fact that a distribution is unimodal and symmetric does not mean it is necessarily a normal distribution. There is in fact a specific formula that describes the curve of a normal distribution, but it is outside the scope of this course.
If the bins are too wide (a small number of bins), then the shape of the histogram can be lost. If the bins are too narrow (very large number of bins), then it can look like there are lots of peaks, though they won’t be prominent. There is no “right” number of bins, but the histogram should give a good overall view of the distribution of the data.
data = c(48,50,rep(52,5),rep(54,3),56,58,rep(60,7),rep(62,9),rep(64,2),rep(66,12),rep(68,14),rep(70,11),rep(72,23),rep(74,27),rep(76,16),rep(78,9),rep(80,3),82)
stripchart(data,method="stack",pch=21,col="blue")
(The above is a rough approximation of the plot in the video.)
boxplot(data,horizontal=T)
(Plot above is rough approximation of video.)
A boxplot gives an idea of skew. In a horizontal boxplot, a long left whisker indicates left skew, for example. If the median is in the middle of the box, and the whisker are roughly equal length, there is little skew.
This is a map in which regions are shaded different colors to represent different values (generally, darker means a higher number)
Previously we examined shape of a distribution in terms of:
The center of a distribution is another key feature of the distribution. It is commonly considered to be one of:
Sample statistics are point estimates of the population parameters. (So the mean of the entire population is a parameter; the mean of a sample is a point estiamte.) These estimates may not be perfect, but if the sample is good (representative of the population), they are usually good guesses.
The sample mean is usually denoted as \(\bar{x}\), and the population mean is denoted as \(\mu\).
Student scores: 76,69,88,93,95,54,87,88,27
The mean is \(\frac{76+69+88+93+95+54+87+88+27}{9}=75.11\).
The mode is 88 because that value occurs most often.
The median is defined as the midpoint of the data. If the scores are arranged in ascending order, this is 87, the fifth value of nine.
If the data set contains an even number of values, there is no single midpoint, so the two values closest to the midpoint are added together and divided by two.
If a unimodal distribution is left-skewed, then the mean will be less than the median. The median in a unimodal distribution will always be near the peak. But in a left-skewed distrbution, the average will be pulled to the left by the outliers on the left. Similarly, if a unimodal distribution is right-skewed, the average will be pulled to the right, so it will be greater than the median.
If a unimodal distribution has little skew, then the mean and median will be approximately equal.
The same basic effect is seen in a boxplot. It shows the median as a heavy line. If the left whisker (in a horizontal boxplot) is longer than the right whisker, it is safe to assume that the mean is to the left of (less than) the median.
To put this intuitively, if you have a sample of ten people, and you take their heights, and one of the ten is an NBA player who is over seven feet tall, then the median of the sample might be five and a half feet, but the average will be increased significantly by the one very tall person.
To recap:
xlim = c(-4,4)
ylim=c(0,0.5)
data <- seq(-4,4,length=100)
plot(data, dnorm(data,mean=0,sd=1),type="l",lty=1,xlim=xlim,ylim=ylim)
par(new=T)
plot(data, dnorm(data,mean=0,sd=2), type="l",lty=2,col="blue",xlim=xlim,ylim=ylim)
In the skinny solid curve, the data are clustered closer to the center, while in the dashed curve, the data are spread more widely.
Measures of spread:
Roughly the average squared deviation from the mean. The sample variance is denoted as \(s^2\) and the population variance as \(\sigma^2\).
The formula for sample variance:
\[s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\]
A deviation from the mean is the distance of any one value in a data set from the mean for that set. So we square each deviation, sum those squares, and then divide the sum by \(n-1\), where \(n\) is the number of values. (The reason for dividing by \(n-1\) is discussed later.)
life_exp2 <- life_exp[,c("X2011")]
life_exp2 <- as.numeric(levels(life_exp2))[life_exp2]
life_exp2 = life_exp2[complete.cases(life_exp2)]
# some sample records
head(life_exp2)
## [1] 60.08 76.98 70.75 51.09 75.61 75.95
# number of data points
length(life_exp2)
## [1] 202
# mean of data set
mean(life_exp2)
## [1] 70.24
# variance of data set
var(life_exp2)
## [1] 85.51
The data set has 202 values. The mean of these values is 70.24. The variance is 85.51 years squared. (These differ a bit from the video, may have been from a different year. I use R’s built-in functions to get the mean and variance.)
The unit “years squared” is not very helpful, so we can take the square root of the variance (or use R’s sd function) to get the standard deviation.
sd(life_exp2)
## [1] 9.247
The standard deviation is 9.247 years.
My personal comment: I don’t claim to have a deep understanding of this, but I believe that there is a rigorous mathematical underpinning behind the standard deviation. It wasn’t chosen arbitrarily just to keep negatives from cancelling out positives, and it’s not an arbitrary method of making larger deviations weigh more. But the mathematical theory is beyond the scope of the course. Take a glance at the formula for a normal distribution (to which the idea of a standard deviation is closely linked) to get just a glimpse of the complexity.
Roughly the average deviation around the mean, and has the same units as the data.
For a sample, this is denoted as \(s\). For a population, it is denoted \(\sigma\).
\[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]
A set of five cars, in which each is a different color, has high diversity. But consider these two sets, each of which gives a color and mileage for each car in it:
Set 1: blue (10), red (20), yellow (30), green (40), purple (50)
Set 2: blue (10), blue (10), blue (10), purple (50), purple (50)
The second set has lower diversity–it has only two different kinds of cars–but higher variability. The standard deviation of the first set is 15.81, but the standard deviation of the second set is 21.91.
sd(c(10,20,30,40,50))
## [1] 15.81
sd(c(10,10,10,50,50))
## [1] 21.91
The range of the middle 50% of the data, distance between the first quartile (25th percentile) and the third quartile (75th percentile).
\[IQR = Q3 - Q1\]
data = c(48,50,rep(52,5),rep(54,3),56,58,rep(60,7),rep(62,9),rep(64,2),rep(66,12),rep(68,14),rep(70,11),rep(72,23),rep(74,27),rep(76,16),rep(78,9),rep(80,3),82)
boxplot(data,horizontal=T)
The first quartile is roughly 65, and the third quartile is roughly 77. The IQR is therefore \(77-65=12\). This value, 12, isn’t useful by itself, but is useful in the context of the overall distribution. The IQR is usually more useful than the range, because it does not rely on extreme observations. Instead, it provides a measure of the variability for the bulk of the data around the center.
We define robust statistics as measures on which extreme observations have little effect.
Example: data set is {1,2,3,4,5,6}. The mean and the median are both 3.5. But if we change one of the values so that the set is {1,2,3,4,5,1000}, the median remains unchanged at 3.5, but the mean balloons to 169. The median is considered a robust statistic.
| robust | non-robust | |
|---|---|---|
| center | median | mean |
| spread | IQR | SD, range |
Robust stastistics are most valuable with skewed distributions, or those with extreme observations. Non-robust statistics are most useful with symmetric distributions.
The natural log transformation is often applied when much of the data cluster near zero (relative to the larger values in the set) and all observations are positive
gdp2 <- gdp[,c("X2011")]
gdp2 <- as.numeric(levels(gdp2))[gdp2]
gdp2 = gdp2[complete.cases(gdp2)]
par(mfrow=c(1,2))
hist(gdp2)
hist(log(gdp2))
(Example above actually shows GDP per capita rather than income, but suffices to demonstrate the transformation. The histogram on the right has had natural log applied to the data.)
Transformations can also be applied to one or both variables in a scatterplot to make the relationship between the variables more linear, and hence easier to model with simple methods
life_exp2 <- life_exp[,c("X2011")]
life_exp2 <- as.numeric(levels(life_exp2))[life_exp2]
gdp2 <- gdp[,c("X2011")]
gdp2 <- as.numeric(levels(gdp2))[gdp2]
data = data.frame(life_exp=life_exp2, gdp=gdp2,stringsAsFactors=F)
data = data[complete.cases(data)]
par(mfrow=c(1,2))
plot(data$life_exp, data$gdp)
plot(data$life_exp, log(data$gdp))
The above code should show side-by-side scatterplots (income as explanatory variable, life expectancy as response variable) in which the one on the left, with the original data, shows a curved relationship between the data, and the one on the right, with natural log applied to response variable, shows a much more linear relationship. Right now the data doesn’t fit together (differing numbers of rows in the two sets), I may fix this later so the plots work.
The video says that later in the course, we’ll get into “a little bit more detail” on choosing a transformation. I don’t recall anything like this; I think transformation are merely touched on here, but it’s not something we use later. Important, but outside the scope of the course. “A transformation of the data can be useful, even though the complicate the interpretation of it.”
Some data on a survey on how difficult Americans find i to save money:
| Counts | Frequencies | |
|---|---|---|
| Very | 231 | 46% |
| Somewhat | 196 | 39% |
| Not very | 58 | 12% |
| Not at all | 14 | 3% |
| Not sure | 1 | ~0% |
save_data = data.frame(counts=c(231,196,58,14,1),cat=c("Very", "Somewhat", "Not very", "Not at all", "Not sure"))
save_data = data.frame(counts=c(rep("Very", 231),rep("Somewhat",196),rep("Not very",58),rep("Not at all",14),"Not sure"))
barplot(table(save_data))
Example of students taking test that measures both reading and writing skills. The two scores cannot be considered independent, because any one student’s ability in reading and writing are probably linked to some degree.
When two observations have a special correspondence (they are not independent), they are said to be paired.
To analyze paired data, it is often useful to look at the difference in outcomes of each paired observation. Example, diff = read - write. It important that we always subtract in the same order for each case (either read-write, or write-read).
You can plot a histogram of the observed differences, but it won’t indicate whether the differences are statistically significant; i.e., it won’t allow you to reject a null hypothesis that states that there is no difference.
\(H_0: \mu_{diff} = 0\)
\(H_A: \mu_{diff} \ne 0\)
Now we have just a single variable, the differences between the paired data, and we proceed as before. All of the conditions are the same as before.
Test statistic is
\[Z = \frac{\bar{x}_{diff}-\mu_{diff}}{SE_{\bar{x}_{diff}}}\]
\(\bar{x}_{diff} \text{~} N(mean=0,SE=0.628)\)
x_bar_diff = -0.545
s_diff = 8.887 # population std. dev.
n_diff = 200
se_diff = s_diff / sqrt(n_diff) # 0.628
nullval = 0
Z = abs(x_bar_diff - nullval) /se_diff
Z
## [1] 0.8673
p_value = (1 - pnorm(Z)) * 2 # twosided test
p_value
## [1] 0.3858
With such a high p-value, we do not reject the null hypothesis.
Like the hypothesis test, the confidence interval for two paired values in a sample is just like a single sample, except the sample statistic is the difference between the two values. Confidence interval is still point estimate \(\pm\) margin of error, or \(\bar{x}_{diff}\pm z^*SE_{\bar{x}_{diff}}\).
x_bar_diff = -0.545
s_diff = 8.887
n_diff = 200
se_diff = s_diff / sqrt(n_diff)
se_diff
## [1] 0.6284
conf_level = 0.95
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
z_star
## [1] 1.96
ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci
## [1] -1.7767 0.6867
We are 95% confident that the average reading score is between 1.78 points lower to 0.69 points higher than the average writing score.
Example is work hours and educational attainment from GSS (2010).
Educational attainment is ordinal: less than high school, high school, junior college, bachelor’s, graduate. Hours worked is numerical.
Best way to visualize a categorical versus numerical variable is side-by-side box plots.
Example then shifts to just “college” or “no college”. The parameter of interest is average difference in number of hours worked among all Americans with a college degree, and those without a degree. \(\mu_{coll} - \mu_{no coll}\)
The point estimate would be average difference in hours worked per week for sampled Americans, \(\bar{x}_{coll} - \bar{x}_{no coll}\).
As always, the formula is point estimate \(\pm\) standard error. In this case, it is \((\bar{x}_1 - \bar{x}_2) \pm z^* SE_{\bar{x}_1 - \bar{x}_2}\).
The standard error of the difference of two independent means is:
\[SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\]
Conditions
conf_level = 0.95
alpha = 1 - conf_level
x_bar_1 = 41.8
n_1 = 505
s_1 = 15.14
x_bar_2 = 39.4
n_2 = 667
s_2 = 15.12
x_bar_diff = x_bar_1 - x_bar_2
se_diff = sqrt((s_1^2/n_1) + (s_2^2/n_2))
se_diff
## [1] 0.8926
z_star = qnorm(conf_level + (alpha/2))
z_star
## [1] 1.96
ci = c(x_bar_diff - z_star * se_diff, x_bar_diff + z_star * se_diff)
ci
## [1] 0.6506 4.1494
(Carrying over example and calculations from previous.)
\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \ne 0\)
\((\bar{x}_{coll} - \bar{x}_{no coll}) ~ N(mean = 0, SE = 0.89)\)
nullval = 0
Z = abs(x_bar_diff-nullval) / se_diff
Z
## [1] 2.689
p_value = (1 - pnorm(Z)) * 2 # twosided test
p_value
## [1] 0.007168
Reject the null hypothesis, there is a difference in hours worked. We already knew this from the 95% confidence interval, in which 0 (no difference) did not appear.
Interpretation: Given that there is no difference between the number of hours worked between those with and without a college degree, there is a 0.7% chance that random samples of 505 college and 667 non-college graduates will show an average difference of hours worked of 2.4 hours or more.
This is not explicitly covered in the course, but was discussed with the instructor in a forum. Think of the two samples as coming from distinct populations. Therefore, it is desirable to have standard deviations for both populations to calculate the standard error. If they are not available, fall back to the standard deviations for the samples. But don’t think of the samples as coming from a single population; there is no such thing as a single population standard deviation for both.
What if you have two samples of different sizes from a single population? In that case, it would make no sense to conduct a hypothesis test of their difference. The point of comparing two unpaired samples is to determine whether there is a difference in their respective population parameters. If they come from the same population, we already know the answer to that is no.
Example: Twenty 1+ bedroom apartments were randomly selected on raleigh.craigslist.org. The median is a better measure of the distribution because the rents are right-skewed.
There is no Central Limit Theorem for medians, so we cannot use methods learned so far to construct confidence interval. An alternative approach to constructing confidence intervals is bootstrapping. The technique involves taking many samples, with replacement, from the original sample, and then building a distribution based on those samples.
Example: There is a sample of 20 apartments. So we take many samples, each of size 20, using replacement, from this sample, and we calculate a sample statistic (mean, median, proportion, etc.) for each of those samples.
I don’t know how to construct a dotplot with 500 elements and get then all to show as in the video, so I show a histogram instead.
set.seed(3)
rents <- c(775,625,733,929,895,749,1020,1349,599,1143,1209,1495,879,975,1076,1282,665,705,799,500)
num_samples = 500
rent_medians <- rep(0, num_samples)
for(i in 1:num_samples) {
samp <- sample(rents, length(rents), replace=TRUE)
rent_medians[i] <- median(samp)
}
hist(rent_medians,breaks=10)
We can use 95% as a confidence level, and thus determine the samples that comprise 95% of all samples that are “in the middle”. Or conversely, find 2.5% of the samples on the left, and 2.5% on the right.
The key point here is that you have a dotplot, and so if you need to determine which points fall outside of 95%, multiply the total number of samples by 0.95, and that is the number of cases within the 95% confidence interval. Subtract this number from the total number of samples, and you have the number of cases outside of the 95% confidence interval. These will be half on one side, and half on the other.
Example with 90% confidence level: 500 samples * 0.9 = 450. 500 - 450 = 50, so there would be 50/2=25 samples on either side of the confidence interval.
conf_level = 0.9
num_in_interval = conf_level * num_samples
num_in_tail = (num_samples - num_in_interval) / 2
medians_sorted = rent_medians[order(rent_medians)]
left_tail_val = medians_sorted[num_in_tail]
right_tail_val = medians_sorted[num_samples - num_in_tail + 1]
c(left_tail_val, right_tail_val)
## [1] 749 1020
The numbers above do not exactly match the video, they are from my own bootstrap simulation.
First determine the mean of the bootstrap distribution, \(\bar{x}_{boot}\), and caclulate the 90% confidence interval as that mean, plus/minus 1.96 times the standard error, which is the bootstrap standard error, \(SE_{boot}\). This is the standard deviation of the observations in the bootstrap distribution.
se_boot = sd(rent_medians) # std. error is just the standard deviation because we use the entire bootstrap population
rent_mean = mean(rent_medians)
conf_level = 0.9
alpha = 1 - conf_level
z_star = qnorm(conf_level + (alpha/2))
ci = c(rent_mean - z_star * se_boot, rent_mean + z_star * se_boot)
ci
## [1] 742.4 1014.1
Note that the numbers above are NOT the numbers in the video, they are based on my own bootstrap. They are however pretty close to the video’s numbers.
You can see that the numbers from the two estimates (percentile method and standard error method) do not match each other exactly, but they are close.
The t distribution is useful when sample size is less than 30.
As long as observations are independent and the population distribution is not extremely skewed, a large sample would ensure that…
For example, if sample size is 10, the sample might show considerable skew, but the sample is too small to be sure. If sample size is 1000, the histogram gives us much more confidence about what the population distribution looks like.
So it is important to be careful with normality condition for small samples; don’t just examine the sample, but also think about where the data come from. Would I expect this distribution to be symmetric, and am I confident that outliers are rare?
When n is small and \(\sigma\) is unknown, use the t distribution to address the uncertainty of the standard error estimate. The t distribution is unimodal and symmetric, but its tails are somewhat thicker. The exact shape of the distribution depends on the degrees of freedom, and as those degrees are larger, the distribution looks more and more like the normal distribution. The plot below shows the normal distribution as a solid line.
normdata <- seq(-4, 4, length=100)
hnormdata <- dnorm(normdata)
plot(normdata, hnormdata, type="l", lty=1, xlab="x value", ylab="Density", main="Normal Distribution",xlim=c(-4,4),ylim=c(0,0.45))
abline(h=0)
tdata <- dt(normdata, df=2)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
tdata <- dt(normdata, df=6)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
tdata <- dt(normdata, df=20)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
The thicker tails mean that observations are more likely to fall beyond 2 standard deviations from the mean. This is helpful for mitigating the effect of a less reliable estimate for the standard error of the sampling distribution. In a nutshell, the confidence interval will include fewer observations with a t distribution than a standard distribution.
Remember, the normal distribution has two parameters, mean and standard deviation.
We can see below that, when degrees of freedom is 50, the t distribution is very similar to the normal distribution. When degrees of freedom is 10, there is a much larger difference.
# P(|Z| > 2)
(1 - pnorm(2)) * 2
## [1] 0.0455
# P(|t[df=50]| > 2)
(1 - pt(2, df=50)) * 2
## [1] 0.05095
# P(|t[df=10]| > 2)
(1 - pt(2, df=10)) * 2
## [1] 0.07339
So in the above results, if the significance level is 5%, we can reject the null hypothesis only for the scenario using the normal distribution. The t distribution makes it less likely to reject the null hypothesis, and the smaller the degrees of freedom, the harder it gets.
It’s actually the Student’s t distribution. This was a pseudonym for William Gosset at Guiness brewing company. He worked with small samples/batches, and thus developed the t distribution. He assumed a pseudonym, Student, for his publications to avoid exposing trade secrets.
The example here is a study on playing video game while eating, which might lead to forgetting what you ate, and this can contribute to increased snacking.
| biscuit intake | \(\bar{x}\) | \(s\) | \(n\) |
|---|---|---|---|
| solitaire | 52.1g | 45.1g | 22 |
| no distraction | 27.1g | 26.4g | 22 |
conf_level = 0.95
alpha = 1 - conf_level
x_bar_solitaire = 52.1 # biscuit intake
s_solitaire = 45.1
n_solitaire = 22
se_solitaire = s_solitaire / sqrt(n_solitaire)
x_bar_no_dist = 27.1 # biscuit intake
s_no_dist = 26.4
n_no_dist = 22
se_no_dist = s_no_dist / sqrt(n_no_dist)
df = n_solitaire - 1
t_star = qt(conf_level + (alpha/2), df=df)
t_star
## [1] 2.08
ci_solitaire = c(x_bar_solitaire - t_star * se_solitaire, x_bar_solitaire + t_star * se_solitaire)
ci_solitaire
## [1] 32.1 72.1
ci_no_dist = c(x_bar_no_dist - t_star * se_no_dist, x_bar_no_dist + t_star * se_no_dist)
ci_no_dist
## [1] 15.39 38.81
Suppose the suggested serving size of these biscuits is 30g. Do these data provide convincing evidence that the amount of snacks consumed by distracted eaters post-lunch is different than the suggested serving size?
\(H_0: \mu = 30\)
\(H_A: \mu \ne 30\)
nullval_solitaire = 30
T = (x_bar_solitaire - nullval_solitaire) / se_solitaire
T
## [1] 2.298
p_value = (1 - pt(T, df=n_solitaire)) * 2 # twosided test
p_value
## [1] 0.03141
So reject the null hypothesis; there is a difference in snacking levels between distracted eaters and suggested serving size.
Nutshell: This is the same as inference for comparing means from two large samples, except you use the t distribution
Same study as last video, but this time we consider means both for those who played solitaire and those who had no distractions.
Confidence interval: point estimate \(\pm\) margin of error or \((\bar{x}_1 - \bar{x}_2) \pm t^*_{df}SE_{(\bar{x}_1 - \bar{x}_2)}\).
Hypothesis test: \(T_{df} = \frac{obs - null}{SE} = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE_{\bar{x}_1 - \bar{x}_2}}\), where \(SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\).
Degrees of freedom for t statistic for inference on difference of two means:
\[df = min(n_1 - 1, n_2 - 1)\]
(Note that the above is a conservative estimate of the actual number, which is complex to derive.)
For hypothesis test:
\(H_0: \mu_{wd} - \mu_{wod} = 0\)
\(H_A: \mu_{wd} - \mu_{wod} \ne 0\)
Where “wd” means “with distraction” and “wod” means “without distraction”.
x_bar_diff = x_bar_solitaire - x_bar_no_dist
nullval = 0
se_diff = sqrt(s_solitaire^2/n_solitaire + s_no_dist^2/n_no_dist)
se_diff
## [1] 11.14
df = min(n_solitaire - 1, n_no_dist - 1)
Z = abs(x_bar_diff - nullval)/se_diff
Z
## [1] 2.244
p_value = (1 - pt(Z, df=df)) * 2
p_value
## [1] 0.03575
Reject the null hypothesis, there is a difference between the two groups.
Confidence interval for above:
conf_level = 0.95
alpha = 1 - conf_level
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(x_bar_diff - t_star * se_diff, x_bar_diff + t_star * se_diff)
ci
## [1] 1.83 48.17
Survey that asked respondents to choose correct definitions of 10 words, and asked them to identify their social class.
# load data
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/gss.Rdata"))
print(levels(gss$class))
## [1] "LOWER" "MIDDLE" "UPPER" "WORKING"
hist(gss$wordsum)
plot(gss$class)
boxplot(gss$wordsum ~ gss$class)
gss.means = as.vector(by(gss$wordsum, gss$class, mean))
gss.counts = as.vector(by(gss$wordsum, gss$class, length))
gss.sds = as.vector(by(gss$wordsum, gss$class, sd))
gss2 <- data.frame(class=levels(gss$class), n=gss.counts, mean=gss.means, sd=gss.sds)
gss2
## class n mean sd
## 1 LOWER 41 5.073 2.240
## 2 MIDDLE 331 6.761 1.886
## 3 UPPER 16 6.188 2.344
## 4 WORKING 407 5.749 1.865
| n | mean | sd | |
|---|---|---|---|
| lower | 41 | 5.07 | 2.24 |
| working | 407 | 5.75 | 1.87 |
| middle | 331 | 6.76 | 1.89 |
| upper | 16 | 6.19 | 2.34 |
| total | 795 | 6.14 | 1.98 |
We need to compare more than two statistics, so methods learned so far won’t do the job.
\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu+k\).
\(H_A:\) The mean outcome is different for at least two of the categories
where \(\mu_i\) mean of the outcome for observations in category \(i\), and \(k\) is the number of groups.
ANOVA does not tell us which of the means are different from each other, or whether it is 2, or 3, or more means that differ, only that there are at least two means that differ.
The test statistic is F, which is:
\[F = \frac{\text{variability between groups}}{\text{variability within groups}}\]
fdata = rf(5000, df1=10, df2=10)
hist(fdata, breaks=100,main="F dist. (df=10)")
The data is again the GSS data on class and vocabulary score.
\(H_0:\) The mean outcome is the same across all categories: \(\mu_1 = \mu_2 = \dots = \mu+k\).
\(H_A:\) The mean outcome is different for at least one pair of means
Total variability in vocabulary scores:
| Df | Sum Sq | Mean Sq | F value | PR(>F) | ||
|---|---|---|---|---|---|---|
| Group | class | 3 | 236.56 | 78.855 | 21.735 | <0.0001 |
| Error | Residuals | 791 | 2869.80 | 3.628 | ||
| Total | 794 | 3106.36 |
(Note that R’s ANOVA output does not show the Total line!)
\[SST = \sum\limits^n_{i=1}(y_i-\bar{y})^2\]
Where \(y_i\) is the value of the response variable for each observation and \(\bar{y}\) is the grand mean of the response variable.
To compute SST:
sst <- function(y) {
y.mean = mean(y)
return(sum(sapply(y, function(y.i){(y.i - y.mean)^2})))
}
sst(gss$wordsum)
## [1] 3106
\[SSG = \sum\limits^k_{j=1} n_j(\bar{y}_j - \bar{y})^2\]
where \(k\) is the number of groups, \(n_j\) is the number of observations in group \(j\), \(\bar{y}_j\) is the mean of the response variable for group \(j\), and \(\bar{y}\) is the grand mean of the response variable.
To compute SSG:
# this can no doubt be optimized--get rid of loop
ssg <- function(data, y, j) {
total = 0
mean.grand = mean(y) # mean for all values
for(grp in unique(j)) {
grp.data = y[j==grp] # vector of the y column of data
n.grp = length(grp.data)
mean.group = mean(grp.data) # mean for current group
total = total + n.grp * (mean.group - mean.grand)^2
}
return(total)
}
ssg(gss, gss$wordsum, gss$class)
## [1] 236.6
\[SSE = SST - SSG\]
Now we need a way to get from these measures of total variability to average variability (“Mean Sq” column in table above). Scaling by a measure that incorporates sample size and number of groups, which means degrees of freedom.
Three degrees of freedom associated with ANOVA:
Average variability between and within groups, calculated as the total variability (sum of squares) scaled by the associated degree of freedom
Here is the table again:
| Df | Sum Sq | Mean Sq | F value | PR(>F) | ||
|---|---|---|---|---|---|---|
| Group | class | 3 | 236.56 | 78.855 | 21.735 | <0.0001 |
| Error | Residuals | 791 | 2869.80 | 3.628 | ||
| Total | 794 | 3106.36 |
So MSG is 236.56/3=78.855, and MSE is 2869.8/791=3.628.
Ratio of the between group and within group variability:
\[F = \frac{MSG}{MSE}\]
Here, this is 78.855/3.628=21.735.
Now at last we can calculate the p-value, based on the F statistic.
F = 21.735
df1 = 4-1
df2 = 795 - df1 - 1
p_value = pf(21.735, df1=df1, df2=df2, lower.tail=FALSE)
p_value
## [1] 1.56e-13
Reject the null hypothesis; there is a difference between at least one pair of groups.
Here is the same thing, using R’s anova function. It summarizes a single model (or it can compare two or more models).
gss.lm <- lm(gss$wordsum ~ gss$class)
gss.anova = anova(gss.lm)
gss.anova
## Analysis of Variance Table
##
## Response: gss$wordsum
## Df Sum Sq Mean Sq F value Pr(>F)
## gss$class 3 237 78.9 21.7 1.6e-13 ***
## Residuals 791 2870 3.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gss.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Min. : 3 Min. : 237 Min. : 3.63 Min. :21.7 Min. :0
## 1st Qu.:200 1st Qu.: 895 1st Qu.:22.43 1st Qu.:21.7 1st Qu.:0
## Median :397 Median :1553 Median :41.24 Median :21.7 Median :0
## Mean :397 Mean :1553 Mean :41.24 Mean :21.7 Mean :0
## 3rd Qu.:594 3rd Qu.:2212 3rd Qu.:60.05 3rd Qu.:21.7 3rd Qu.:0
## Max. :791 Max. :2870 Max. :78.85 Max. :21.7 Max. :0
## NA's :1 NA's :1
gss.anova
## Analysis of Variance Table
##
## Response: gss$wordsum
## Df Sum Sq Mean Sq F value Pr(>F)
## gss$class 3 237 78.9 21.7 1.6e-13 ***
## Residuals 791 2870 3.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is the output from the course’s inference function.
source("http://bit.ly/dasi_inference")
# This will do ANOVA analysis because the explanatory variable is categorical and has more than 2 levels
# alternative is "greater" because an F test is always one-sided
inference(y=gss$wordsum, x=gss$class, type="ht", est="mean", method="theoretical", alternative="greater", eda_plot=F, inf_plot=F)
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_LOWER = 41, mean_LOWER = 5.073, sd_LOWER = 2.24
## n_MIDDLE = 331, mean_MIDDLE = 6.761, sd_MIDDLE = 1.886
## n_UPPER = 16, mean_UPPER = 6.188, sd_UPPER = 2.344
## n_WORKING = 407, mean_WORKING = 5.749, sd_WORKING = 1.865
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Warning: option "show.signif.stars" is invalid: assuming TRUE
## Df Sum Sq Mean Sq F value Pr(>F)
## x 3 237 78.9 21.7 1.6e-13 ***
## Residuals 791 2870 3.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise tests: t tests with pooled SD
## LOWER MIDDLE UPPER
## MIDDLE 0.0000 NA NA
## UPPER 0.0475 0.2396 NA
## WORKING 0.0306 0.0000 0.3671
Now, here is the output of R’s aov function. It fits a single model.
gss.aov = aov(gss.lm)
summary(gss.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## gss$class 3 237 78.9 21.7 1.6e-13 ***
## Residuals 791 2870 3.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gss.aov
## Call:
## aov(formula = gss.lm)
##
## Terms:
## gss$class Residuals
## Sum of Squares 236.6 2869.8
## Deg. of Freedom 3 791
##
## Residual standard error: 1.905
## Estimated effects may be unbalanced
Sample observations must be independent of each other
So far we’ve determined that at least two groups differ from each other, but not which. Now we conduct multiple tests to determine which groups differ.
### Which means differ
Conduct two-sample t tests for each possible pair of groups. But because there are many tests, this means an increase in the rate of Type 1 errors.
\[\alpha^* = \alpha/K\]
where \(\alpha\) is starting significance level, and \(K = \frac{k(k-1)}{2}\)
Example
alpha = 0.05
k = 4
K = (k*(k-1))/2
alpha.mod = alpha/K
alpha.mod
## [1] 0.008333
Standard error for pairwise comparison:
\[SE = \sqrt{\frac{MSE}{n_1} + \frac{MSE}{n_2}}\]
Recall that MSE was 3.628 in the table above.
Degrees of freedom for pairwise comparisons:
\[df = df_E\]
Recal that \(df_E\) was 791 in the table above.
Example comparing lower and middle class:
df_E = 791
k = 4
K = (4*(4-1))/2
mean.lower=5.07
mean.middle=6.76
mean.diff = mean.middle - mean.lower
n.lower=41
n.middle=331
MSE = 3.628
SE.pairwise = sqrt(MSE/n.lower + MSE/n.middle)
nullval = 0
T = abs(mean.diff - nullval)/SE.pairwise
T
## [1] 5.359
p_value = pt(T, df=df_E, lower.tail=FALSE) * 2 # twosided test
p_value
## [1] 1.098e-07
p_value < alpha.mod # TRUE means reject null hypothesis
## [1] TRUE
Similar comparisons could now be made between the other groups. I want to code a function that would take in a data frame and do all of this automatically.
Focus shifts from numeric to categorical variables.
This video has a lousy example on proportions of smokers by country, worldwide. I don’t like it because there is obvious cultural differences that can result in very large differences in proportions between countries, so any estimated proportion based on the sample proportions of the various countries is invalid, as far as I’m concerned.
But the idea is that you take the mean of \(\hat{p}\), which apparently means add all of the \(\hat{p}\) values for the various countries together, and then divide by the number of countries (population size differences unaccounted for!) and this should approximate the population \(p\) parameter.
The distribution of sample proportions is nearly normal, centered at the population proportion, and with a standard error inversely proportional to the sample size.
\[\hat{p} \text{~} N (\text{mean} = p, SE=\sqrt{\frac{p(1-p)}{n}})\]
Conditions
If \(p\) is unknown for success/failure, use \(\hat{p}\).
Example: 90% of all plant species are angiosperms. If you randomly sample 200 plants, what is probability that at least 95% will be angiosperms?
p = 0.9
nullval = 0.95
n = 200
SE = sqrt((p * (1-p))/n)
SE
## [1] 0.02121
Z = abs(p - nullval)/SE
Z
## [1] 2.357
p_value = 1 - pnorm(Z)
p_value
## [1] 0.009211
This can also be done with binomial distribution, although answer will be slightly different.
sum(dbinom(190:200, size=200, prob=0.9))
## [1] 0.008071
NOTE: I was really confused by this for a while. Later on, we’re told that, in a hypothesis test based on a proportion, we use the null value to calculate the standard error, but that doesn’t happen above. Why not? Because we have the population parameter, p, rather than a statistic, \(\hat{p}\). I think!
In a survey, 99 of 670 respondents showed bad intuition about experiment design, and 571 showed good intuition.
The parameter of interest is \(p\). The point estimate is \(\hat{p}\), the sample proportion. This is \(571/670\approx0.85\).
Confidence interval is:
\[\hat{p} \pm z^*SE_{\hat{p}}\]
The standard error for proportion:
\[SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
Note that, for a confidence interval, we use \(\hat{p}\) to calculate standard error.
n = 670
success = 571
conf_level = 0.95
alpha = 1 - conf_level
p_hat = success / n
SE = sqrt((p_hat*(1-p_hat))/n)
Z_star = qnorm(conf_level + alpha/2)
ci = c(p_hat - Z_star * SE, p_hat + Z_star * SE)
ci
## [1] 0.8254 0.8791
In the example above, the margin of error was about 2.7%. To determine sample size needed for a confidence interval at a given margin of error, e.g., 1%:
\[n = (z^*/ME)^2 \times (p(1-p))\]
p_hat = 0.85
ME = 0.01
conf_level = 0.95
alpha = 1 - conf_level
Z_star = qnorm(conf_level + alpha/2)
n = (Z_star/ME)^2 * (p_hat * (1 - p_hat))
ceiling(n)
## [1] 4898
The answer given here is 4898 on the dot. Video shows 4894.4. This due to rounding errors (the Z value for 95% is actually a bit lower than 1.96). Don’t know how to deal with this on the exam. Use 1.96 hard-coded I guess.
\[SE = \sqrt{\frac{p(1-p)}{n}}\]
\[Z = \frac{\hat{p} - p}{SE}\]
Note that “p” here is the null value, because the null hypothesis assumes this is the population parameter.
A poll found that 60% of 1983 randomly sampled American adults believe in evolution. Does this provide convincing evidence that a majority believe?
\(H_0: p = 0.5\)
\(H_A: p > 0.5\)
Conditions
conf_level = 0.95
alpha = 1 - conf_level
nullval = 0.5
p_hat = 0.6
n = 1983
SE = sqrt((nullval * (1-nullval))/n)
Z_score = abs(p_hat - nullval) / SE
Z_score
## [1] 8.906
p_value = 1 - pnorm(Z_score)
p_value
## [1] 0
Reject the null hypothesis.
“Calculating a confidence interval between the two population proportions that are unknown using data from our samples.” Two categorical variables, each of which can be “success” or “failure.”
Gallup poll on handgun ban
| succ. | n | \(\hat{p}\) | |
|---|---|---|---|
| US | 257 | 1025 | 0.25 |
| Coursera | 59 | 83 | 0.71 |
\(H_0: \hat{p}_{Coursera} - \hat{p}_{US} = 0\)
\(H_A: \hat{p}_{Coursera} - \hat{p}_{US} \ne 0\)
Conditions:
For a confidence interval, we use the two observed proportions.
conf_level = 0.95
alpha = 1 - conf_level
success_coursera = 59
n_coursera = 83
success_us = 257
n_us = 1025
p_hat_coursera = success_coursera / n_coursera
p_hat_us = success_us / n_us
p_hat_coursera
## [1] 0.7108
p_hat_us
## [1] 0.2507
point_est = p_hat_coursera - p_hat_us
point_est
## [1] 0.4601
z_star = qnorm(conf_level + alpha/2)
z_star
## [1] 1.96
SE = sqrt((p_hat_coursera * (1 - p_hat_coursera))/n_coursera + (p_hat_us * (1 - p_hat_us))/n_us)
SE
## [1] 0.05157
ci = c(point_est - z_star * SE, point_est + z_star * SE)
ci
## [1] 0.3590 0.5612
In the hypothesis test, the null hypothesis assumes that the two sample proportions are equal. This means that we use a pooled proportion to check the success-failure condition, and to calculate standard error.
| Male | Female | |
|---|---|---|
| Yes | 34 | 61 |
| No | 52 | 61 |
| Not sure | 4 | 0 |
| Total | 90 | 122 |
| \(\hat{p}\) | 0.38 | 0.5 |
\(H_0: p_{male} - p_{female} = 0\)
\(H_A: p_{male} - p_{female} \ne 0\)
p_success_m = 34
p_success_f = 61
n_m = 90
n_f = 122
p_hat_m = p_success_m / n_m
p_hat_f = p_success_f / n_f
p_null = 0
p_pool = (p_success_m + p_success_f)/(n_m + n_f)
SE = sqrt( (p_pool * (1-p_pool))/n_m + (p_pool * (1-p_pool))/n_f )
SE
## [1] 0.0691
point_est = p_hat_m - p_hat_f
Z = abs(point_est - p_null) / SE
Z
## [1] 1.769
p_value = 2 * (1 - pnorm(Z))
p_value
## [1] 0.07694
Do not reject null hypothesis.
Use simulation when the success-fail condition is not met for a proportion.
source("http://bit.ly/dasi_inference")
paul = factor(c(rep("yes", 8), rep("no", 0)), levels=c("yes", "no"))
paul
## [1] yes yes yes yes yes yes yes yes
## Levels: yes no
inference(paul, est="proportion", type="ht", method="simulation", success="yes", null=0.5, alternative="greater")
## Single proportion -- success: yes
## Summary statistics:
## p_hat = 1 ; n = 8
## H0: p = 0.5
## HA: p > 0.5
## p-value = 0.0038
“Know the back of your hand”
back = factor(c(rep("correct", 11), rep("incorrect",1)))
inference(back, est="proportion", type="ht", method="simulation", success="correct", null=0.5, alternative="greater",nsim=100)
## Single proportion -- success: correct
## Summary statistics:
## p_hat = 0.9167 ; n = 12
## H0: p = 0.5
## HA: p > 0.5
## p-value = 0
Extension of example from 3-2.
| back | palm | total | |
|---|---|---|---|
| Correct | 11 | 7 | 18 |
| Incorrect | 1 | 5 | 6 |
| Total | 12 | 12 | 24 |
| \(\hat{p}\) | 0.9167 | 0.5833 | 0.75 |
\(H_0: p_{back} - p_{palm} = 0\)
\(H_A: p_{back} - p_{palm} \ne 0\)
Conditions:
Simulation scheme
hand = factor(c(rep("correct", 7), rep("wrong", 5), c(rep("correct", 11), rep("wrong", 1))))
grp = c(rep("palm",12),rep("back",12))
inference(hand, grp, est = "proportion", type = "ht", null = 0, alternative = "twosided", order = c("back","palm"), success = "correct", method = "simulation", seed = 842, eda_plot = FALSE)
## Response variable: categorical, Explanatory variable: categorical
## Difference between two proportions -- success: correct
## Summary statistics:
## x
## y back palm Sum
## correct 11 7 18
## wrong 1 5 6
## Sum 12 12 24
## Observed difference between proportions (back-palm) = 0.3333
## H0: p_back - p_palm = 0
## HA: p_back - p_palm != 0
## p-value = 0.1566
There is not enough evidence to reject the null hypothesis. We do not conclude that there is a difference between the ability to recognize back versus palm.
GOF is “goodness of fit”. Use it when you have one categorical variable with two or more levels. It tells you whether the proportions of the levels match the proportions in the population. So you need to know the proportions in the population for those levels.
Example: Racial distribution in jury selection
\(H_0:\) People selected for jury duty are a simple random sample from the population of potential jurors. The observed counts of jurors from various race/ethnicities follow the same ethnicity pattern distribution in the population.
\(H_A:\) People selected for jury duty are not a simple random sample from the population of potential jurors. The observed counts of jurors from various ethnicities do not follow the same distribution in the population.
The table shows availability for jury duty by race, based on census data.
| ethnicity | white | black | nat. amer. | asian & PI | other | total |
|---|---|---|---|---|---|---|
| % in pop. | 80.29 | 12.06 | 0.79 | 2.92 | 3.94 | 100% |
| expected # | 2007 | 302 | 20 | 73 | 98 | 2500 |
Distribution of 2500 people selected for jury duty.
| ethnicity | white | black | nat. amer. | asian & PI | other |
|---|---|---|---|---|---|
| observed # | 1920 | 347 | 19 | 84 | 130 |
| % |
Conditions for chi-square test
Known as \(X^2\):
\[X^2 = \sum\limits_{i=2}^k \frac{(O-E)^2}{E}\]
Where O is the observed count, E is the expected count, and k is the number of cells. In this example, k is 5 (white, black, nat. amer., asian & PI, other). We square \(X\) to:
The chi-square distribution has one parameter, degrees of freedom. For chi-square GOD test, \(df = k-1\) where k is the number of cells. So our example has df=4.
The following shows chi-square distributions for 2 degrees of freedom (green), 4 (red) and 9 (blue).
x <- rchisq(100, 5)
hist(x, prob=TRUE, ylim=c(0,0.6), xlim=c(-0.1, 20))
curve( dchisq(x, df=2), col='green', add=TRUE)
curve( dchisq(x, df=4), col='red', add=TRUE )
curve( dchisq(x, df=9), col='blue', add=TRUE)
\(H_0: p_{observed}-p_{expected} = 0\)
\(H_A: p_{observed}-p_{expected} \ne 0\)
k = 5
df = k - 1
expected = c(2007, 302, 20, 73, 98)
observed = c(1920, 347, 19, 84, 130)
x_squared = sum((observed-expected)^2/expected)
x_squared
## [1] 22.63
p_value = pchisq(x_squared, df, lower.tail=FALSE)
p_value
## [1] 0.0001499
Reject the null hypothesis.
This applies when there are two categorical variables, at least one of which has more than two levels. It tells you whether there is a relationship between these two variables.
Example: Obesity and marital status
\(H_0:\) Weight and relationship status are independent. Obesity rates do not vary by relationship status. \(H_A:\) Weight and relationship status are dependent. Obesity rates do vary by relationship status.
| dating | cohabiting | married | total | |
|---|---|---|---|---|
| obese | 81 | 103 | 147 | 331 |
| not obese | 359 | 326 | 277 | 962 |
| total | 440 | 429 | 424 | 1293 |
The first variable is obesity, which has only two levels, obese or not obese. The second variable is status, which has three levels, dating, cohabiting and married. The question: Does there appear to be a relationship between weight and marital status?
\(H_0:\) Weight and relationship status are independent.
\(H_A:\) Weight and relationship status are not independent.
This hypothesis test is an independence test, because we’re evaluating the relationship between two categorical variables.
Chi-square statistic determined same way as before:
\[X^2 = \sum\limits_{i-1}^k \frac{(O-E)^2}{E}\]
Degrees of freedom: \(df = (R-1)\times(C-1)\), where R is the number of rows and C is the number of columns. Here, there are two rows and three columns, so df is (2-1)(3-1)=2.
Conditions are same as GOF test.
obese = c(81,103,147)
not_obese = c(359,326,277)
obesity_rate = sum(obese)/(sum(obese) + sum(not_obese))
obesity_rate
## [1] 0.256
Now we can add expected numbers of obese people to each level.
| dating | cohabiting | married | total | |
|---|---|---|---|---|
| obese | 81 | 103 | 147 | 331 |
| not obese | 359 | 326 | 277 | 962 |
| total | 440 | 429 | 424 | 1293 |
| exp. obese | 113 | 110 | 108 | |
| exp. not obese | 327 | 319 | 316 |
Expected counts in two-way tables:
\[\text{expected count} = \frac{(\text{row total})\times(\text{column total})}{\text{tabletotal}}\]
So for example, the row total for obese is 331, the column total for dating is 440, and the table total is 1293. So the expected count for obese in a relationship is (331*440)/1293, which when rounded is 113.
expected_obese = c(113,110,108)
observed_obese = c(81,103,147)
expected_not = c(327,319,316)
observed_not = c(359,326,277)
r = 2
c = 3
x_squared = sum((observed_obese-expected_obese)^2/expected_obese) + sum((observed_not - expected_not)^2/expected_not)
x_squared
## [1] 31.69
df = (r-1)*(c-1)
p_value = pchisq(x_squared, df, lower.tail=FALSE)
p_value
## [1] 1.315e-07
p-value is very small, so reject the null hypothesis.
In regression, there is an explanatory variable (on the x axis) and a response variable (on the y axis). The relationship must be linear in order for linear regression to be useful. The relationship can be positive or negative, weak or strong.
One measure of the relationship between two variables is correlation. It describes the strength of the linear relationship between the variables. It is denoted as R.
Properties of the correlation coefficient
With perfect positive assocation, slope of regression line is positive. With perfect negative assocation, slope is negative. When there is none, slope is horizontal. Note that correlation coefficient is not exactly the same as the slope, but its sign and the sign of the regression line slope will always be the same.
Residuals are leftovers from the model fit. Data = Fit + Residuals. A residual is the difference between the observed and the predicted \(y\) for a given \(x\). Specifically, this is observed minus \(y\).
\[\text{residual} = e_i = y_i - \hat{y}_i\]
(I have seen other texts that refer to error as the difference between a response variable and the “true” regression line, the one based on a population, and the difference between a response variable and the estimated regression line as a residual. In that case, the error is represented by \(e\) and the residual by \(\epsilon\).)
If a residual is negative, then the actual data point is below the regression line. The model overestimates at that particular \(x\).
If the residual is positive, then the data point is above the regression line. The model underestimates at that particular \(x\).
For the slope of the regression line, we minimize the sum of the squared residuals. Squaring the values avoids the problem of positive and negative values cancelling each other out, and it also makes residual values even more costly as they become larger.
(Unfortunately, this does not explain why squared values are valid. It explains two problems with using the original values, or absolute values. But there could be many methods of making residual values more costly as they become larger; for example, \(\exp(y)\), or cubing the value. I’m sure there must be a mathematical basis for squaring the values.)
The least squares line is \(\hat{y} = \beta_0 + \beta_1x\), where \(x\) is the explanatory variable, \(\hat{y}\) is the predicted response variable, \(\beta_1\) is the slope of the line, and \(\beta_0\) is the y-intercept of the line.
Notation:
| parameter | point estimate | |
|---|---|---|
| intercept | \(\beta_0\) | \(b_0\) |
| slope | \(\beta_1\) | \(b_1\) |
(I have a textbook on regression analysis that refers to the estimates as \(\hat{\beta_0}\) and \(\hat{\beta_1}\) rather than \(b_0\) and \(b_1\).)
(Video does not show this)
\[R = \frac{1}{n-1} \sum\limits_{i=1}^{n} \frac{(y_i-\bar{y})}{s_y} \times \frac{(x_i-\bar{x})}{s_x}\]
# make up some data for x and y
x <- c(10,12,14,10,13,16,20)
y <- c(9,10,12,9,14,14,16)
x.mean = mean(x)
y.mean = mean(y)
x.sd = sd(x)
y.sd = sd(y)
n <- length(x)
correl = (1/(n-1)) * sum(((y - y.mean)/y.sd) * ((x - x.mean)/x.sd))
correl
## [1] 0.915
Or use the cor function in R:
cor(x, y)
## [1] 0.915
The slope of the regression line, \(b_1\), is:
\[b_1 = \frac{s_y}{s_x}R\]
# x, y, x.sd and y.sd defined or calculated above
R = cor(x, y)
b_1 = (y.sd/x.sd) * R
b_1
## [1] 0.7132
The regression line must always pass through \(\bar{x}, \bar{y}\). We can combine this fact, with the slope, to calculate the y-intercept at the point where \(x=0\):
\[b_0 - \bar{y} = b_1(x - \bar{x}) = b_1(0 - \bar{x}) = -b_1\bar{x}\] \[b_0 = -b_1\bar{x} + \bar{y}\]
# x.mean and y.mean calculated above
b_0 = -b_1 * x.mean + y.mean
b_0
## [1] 2.321
Output from R for calculating regression line will look something like:
| Estimate | Std. Error | t value | Pr(>abs(t)) | |
|---|---|---|---|---|
| (Intercept) | 64.78 | 6.80 | 9.52 | 0.00 |
| hsgrad | -0.62 | 0.08 | -7.86 | 0.00 |
The first line is for the y-intercept, second is for the slope. “hsgrad” here is the explanatory variable. So the slope is -0.62 and the y-intercept is 64.78.
In R, using made-up data defined above:
data.lm = lm(y ~ x)
summary(data.lm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.453 -0.879 -0.306 -0.453 2.408 0.268 -0.585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.321 1.964 1.18 0.2904
## x 0.713 0.141 5.07 0.0039 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.22 on 5 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.805
## F-statistic: 25.7 on 1 and 5 DF, p-value: 0.00386
Predict. Don’t extrapolate.
Prediction means using the linear model to predict the value for the response variable for a given value of the explanatory variable. You can safely calculate a value of the response variable \(y\) for any value of the explanatory variable \(x\) that is between the minimum and maximum observed values for \(x\). Going outside this range is extrapolation. Don’t.
There should be a straight-line relationship between the explanatory and response variables. Eyeball this using a scatterplot or residuals plot.
TODO: How to plot a residuals plot.
TODO: How to do either on a set of data’s residuals
The variability of the residuals should not increase or shrink as you move along x. It should not form a “cone.” Constant variability is called homoscedasticity.
R-squared is just what it sounds like: square the R value. This value indicates the percentage of the variation in the response variable that is explained by the model. It is always between 0 and 1.
Now we consider a case where the response variable is numerical but the explanatory variable is categorical.
Example: Poverty vs. region. The explanatory variable is region, where 0=east and 1=west.
\(\hat{poverty} = 11.17 + 0.38\text{(region:west)}\)
For eastern states, we plug in 0 for x. This means that the predicted poverty rate for all eastern states is 11.17.
For western states, the predicted poverty rate is 11.17 + 0.38 = 11.55.
When the explanatory variable is categorical, we always code one, and only one, level of that variable to be zero. This is the “reference level”.
Example with more than two levels:
| Estimate | Std. Err. | t value | Pr(>abs(t)) | |
|---|---|---|---|---|
| (Intercept) | 9.50 | 0.87 | 10.94 | 0.00 |
| region4:midwest | 0.03 | 1.15 | 0.02 | 0.98 |
| region4:west | 1.79 | 1.13 | 1.59 | 0.12 |
| region4:south | 4.16 | 1.07 | 3.87 | 0.00 |
The top row is the reference level (northeast, name not shown).
\[\hat{\text{% in poverty}} = 9.50 + 0.03\text{(region4:midwest)} + 1.79\text{(region4:west)} + 4.16\text{(region4:south)}\]
So here, the predicted poverty rate for midwest states would be 9.53. It is 11.29 for western states, and 13.66 for southern states.
Leverage: a point that “falls away horizontally” from the cloud. This means it is some distance away horizontally from the cloud, although it might appear more or less where the regression line predicts.
Influence: The amount of change an outlier can create in the regression line. So a point can have high leverage but low influence. Influential points are usually high-leverage points.
Influential points do not necessarily reduce \(R^2\). It is possible that the data set without the outlier has no linear relationship, in which case both \(R\) and \(R^2\) will be very low. A highly-influential outlier might actually cause \(R^2\) to bump up slightly in this case. Example is a cloud of points with very low \(R\), and one highly influential outlier. This causes R suddenly to become quite high, and so \(R^2\) is also much higher.
I consider this question to be bogus. The instructor concedes “we would never want to fit a linear model” to such data. It’s not fit for linear regression. So we should never get to the point of checking \(R^2\) in this case. This is sort of like asking, “Is it always a good idea to put smoke detectors in a house?” and the answer is, “No, not if the house has been abandoned for 25 years, and there would be no negative impact if it burned.” Correct and irrelevant in my opinion.
Example is a study on IQ among twins, raised together or apart.
| Estimate | Std. Error | t value | Pr(>abs(t)) | |
|---|---|---|---|---|
| (Intercept) | 9.2076 | 9.2999 | 0.99 | 0.3316 |
| bioIQ | 0.9014 | 0.0963 | 9.36 | 0.00 |
The linear model is \(\hat{\text{fosterIQ}} = 9.2076 + 0.9014\text{(bioIQ)}\). \(R^2\) is 0.78, althoug NOTE that you cannot see this from the above table (the summary of the output from R’s lm function will include the \(R\) value, and \(R^2\)).
\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \ne 0\)
What the above means is that, if the slope of the line is zero, there is no linear relationship between the variables. (I’ve done some thinking about this and remain bemused. If \(R\) is zero, then there is no linear relationship between the variables. But if \(R\) is 0.000001, then there is a relationship, no matter how weak, and the slope of the line could be either very small, like 0.000001, or it could be enormous, a billion.)
Linear regression always uses a t statistic for inference:
\[T = \frac{\text{point estimate}-\text{null value}}{SE_{b1}}\]
b_1 = 0.9014
point_est = b_1
nullval = 0
n = 27
df = n - 2 # always n-2 for simple linear regression, because we have two variables, intercept and slope
SE = 0.0963 # from table above, SE for b_1
T_score = abs(point_est - nullval)/SE
T_score
## [1] 9.36
p_value = 1 - pt(T_score, df=df)
p_value
## [1] 5.982e-10
With such a high t-score, and more to the point, such a low p-value, we can have high confidence that the null hypothesis should be rejected, i.e., \(\beta_1\) is not zero.
(A question occurs to me: what about the intercept? The p-value for it is much larger. I guess the conclusion here is that the estimated intercept may not be precisely correct. But that’s a different kettle of fish. The slope is special in the sense that, if it is precisely zero, there is no correlation at all between the variables. Otherwise, there is at least some correlation. But if there is a small difference between our estimated intercept and the true intercept, it does not decide whether there is some, or no, correlation. The video does make the point later that “inference on the intercept is rarely done.”)
Confidence interval is point estimate \(\pm\) margin of error:
\[b_1 \pm t^*_{df} \cdot SE_{b1}\]
b_1 = 0.9014
point_est = b_1
n = 27
df = n - 2
conf_level = 0.95
alpha = 1 - conf_level
SE = 0.0963
t_star = qt(conf_level + (alpha/2), df=df)
ci = c(point_est - t_star * SE, point_est + t_star * SE)
ci
## [1] 0.7031 1.0997
We are 95% confident that for each additional point on the biological twins’ IQs, the foster twins’ IQs are expected on average to be higher by 0.7 to 1.1 points.
So far, inference has used a t-test to evaluate the strength of the evidence for a hypothesis test for the slope of the relationship between \(x\) and \(y\).
Alternatively, we can consider the variability in \(y\) explained by \(x\), compared to the unexplained variability.
Partitioning the variability in \(y\) to explained and unexplained variability requires ANOVA. Here is an example of ANOVA output:
| Df | SumSq | MeanSq | F val | Pr(>F) | |
|---|---|---|---|---|---|
| bioIQ | 1 | 5231.13 | 5231.13 | 87.56 | 0.0000 |
| Residuals | 25 | 1493.53 | 59.74 | ||
| Total | 26 | 6724.66 |
\(SS_{Tot}\) (sum of squares total) is 6724.66 above. The unexplained variability is \(SS_{Res}\) (sum of squares of residuals) or 1493.53. So the explained variability is \(SS_{Reg} = SS_{Tot}-SS_{Res} = 5231.13\).
The total degrees of freedom is \(df_{Tot} = 27 - 1 = 26\). The regression degrees of freedom is \(df_{Reg} = 1\) because there is only one predictor (in multiple linear regression this will be higher). The residuals degrees of freedom is \(df_{Res} = df_{Tot} - df_{Reg} = 26 - 1 = 25\).
The mean squares for the regression is \(MS_{Reg} = \frac{SS_{Reg}}{df_{Reg}} = \frac{5231.13}{1} = 5231.13\). The mean squares of the residuals is \(MS_{Res} = \frac{SS_{Res}}{df_{Res}} = \frac{1493.53}{25} = 59.74\).
Finally, the F statistic, which is the ratio of explained to unexplained variability, is \(F_{(1,25)} = \frac{MS_{Reg}}{MS_{Res}} = \frac{5231.13}{59.74} = 87.56\).
\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \ne 1\)
The p-value in the table is very small, so we reject the null hypothesis. The data provide convincing evidence that the slope is significantly different from 0.
Two ways to calculate
In the example, \(R^2 = 0.882^2 \approx 0.78\). Using the ANOVA table, it is \(\frac{SS_{Reg}}{SS_{Tot}} = \frac{5231.13}{6724.66} \approx 0.78\).
So in the ANOVA table, it is Sum Sq, first row, divided by Sum Sq, total row.
This unit will use, for one example, data with predictive variables for birthweight.
The response variable will always be numerical, but the explanatory variables can be numerical or categorial.
Topics:
First example is relationship between weight of books, and volume and cover type as the explanatory variables.
library(DAAG) # contains some data to be used as example
## Loading required package: lattice
##
## Attaching package: 'DAAG'
##
## The following object is masked from 'package:openintro':
##
## possum
data(allbacks)
# fit model
book_mlr = lm(weight ~ volume + cover, data = allbacks)
summary(book_mlr)
##
## Call:
## lm(formula = weight ~ volume + cover, data = allbacks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.1 -32.3 -16.1 28.9 210.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 197.9628 59.1927 3.34 0.00584 **
## volume 0.7180 0.0615 11.67 6.6e-08 ***
## coverpb -184.0473 40.4942 -4.55 0.00067 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared: 0.927, Adjusted R-squared: 0.915
## F-statistic: 76.7 on 2 and 12 DF, p-value: 1.45e-07
Here, the intercept is 197.96, and the explanatory variables are volume and cover. We see cover:pb, for paperback. Since this is listed, we know it is the “non-reference” value, and the one not shown must be the “reference” value, which means its value would be 0. So in this case, a paperback starts out -184 grams compared to a hardback. If there are three cover types, then two would be shown.
The \(R^2\) value is 0.9275, so 92.75% of the variability in the response variable is explained by volume and cover type.
Based on the output above, the model is:
\(\hat{\text{weight}} = 197.96 + 0.72\text{volume} - 184.05\text{cover:pb}\)
Because hardcover is the reference value, for a hardcover book, we plug in a 0 for cover, and the weight would be:
\(\hat{weight} = 197.96 + 0.72\text{volume} - 184.05 \times 0 = 197.96 + 0.72\text{volume}\)
For a paperback:
\(\hat{weight} = 197.96 + 0.72\text{volume} - 184.05 \times 1 = 13.91 + 0.72\text{volume}\)
pb = subset(allbacks, allbacks$cover=="pb")
hb = subset(allbacks, allbacks$cover=="hb")
pb.lm = lm(pb$weight ~ pb$volume)
hb.lm = lm(hb$weight ~ hb$volume)
plot(hb$volume, hb$weight,pch="*", ylab="weight", xlab="volume (cm3)", xlim=c(0,1500), ylim=c(0,1000))
par(new=T)
plot(pb$volume, pb$weight,pch=17, xlab="", ylab="", xlim=c(0,1500), ylim=c(0,1000),bg=4)
abline(pb.lm)
abline(hb.lm)
legend("topleft", inset=0.05, title="Cover Type", c("pb", "hb"))
par(new=F)
The slope estimate for volume is 0.72, so that, for each additional cubic centimeter in volume, all else being held constant, the weight is predicted to increase by 0.72 grams.
The slop for the cover variable is -184.05, so that, all else being held constant, when a book has a cover type of paperback, there is an expected decrease of 184.05 grams in weight.
The intercept is 197.96, which would mean that a book with no volume would weigh 197.96 grams. Since this is clearly nonsense, the intercept has no relevance by itself.
We can use this linear regression model to predict book weight. For example, a paperback with a volume of 600 \(cm^3\) would be \(197.96 + 0.72(600) - 184.05(1) = 445.91\) grams.
The model assumes hardcover and paperback bookshave the same slope for the relationship between their volume and weight. This may be true in some cases, but in others it may not. In that case we would include an interaction variable, which is beyond the scope of this course.
states = read.csv("http://bit.ly/dasi_states")
pov_slr = lm(poverty ~ female_house, data=states)
summary(pov_slr)
##
## Call:
## lm(formula = poverty ~ female_house, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.754 -1.825 -0.037 1.556 6.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.309 1.897 1.74 0.087 .
## female_house 0.691 0.160 4.32 7.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.66 on 49 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.261
## F-statistic: 18.7 on 1 and 49 DF, p-value: 7.53e-05
cor(states$female_house, states$poverty)
## [1] 0.5254
plot(female_house ~ poverty, data=states)
abline(pov_slr)
\(R\) is 0.525, \(R^2\) is 0.276. The p-value shows that female householder is a significant predictor of living in poverty.
Now the output from ANOVA.
anova.pov_slr = anova(pov_slr)
anova.pov_slr
## Analysis of Variance Table
##
## Response: poverty
## Df Sum Sq Mean Sq F value Pr(>F)
## female_house 1 133 132.6 18.7 7.5e-05 ***
## Residuals 49 348 7.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explained variability is \(R^2 = \frac{\text{explained variability}}{\text{explained variability}} = \frac{132.57}{480.25} = 0.276\). This matches the values from the lm output.
Now add another variable to move to multiple linear regression.
pov_mlr = lm(poverty ~ female_house + white, data=states)
summary(pov_mlr)
##
## Call:
## lm(formula = poverty ~ female_house + white, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.525 -1.853 -0.038 1.377 6.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5789 5.7849 -0.45 0.65774
## female_house 0.8869 0.2419 3.67 0.00061 ***
## white 0.0442 0.0410 1.08 0.28676
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.66 on 48 degrees of freedom
## Multiple R-squared: 0.293, Adjusted R-squared: 0.264
## F-statistic: 9.95 on 2 and 48 DF, p-value: 0.000242
anova(pov_mlr)
## Analysis of Variance Table
##
## Response: poverty
## Df Sum Sq Mean Sq F value Pr(>F)
## female_house 1 133 132.6 18.74 7.6e-05 ***
## white 1 8 8.2 1.16 0.29
## Residuals 48 339 7.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output similar to previous, but with new variable, white, as a predictor.
The total variability has not changed, because this is the inherent variability that is in the response variable, regardless of the number of explanatory variables. But what has changed is how the variability is partitioned. Some is now attributed to female householder, and some to race=white.
To calculate \(R^2\) here, we use both explanatory variables: \(R^2 = \frac{132.57 + 8.21}{480.25} = 0.293\). This has increased slightly from 28% with the single linear regression. The number changes because the variability is caused by two variables instead of one.
Adjusted \(R^2\):
\[R^2_{adj} = 1 - \left(\frac{SSE}{SST} \times \frac{n-1}{n-k-1}\right)\]
where \(k\) is the number of predictors. This introduces a penalty to \(R^2\) based on the number of predictors in the model, and the magnitude of this penalty is based on the proportion of \(k\) to \(n\). The larger \(n\) is, the more \(k\) can increase without significantly penalizing.
Remember n is 51 (50 states plus DC), though the table above shows 50 (n-1). So adjusted \(R^2\) is:
\(1 - (\frac{339.47}{480.25} \times \frac{51-1}{51-1-1}) = 0.26\)
| \(R^2\) | adjusted \(R^2\) | |
|---|---|---|
| Model 1 (poverty vs. female_house) | 0.28 | 0.28 |
| Model 2 (poverty vs. female_house + white) | 0.29 | 0.26 |
So we see that \(R^2\) increased, but adjusted \(R^2\) actually decreased.
In the example data, there is a very strong negative relationship between the variable white and the variable female_householder, so they are not independent of each other. So if female_householder is one of our variables, we should not add white. Think of the two together as a sort of echo chamber.
This leads to the subject of parsimony.
Data for cognitive study:
cognitive = read.csv("http://bit.ly/dasi_cognitive")
cog_full = lm(kid_score ~ mom_hs + mom_iq + mom_work + mom_age, data=cognitive)
summary(cog_full)
##
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq + mom_work + mom_age,
## data = cognitive)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.05 -12.92 1.99 11.56 49.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.5924 9.2191 2.13 0.034 *
## mom_hsyes 5.0948 2.3145 2.20 0.028 *
## mom_iq 0.5615 0.0606 9.26 <2e-16 ***
## mom_workyes 2.5372 2.3507 1.08 0.281
## mom_age 0.2180 0.3307 0.66 0.510
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.1 on 429 degrees of freedom
## Multiple R-squared: 0.217, Adjusted R-squared: 0.21
## F-statistic: 29.7 on 4 and 429 DF, p-value: <2e-16
\(H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0\)
\(H_A:\) At least one \(\beta_i\) is different than 0
The test statistics is an F-statistic, 29.74 on 4 (num. predictors) and 429 (n-k-1) degrees of freedom (there were 434 cases). We also have the p-value, so we don’t need to do any calculations. Since p-value is less then 0.05, we reject the null hypothesis.
Example: Is whether or not the mother went to high school a significant predictor of the cognitive test scores of children, given all other variables in the model?
\(H_0: \beta_1=0\), when all other variables are included in the model
\(H_A: \beta_1\ne0\), when all other variables are included in the model
where \(\beta_1\) is estimate for mom_hs:yes. The p-value for this variable is 0.028, so we can reject the null hypothesis. The variable is a good predictor of \(y\).
We can walk through how this p-value was obtained.
\[T = \frac{\text{point_estimate}- \text{null value}}{SE}\]
Our point estimate is \(b_1\) and \(SE\) is \(SE_{b1}\). So T is \(\frac{b_1 - 0}{SE_{b1}} = \frac{5.095}{2.315}=2.201\) where \(df = n - k - 1\).
A note about \(df = n - k - 1\). Recall that in single linear regression, \(df = n - 2\). This is because \(k=1\), since there is only one predictor. So this is the same concept.
If T is 2.201, then the p-value is 0.028.
n = 434
k = 4
b_1 = 5.095
se_b1 = 2.315
nullval = 0
T = (b_1 - nullval)/se_b1
df = n - k - 1
p_value = pt(T, df=df, lower.tail=FALSE) * 2
p_value
## [1] 0.02828
As always, it is point estimate \(\pm\) margin of error. Here, the point estimate is the slope estimate, so for mom_work:yes, \(b_3 \pm t^*_{df} \cdot SE_{b_3}\).
conf_level = 0.95
alpha = 1 - conf_level
b_3 = 2.54
se_b3=2.35
#df comes from previous snippet
t_star_df = abs(qt(1 - (alpha/2), df=df))
conf_level - (alpha/2)
## [1] 0.925
ci = c(b_3 - t_star_df * se_b3, b_3 + t_star_df * se_b3)
ci
## [1] -2.079 7.159
We are 95% confident that the cognitive IQ of children whose mothers worked score 2.09 points lower to 7.17 points higher than those whose mothers did not work.
One method is backwards elimination: start with full model (all predictors), drop one at a time until parsimonious model is reached
Another is forward selection: start with empty model and add one predictor at a time until parsimonious model is reached
There are many criteria for mode selection, such as: * p-value * adjusted \(R^2\) * AIC (blah-blah information criterion) * BIC (bayesian information criterion) * DIC (deviance information criterion) * Bayes factor * Mallow’s \(C_p\)
All but first two beyond scope of course.
We saw that adjusted \(R^2\) for full model is 0.2098. Now remove mom_hs:
kid_score ~ mom_iq + mom_work + mom_age
This yields adjusted \(R^2\) of 0.2027. Removing it did not improve the model, it made it worse. So put it back in, and try removing another. End result:
| step | variables included | removed | adjusted \(R^2\) |
|---|---|---|---|
| full | kid_score ~ mom_hs + mom_iq + mom_work + mom_age | 0.2098 | |
| step 1 | kid_score ~ mom_iq + mom_work + mom_age | [-mom_hs] | 0.2027 |
| kid_score ~ mom_hs + mom_work + mom_age | [-mom_iq] | 0.0541 | |
| kid_score ~ mom_hs + mom_iq + mom_age | [-mom_work] | 0.2095 | |
| kid_score ~ mom_hs + mom_iq + mom_work | [-mom_age] | 0.2109 | |
| step 2 | kid_score ~ mom_iq + mom_work | [-mom_hs] | 0.2024 |
| kid_score ~ mom_hs + mom_work | [-mom_iq] | 0.0546 | |
| kid_score ~ mom_hs + mom_iq | [-mom_work] | 0.2105 |
In step 1, we found that removing mom_age improved the model. The question in step 2 is, with mom_age removed, can we remove any other predictors? The answer in this case is no, removing them one at a time never improves the model, so the final model is kid_score ~ mom_hs + mom_iq + mom_work. But if we had been able to improve the model by removing another predictor in step 2, then we would go on to step 3, and so on.
So the idea here is to look at the full model, and find the highest p-value. Is it above the significance level? If so, remove it, and then refit the model. Is the new highest p-value also above the significance level? Then repeat. Keep doing this until you no longer have a p-value above the significance level, or you only have one predictor left.
Note that if a predictor is a categorical variable, you have to look at all of the levels for that variable. If any of them have a p-value below the significance level, then you keep that variable, which means keeping all of the levels.
In the example with cognitive IQ data, we end with a full model of kid_score ~ mom_hsyes + mom_iq. This is different from the adjusted \(R^2\) method. This is because p-values are based on a confidence level, which is somewhat arbitrary.
We use the p-value approach if we are interested in which predictors are statistically significant. We use adjusted \(R^2\) if we want more reliable predictions.
The p-value method is used more commonly, since it requires fitting fewer models (in the more commonly used backwards-selection process)
So if you have 10 predictors, you start by fitting 10 single-predictor models, choose the best one, then fit 9 more models, each of which has the first predictor plus one of the remaining predictors. Keep repeating until you’ve fitted all possible predictors. Of course, if you’ve fitted \(n\) predictors, and fitting each one of the remaining predictors yields a worse model, you stop with those \(n\) predictors.
I found this a little confusing so here is my example. Say you have five predictors. Create a single-predictor model for each. Pick the one with the lowest p-value. If this is not significant, then stop here–you can’t create a significant model. Otherwise, now create four models, each of which has this first predictor and one of the others. Now pick the model that has the lowest p-value. Is it significant? If so, continue, otherwise stop with the single-predictor model. Now create three models, each with the first two predictors and one of the remaining three. Keep repeating this until either all of the predictors are added, and the p-value is still significant, or else you have to stop at a previous step because adding another predictor did not result in a significant p-value.
Sometimes variables are included, or eliminated, based on expert opinion. Or, if you are studying a certain variable, you will leave it in the model regardless.
Our final model, using adjusted \(R^2\):
cog_final = lm(kid_score ~ mom_hs + mom_iq + mom_work, data=cognitive)
summary(cog_final)
##
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq + mom_work, data = cognitive)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.76 -13.02 2.09 11.55 48.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.1794 6.0432 4.00 7.4e-05 ***
## mom_hsyes 5.3823 2.2716 2.37 0.018 *
## mom_iq 0.5628 0.0606 9.29 < 2e-16 ***
## mom_workyes 2.5664 2.3487 1.09 0.275
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.1 on 430 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.211
## F-statistic: 39.6 on 3 and 430 DF, p-value: <2e-16
Note that mom_work does not have a significant p-value, but that is because we used the adjusted \(R^2\) method. It gives the model higher predictive power, even though it is not statistically significant. I think this means that it gives it just a teensy bit of higher predictive power, but using adjusted \(R^2\), more is always better, no matter how tiny the addition.
Conditions for valid model:
If an explanatory variable is categorical, it doesn’t make sense to ask for a linear relationship between it and a numerical variable.
cog_final = lm(kid_score ~ mom_hs + mom_iq + mom_work, data=cognitive)
# note: mom_iq is only numerical variable
# residuals are saved in residuals field in model
plot(cog_final$residuals ~ cognitive$mom_iq)
The plot shows values are scattered randomly around zero, condition is satisified.
par(mfrow=c(1,2))
hist(cog_final$residuals)
qqnorm(cog_final$residuals)
qqline(cog_final$residuals)
Histogram shows slight skew but acceptable. Condition met.
par(mfrow=c(1,2))
plot(cog_final$residuals ~ cog_final$fitted)
plot(abs(cog_final$residuals) ~ cog_final$fitted)
To do a quick check for time series:
plot(cog_final$residuals)
If there were time series, we would see data go up and down, probably.
Some of my own notes.
The book states but does not explain that:
\[\log(\frac{p_i}{1-p_i}) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}{1+e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}\]
We know that if \(\log(p) = x\) then \(p = e^x\). But let’s start with:
\[\log(\frac{p}{1-p}) = x\]
Therefore:
\[\frac{p}{1-p} = e^x\]
We can manipulate the left side:
\[\frac{p}{1-p} = -1 \cdot \frac{p}{p-1} = -1 \cdot \frac{1+p-1}{p-1} = -1 \cdot \left(\frac{p-1}{p-1} + \frac{1}{p-1} \right)\]
\[= -1 \cdot \left(1 + \frac{1}{p-1} \right) = -1 + \frac{1}{1-p}\]
So:
\[ \frac{p}{1-p} = -1 + \frac{1}{1-p}\]
And therefore:
\[-1 + \frac{1}{1-p} = e^x\]
Now simplify:
\[ \frac{1}{1-p} = 1 + e^x\]
\[1-p = \frac{1}{1 + e^x}\]
\[p = 1 - \frac{1}{1 + e^x}\]
\[p = \frac{1 + e^x}{1 + e^x} - \frac{1}{1 + e^x} = \frac{1 + e^x -1}{1 + e^x} = \frac{e^x}{1 + e^x}\]
Therefore:
\[\log(\frac{p_i}{1-p_i}) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}{1 + e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}\]