Check out the following opinions: Opinion 1. Opinion 2. Opinion 3. Opinion 4.
R’s capabilities are simply amazing. Check out:
RStudio is a graphical user interface for R which includes a set of integrated tools designed to help you be more productive with R. It includes:
Note: Once R and Rstudio are installed, it is not necessary to start R, because Rstudio will start it
R can be used as a calculator:
5+8
## [1] 13
13-pi
## [1] 9.858407
5/8
## [1] 0.625
We can assign values to variables in R:
x=4
y=16+x
x
## [1] 4
y
## [1] 20
Use the concatenate function to assign an array of values to a variable:
values=c(6,7,9,100)
values
## [1] 6 7 9 100
We can perform calculations on objects:
newvalues=3*values
newvalues
## [1] 18 21 27 300
Week 2.
There are 4 types of measurement scales: 1. Nominal/Categorical
Ordinal
Interval
Ratio
These are simply ways to categorize different types of variables.
o Used to label variables without any quantitative value.
o “Nominal” scales could also be called “labels.”
o Scales are mutually exclusive.
o Categories have no intrinsic ranking.
o Gender
o Zip code
o Eye color
o Allows for rank order (1st, 2nd, 3rd, etc.) by which data can be sorted.
o Does not allow for relative degree of difference between them.
o ‘sick’ vs. ‘healthy’ when measuring health,
o ‘guilty’ vs. ‘not-guilty’ when making judgments in courts,
o ‘wrong/false’ vs. ‘right/true’ when measuring truth value,
o a spectrum of values, such as ‘completely agree’, ‘mostly agree’, ‘mostly disagree’, ‘completely disagree’ when measuring opinion.
o Interval scales provide information about order,
o Possess equal intervals.
o Interval scale is temperature,
o Interval time of day.
o Has an absolute zero (a point where none of the quality being measured exists),
o Tell us the exact value between units,
o Quantitative variable
o Salary
o Height
o Weight
Whether you know it or not, you are a fortune teller of sorts. Every day, all day, you are constantly predicting what will happen:
You choose clothes based on what you think the weather will be (or, be honest, what you have that’s clean regardless of the weather).
You choose what table to sit at in the cafeteria based on where you think your friends will sit.
You choose and choose and choose, and every choice is a prediction of how likely you think an event or series of events is to happen. We can actually measure how likely it is for something to happen, and that measurement is called probability.
Fig. 5.1
Flip a coin. What is your sample space?
Let’s start with dice. Take a die (make sure it’s fair, not weighted or “funny” in any way). What are the odds (the probability) of rolling a 3 if you roll the die one time? Hopefully you figured out that it is one in six because there are six sides to the die, and one of those sides has three dots.
Let’s try it. Take the die and roll it 100 times, recording your results below. Calculate the percentage of each result (for example, if you rolled a 2 17 times, that would be 17/100, or 17 percent).
Fig. 2.2.1
We would expect that the percentages for each number would hover around 16 or 17, which is 1/6 or .1666. This is probability in a nutshell.
Now guess what the percentage would be if you added up the percentages of the rolls of only the oddnumbered sides. When you add up those roles, does the percentage come close to your guess?
Probability is the ratio of the times an event is likely to occur divided by the total possible events.
In the case of our die, there are six possible events, and there is one likely event for each number with each roll, or 1/6. If there were no dots on any of the sides, the probability of rolling a 3 would be zero because there would be no 3 and no other dots either, giving us this ratio: 0/0. If every side had three dots, the probability of rolling a 3 would be 1 because it would be 6/6, or 1. So, probability is expressed as a number somewhere between 0 (not gonna happen) and 1 (definitely going to happen), with ratios closer to 1 being most likely.
Let’s put it in formula form:
Using the formula: 1. What is the probability of getting heads on a coin toss?
So far, we’ve just looked at things that could occur. What about looking at things that have actually happened?
We call that relative frequency, and it has its own formula:
Fig. 2.2.3
Think back to your die experiment above. The first formula gives you your expectation (1/6). The formula for relative frequency gives you the actual outcome. What was the relative frequency of rolling 5s in your dierolling experiment?
The more times we roll the die, the closer we will get to the outcome we expected (1/6). We call that the Law of Large Numbers — even if you don’t get it to come out like you expect with a few tries, the more you do it, the closer you will come to the expectation.
Flip a penny 10 times and record how many times it lands on heads and how many times it lands on tails.
Heads: ____________
Tails: ____________
Now flip it 100 times and record your results.
Heads: ____________
Tails: ____________
Did you get closer to a ½ ratio the second time? That’s the Law of Large Numbers at work. Remember, the Law of Large Numbers tells us that the more times you repeat an experiment, the closer the relative frequency will come to the probability.
We have one more thing to learn before we move on. Let’s figure out how to calculate the possible outcomes of an event. We’ve figured out the probability of simple events occurring, but what happens when the possible outcomes are harder to figure out? When you are rolling one die or flipping one coin, it’s simple to figure out possible outcomes, but it gets more complicated when you add in more dice or more coins.
Imagine that your parents pay you an allowance of 50¢ a week. Let’s say you love nickels. What is the probability that your 50¢ will contain a nickel? We need to figure out all the possible ways to give someone 50¢. Let’s say that your parents don’t ever use pennies or 50-cent pieces. Besides those, there are three possible coins to use: nickels, dimes and quarters. If your parents pay you in quarters, it’s simple, right? Two quarters make 50¢. But what other possible combinations make 50¢? And how likely are you to get that nickel you want? Let’s set it up in the table below. First, across the top we’ll list the possible coins. Next, we’ll start listing possible combinations by listing the greatest possible number of that type of coin and then decreasing that by one on the next line. Once we’ve gotten to zero of that coin, let’s move to the highest level possible of the next highest coin. Sound confusing? It’s not once you get going. Let’s try it:
First, quarters. We list two quarters, and that makes 50¢ by itself, so the columns for dimes and nickels are zero. There are no other possible combinations with two quarters, so the next line lists one quarter. Remember, we are going from largest to smallest, so first we’ll try one quarter and the greatest number of dimes possible, which is two. There are two other possibilities with one quarter, so we’ll list those on the next lines and then the next lower number of dimes, which is one. That means we’ll need three nickels because we’ve always got to add up to fifty cents.
Fig. 5.5
Statistical inference is inference about a population from a random sample drawn from it or, more generally, about a random process from its observed behavior during a finite period of time. It includes:
point estimation
interval estimation
hypothesis testing (or statistical significance testing)
Inferential statistics and prediction
Example: Suppose you are interested in the distribution of ages for employees working in a certain office. The following data is available: 36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55. We use R to calculate the mean of the data set.
3.2.2.1 We create an object “age” that will contain the data as follows:
age<-c(36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55)
3.2.2.2 Now we calculate the mean age of the employees:
mean(age)
## [1] 50.7
The mean age of the employees is: 50.7 years.
The output appears under the ‘Plots’ tab.
Boxplots can be created for individual variables or for variables by group. The format is ‘boxplot(x, data=y)’, where ‘x’ is a formula and ‘data=’ denotes the data frame providing the data.
To illustrate this command, we will use the ‘Dimes’ dataset in the ‘mosaicData’ package:
We click on the ‘Dimes’ dataset to access a description of the data frame.
It’s clear that the data frame contains 2 variables: ‘mass’ and ‘year’. We are interested in the ‘mass’ of dimes.
Example: We can access the ‘mass’ column by using the “$” sign as follows:
library(mosaicData)
Dimes$mass
## [1] 2.259 2.247 2.254 2.298 2.287 2.254 2.268 2.214 2.268 2.274 2.271
## [12] 2.268 2.298 2.292 2.274 2.234 2.238 2.252 2.249 2.234 2.275 2.230
## [23] 2.236 2.233 2.255 2.277 2.256 2.282 2.235 2.235
Example: We are interested in the median weight and plot a box plot as follows:
boxplot(Dimes$mass)
We can also use the “Summary”-function as follows:
summary(Dimes$mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.214 2.236 2.256 2.258 2.274 2.298
It is clear that the minimum weight is 2.214 grams and the maximum weight is 2.298 grams. The median weight is 2.256 grams, etc.
In descriptive statistics, the interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = \(Q_3\)-\(Q_1\).
In this case the IQR is:
\(Q_3\)-\(Q_1\) = \(2.274-2.236=0.038\) grams.
Please download Kaltura if you need software to record your screen. Here are the steps:
Kaltura is a video platform service. You can download it for free from CANVAS.
Log into Canvas
Go to Math 3315
Click on “My media”:
Figure 2.1
A histogram is a visual representation of the distribution of a dataset. The shape of a histogram allows you to easily see where most of the data is situated. In particular, you can see where the middle of distribution is located, how closely the data lie around the middle, and where possible outliers are to be found. As shown in the figures below, a histogram consists of an x-axis, a y-axis and bars of different heights. The x-axis is divided into intervals (called “bins”), and on each bin a vertical bar is constructed whose height represents the number of data values within that bin. Note that histograms (unlike bar charts) don’t have gaps between the bars (if it looks like there’s a gap, that’s because that particular bin has no data in it).
Example: Suppose you are interested in the distribution of ages for employees working in a certain office. The following data is available: 36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55. We use R to construct a histogram to represent the distribution of the data.
age<-c(36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55)
hist(age)
The output appears under the ‘Plots’ tab.
The ‘hist’ command has many options that enable the user to change the display. For example, the user can control the number of bins by using the ‘breaks’ option. The title of the histogram by using the ‘main’ option, and the x- and y-axis labels using the ‘xlab’ and ‘ylab’ options.
Example: The following command creates a histogram with 7 nonempty bins, with title “Age of Employees” and x label “Employee ages”:
hist(age,breaks=7,main="Age of Employees",xlab="Employee ages")
Whenever we hear the term ‘population,’ the first thing that strikes our mind is a large group of people. In the same way, in statistics population denotes a large group consisting of elements having at least one common feature. The term is often contrasted with the sample, which is nothing but a part of the population that is so selected to represent the entire group.
Population represents the entirety of persons, units, objects and anything that is capable of being conceived, having certain properties. On the contrary, the sample is a finite subset of the population, that is chosen by a systematic process, to find out the characteristics of the parent set.
The comparison chart summarizes the differences:
The primary objective of the sample is to make statistical inferences about the population. The greater the size of the sample, the higher is the level of accuracy of generalization.
The sample mean \(\bar{x}\) is an estimator for the population mean, \(\mu\).
The sample proportion \(\hat{p}\) is an estimator for the population proportion \(p\).
The U.S. Census Bureau publishes annual price figures for new mobile homes in Manufactured Housing Statistics. The figures are obtained from sampling, not from a census. A simple random sample of 36 new mobile homes yielded the prices, in thousands of dollars, shown in the Table below:
X67.8 |
---|
67.1 |
49.9 |
56.0 |
68.4 |
73.4 |
56.5 |
76.7 |
59.2 |
63.7 |
71.2 |
76.8 |
56.9 |
57.7 |
59.1 |
60.6 |
63.9 |
66.7 |
64.3 |
74.5 |
62.2 |
61.7 |
64.0 |
57.9 |
55.6 |
55.5 |
55.9 |
70.4 |
72.9 |
49.3 |
51.3 |
63.8 |
62.6 |
72.9 |
53.7 |
77.9 |
I created a .csv file available on Canvas under “Presenation 2” with the data for “Mobile Home Prices.”
Download the file to your desktop and save as a .csv file.
Import the data and save in the object MHPrice. Make sure to click on “Import Dataset” in the Environment window of RStudio:
MHPrice=read.csv(("/Users/dekock/Desktop/Mhomeprices.csv"), header=TRUE)
Use the data to estimate the population mean price, \(\mu\), of all new mobile homes.
We find the sample mean, \(\bar{x}\):
mean(MHPrice$Price)
## [1] 63.27778
Based on the sample data, we estimate the mean price, \(\mu\), of all new mobile homes to be approximately \(\$63.28\) thousand, that is, \(\$63,280\).
To emphasize, an estimate of this kind is called a point estimate for \(\mu\) because it consists of a single number, or point.
Calories per \(\frac{1}{2}\) cup serving for 16 popular chocolate ice cream is shown below. Find a point estimate for the proportion that are greater than 190.
270, 170, 160, 160, 199, 160, 150, 180, 150, 140, 160, 290, 191, 170, 110, 170
We see that the following value are greater than 190:
270, 199, 290.
And we have a total of 16 data points, so from the sample, we can conclude that the sample proportion, \(\widehat{p}\) is for values greater than 190 is: \(\dfrac{3}{16}=0.1875\). We use \(\widehat{p}\) to estimate the proportion \(p\) of chocolate ice cream with calories greater than 190.
Boxplots can be created for individual variables or for variables by group. The format is ‘boxplot(x, data=y)’, where ‘x’ is a formula and ‘data=’ denotes the data frame providing the data.
To illustrate this command, we will use the ‘Dimes’ dataset in the ‘mosaicData’ package:
We click on the ‘Dimes’ dataset to access a description of the data frame:
It’s clear that the data frame contains 2 variables: ‘mass’ and ‘year’. We are interested in the ‘mass’ of dimes.
Example: We can access the ‘mass’ column by using the “$” sign as follows:
Dimes$mass
## [1] 2.259 2.247 2.254 2.298 2.287 2.254 2.268 2.214 2.268 2.274 2.271
## [12] 2.268 2.298 2.292 2.274 2.234 2.238 2.252 2.249 2.234 2.275 2.230
## [23] 2.236 2.233 2.255 2.277 2.256 2.282 2.235 2.235
Example: We are interested in the median weight and plot a box plot as follows:
boxplot(Dimes$mass)
Example: We can also find the max, min, median, and quartiles using the ‘summary’ command:
summary(Dimes$mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.214 2.236 2.256 2.258 2.274 2.298
3.4
Bar graphs are produced with the ‘barplot(height)’ function, where height is a vector or matrix. If you want to do a bar graph for a single categorical variable, you must first create a frequency table for the values. To show this, we use the ‘HELPrct’ dataset, which you may view under the ‘mosaicData’ package:
We are interested in the ‘substance’ variable:
Example: Creating a bar plot. First we create a table of frequencies.
freq=table(HELPrct$substance)
freq
##
## alcohol cocaine heroin
## 177 152 124
Then, we can create the bar plot:
barplot(freq)
Again, there are setting we can change to make this look nicer. The option ‘col’ changes the bar colors. The ‘main’ option changes the title of the plot. The ‘xlab’ and ‘ylab’ options changes the title of the x-axis and y-axis titles. The ‘las’ option changes the orientation of the x- and y- labels.
Example: Changing settings in a bar plot.
barplot(freq, col="deepskyblue", main="Frequencies", xlab="Substance", ylab="Count", las=1)
box()
For a horizontal bar graphs, we have to add the horizontal=TRUE option. The option ‘las’ changes the x- and y-axis label orientations. The option ‘cex.axis’ changes the size of the x-axis labels and ‘cex.names’ changes the size of the y-axis labels. Another new option is ‘xlim’ which extended the x-axis limits.
Example: Creating horizontal bar plots and changing more options.
barplot(freq, horiz=TRUE, col=rainbow(3), main="Frequencies", xlab="Count", ylab="Substance", xlim=c(0,200), cex.axis=.8, cex.names=0.6, las=1)
box()
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.
A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median. The following quantities (called fences) are needed for identifying extreme values in the tails of the distribution:
lower inner fence: Q1 - 1.5*IQ
upper inner fence: Q3 + 1.5*IQ
lower outer fence: Q1 - 3*IQ
upper outer fence: Q3 + 3*IQ
A point beyond an inner fence on either side is considered a mild outlier. A point beyond an outer fence is considered an extreme outlier.
Consider the following dataset:
x=c(30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322, 336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451, 453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550, 559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637, 638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794, 802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068, 1441)
x
## [1] 30 171 184 201 212 250 265 270 272 289 305 306 322 322
## [15] 336 346 351 370 390 404 409 411 436 437 439 441 444 448
## [29] 451 453 470 480 482 487 494 495 499 503 514 521 522 527
## [43] 548 550 559 560 570 572 574 578 585 592 592 607 616 618
## [57] 621 629 637 638 640 656 668 707 709 719 737 739 752 758
## [71] 766 792 792 794 802 818 830 832 843 858 860 869 918 925
## [85] 953 991 1000 1005 1068 1441
The summary statistics for the dataset can be given by:
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.0 436.2 559.5 576.1 738.5 1441.0
The boxplot:
boxplot(x)
From an examination of the fence points and the data, one point (1441) exceeds the upper fence and stands out as a oulier.
The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).
The sampling distribution of a population mean is generated by repeated sampling and recording of the means obtained.
This forms a distribution of different means, and this distribution has its own mean and variance.
Mathematically, the variance of the sampling distribution obtained is equal to the variance of the population divided by the sample size. This is because as the sample size increases, sample means cluster more closely around the population mean.
Therefore, the relationship between the standard error and the standard deviation is such that, for a given sample size, the standard error equals the standard deviation divided by the square root of the sample size. In other words, the standard error of the mean is a measure of the dispersion of sample means around the population mean.
The standard error of the mean (SEM) can be expressed as:
\(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\),
where \(\sigma\) is the standard deviation of the population and \(n\) is the size of the sample.
Let’s assume you collected the following data on basketball player height at your school:
heights = c(75,70,69,68,68,72,72,73,73, 74,74,73,75)
heights
## [1] 75 70 69 68 68 72 72 73 73 74 74 73 75
The first step to find the standard error is to find the sample mean:
mean(heights)
## [1] 72
Find the standard deviation:
sd(heights)
## [1] 2.483277
Now, we can find the standard error:
SE = \(\dfrac{2.48}{13} = 0.69\) inches.
To compute the standard error in R, download the “plotrix”-package and use the “std.error(x,na.rm)” function:
library(plotrix)
std.error(heights,na.rm)
## [1] 0.6887372
As we know, a sample mean, \(\bar{x}\) is usually not equal to the population mean \(\mu\); generally, there is sampling error. Therefore, we should accompany any point estimate of \(\mu\) with information that indicates the accuracy of that estimate. This information is called a confidence-interval estimate for \(\mu\), which we introduce in the next example.
Consider again the problem of estimating the (population) mean price, \(\mu\), of all new mobile homes by using the sample data in the Table above. Let’s assume that the population standard deviation \(\sigma\), of all such prices is \(\$7200\).
Identify the distribution of the variable \(\bar{x}\), that is, the sampling distribution of the sample mean for samples of size 36.
Use part (a) to show that 95% of all samples of 36 new mobile homes have the property that the interval from \(\bar{x} − 2.4\) to \(\bar{x} + 2.4\) contains \(\mu\).
Use part (b) and the sample data in the Table above to find a \(95\%\) confidence interval for \(\mu\), that is, an interval of numbers that we can be \(95\%\) confident contains \(\mu\).
hist(MHPrice$Price)
Because \(n = 36\), \(\sigma = 7.2\), and prices of new mobile homes are normally distributed, it follows that
\(\mu_{\bar{x}} = \mu\)
\(\sigma_{\bar{x}}={\frac{\sigma}{\sqrt{n}}}=\frac{7.2}{\sqrt{36}}=1.2\)
\(\bar{x}\) is normally distributed.
In other words, for samples of size 36, the variable \(\bar{x}\) is normally distributed with mean \(\mu\) and standard deviation 1.2.
Empirical rule
The formula used to compute confidence intervals is:
\(\bar{x} \pm z_{\frac{\alpha}{2}}\cdot \dfrac{\sigma}{\sqrt{n}}\)
where
\(\bar{x}\) is the sample mean
\(z_{\frac{\alpha}{2}}\) and the confidence coefficient
\({\sigma}\) is the standard deviation
\(n\) is the sample size.
The student can use the following table to determine
\(z_{\frac{\alpha}{2}}\).
Fourteen users attempted to add a channel on their cable TV to a list of favorites. After the task they rated the difficulty on the 7 point Single Ease Question. Compute the 95% confidence interval. The responses are shown below
2, 6, 4, 1, 7, 3, 6, 1, 7, 1, 6, 5, 1, 1.
Step 1:
Compute the mean:
responses=c(2, 6, 4, 1, 7, 3, 6, 1, 7, 1, 6, 5, 1, 1)
responses
## [1] 2 6 4 1 7 3 6 1 7 1 6 5 1 1
mean(responses)
## [1] 3.642857
Mean = 3.64
Step 2:
Compute the standard deviation:
sd(responses)
## [1] 2.468483
The standard deviation is: 2.47
Step 3:
Compute the standard error:
library(plotrix)
std.error(responses)
## [1] 0.6597297
The standard error is 0.66.
Step 4:
Compute the margin of error by multiplying the standard error by 2.
\(.66 \cdot 2 = 1.3\)
Step 5:
Use the table to find the value of \(z_{\frac{\alpha}{2}}\).
Since we are finding the \(95\)% confidence interval, we will use \(z_{\frac{\alpha}{2}}=1.96\).
Step 6:
Plug into the formula:
\(\bar{x} \pm z_{\frac{\alpha}{2}}\cdot \dfrac{\sigma}{\sqrt{n}}\)
to find the confidence interval:
\(3.64 \pm 1.96 \cdot 0.66\), which results in:
\(3.64 + 1.96 \cdot 0.66 = 4.93\) and
\(3.64 - 1.96 \cdot 0.66 = 2.35\).
Therefore, the confidence interval is:
\((2.35, 4.93)\).
From several hundred tasks, the average score of the SEQ is around a 5.2. This confidence interval tells us that we can be fairly confident that this task is harder than average because the upper boundary of the confidence interval (4.94) is still below the historical average of 5.2.
The student can also use the z.test command in the “TeachingDesmos” package. Make sure to install the “TeachingDesmos”-package first.
Use the z.test command if you want to:
Compute the test of hypothesis and confidence interval on the mean of a population when the standard deviation of the population is known.
Make sure to call the dataset and also provide the standard deviation:
z.test(dataset, sd=value of standard deviation)
library(TeachingDemos)
z.test(responses, sd=2.47, conf.level=0.95)
##
## One Sample z-test
##
## data: responses
## z = 5.5183, n = 14.00000, Std. Dev. = 2.47000, Std. Dev. of the
## sample mean = 0.66014, p-value = 3.422e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 2.349016 4.936698
## sample estimates:
## mean of responses
## 3.642857
Now that, for purely pedagogical reasons, we have the unrealistic situation (of a known population variance) behind us, let’s turn our attention to the realistic situation in which both the population mean and population variance are unknown.
Here is a summary of the formulas:
Summary of confidence interval formulae.
Over the three-day period from April 1 to April 3, 2015, a national poll surveyed 1500 American households to gauge their levels of discretionary spending. The question asked was how much the respondent spent the day before; not counting the purchase of a home, motor vehicle, or normal household bills. For these sampled households, the average amount spent was
\(\bar{x}= \$ 95\) with a sample standard deviation of \(s = \$185\).
How close will the sample average come to the population mean?
We have:
Sample average=population mean+random error
The Normal Approximation tells us that the distribution of these random errors over all possible samples follows the normal curve with a standard deviation of
\(\dfrac{\sigma}{\sqrt{n}}\).
Notice how the formula for the standard deviation of the average depends on the true population standard deviation \(\sigma\). When the population standard deviation is unknown, like in this example, we can still get a good approximation by plugging in the sample standard deviation \(s\). We call the resulting estimate the Standard Error of the Mean (SEM).
Standard Error of the Mean (SEM) = estimated standard deviation of the sample average =
\(\dfrac{standard\ deviation\ of\ the\ sample}{\sqrt{n}}\)= \(\dfrac{s}{\sqrt{n}}\)
In the example, we have \(s = \$185\) so the Standard Error of the Mean =
\(\dfrac{\$185}{\sqrt{1500}}\)
= \(\$4.78\)
Recap: the estimated daily amount of discretionary spending amongst American households at the beginning of April, 2015 was \(\$95\) with a standard error of \(\$4.78\).
The Normal Approximation tells us, for example, that for 95% of all large samples, the sample average will be within two SEM of the true population average.
Thus, a 95% confidence interval for the true daily discretionary spending would be \(\$95 ± 1.96(\$4.78)\) or \(\$95 ± \$9.56\).
Of course, other levels of confidence are possible. When the sample size is large, \(s\) will be a good estimate of \(\sigma\) and you can use multiplier numbers from the normal curve. When the sample size is smaller (say n < 30), then \(s\) will be fairly different from \(\sigma\) for some samples - and that means that we we need a bigger multiplier number to account for that.
We will then use the \(t\)-distribution.
The tinterval command of R is a useful one for finding confidence intervals for the mean when the data are normally distributed with unknown variance. We illustrate the use of this command for the lizard tail length data.
lizard = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6,
8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8,
11.3, 11.9)
We can see that \(n=24\):
length(lizard)
## [1] 24
Since the population variance \(\sigma\) is unknown and \(n<30\), we will us the t-distribution to find a confidence interval for the population mean \(\mu\).
t.test(lizard)
##
## One Sample t-test
##
## data: lizard
## t = 30.477, df = 23, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.292017 9.499649
## sample estimates:
## mean of x
## 8.895833
From which it follow that the 95% confidence interval is: (8.292017 9.499649).
The t.test command can also be used to find confidence intervals with levels of confidence different from 95%. We do so by specifying the desired level of confidence using the conf.level option.
t.test(lizard, conf.level = 0.9)
##
## One Sample t-test
##
## data: lizard
## t = 30.477, df = 23, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 8.395575 9.396092
## sample estimates:
## mean of x
## 8.895833
A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. For example:
A new medicine you think might work.
A way of teaching you think might be better.
A possible location of new species.
A fairer way to administer standardized tests.
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.
Determine the null hypothesis and the alternative hypothesis.
Collect and summarize the data into a test statistic.
Use the test statistic to determine the p-value.
The result is statistically significant if the p-value is less than or equal to the level of significance.
If you trace back the history of science, the null hypothesis is always the accepted fact. Simple examples of null hypotheses that are generally accepted as being true are:
DNA is shaped like a double helix.
There are 8 planets in the solar system (excluding Pluto).
Taking Vioxx can increase your risk of heart problems (a drug now taken off the market).
A researcher thinks that if knee surgery patients go to physical therapy twice a week (instead of 3 times), their recovery period will be longer. Average recovery times for knee surgery patients is 8.2 weeks.
The hypothesis statement in this question is that the researcher believes the average recovery time is more than 8.2 weeks. It can be written in mathematical terms as:
\(H_1: \mu > 8.2\)
Next, you’ll need to state the null hypothesis. That’s what will happen if the researcher is wrong. In the above example, if the researcher is wrong then the recovery time is less than or equal to 8.2 weeks. In math, that’s:
\(H_0: μ ≤ 8.2\)
A principal at a certain school claims that the students in his school are above average intelligence. A random sample of thirty students IQ scores have a mean score of 112. Is there sufficient evidence to support the principal’s claim? The mean population IQ is 100 with a standard deviation of 15.
The accepted fact is that the population mean is 100, so:
\(H_0: \mu=100\)
The claim is that the students have above average IQ scores, so:
\(H_1: \mu > 100\)
The fact that we are looking for scores “greater than” a certain point means that this is a one-tailed test.
If you aren’t given an alpha level, use 5% (0.05).
An area of .05 is equal to a z-score of 1.645.
z score formula= \(\dfrac{\bar{x}-\mu_0}{\frac{\sigma}{\sqrt{n}}}\)
For this set of data:
\(z= \dfrac{(112.5-100)}{(15/√30)}=4.56\).
In this case, it is greater (4.56 > 1.645), so you can reject the null hypothesis which stated that the mean IQ score is 100.
An outbreak of Salmonella-related illness was attributed to ice cream produced at a certain factory. Scientists measured the level of Salmonella in 9 randomly sampled batches of ice cream. The levels (in MPN/g) were:
0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418
Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3 MPN/g?
Let \(\mu\) be the mean level of Salmonella in all batches of ice cream. Here the hypothesis of interest can be expressed as:
\(H_0: \mu = 0.3\)
\(H_a: \mu > 0.3\)
Create a vector x that contain the values:
x = c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418)
Perform a t.test:
t.test(x, alternative="greater", mu=0.3)
##
## One Sample t-test
##
## data: x
## t = 2.2051, df = 8, p-value = 0.02927
## alternative hypothesis: true mean is greater than 0.3
## 95 percent confidence interval:
## 0.3245133 Inf
## sample estimates:
## mean of x
## 0.4564444
From the output we see that the p-value = 0.029. Hence, there is moderately strong evidence that the mean Salmonella level in the ice cream is above 0.3 MPN/g.
The two-sample t-test is used to determine if two population means are equal. A common application is to test if a new process or treatment is superior to a current process or treatment.
There are several variations on this test.
For paired samples, the difference \(X_i - Y_i\) is usually calculated. For unpaired samples, the sample sizes for the two samples may or may not be equal. The formulas for paired data are somewhat simpler than the formulas for unpaired data.
3.In some applications, you may want to adopt a new process or treatment only if it exceeds the current treatment by some threshold. In this case, we can state the null hypothesis in the form that the difference between the two populations means is equal to some constant \(\mu_1-\mu_2=d_0\) where the constant is the desired threshold.
The two-sample t-test for unpaired data is defined as:
\(H_0: \mu_1=\mu_2\)
\(H_a: \mu_1 \neq \mu_2\)
Test Statistic:
\(T=\dfrac{\bar{Y_1}-\bar{Y_2}}{{\sqrt{\frac{s_1^2}{N_1}+{\frac{s_2^2}{N_2}}}}}\)
where \(N_1\) and \(N_2\) are the sample sizes, \(\bar{Y_1}\) and \(\bar{Y_2}\) are the sample means, and \(s_1^2\) and \(s_2^2\) are the sample variances.
If equal variances are assumed, then the formula reduces to:
\(T=\dfrac{\bar{Y_1}-\bar{Y_2}}{s_p{\sqrt{\frac{1}{N_1}+{\frac{1}{N_2}}}}}\)
where \(s_p^2= \dfrac{(N_1-1)s_1^2 + (N_2-1)s_2^2}{N_1+N_2 -2}\)
Reject \(H_0\) that the two means are equal if
\(|T| > t_{1-\frac{\alpha}{2}, \nu}\),
where \(t_{1-\frac{\alpha}{2}\) is the critical value of the t-distribution with \(\nu\) degrees of freedom.
The R function t.test() can be used to perform both one and two sample t-tests on vectors of data.
The function contains a variety of options and can be called as follows:
t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
Here x is a numeric vector of data values and y is an optional numeric vector of data values. If y is excluded, the function performs a one-sample t-test on the data contained in x, if it is included it performs a two-sample t-tests using both x and y.
The option mu provides a number indicating the true value of the mean (or difference in means if you are performing a two sample test) under the null hypothesis. The option alternative is a character string specifying the alternative hypothesis, and must be one of the following: “two.sided” (which is the default), “greater” or “less” depending on whether the alternative hypothesis is that the mean is different than, greater than or less than mu, respectively.
6 subjects were given a drug (treatment group) and an additional 6 subjects a placebo (control group).
Their reaction time to a stimulus was measured (in ms).
We want to perform a two-sample t-test for comparing the means of the treatment and control groups. Let \(\mu_1\) be the mean of the population taking medicine and \(\mu_2\) the mean of the untreated population. Here the hypothesis of interest can be expressed as:
\(H_0: \mu_1 - \mu_2=0\)
\(H_a: \mu_1- \mu_2<0\)
Here we will need to include the data for the treatment group in x and the data for the control group in y.
We will also need to include the options alternative=“less”, mu=0.
Finally, we need to decide whether or not the standard deviations are the same in both groups.
Below is the relevant R-code when assuming equal standard deviation:
Control = c(91, 87, 99, 77, 88, 91)
Treat = c(101, 110, 103, 93, 99, 104)
t.test(Control,Treat,alternative="less", var.equal=TRUE)
##
## Two Sample t-test
##
## data: Control and Treat
## t = -3.4456, df = 10, p-value = 0.003136
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -6.082744
## sample estimates:
## mean of x mean of y
## 88.83333 101.66667
Below is the relevant R-code when not assuming equal standard deviation:
t.test(Control,Treat,alternative="less")
##
## Welch Two Sample t-test
##
## data: Control and Treat
## t = -3.4456, df = 9.4797, p-value = 0.003391
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -6.044949
## sample estimates:
## mean of x mean of y
## 88.83333 101.66667
Here the pooled t-test and the Welsh t-test give roughly the same results (p-value = 0.00313 and 0.00339, respectively). So we reject the null hypothesis, which means that we can state (with 95% confidence) that the reaction time for the Treatment group is less than for the Control group.
prop.test can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”), conf.level = 0.95, correct = TRUE)
x is
a vector of counts of successes, a one-dimensional table with two entries, or a two-dimensional table (or matrix) with 2 columns, giving the counts of successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
p a vector of probabilities of success. The length of p must be the same as the number of groups specified by x, and its elements must be greater than 0 and less than 1.
alternative a character string specifying the alternative hypothesis, must be one of “two.sided” (default), “greater” or “less”. You can specify just the initial letter. Only used for testing the null that a single proportion equals a given value, or that two proportions are equal; ignored otherwise.
conf.level confidence level of the returned confidence interval. Must be a single number between 0 and 1. Only used when testing the null that a single proportion equals a given value, or that two proportions are equal; ignored otherwise.
correct a logical indicating whether Yates’ continuity correction should be applied where possible.
Let’s look at an example to illustrate the basic R tests for data proportions. The following example is based on real research, published by Robert Rutledge, MD, and his colleagues in the Annals of Surgery (1993).
In a hospital in North Carolina, the doctors registered the patients who were involved in a car accident and whether they used seat belts. The following matrix represents the number of survivors and deceased patients in each group:
survivors <- matrix(c(1781,1443,135,47), ncol=2)
colnames(survivors) <- c('survived','died')
rownames(survivors) <- c('no seat belt','seat belt')
survivors
## survived died
## no seat belt 1781 135
## seat belt 1443 47
To know whether seat belts made a difference in the chances of surviving, you can carry out a proportion test. This test tells how probable it is that both proportions are the same. A low p-value tells you that both proportions probably differ from each other. To test this in R, you can use the prop.test() function on the preceding matrix:
result.prop <- prop.test(survivors)
The prop.test() function then gives you the following output:
result.prop
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: survivors
## X-squared = 24.333, df = 1, p-value = 8.105e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05400606 -0.02382527
## sample estimates:
## prop 1 prop 2
## 0.9295407 0.9684564
This test report is almost identical to the one from t.test() and contains essentially the same information. At the bottom, R prints for you the proportion of people who died in each group. The p-value tells you how likely it is that both the proportions are equal.
So, you see that the chance of dying in a hospital after a crash is lower if you’re wearing a seat belt at the time of the crash. R also reports the confidence interval of the difference between the proportions.
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among group means in a sample.
We are often interested in determining whether the means from more than two populations or groups are equal or not. To test whether the difference in means is statistically significant we can perform analysis of variance (ANOVA) using the R function aov().
The first step in our analysis is to graphically compare the means of the variable of interest across groups. It is possible to create side-by-side boxplots of measurements organized in groups using the function plot(). Simply type plot(response ~ factor, data=data_name) where response is the name of the response variable and factor the variable that separates the data into groups. Both variables should be contained in a data frame called data_name.
A drug company tested three formulations of a pain relief medicine for migraine headache sufferers. For the experiment 27 volunteers were selected and 9 were randomly assigned to one of three drug formulations. The subjects were instructed to take the drug during their next migraine headache episode and to report their pain on a scale of 1 to 10 (10 being most pain).
Drug A 4 5 4 3 2 4 3 4 4 Drug B 6 8 4 5 4 6 5 8 6 Drug C 6 7 6 6 7 5 6 5 5
To make side-by-side boxplots of the variable pain grouped by the variable drug we must first read in the data into the appropriate format.
pain=c(4,5,4,3, 2, 4, 3, 4,4, 6, 8, 4,5,4, 6, 5, 8, 6,6, 7, 6,6,7, 5,6,5,5)
The “rep” command replicates the values in x.
drug=c(rep("A",9), rep("B",9), rep("C",9))
migraine=data.frame(pain,drug)
migraine
## pain drug
## 1 4 A
## 2 5 A
## 3 4 A
## 4 3 A
## 5 2 A
## 6 4 A
## 7 3 A
## 8 4 A
## 9 4 A
## 10 6 B
## 11 8 B
## 12 4 B
## 13 5 B
## 14 4 B
## 15 6 B
## 16 5 B
## 17 8 B
## 18 6 B
## 19 6 C
## 20 7 C
## 21 6 C
## 22 6 C
## 23 7 C
## 24 5 C
## 25 6 C
## 26 5 C
## 27 5 C
Note the command rep(“A”,9) constructs a list of nine A‟s in a row. The variable drug is therefore a list of length 27 consisting of nine A‟s followed by nine B‟s followed by nine C‟s. If we print the data frame migraine we can see the format the data should be on in order to make side-by-side boxplots and perform ANOVA. We can now make the boxplots by typing:
plot(pain~drug, data=migraine)
From the boxplots it appears that the median pain for drug A is lower than that for drugs B and C.
Next, the R function aov() can be used for fitting ANOVA models. The general form is aov(response ~ factor, data=data_name) where response represents the response variable and factor the variable that separates the data into groups. Both variables should be contained in the data frame called data_name. Once the ANOVA model is fit, one can look at the results using the summary() function. This produces the standard ANOVA table.
results=aov(formula = pain~drug, data=migraine)
results
## Call:
## aov(formula = pain ~ drug, data = migraine)
##
## Terms:
## drug Residuals
## Sum of Squares 28.22222 28.44444
## Deg. of Freedom 2 24
##
## Residual standard error: 1.088662
## Estimated effects may be unbalanced
summary(results)
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 2 28.22 14.111 11.91 0.000256 ***
## Residuals 24 28.44 1.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Studying the output of the ANOVA table above we see that the F-statistic is 11.91 with a p-value equal to 0.000256. We clearly reject the null hypothesis of equal means for all three drug groups (since \(p<\alpha\)).
Also, recall that a p-value is also a probability, but it comes from a different source than \(\alpha\). Every test statistic has a corresponding probability or p-value. This value is the probability that the observed statistic occurred by chance alone. To emphasize: We reject the null hypothesis if \(p <\alpha\).
Two-way ANOVA. The two-way analysis of variance is an extension to the one-way analysis of variance. There are two independent variables (hence the name two-way). Assumptions • The populations from which the samples were obtained must be normally or approximately normally distributed. • The samples must be independent. • The variances of the populations must be equal. • The groups must have the same sample size. Hypotheses There are three sets of hypothesis with the two-way ANOVA. The null hypotheses for each of the sets are given below. 1. The population means of the first factor are equal. This is like the one-way ANOVA for the row factor. 2. The population means of the second factor are equal. This is like the one-way ANOVA for the column factor. 3. There is no interaction between the two factors. This is similar to performing a test for independence with contingency tables.
Download the dataset: dataset_ANOVA_TwoWayInteraction.csv
Import the dataset into R.
dataTwoWayInteraction = read.csv('dataset_ANOVA_TwoWayInteraction.csv')
Display the data:
dataTwoWayInteraction
## Treatment Gender StressReduction
## 1 medical F 1
## 2 medical F 1
## 3 medical F 1
## 4 medical F 1
## 5 medical F 2
## 6 medical F 2
## 7 medical F 3
## 8 medical F 3
## 9 medical F 3
## 10 medical F 3
## 11 medical M 1
## 12 medical M 1
## 13 medical M 2
## 14 medical M 2
## 15 medical M 2
## 16 medical M 2
## 17 medical M 2
## 18 medical M 2
## 19 medical M 3
## 20 medical M 3
## 21 mental F 3
## 22 mental F 3
## 23 mental F 4
## 24 mental F 4
## 25 mental F 4
## 26 mental F 4
## 27 mental F 4
## 28 mental F 4
## 29 mental F 5
## 30 mental F 5
## 31 mental M 2
## 32 mental M 2
## 33 mental M 2
## 34 mental M 2
## 35 mental M 3
## 36 mental M 3
## 37 mental M 4
## 38 mental M 4
## 39 mental M 4
## 40 mental M 4
## 41 physical F 1
## 42 physical F 1
## 43 physical F 1
## 44 physical F 1
## 45 physical F 2
## 46 physical F 2
## 47 physical F 3
## 48 physical F 3
## 49 physical F 3
## 50 physical F 3
## 51 physical M 3
## 52 physical M 3
## 53 physical M 4
## 54 physical M 4
## 55 physical M 4
## 56 physical M 4
## 57 physical M 4
## 58 physical M 4
## 59 physical M 5
## 60 physical M 5
This dataset contains a hypothetical sample of 60 participants who are divided into three stress reduction treatment groups (mental, physical, and medical) and two gender groups (male and female). The stress reduction values are represented on a scale that ranges from 1 to 5. This dataset can be conceptualized as a comparison between three stress treatment programs, one using mental methods, one using physical training, and one using medication across genders. The values represent how effective the treatment programs were at reducing participant’s stress levels, with higher numbers indicating higher effectiveness.
Let’s run a general omnibus test to assess the main effects and interactions present in the dataset.
Use anova(object) to test the omnibus hypothesis:
anova(lm(StressReduction ~ Treatment * Gender, dataTwoWayInteraction))
## Analysis of Variance Table
##
## Response: StressReduction
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 2 23.333 11.6667 17.5 1.384e-06 ***
## Gender 1 1.667 1.6667 2.5 0.1197
## Treatment:Gender 2 23.333 11.6667 17.5 1.384e-06 ***
## Residuals 54 36.000 0.6667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The lm command stands for linear model (we will talk more about lm next week).
The significant omnibus interaction suggests that we should ignore the main effects and instead investigate the simple main effects for our independent variables. To do so, we need to divide our dataset along each level of our treatment variable. We can create subsets of our dataset using the subset(data, condition) function, where data is the original dataset and condition contains the parameters defining the subset.
Use subset(data, condition) to divide the original dataset:
Medical subset:
dataMedical <- subset(dataTwoWayInteraction, Treatment == 'medical')
dataMedical
## Treatment Gender StressReduction
## 1 medical F 1
## 2 medical F 1
## 3 medical F 1
## 4 medical F 1
## 5 medical F 2
## 6 medical F 2
## 7 medical F 3
## 8 medical F 3
## 9 medical F 3
## 10 medical F 3
## 11 medical M 1
## 12 medical M 1
## 13 medical M 2
## 14 medical M 2
## 15 medical M 2
## 16 medical M 2
## 17 medical M 2
## 18 medical M 2
## 19 medical M 3
## 20 medical M 3
Mental subset:
dataMental <- subset(dataTwoWayInteraction, Treatment == 'mental')
dataMental
## Treatment Gender StressReduction
## 21 mental F 3
## 22 mental F 3
## 23 mental F 4
## 24 mental F 4
## 25 mental F 4
## 26 mental F 4
## 27 mental F 4
## 28 mental F 4
## 29 mental F 5
## 30 mental F 5
## 31 mental M 2
## 32 mental M 2
## 33 mental M 2
## 34 mental M 2
## 35 mental M 3
## 36 mental M 3
## 37 mental M 4
## 38 mental M 4
## 39 mental M 4
## 40 mental M 4
Physical subset:
dataPhysical <- subset(dataTwoWayInteraction, Treatment == 'physical')
dataPhysical
## Treatment Gender StressReduction
## 41 physical F 1
## 42 physical F 1
## 43 physical F 1
## 44 physical F 1
## 45 physical F 2
## 46 physical F 2
## 47 physical F 3
## 48 physical F 3
## 49 physical F 3
## 50 physical F 3
## 51 physical M 3
## 52 physical M 3
## 53 physical M 4
## 54 physical M 4
## 55 physical M 4
## 56 physical M 4
## 57 physical M 4
## 58 physical M 4
## 59 physical M 5
## 60 physical M 5
With datasets representing each of our treatment groups, we can now run an ANOVA for each that investigates the impact of gender. You may notice that this is effectively running three one-way ANOVAs with gender being the independent variable. Therefore, we should control for Type I error by dividing our typical .05 alpha level by three (.017).
Run ANOVA on the treatment subsets to investigate the impacts of gender within each:
medical:
anova(lm(StressReduction ~ Gender, dataMedical))
## Analysis of Variance Table
##
## Response: StressReduction
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 0 0.00000 0 1
## Residuals 18 12 0.66667
mental:
anova(lm(StressReduction ~ Gender, dataMental))
## Analysis of Variance Table
##
## Response: StressReduction
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 5 5.0000 7.5 0.0135 *
## Residuals 18 12 0.6667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
physical:
anova(lm(StressReduction ~ Gender, dataPhysical))
## Analysis of Variance Table
##
## Response: StressReduction
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 20 20.0000 30 3.345e-05 ***
## Residuals 18 12 0.6667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At an \(\alpha\) level of .017, the gender effect within the mental (p = .014) and physical (p < .001) groups was statistically significant. In the mental condition, the means are 3 for males and 4 for females. In the physical condition, the means are 4 for males and 2 for females. These results suggest that the mental treatment is more effective in reducing stress for females than males, while the physical treatment is more effective for males than females. Further, there is insufficient statistical support for a gender difference in the medical treatment.
The data for this example is the ages of male and female actors who won the Oscar for their work in a leading role. These Oscar winners are from twelve consecutive years.
Female: 26, 25, 33, 35, 35, 28, 30, 29, 61, 32, 33, 45
Male: 46, 40, 36, 47, 29, 43, 37, 38, 45, 50, 48, 60
Create side-by-side boxplots to compare the median age by gender for Oscar winners over 12 consecutive years:
F <- c(26, 25, 33, 35, 35, 28, 30, 29, 61, 32, 33, 45)
M <- c(46, 40, 36, 47, 29, 43, 37, 38, 45, 50, 48, 60)
genders <- c("Female", "Male")
genders
## [1] "Female" "Male"
The boxplot:
boxplot(F, M, names=genders, horizontal = TRUE,
main = "Ages of Oscar Winning Actors", xlab = "Age")
It is clear that the median age of male Oscar winners is higher than the median age of female Oscar winners.
There are many ways to create a scatterplot in R. We will use “theeasyGgplot2”" package:
library(devtools)
install_github("easyGgplot2", "kassambara")
We will use the package “easyGGplot2”
library(easyGgplot2)
## Loading required package: ggplot2
Create a dataframe “df” that contains all the rows and the following columns from the “mtcars”-dataset:
df <- mtcars[, c("mpg", "cyl", "wt", "qsec", "vs")]
Take a look at the first few rows:
head(df)
## mpg cyl wt qsec vs
## Mazda RX4 21.0 6 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 2.875 17.02 0
## Datsun 710 22.8 4 2.320 18.61 1
## Hornet 4 Drive 21.4 6 3.215 19.44 1
## Hornet Sportabout 18.7 8 3.440 17.02 0
## Valiant 18.1 6 3.460 20.22 1
Create a basic scatter plot of mpg according to cyl:
We use the built-in dataset “mtcars” to create a scatterplot to see if there is an association between the “number of cylinders” and “miles per gallon” for the cars. We use the “pch=”" option to specify symbols to use when plotting points.
ggplot2.scatterplot(data=df, xName='wt',yName='mpg')
Add linear regression line:
ggplot2.scatterplot(data=df, xName='wt',yName='mpg',
addRegLine=TRUE, regLineColor="blue")
We set color/shape by another variable (cyl):
library(devtools)
library(ggplot2)
library(easyGgplot2)
ggplot2.scatterplot(data=df, xName='wt',yName='mpg',
groupName="cyl")
It is clear that there is a negative correlation between the weight of a car and the miles per gallon. The heavier the car, the lower the miles per gallon.
The general purpose of multiple regression is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, you might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how “pretty” the house is (subjective rating). You may also detect “outliers,” that is, houses that should really sell for more, given their location and characteristics.
When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data “passes” eight assumptions that are required for multiple regression to give you a valid result.
Assumption #1:
Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth.
Assumption #2:
You have two or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable).
Assumption #3:
You should have independence of observations.
Assumption #4:
There needs to be a linear relationship between
the dependent variable and each of your independent variables, and
the dependent variable and the independent variables collectively. Whilst there are a number of ways to check for these linear relationships, we suggest creating scatterplots.
Assumption #5:
Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line.
Assumption #6:
Your data must not show multicollinearity, which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model.
Assumption #7:
There should be no significant outliers, high leverage points or highly influential points. Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line.
Assumption #8:
Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms in our enhanced multiple regression guide). Two common methods to check this assumption include using:
a histogram (with a superimposed normal curve) and a Normal P-P Plot; or
a Normal Q-Q Plot of the studentized residuals.
The function “lm”" can be used to perform multiple linear regression in R and much of the syntax is the same as that used for fitting simple linear regression models. To perform multiple linear regression with p explanatory variables use the command:
lm(response ~ explanatory_1 + explanatory_2 + … + explanatory_p)
Here the terms “response”" and “explanatory_i”" in the function should be replaced by the names of the response and explanatory variables, respectively, used in the analysis.
Consider the data set “mtcars” available in the R environment. It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement(“disp”), horse power(“hp”), weight of the car(“wt”) and some more parameters.
The goal of the model is to establish the relationship between “mpg” as a response variable with “disp”,“hp” and “wt” as predictor variables. We create a subset of these variables from the mtcars data set for this purpose.
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
## mpg disp hp wt
## Mazda RX4 21.0 160 110 2.620
## Mazda RX4 Wag 21.0 160 110 2.875
## Datsun 710 22.8 108 93 2.320
## Hornet 4 Drive 21.4 258 110 3.215
## Hornet Sportabout 18.7 360 175 3.440
## Valiant 18.1 225 105 3.460
Create the relationship model:
input <- mtcars[,c("mpg","disp","hp","wt")]
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
##
## Coefficients:
## (Intercept) disp hp wt
## 37.105505 -0.000937 -0.031157 -3.800891
So the equation is:
\(Y = a+disp.x_1+hp.x_2+wt.x_3\)
which is equal to:
\(Y = 37.105505 + disp \cdot (-0.000937)+hp \cdot (-0.031157)+wt\cdot(-3.800891)\)
We can use the regression equation created above to predict the mileage when a new set of values for displacement, horse power and weight is provided.
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is:
\(Y = 37.105505 + 221 \cdot (-0.000937)+102 \cdot (-0.031157)+2.91\cdot(-3.800891) = 22.7104.\)