Week 1:

1.1 Introduction to R and RStudio

1.1.1 Download R and RStudio

What is R?

Top of page
  • R is a free software environment originally designed for statistical computing and graphics. It has grown considerably in capability and usability over the years. Now it is an excellent “jackknife” for anything that you might want to do with data.
  • R is open-source and open-development, which means that thousands of users contribute improvements and enhancements, without changing the basic functionality.


Why use R?

Top of page
  • It’s free!
  • Statistics plus data management/representation/analysis
  • Great graphics
  • Ever-expanding capabilities (just about any statistics under the sun)
  • Can easily import data from most other major programs (including SPSS, SAS, Stata, Excel)
  • Beautifully-formatted output (R notebooks, R bookdown)
  • Not too hard to learn (especially after this workshop!)
  • Comprehensive help resources (internal help;r-bloggers.com Google search)
  • Excellent video tutorials: MarinStatsLectures

Check out the following opinions: Opinion 1. Opinion 2. Opinion 3. Opinion 4.


Capabilities of R:

What is Rstudio?

Top of page

RStudio is a graphical user interface for R which includes a set of integrated tools designed to help you be more productive with R. It includes:

  • Console, which displays executed R commands
  • An editor (with syntax highlighting). Code may be executed directly from the editor.
  • History viewer (record of past commands)
  • Environment viewer (displays all variables in your current workspace)
  • Package installer (adds new capabilities to R)
  • Plot viewer / exporter
  • Help window


1.3 Downloading and installing R and Rstudio (windows)

Top of page


1.4 Exercises:

Top of page
  1. Install R (either from USB or from the Internet)
  2. Install Rstudio (either from USB or from the Internet)

Note: Once R and Rstudio are installed, it is not necessary to start R, because Rstudio will start it


2. Probability: Basic Concepts and Random Variables

Measurement Scales:

There are 4 types of measurement scales: 1. Nominal/Categorical

  1. Ordinal

  2. Interval

  3. Ratio

These are simply ways to categorize different types of variables. 

1. Nominal/Categorical:

o Used to label variables without any quantitative value.

o “Nominal” scales could also be called “labels.” 

o Scales are mutually exclusive.

o Categories have no intrinsic ranking.

Examples:

o Gender

o Zip code

o Eye color

2. Ordinal:

o Allows for rank order (1st, 2nd, 3rd, etc.) by which data can be sorted.

o Does not allow for relative degree of difference between them.

Examples

o ‘sick’ vs. ‘healthy’ when measuring health,

o ‘guilty’ vs. ‘not-guilty’ when making judgments in courts,

o ‘wrong/false’ vs. ‘right/true’ when measuring truth value,

o a spectrum of values, such as ‘completely agree’, ‘mostly agree’, ‘mostly disagree’, ‘completely disagree’ when measuring opinion.

3. Interval:

o Interval scales provide information about order,

o Possess equal intervals.

Examples:

o Interval scale is temperature,

o Interval time of day.

4. Ratio:

o Has an absolute zero (a point where none of the quality being measured exists),

o Tell us the exact value between units,

o Quantitative variable

Examples:

o Salary

o Height

o Weight

Whether you know it or not, you are a fortune teller of sorts. Every day, all day, you are constantly predicting what will happen:

  1. You choose clothes based on what you think the weather will be (or, be honest, what you have that’s clean regardless of the weather).

  2. You choose what table to sit at in the cafeteria based on where you think your friends will sit.

  3. You choose and choose and choose, and every choice is a prediction of how likely you think an event or series of events is to happen. We can actually measure how likely it is for something to happen, and that measurement is called probability.

Fig. 5.1

Fig. 5.1

Probability and relative frequency

Example 1:

Flip a coin. What is your sample space?

Example 2:

Let’s start with dice. Take a die (make sure it’s fair, not weighted or “funny” in any way). What are the odds (the probability) of rolling a 3 if you roll the die one time? Hopefully you figured out that it is one in six because there are six sides to the die, and one of those sides has three dots.

Let’s try it. Take the die and roll it 100 times, recording your results below. Calculate the percentage of each result (for example, if you rolled a 2 17 times, that would be 17/100, or 17 percent).

Fig. 5.2

Fig. 5.2

We would expect that the percentages for each number would hover around 16 or 17, which is 1/6 or .1666. This is probability in a nutshell.

Now guess what the percentage would be if you added up the percentages of the rolls of only the oddnumbered sides. When you add up those roles, does the percentage come close to your guess?

Probability is the ratio of the times an event is likely to occur divided by the total possible events.

In the case of our die, there are six possible events, and there is one likely event for each number with each roll, or 1/6. If there were no dots on any of the sides, the probability of rolling a 3 would be zero because there would be no 3 and no other dots either, giving us this ratio: 0/0. If every side had three dots, the probability of rolling a 3 would be 1 because it would be 6/6, or 1. So, probability is expressed as a number somewhere between 0 (not gonna happen) and 1 (definitely going to happen), with ratios closer to 1 being most likely.

Let’s put it in formula form: Fig. 5.3

Using the formula: 1. What is the probability of getting heads on a coin toss?

  1. If you go to school Monday through Friday and you know the cafeteria is going to serve pizza two days that week, what is the probability that pizza will be served on any given day? It would be 2/5, because there are two desired outcomes (pizza!) and five possible outcomes (days of the school week).

So far, we’ve just looked at things that could occur. What about looking at things that have actually happened?

We call that relative frequency, and it has its own formula:

Fig. 5.4

Fig. 5.4

Think back to your die experiment above. The first formula gives you your expectation (1/6). The formula for relative frequency gives you the actual outcome. What was the relative frequency of rolling 5s in your dierolling experiment?

The law of large numbers

The more times we roll the die, the closer we will get to the outcome we expected (1/6). We call that the Law of Large Numbers — even if you don’t get it to come out like you expect with a few tries, the more you do it, the closer you will come to the expectation.

Flip a penny 10 times and record how many times it lands on heads and how many times it lands on tails.

Heads: ____________

Tails: ____________

Now flip it 100 times and record your results.

Heads: ____________

Tails: ____________

Did you get closer to a ½ ratio the second time? That’s the Law of Large Numbers at work. Remember, the Law of Large Numbers tells us that the more times you repeat an experiment, the closer the relative frequency will come to the probability.

Calculating possible outcomes

We have one more thing to learn before we move on. Let’s figure out how to calculate the possible outcomes of an event. We’ve figured out the probability of simple events occurring, but what happens when the possible outcomes are harder to figure out? When you are rolling one die or flipping one coin, it’s simple to figure out possible outcomes, but it gets more complicated when you add in more dice or more coins.

Imagine that your parents pay you an allowance of 50¢ a week. Let’s say you love nickels. What is the probability that your 50¢ will contain a nickel? We need to figure out all the possible ways to give someone 50¢. Let’s say that your parents don’t ever use pennies or 50-cent pieces. Besides those, there are three possible coins to use: nickels, dimes and quarters. If your parents pay you in quarters, it’s simple, right? Two quarters make 50¢. But what other possible combinations make 50¢? And how likely are you to get that nickel you want? Let’s set it up in the table below. First, across the top we’ll list the possible coins. Next, we’ll start listing possible combinations by listing the greatest possible number of that type of coin and then decreasing that by one on the next line. Once we’ve gotten to zero of that coin, let’s move to the highest level possible of the next highest coin. Sound confusing? It’s not once you get going. Let’s try it:

First, quarters. We list two quarters, and that makes 50¢ by itself, so the columns for dimes and nickels are zero. There are no other possible combinations with two quarters, so the next line lists one quarter. Remember, we are going from largest to smallest, so first we’ll try one quarter and the greatest number of dimes possible, which is two. There are two other possibilities with one quarter, so we’ll list those on the next lines and then the next lower number of dimes, which is one. That means we’ll need three nickels because we’ve always got to add up to fifty cents.

Fig. 5.5

Fig. 5.5

Now we’re out of possibilities that use quarters, so we move on to dimes. The most dimes we could have is five, so we start with that. We decrease that number by one on each line. For every dime we take away, we have to add two nickels, so notice that the nickels increase by two each line. We end up with 10 possible combinations of coins.

So, how many of the 10 possibilities contain nickels? Did you count eight? You’re right! So let’s put this in our probability formula:

8 = desired (or likely) outcomes

10 = possible outcomes

So 8/10.

Does that seem like good odds to you? Not bad!

What are the odds if you want a quarter? How many possibilities contained a quarter? Did you find four? So the odds of getting a quarter are: 4 = likely outcomes

10 = possible outcomes

which gives 4/10.

Are you more likely to get a quarter or a nickel? All other things being equal (meaning your parents have a wide variety of coins and aren’t out of dimes or something), you are far more likely to get a nickel than a quarter.

In summary, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

The LLN is important because it “guarantees” stable long-term results for the averages of some random events. For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. It is important to remember that the LLN only applies (as the name indicates) when a large number of observations is considered.

Assessment:

  1. Suppose you go to the store and there are six kinds of cereal. Your mom tells you to pick one. When she comes back, you ask her to guess which one you chose. Assuming she had no idea beforehand what your choice would be, how likely is it that she will guess the correct one? _____________________________________________________________________________________________

  2. Let’s say you are playing a coin-tossing game with a friend. You toss the same coin 500 times, and 400 times it comes up heads. Would this be normal? What could explain it? ____________________________________________________________

  3. You are bored one day, so you start rolling a die. You roll it 10 times and you get a 6 eight of the times. You think this is strange, so you keep rolling. You roll 100 more times and only get eight more 6s, leaving a total of 16. What is the rule that accounts for this scenario? ___________________________________________________________________________

  4. What is the relative frequency of the final total of the 6’s rolled in Question 3? _____________________________________________________________________________________________

  5. Someone hands you a standard deck of 52 cards. There are four queens. What is the likelihood that you will draw a queen from the deck the first try? _____________________________________________________________________________________________

  6. Let’s say you earn $20 doing chores (lots of them), and you’re being paid in cash. Create a table of possible outcomes that shows the possible combinations of $10’s, $5’s and $1’s that would total your $20. (Look back at the table we did before to see how to set this problem up.)

  7. You and a friend (who has no understanding of probability) are at the mall on a hot Tuesday afternoon in the middle of the summer. She offers to buy you a snow cone if you can guess the types of dollar bills she has in her wallet in fewer than three guesses. She tells you she has $20 and none of the bills are $1’s or a $20. If you don’t guess correctly by the third try, you have to give her a quarter. Are your odds of guessing correctly greater than ½?

Week 2: Statistical Inference.

2.2 How to calculate the mean in R:

Example: Suppose you are interested in the distribution of ages for employees working in a certain office. The following data is available: 36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55. We use R to calculate the mean of the data set.

2.2.1 We create an object “age” that will contain the data as follows:

age<-c(36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55)

2.2.2 Now we calculate the mean age of the employees:

mean(age)
## [1] 50.7

The mean age of the employees is: 50.7 years.

The output appears under the ‘Plots’ tab.

2.2 Statistical inference:

2.2.1 Definition:

Statistical inference is inference about a population from a random sample drawn from it or, more generally, about a random process from its observed behavior during a finite period of time. It includes:

  1. point estimation

  2. interval estimation

  3. hypothesis testing (or statistical significance testing)

  4. Inferential statistics and prediction

2.2.2Point estimation:

In statistics, point estimation involves the use of sample data to calculate a single value (known as a statistic) which is to serve as a “best guess” for an unknown (fixed or random) population parameter.

More formally, it is the application of a point estimator to the data.

The purpose of an inferential statistic is to make a statement about the population from which a sample was taken. Assume, for example, that you are interested in the mean level of anxiety suffered by undergraduate students. You could, measure the anxiety level of every undergraduate student from every department at every university and thereby learn the true value, but this is impractical. So, instead, you take a sample of undergraduates, measure only their anxiety levels, and then make an inference (i.e., a best guess) about the entire population.

The sample mean, \(\mu\) is an unbiased estimator for the population mean, \(\bar{x}\).

Example 2.2.3:

The U.S. Census Bureau publishes annual price figures for new mobile homes in Manufactured Housing Statistics. The figures are obtained from sampling, not from a census. A simple random sample of 36 new mobile homes yielded the prices, in thousands of dollars, shown in the Table below:

Sample of Mobile home prices.
X67.8
67.1
49.9
56.0
68.4
73.4
56.5
76.7
59.2
63.7
71.2
76.8
56.9
57.7
59.1
60.6
63.9
66.7
64.3
74.5
62.2
61.7
64.0
57.9
55.6
55.5
55.9
70.4
72.9
49.3
51.3
63.8
62.6
72.9
53.7
77.9

I created a .csv file available on Canvas under “Project 2” with the data for “Mobile Home Prices.”

Import the data and save in the object MHP. Make sure to click on “Import Dataset” in the Environment window of RStudio:

MHPrice=read.csv(("/Users/dekock/Desktop/Mhomeprices.csv"), header=TRUE)

Use the data to estimate the population mean price, \(\mu\), of all new mobile homes.

We find the sample mean, \(\bar{x}\):

mean(MHPrice$Price)
## [1] 63.27778

Solution:

Based on the sample data, we estimate the mean price, \(\mu\), of all new mobile homes to be approximately \(\$63.28\) thousand, that is, \(\$63,280\).

To emphasize, an estimate of this kind is called a point estimate for \(\mu\) because it consists of a single number, or point.

Example 2.2.4:

Calories per \(\frac{1}{2}\) cup serving for 16 popular chocolate ice cream is shown below. Find a point estimate for the proportion that are greater than 190.

270, 170, 160, 160, 199, 160, 150, 180, 150, 140, 160, 290, 191, 170, 110, 170

Solution:

We see that the following value are greater than 190:

270, 199, 290.

And we have a total of 16 data points, so from the sample, we can conclude that the sample proportion, \(\widehat{p}\) is for values greater than 190 is: \(\dfrac{3}{16}=0.1875\). We use \(\widehat{p}\) to estimate the proportion \(p\) of chocolate ice cream with calories greater than 190.

2.3 Confidence-Interval Estimate:

As we know, a sample mean, \(\bar{x}\) is usually not equal to the population mean \(\mu\); generally, there is sampling error. Therefore, we should accompany any point estimate of \(\mu\) with information that indicates the accuracy of that estimate. This information is called a confidence-interval estimate for \(\mu\), which we introduce in the next example.

Example 2.3.1:

Consider again the problem of estimating the (population) mean price, \(\mu\), of all new mobile homes by using the sample data in the Table above. Let’s assume that the population standard deviation \(\sigma\), of all such prices is \(\$7200\).

  1. Identify the distribution of the variable \(\bar{x}\), that is, the sampling distribution of the sample mean for samples of size 36.

  2. Use part (a) to show that 95% of all samples of 36 new mobile homes have the property that the interval from \(\bar{x} − 2.4\) to \(\bar{x} + 2.4\) contains \(\mu\).

  3. Use part (b) and the sample data in the Table above to find a \(95\%\) confidence interval for \(\mu\), that is, an interval of numbers that we can be \(95\%\) confident contains \(\mu\).

Solution:
  1. A histogram of the price data in the Table above shows that the prices of new mobile homes are normally distributed.
hist(MHPrice$Price)

Because \(n = 36\), \(\sigma = 7.2\), and prices of new mobile homes are normally distributed, it follows that

  1. \(\mu_{\bar{x}} = \mu\)

  2. \(\sigma_{\bar{x}}={\frac{\sigma}{\sqrt{n}}}=\frac{7.2}{\sqrt{36}}=1.2\)

  3. \(\bar{x}\) is normally distributed.

In other words, for samples of size 36, the variable \(\bar{x}\) is normally distributed with mean \(\mu\) and standard deviation 1.2.

  1. The “95” part of the 68-95-99.7 rule states that, for a normally distributed variable, \(95\%\) of all possible observations lie within two standard deviations to either side of the mean. Applying this rule to the variable \(\bar{x}\) and referring to part (a), we see that \(95\%\) of all samples of 36 new mobile homes have mean prices within \(2 · 1.2 = 2.4\) of \(\mu\). Equivalently, \(95\%\) of all samples of 36 new mobile homes have the property that the interval from \(\bar{x} − 2.4\) to \(\bar{x} + 2.4\) contains \(\mu\).
Empirical rule

Empirical rule

  1. Because we are taking a simple random sample, each possible sample of size 36 is equally likely to be the one obtained. From part (b), \(95\%\) of all such samples have the property that the interval from \(\bar{x} − 2.4\) to \(\bar{x} + 2.4\) contains \(\mu\). Hence, chances are \(95\%\)that the sample we obtain has that property. Consequently, we can be \(95\%\) confident that the sample of 36 new mobile homes whose prices are shown in the Table above has the property that the interval from \(\bar{x} − 2.4\) to \(\bar{x} + 2.4\) contains \(\mu\). For that sample, \(\bar{x} = 63.28\), so \(\bar{x} − 2.4 = 63.28 − 2.4 = 60.88\) and \(\bar{x} + 2.4 = 63.28 + 2.4 = 65.68\) Interpretation: We can be \(95\%\) confident that the mean price, \(\mu\), of all new mobile homes is somewhere between \(\$60,880\) and \(\$65,680\).

Week 3:

Description of Data: Statistics and Graphical Description


A histogram is a visual representation of the distribution of a dataset. The shape of a histogram allows you to easily see where most of the data is situated. In particular, you can see where the middle of distribution is located, how closely the data lie around the middle, and where possible outliers are to be found. As shown in the figures below, a histogram consists of an x-axis, a y-axis and bars of different heights. The x-axis is divided into intervals (called “bins”), and on each bin a vertical bar is constructed whose height represents the number of data values within that bin. Note that histograms (unlike bar charts) don’t have gaps between the bars (if it looks like there’s a gap, that’s because that particular bin has no data in it).


Example: Suppose you are interested in the distribution of ages for employees working in a certain office. The following data is available: 36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55. We use R to construct a histogram to represent the distribution of the data.

age<-c(36, 25, 38, 46, 55, 68, 72, 55, 36, 38, 67, 45, 22, 48, 91, 46, 52, 61, 58, 55)
hist(age)

The ‘hist’ command has many options that enable the user to change the display. For example, the user can control the number of bins by using the ‘breaks’ option. The title of the histogram by using the ‘main’ option, and the x- and y-axis labels using the ‘xlab’ and ‘ylab’ options.


Example: The following command creates a histogram with 7 nonempty bins, with title “Age of Employees” and x label “Employee ages”:

hist(age,breaks=7,main="Age of Employees",xlab="Employee ages")

For other options available to change type ‘help(“hist”)’ in the console and read the ‘Help’ window information.

The command ‘xyplot’ can be used to plot one variable against another. The command uses the ‘lattice’ package, so before using it you must load the package.


Example: Load a new package called ‘lattice’.

library(lattice)

If you get an error message, it probably means you haven’t installed ‘lattice’. In this case, go back to “R_RStudioWindows” and follow the instructions found in the section ‘Packages window’.

To demonstrate ‘xyplot’ we will be using data from the ‘mosaicData package’, so you must load this package as well.


Example: Load the package called ‘mosaicData’.

library(mosaicData)

You will find ‘mosaicData’ listed in the ‘Packages’ window.


Click on the blue word “mosaicData” to see the different data sets:


Click on “Births78” to see a description of that particular dataset:

Example: Create an xyplot to display the number of births in the United States for each day in 1978.

require(mosaic)
## Loading required package: mosaic
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggformula
## Loading required package: ggplot2
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: Matrix
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median,
##     prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
require(mosaicData)
xyplot(births~Births78$dayofyear, data=Births78)
## Warning in order(as.numeric(x)): NAs introduced by coercion
## Warning in diff(as.numeric(x[ord])): NAs introduced by coercion
## Warning in (function (x, y, type = "p", groups = NULL, pch = if
## (is.null(groups)) plot.symbol$pch else superpose.symbol$pch, : NAs
## introduced by coercion


The ‘xyplot’ command also has many options that enable the user to change the display. The options ‘main’, ‘xlab’, and ‘ylab’ give the title, x-label, and y-label (note these are exactly the same as for the ‘hist’ command).


Example: The following command creates a xy plot, with title “Births by Day”, x-label “Day of Year”, and y-label “Births”:

require(mosaic)
require(mosaicData)
xyplot(Births78$births~Births78$dayofyear, data=Births78, xlab="Day of Year", ylab="Births", main="Births per Day")
## Warning in order(as.numeric(x)): NAs introduced by coercion
## Warning in diff(as.numeric(x[ord])): NAs introduced by coercion
## Warning in (function (x, y, type = "p", groups = NULL, pch = if
## (is.null(groups)) plot.symbol$pch else superpose.symbol$pch, : NAs
## introduced by coercion


From the ‘Plots’ window, you can navigate to previous plots, using the back arrow, and also export plots in various formats after interactively resizing them.


  1. For the data in “Births78”, plot the number of births as a function of day of the week. Can you explain your results?
  2. Extract the data for weekends only, and plot as a function of the day of the year. Can you explain your results?


Box and whiskers plots

Top of page

Boxplots can be created for individual variables or for variables by group. The format is ‘boxplot(x, data=y)’, where ‘x’ is a formula and ‘data=’ denotes the data frame providing the data.

To illustrate this command, we will use the ‘Dimes’ dataset in the ‘mosaicData’ package:


We click on the ‘Dimes’ dataset to access a description of the data frame:


It’s clear that the data frame contains 2 variables: ‘mass’ and ‘year’. We are interested in the ‘mass’ of dimes.


Example: We can access the ‘mass’ column by using the “$” sign as follows:

Dimes$mass
##  [1] 2.259 2.247 2.254 2.298 2.287 2.254 2.268 2.214 2.268 2.274 2.271
## [12] 2.268 2.298 2.292 2.274 2.234 2.238 2.252 2.249 2.234 2.275 2.230
## [23] 2.236 2.233 2.255 2.277 2.256 2.282 2.235 2.235


Example: We are interested in the median weight and plot a box plot as follows:

boxplot(Dimes$mass)


Example: We can also find the max, min, median, and quartiles using the ‘summary’ command:

summary(Dimes$mass)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.214   2.236   2.256   2.258   2.274   2.298


Exercise: This exercise shows how to compare dime masses before 2000 to dime masses after 2000 by creating a pair of box plots side by side.

  1. Create an xy plot of ‘Dimes’ mass as a function of year. What difference do you notice between the earlier data and the later data?
  2. Create a vector ‘split.at.2000’ which contains the ‘year’ data for the ‘Dimes’ data frame with the data using ‘<2000’ as the separation point.
  3. Change the values in split.at.2000 from “TRUE” to “Before 2000”
  4. Change all values in split.at.2000 from “FALSE” to “2000 and After”

Bar graphs are produced with the ‘barplot(height)’ function, where height is a vector or matrix. If you want to do a bar graph for a single categorical variable, you must first create a frequency table for the values. To show this, we use the ‘HELPrct’ dataset, which you may view under the ‘mosaicData’ package:



We are interested in the ‘substance’ variable:



Example: Creating a bar plot. First we create a table of frequencies.

freq=table(HELPrct$substance)
freq
## 
## alcohol cocaine  heroin 
##     177     152     124
Top of page

Then, we can create the bar plot:

barplot(freq)



Again, there are setting we can change to make this look nicer. The option ‘col’ changes the bar colors. The ‘main’ option changes the title of the plot. The ‘xlab’ and ‘ylab’ options changes the title of the x-axis and y-axis titles. The ‘las’ option changes the orientation of the x- and y- labels.


Example: Changing settings in a bar plot.

barplot(freq, col="deepskyblue", main="Frequencies", xlab="Substance", ylab="Count", las=1)
box()

For a horizontal bar graphs, we have to add the horizontal=TRUE option. The option ‘las’ changes the x- and y-axis label orientations. The option ‘cex.axis’ changes the size of the x-axis labels and ‘cex.names’ changes the size of the y-axis labels. Another new option is ‘xlim’ which extended the x-axis limits.


Example: Creating horizontal bar plots and changing more options.

barplot(freq, horiz=TRUE, col=rainbow(3), main="Frequencies", xlab="Count", ylab="Substance", xlim=c(0,200), cex.axis=.8, cex.names=0.6, las=1)
box()

See “R_RStudioReferences” for a PDF of colors available. You can also go to ‘help(par)’ for more information on parameters in graphics. For example you can search ‘par’ for ‘las’ and it will tell you the settings for label orientation.


Exercise:

  1. Investigate whether there is any significant age difference between abusers of different substances. (Think about what kind of plots would give you this information).


You can also create bar graphs using multiple variables for each category.


Example: Creating a bar plot with multiple variables for each category.

Using the data set ‘AirPassengers’ we need to divide the information by month. For our example we’ll just do Jan, Feb, and Mar.

Jan=AirPassengers[seq(1,144,by=12)]
Feb=AirPassengers[seq(2,144,by=12)]
Mar=AirPassengers[seq(3,144,by=12)]

Next, we’ll create a data frame using our months.

AirPassengersByMonth=data.frame(Jan,Feb,Mar)

Create a sequence ‘Years’ and make these the row names for your data frame.

Years=seq(1949,1960)
row.names(AirPassengersByMonth)=Years

Now, we want to plot the first 5 years of data in a bar plot. Using the ‘barplot’ command we plot a transposed matrix of the data.

barplot(t(data.matrix(AirPassengersByMonth[1:5,])), beside=TRUE, ylim=c(0,300), col=gray.colors(3), main="Passengers by Month")
legend("topleft", legend=colnames(AirPassengersByMonth), fill=c(gray.colors(3)), title="Months")

Confidence Intervals:

Previously, we calculated the confidence interval for the population mean \(\mu\) if the population standard deviation \(\sigma\) is known. Now, we will take a look at the procedure of calculating the confidence interval for the population mean \(\mu\) if the population standard deviation \(\sigma\) is not known.

Here is a summary of the formulas:

Summary of confidence interval formulae.

Summary of confidence interval formulae.

Example 3.1

Over the three-day period from April 1 to April 3, 2015, a national poll surveyed 1500 American households to gauge their levels of discretionary spending. The question asked was how much the respondent spent the day before; not counting the purchase of a home, motor vehicle, or normal household bills. For these sampled households, the average amount spent was

\(\bar{x}= \$ 95\) with a sample standard deviation of \(s = \$185\).

How close will the sample average come to the population mean?
We have:

Sample average=population mean+random error

The Normal Approximation tells us that the distribution of these random errors over all possible samples follows the normal curve with a standard deviation of
\(\dfrac{\sigma}{\sqrt{n}}\).

Notice how the formula for the standard deviation of the average depends on the true population standard deviation \(\sigma\). When the population standard deviation is unknown, like in this example, we can still get a good approximation by plugging in the sample standard deviation \(s\). We call the resulting estimate the Standard Error of the Mean (SEM).

Standard Error of the Mean (SEM) = estimated standard deviation of the sample average =

\(\dfrac{standard\ deviation\ of\ the\ sample}{\sqrt{n}}\)= \(\dfrac{s}{\sqrt{n}}\)

In the example, we have \(s = \$185\) so the Standard Error of the Mean =

\(\dfrac{\$185}{\sqrt{1500}}\)

= \(\$4.78\)

Recap: the estimated daily amount of discretionary spending amongst American households at the beginning of April, 2015 was \(\$95\) with a standard error of \(\$4.78\).

The Normal Approximation tells us, for example, that for 95% of all large samples, the sample average will be within two SEM of the true population average.

Thus, a 95% confidence interval for the true daily discretionary spending would be \(\$95 ± 1.96(\$4.78)\) or \(\$95 ± \$9.56\).

Of course, other levels of confidence are possible. When the sample size is large, \(s\) will be a good estimate of \(\sigma\) and you can use multiplier numbers from the normal curve. When the sample size is smaller (say n < 30), then \(s\) will be fairly different from \(\sigma\) for some samples - and that means that we we need a bigger multiplier number to account for that.

We will then use the \(t\)-distribution.


[Week 4] #Week 4:

Hypothesis testing:

In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail.

The general idea of hypothesis testing involves:

  1. Making an initial assumption.

  2. Collecting evidence (data).

  3. Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.

Every hypothesis test — regardless of the population parameter involved — requires the above three steps.

Example 4.1:

Is normal body temperature really 98.6 degrees F? Consider the population of many, many adults. A researcher hypothesized that the average adult body temperature is lower than the often-advertised 98.6 degrees F. That is, the researcher wants an answer to the question: “Is the average adult body temperature 98.6 degrees? Or is it lower?” To answer his research question, the researcher starts by assuming that the average adult body temperature was 98.6 degrees F.

Then, the researcher went out and tried to find evidence that refutes his initial assumption. In doing so, he selects a random sample of 130 adults. The average body temperature of the 130 sampled adults is 98.25 degrees.

Then, the researcher uses the data he collected to make a decision about his initial assumption. It is either likely or unlikely that the researcher would collect the evidence he did given his initial assumption that the average adult body temperature is 98.6 degrees:

If it is likely, then the researcher does not reject his initial assumption that the average adult body temperature is 98.6 degrees. There is not enough evidence to do otherwise. If it is unlikely, then: either the researcher’s initial assumption is correct and he experienced a very unusual event; or the researcher’s initial assumption is incorrect. In statistics, we generally don’t make claims that require us to believe that a very unusual event happened. That is, in the practice of statistics, if the evidence (data) we collected is unlikely in light of the initial assumption, then we reject our initial assumption.

Example 4.2:

One place where you can consistently see the general idea of hypothesis testing in action is in criminal trials held in the United States. Our criminal justice system assumes “the defendant is innocent until proven guilty.” That is, our initial assumption is that the defendant is innocent.

In the practice of statistics, we make our initial assumption when we state our two competing hypotheses – the null hypothesis \((H_0)\) and the alternative hypothesis \((H_A)\). Here, our hypotheses are:

\(H_0\): Defendant is not guilty

\(H_A\): Defendant is guilty

In statistics, we always assume the null hypothesis is true. That is, the null hypothesis is always our initial assumption.

The prosecution team then collects evidence — such as finger prints, blood spots, hair samples, carpet fibers, shoe prints, ransom notes, and handwriting samples — with the hopes of finding “sufficient evidence” to make the assumption of innocence refutable.

In statistics, the data are the evidence.

The jury then makes a decision based on the available evidence:

If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects the null hypothesis and deems the defendant guilty. We behave as if the defendant is guilty. If there is insufficient evidence, then the jury does not reject the null hypothesis. We behave as if the defendant is not guilty. In statistics, we always make one of two decisions. We either “reject the null hypothesis” or we “fail to reject the null hypothesis.”

Example 4.2:

Hypothesis Testing

critical values:

The main purpose of statistics is to test a hypothesis. For example, you might run an experiment and find that a certain drug is effective at treating headaches. But if you can’t repeat that experiment, no one will take your results seriously. A good example of this was the cold fusion discovery, which petered into obscurity because no one was able to duplicate the results.

A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. For example: A new medicine you think might work. A way of teaching you think might be better. A possible location of new species. A fairer way to administer standardized tests. It can really be anything at all as long as you can put it to the test.

What is a Hypothesis Statement? If you are going to propose a hypothesis, it’s customary to write a statement. Your statement will look like this: “If I…(do this to an independent variable)….then (this will happen to the dependent variable).” For example:

If I (decrease the amount of water given to herbs) then (the herbs will increase in size).

If I (give patients counseling in addition to medication) then (their overall depression scale will decrease).

If I (give exams at noon instead of 7) then (student test scores will improve).

If I (look in this certain location) then (I am more likely to find new species).

A good hypothesis statement should:

Include an “if” and “then” statement (according to the University of California).

Include both the independent and dependent variables.

Be testable by experiment, survey or other scientifically sound technique.

Be based on information in prior research (either yours or someone else’s).

Have design criteria (for engineering or programming projects).

What is Hypothesis Testing?

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.

Hypothesis testing can be one of the most confusing aspects for students, mostly because before you can even perform a test, you have to know what your null hypothesis is. Often, those tricky word problems that you are faced with can be difficult to decipher. But it’s easier than you think; all you need to do is:

Figure out your null hypothesis, 1.State your null hypothesis,

2.Choose what kind of test you need to perform,

3.Either support or reject the null hypothesis.

  1. What is the Null Hypothesis?

If you trace back the history of science, the null hypothesis is always the accepted fact. Simple examples of null hypotheses that are generally accepted as being true are:

DNA is shaped like a double helix.

There are 8 planets in the solar system (excluding Pluto).

Taking Vioxx can increase your risk of heart problems (a drug now taken off the market).

How do I State the Null Hypothesis? You won’t be required to actually perform a real experiment or survey in elementary statistics (or even disprove a fact like “Pluto is a planet”!), so you’ll be given word problems from real-life situations. You’ll need to figure out what your hypothesis is from the problem. This can be a little trickier than just figuring out what the accepted fact is. With word problems, you are looking to find a fact that is nullifiable (i.e. something you can reject).

Hypothesis Testing Examples

1: Basic Example

A researcher thinks that if knee surgery patients go to physical therapy twice a week (instead of 3 times), their recovery period will be longer. Average recovery times for knee surgery patients is 8.2 weeks.

The hypothesis statement in this question is that the researcher believes the average recovery time is more than 8.2 weeks. It can be written in mathematical terms as:

\(H_1\): \(\mu > 8.2\)

Next, you’ll need to state the null hypothesis (See: How to state the null hypothesis). That’s what will happen if the researcher is wrong. In the above example, if the researcher is wrong then the recovery time is less than or equal to 8.2 weeks. In math, that’s:

\(H_0\): \(\mu ≤ 8.2\)

Rejecting the null hypothesis

Ten or so years ago, we believed that there were 9 planets in the solar system. Pluto was demoted as a planet in 2006. The null hypothesis of “Pluto is a planet” was replaced by “Pluto is not a planet.”

Of course, rejecting the null hypothesis isn’t always that easy — the hard part is usually figuring out what your null hypothesis is in the first place.

Hypothesis Testing Examples (One Sample Z Test)

The one sample z test isn’t used very often (because we rarely know the actual population standard deviation). However, it’s a good idea to understand how it works as it’s one of the simplest tests you can perform in hypothesis testing. In English class you got to learn the basics (like grammar and spelling) before you could write a story; think of one sample z tests as the foundation for understanding more complex hypothesis testing. This page contains two hypothesis testing examples for one sample z-tests.

One Sample Hypothesis Testing Examples: #2

A principal at a certain school claims that the students in his school are above average intelligence. A random sample of thirty students IQ scores have a mean score of 112. Is there sufficient evidence to support the principal’s claim? The mean population IQ is 100 with a standard deviation of 15.

Step 1: State the Null hypothesis. The accepted fact is that the population mean is 100, so:

\(H_0\): \(\mu=100\).

Step 2: State the Alternate Hypothesis. The claim is that the students have above average IQ scores, so:

\(H_1\): \(\mu >100\). The fact that we are looking for scores “greater than” a certain point means that this is a one-tailed test.

Step 3: Draw a picture to help you visualize the problem.

Step 4: State the alpha level. If you aren’t given an \(\alpha\) level, use 5% (0.05).

Step 5: Find the rejection region area (given by your \(\alpha\) level above) from the z-table. An area of .05 is equal to a z-score of 1.645.

Step 6: Find the test statistic using this formula: z score formula For this set of data:

\(z= \dfrac{\bar{x}-\mu}{\dfrac{\sigma}{\sqrt{n}}}= \dfrac{(112.5-100)}{\frac{15}{\sqrt{30}}}=4.56\).

Step 6: If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5, you cannot reject the null hypothesis. In this case, it is greater (4.56 > 1.645), so you can reject the null.

Example 4.4:

Blood glucose levels for obese patients have a mean of \(\mu=100\) with a standard deviation of \(\sigma=15\). A researcher thinks that a diet high in raw cornstarch will have a positive or negative effect on blood glucose levels. A sample of 30 patients who have tried the raw cornstarch diet have a mean glucose level of 140. Test the hypothesis that the raw cornstarch had an effect.

Step 1: State the null hypothesis: \(H_0:\mu=100\)

Step 2: State the alternate hypothesis: \(H_1: \mu\neq 100\) Step 3: State your \(\alpha\)-level. We’ll use 0.05 for this example. As this is a two-tailed test, split the alpha into two.

0.05/2=0.025

Step 4: Find the z-score associated with your \(\alpha\)-level. You’re looking for the area in one tail only. A z-score for 0.75(1-0.025=0.975) is 1.96. As this is a two-tailed test, you would also be considering the left tail (z=1.96)


Week 5:

Hypothesis testing of the difference between two population means:

This is a two sample z-test which is used to determine if two population means are equal or unequal. There are three possibilities for formulating hypotheses.

Hypothesis testing

Hypothesis testing

Sampling from normally distributed populations with population variances known:

Example 5.1:

Serum uric acid levels

Is there a difference between the means between individuals with Down’s syndrome and normal individuals?

Example 5.1

Example 5.1

Example 5.1 Continued

Example 5.1 Continued

P-values:

All hypothesis tests ultimately use a p-value to weigh the strength of the evidence or in other words what the data are about the population. The p-value is a number between 0 and 1 and interpreted in the following way:

A small p-value (typically ≤0.05) indicates strong evidence against the null hypothesis, so you reject it. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it. A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.

Example 5.2 T-test (Independent samples) using R:

The independent t test is used to test if there is any statistically significant difference between two means. Use of an independent t test requires several assumptions to be satisfied. The assumptions are listed below

  1. The variables are continuous and independent.

  2. The variables are normally distributed.

  3. The variances in each group are equal.

When these assumptions are satisfied the results of the t test are valid. Otherwise they are invalid and you need to use a non-parametric test.

Summary of Hypothesis testing through T-test

Summary of Hypothesis testing through T-test

Example 5.2

Example 5.2

A=mtcars$am==0
A
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

We extract the mpg for automatic cars and save the data to the vector: mpg.auto

mpg.auto=mtcars[A,]$mpg
mpg.auto
##  [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
## [15] 21.5 15.5 15.2 13.3 19.2

By applying the negation, we can find the gas mileage for manual transmission.

B=mtcars$am==1
B
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
mpg.manual=mtcars[B,]$mpg
mpg.manual
##  [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4

We can now apply the t.test function to compute the difference in means of the two sample data.

t.test(mpg.auto, mpg.manual)
## 
##  Welch Two Sample t-test
## 
## data:  mpg.auto and mpg.manual
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Answer: In mtcars, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg. The 95% confidence interval of the difference in mean gas mileage is between 3.2097 and 11.2802 mpg.

Week 6:

Correlation and Association.

We look at terms that are often used to describe a relationship between two numeric variables.

6.1 Scatterplots.

There are many ways to create a scatterplot in R. We will use “theeasyGgplot2”" package:

library(devtools)
install_github("easyGgplot2", "kassambara")

We will use the package “easyGGplot2”

library(easyGgplot2)

Create a dataframe “df” that contains all the rows and the following columns from the “mtcars”-dataset:

df <- mtcars[, c("mpg", "cyl", "wt", "qsec", "vs")]

Take a look at the first few rows:

head(df)
##                    mpg cyl    wt  qsec vs
## Mazda RX4         21.0   6 2.620 16.46  0
## Mazda RX4 Wag     21.0   6 2.875 17.02  0
## Datsun 710        22.8   4 2.320 18.61  1
## Hornet 4 Drive    21.4   6 3.215 19.44  1
## Hornet Sportabout 18.7   8 3.440 17.02  0
## Valiant           18.1   6 3.460 20.22  1

Create a basic scatter plot of mpg according to cyl:

Example 6.1.1

We use the built-in dataset “mtcars” to create a scatterplot to see if there is an association between the “number of cylinders” and “miles per gallon” for the cars. We use the “pch=”" option to specify symbols to use when plotting points.

ggplot2.scatterplot(data=df, xName='wt',yName='mpg')

Add linear regression line:

ggplot2.scatterplot(data=df, xName='wt',yName='mpg', 
                    addRegLine=TRUE, regLineColor="blue")

We set color/shape by another variable (cyl):

library(devtools)
library(ggplot2)
library(easyGgplot2)
ggplot2.scatterplot(data=df, xName='wt',yName='mpg',
                    groupName="cyl")

Week 7: Regression Analysis

Multiple regression:

The general purpose of multiple regression is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, you might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how “pretty” the house is (subjective rating). You may also detect “outliers,” that is, houses that should really sell for more, given their location and characteristics.

When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data “passes” eight assumptions that are required for multiple regression to give you a valid result.

Assumption #1:

Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth.

Assumption #2:

You have two or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable).

Assumption #3:

You should have independence of observations.

Assumption #4:

There needs to be a linear relationship between

  1. the dependent variable and each of your independent variables, and

  2. the dependent variable and the independent variables collectively. Whilst there are a number of ways to check for these linear relationships, we suggest creating scatterplots.

Assumption #5:

Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line.

Assumption #6:

Your data must not show multicollinearity, which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model.

Assumption #7:

There should be no significant outliers, high leverage points or highly influential points. Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line.

Assumption #8:

Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms in our enhanced multiple regression guide). Two common methods to check this assumption include using:

  1. a histogram (with a superimposed normal curve) and a Normal P-P Plot; or

  2. a Normal Q-Q Plot of the studentized residuals.

The function “lm”" can be used to perform multiple linear regression in R and much of the syntax is the same as that used for fitting simple linear regression models. To perform multiple linear regression with p explanatory variables use the command:

lm(response ~ explanatory_1 + explanatory_2 + … + explanatory_p)

Here the terms “response”" and “explanatory_i”" in the function should be replaced by the names of the response and explanatory variables, respectively, used in the analysis.

Consider the data set “mtcars” available in the R environment. It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement(“disp”), horse power(“hp”), weight of the car(“wt”) and some more parameters.

The goal of the model is to establish the relationship between “mpg” as a response variable with “disp”,“hp” and “wt” as predictor variables. We create a subset of these variables from the mtcars data set for this purpose.

Example 7.1:

input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
##                    mpg disp  hp    wt
## Mazda RX4         21.0  160 110 2.620
## Mazda RX4 Wag     21.0  160 110 2.875
## Datsun 710        22.8  108  93 2.320
## Hornet 4 Drive    21.4  258 110 3.215
## Hornet Sportabout 18.7  360 175 3.440
## Valiant           18.1  225 105 3.460

Create the relationship model:

input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.
print(model)
## 
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = input)
## 
## Coefficients:
## (Intercept)         disp           hp           wt  
##   37.105505    -0.000937    -0.031157    -3.800891

So the equation is:

\(Y = a+disp.x_1+hp.x_2+wt.x_3\)

which is equal to:

\(Y = 37.105505 + disp \cdot (-0.000937)+hp \cdot (-0.031157)+wt\cdot(-3.800891)\)

We can use the regression equation created above to predict the mileage when a new set of values for displacement, horse power and weight is provided.

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is:

\(Y = 37.105505 + 221 \cdot (-0.000937)+102 \cdot (-0.031157)+2.91\cdot(-3.800891) = 22.7104.\)

Complete Project 5 (Posted on CANVAS)

Week 8:

Analysis of Variance

8.1 One-way ANOVA with R:

We are often interested in determining whether the means from more than two populations or groups are equal or not. To test whether the difference in means is statistically significant we can perform analysis of variance (ANOVA) using the R function aov().

The first step in our analysis is to graphically compare the means of the variable of interest across groups. It is possible to create side-by-side boxplots of measurements organized in groups using the function plot(). Simply type plot(response ~ factor, data=data_name) where response is the name of the response variable and factor the variable that separates the data into groups. Both variables should be contained in a data frame called data_name.

Example 8.1:

A drug company tested three formulations of a pain relief medicine for migraine headache sufferers. For the experiment 27 volunteers were selected and 9 were randomly assigned to one of three drug formulations. The subjects were instructed to take the drug during their next migraine headache episode and to report their pain on a scale of 1 to 10 (10 being most pain).

Drug A 4 5 4 3 2 4 3 4 4 Drug B 6 8 4 5 4 6 5 8 6 Drug C 6 7 6 6 7 5 6 5 5

To make side-by-side boxplots of the variable pain grouped by the variable drug we must first read in the data into the appropriate format.

pain=c(4,5,4,3, 2, 4, 3, 4,4, 6, 8, 4,5,4, 6, 5, 8, 6,6, 7, 6,6,7, 5,6,5,5)
drug=c(rep("A",9), rep("B",9), rep("C",9))
migraine=data.frame(pain,drug)
migraine
##    pain drug
## 1     4    A
## 2     5    A
## 3     4    A
## 4     3    A
## 5     2    A
## 6     4    A
## 7     3    A
## 8     4    A
## 9     4    A
## 10    6    B
## 11    8    B
## 12    4    B
## 13    5    B
## 14    4    B
## 15    6    B
## 16    5    B
## 17    8    B
## 18    6    B
## 19    6    C
## 20    7    C
## 21    6    C
## 22    6    C
## 23    7    C
## 24    5    C
## 25    6    C
## 26    5    C
## 27    5    C

Note the command rep(“A”,9) constructs a list of nine A‟s in a row. The variable drug is therefore a list of length 27 consisting of nine A‟s followed by nine B‟s followed by nine C‟s. If we print the data frame migraine we can see the format the data should be on in order to make side-by-side boxplots and perform ANOVA. We can now make the boxplots by typing:

plot(pain~drug, data=migraine)

From the boxplots it appears that the median pain for drug A is lower than that for drugs B and C.

Next, the R function aov() can be used for fitting ANOVA models. The general form is aov(response ~ factor, data=data_name) where response represents the response variable and factor the variable that separates the data into groups. Both variables should be contained in the data frame called data_name. Once the ANOVA model is fit, one can look at the results using the summary() function. This produces the standard ANOVA table.

results=aov(formula = pain~drug, data=migraine)
results
## Call:
##    aov(formula = pain ~ drug, data = migraine)
## 
## Terms:
##                     drug Residuals
## Sum of Squares  28.22222  28.44444
## Deg. of Freedom        2        24
## 
## Residual standard error: 1.088662
## Estimated effects may be unbalanced
summary(results)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## drug         2  28.22  14.111   11.91 0.000256 ***
## Residuals   24  28.44   1.185                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Studying the output of the ANOVA table above we see that the F-statistic is 11.91 with a p-value equal to 0.0003. We clearly reject the null hypothesis of equal means for all three drug groups.

Also, recall that a p-value is also a probability, but it comes from a different source than alpha. Every test statistic has a corresponding probability or p-value. This value is the probability that the observed statistic occurred by chance alone. We reject the null hypothesis if p <α.

Non-parametric Methods

Factor Analysis and Structural Equation Modeling

Basic Psychometrics