Complete all Questions, and submit final documents in PDF form on Sakai.

The Data

The Austin Animal Shelter in Austin, Texas, is the largest no-kill shelter in the United States. The organization cares for dogs, cats, and other animals in need, and each year, this work results in thousands of adoptions. Today, we will be working with random sample of 10,000 animals adopted from the Shelter in 2016 or 2017. Our task is to determine the average number of number of days an animal tends to spend in the Shelter prior to being adopted. We will be using the techniques of building confidence intervals and performing hypothesis tests to take the results from our sample and try to generalize to the population.

  1. What is the population of interest?

EDA

As we know, the first step in any data analysis is EDA. We need to know what kind of data we are working with, and what questions we might be able to explore using this data. To start, we need to load in the data. This data set is on Sakai under Lab 7. This is a cleaned version of a data set available here . If you want to remind yourself of the steps needed to read a .csv file into R, look back at Lab 2. There are detailed steps provided on how to load the data into your workspace. Check to be sure you see AdoptedSamp in your workspace and that you have put the necessary code into a chunk in your RMarkdown document before you proceed..

The data set we will be working with contains information on 16 different variables. The meaning of each variable is as follows:

  • animal_type: the type of animal, i.e., dog, cat, etc.
  • breed: the specific breed of animal; example: Cocker Spaniel.
  • color: the coat/feather color of the animal.
  • intake_type: the reason that the animal was taken in to the shelter. Levels: Owner Surrender, Public Assist, Stray, Wildlife.
  • intake_condition: the condition of the animal upon its arrival at the shelter.
  • age_upon_intake_.days.: the age of the animal in days upon its arrival at the shelter.
  • age_upon_intake_.years.: the age of the animal in years (rounded) upon its arrival at the shelter.
  • intake_month: the month of the year the animal arrived at the shelter.
  • intake_year: the year the animal arrived at the shelter.
  • time_in_shelter_days: the number of days the animal remained in the shelter prior to being adopted.
  • outcome_year: the year the animal was adopted from the shelter.
  • outcome_subtype: if the adoption was not to a family how, where was the animal adopted? Options: Blank (adopted to a family), Foster (sent to a foster home) or Offsite (adopted by an organization or business).
  • outcome_type: for all animals in our subset, the outcome of being in the shelter is "Adopted".
  • sex_upon_outcome: the sex of the animal, including information about neutering/spaying.
  • age_upon_outcome_.days.: the age of the animal in days upon its departure from the shelter.
  • age_upon_outcome_.years.: the age of the animal in years upon its departure from the shelter.

We are interested in the variable time_in_shelter_days, which expresses the number of days an animal is in the Shelter prior to being adopted.

  1. Make two plots to explore the distribution of time_in_shelter_days. Label your axes. Based on the plots, describe the distribution.
  1. If the distribution is not symmetric, can we still use the confidence interval and hypothesis test we have worked with in class? In other words, even with a skewed distribution, do we still meet the necessary conditions? Justify your response.
  1. Use the summary command to obtain a summary of the number of days an animal remains in the Shelter.
  1. Based on your plots, do you have any concerns about using the sample mean and standard deviation as the measures of center and spread? Explain.

The population standard deviation is provided to us as 44.45 - this is something we have from a national study.

  1. Find the sample mean for the number of days an animal stays in the Shelter. Remember that the code mean will be useful for this.

We have discussed the fact that when data is strongly skewed, we tend to use the median and the IQR as measures of center and spread. Why, then, are we looking at the sample mean and standard deviation??

The reasoning behind this relates to the Central Limit Theorem. The CLT is an extremely powerful tool that allows us to make confidence intervals and perform hypothesis tests by describing the sampling distribution of the sample mean using a normal distribution. There is no such corresponding theorem for the sampling distribution of the median. Because of this, the sample mean is often used for inference even when the sample is skewed.

Does this indicate that the mean is always used? No! There are randomization tests that allow us to use sample statistics like the median in order to perform inference. These tests are also extremely powerful and easy to perform using computing; we'll get to that later in class. For now, however, we will stick with the sample mean.

There is a trade off that we make with this choice. Confidence intervals and hypothesis tests are related to sampling distributions. Sampling distributions describe how much we believe a statistic (like a sample mean) will vary from sample to sample. The standard deviation of this distribution (the standard error) describes this sample to sample variation. This means that when we use sampling distributions to perform inference, the margin of error ignores all sources of variation except sample-to-sample variability. There is no measure of the fact that there is skewness within the individual samples, and hence that the sample means may not be resistant (robust) measures of center. This is a trade-off that we make in order to leverage the Central Limit Theorem.

Confidence Interval

Suppose we have been asked to say how many days on average an animal remains in the Shelter in Austin before being adopted.

  1. We have already computed the sample mean. Why would it not be reasonable to just report this point estimate (i.e., sample mean) in response to this question?
  1. Make and interpret a 95% confidence interval for the number of days an animal remains in the Shelter before being adopted.
  1. The standard deviation of our sample data is huge. Why is the confidence interval fairly narrow?

Hypothesis Test

Suppose an individual who works at the Shelter claims that, on average, animals spend less than a month (30 days) in the Shelter prior to being adopted. In order to check this claim, we could either build a confidence interval or conduct a hypothesis test.

  1. Based on the confidence interval you have built, how would you reply to the claim that on average, animals spend less than 30 days in the Shelter prior to being adopted?
  1. There is a relationship between hypothesis tests and confidence intervals. In order to test the claim that on average, animals spend less than a month (30 days) in the Shelter prior to being adopted, what significance level do we need to choose to conduct a hypothesis test that corresponds to our confidence interval?
  1. Run a hypothesis test to determine whether or not the average length of time an animal remains in the Shelter prior to being adopted is less than 30 days. Use the significance level you specified in Question 11. Write down all 6 steps, interpret the p-value, and state your conclusion. Note: In Step 4, just state the distribution; you do not need to draw a picture. To produce the necessary mathematical notation for the 6 steps, copy and paste the following into the white space in your Markdown.

    Step 1: $H_0:$, $H_a:$

    Step 2: $\bar{x} =$, $SE=$

    Step 3:

    Step 4: The sampling distribution of the test statistic under the null is a ? distribution. In other words, if the null were true, we would expect that $\bar{X}$ would follow a ? distribution.

    Step 5: The p-value is ?. This means that ...

    Step 6: Using a significance level of ?, we ...

    As part of a hypothesis test, we need to compute a p-value. We know how to do this using our z-tables, but we can also do it in R. Suppose we have computed a z-score for some sample mean, and found that it is -2.4. In order to compute the probability of seeing a sample mean as extreme as -2.4, we use the code:

    pnorm(-2.4)

    Note that, just like the tables, this computes the probability of being less than or equal to the z-score of -2.4.

  1. Is the null hypothesis \(\mu=30\) in the confidence interval we built in Question 8. Do we expect it to be? Explain.
  1. What is the probability of making a Type 1 Error in the hypothesis test you ran in Question 12?

While we can use both hypothesis tests and confidence intervals to evaluate claims, confidence intervals provide more information than a hypothesis test. Hypothesis tests allow you to evaluate a claim and make a statement such as "There is strong evidence the mean is less than 7." Confidence intervals still provide this information, but they also provide you a range of values for where the population parameter might lie. For instance, with a confidence interval, we can make claims such as "We are 95% confident that the mean is between 4 and 5." Hypothesis tests are widely used in practice, but you should ALWAYS prefer reporting a confidence interval to reporting a p-value.

The z.test function

Now that we have worked with the data, and practiced performing inference the "long way", let's use R to speed up the process. The confidence intervals and hypothesis tests we have been conducting so far rely on the standard normal distribution. As the standard normal is often represented as a Z, we call such procedures Z-procedures. To perform these procedures in R, we use a function called z.test.

This z.test function is not one that RStudio already knows (like hist or mean). Instead, we have to teach R this function by loading in a package, a downloadable set of functions written for R. To load the function we need, copy the following two lines of code, paste them into a chunk, and hit play.

install.packages("BSDA")
library(BSDA)

You have now loaded the package called BSDA and used the library to tell R to get ready to use the functions stored in the package. The function z.test that we need is included in that package. Now that you have installed the package, you need to comment out that line. In other words, you need to put a # symbol in front of install.packages. Your chunk should now look like this:

#install.packages("BSDA")
library(BSDA)

To get started, let's calculate a 98% confidence interval for the average number of days an animal stays at the Shelter before being adopted. To do this using R, we would use

z.test(x = AdoptedSamp$time_in_shelter_days, alternative = "two.sided", mu = 30, sigma.x = sd(AdoptedSamp$time_in_shelter_days), conf.level = .98 )

Let's go through the arguments of this function.

  • The first argument is x = AdoptedSamp$time_in_shelter_days, which is the variable that we are interested in working with.
  • The second argument, alternative, tells R what our alternative hypothesis is: less than or equal to ("less"), greater than or equal to ("more") or is not equal to ("two.sided") the value specified in the null hypothesis. If we are only building a confidence interval, we should set alternative = "two.sided".
  • Next we decide on the mu we want to use as our null value. In other words, what value of the population mean are we testing? NOTE:If we are just building a confidence interval, this argument is irrelevant and you may feel free to leave it blank. The computer will fill in a default value of 0, but you will only look at the confidence interval produced in the output.
  • Next we supply the sample standard deviation of our data, which is denoted in the function as sigma.x. R will convert this value into the SE as a part of the function.
  • The conf.level is the confidence level used to define our confidence interval.

You will notice that we do not specify a significance level when we run this function. This means that once the function is run, you have to look at p-value supplied as a result of the hypothesis test and make a decision.

  1. Interpret the 98% confidence interval.

Considering Sample Size

We have discussed the impact of sample size on sample means. According to the law of large numbers, as the sample size increases, the sample mean approaches the population mean. Since sample means are critical to hypothesis testing, this suggests that sample size should also play a role in hypothesis tests.

To explore this, let's try working with a smaller sample. Run the lines of code below to create a new sample AdoptSmallof 500 animals.

set.seed(234)
AdoptSmall <- AdoptedSamp[sample(1:10000, 500, replace=FALSE),]

Now, let's try building a 95% confidence interval using this smaller sample.

  1. Using the z.test function and the sample AdoptSmall, create a 95% confidence interval for the average number of days an animal is in the Shelter prior to being adopted.
  1. How is this interval different from the interval you built in Question 8? Why do you think the change occured? Give two specific reasons.

What we have just seen is extremely important for statistical significance: sample size affects statistical significance, and hence the results of hypothesis tests and confidence intervals can change for samples of different sizes. In general, large samples tend to exhibit very little sampling variability. In other words, results are expected to be very consistent from sample to sample (the standard error is small). This means that even small changes may register as statistically significant. This means that with large samples, it is generally easier to reject the null hypothesis when you see a sample mean that is different from your null.

On the other hand, small random samples tend to vary greatly from sample to sample. This means that there is a large standard error expected. This means that unless an effect is very large, it can be hard for significance tests to register an effect as actually beign statistically signicant. This means that with small samples, it is generally harder to reject the null hypothesis even when you see a sample mean that is different from your null.

Just Dogs

Let's go back to our original large sample AdoptedSamp, but let's narrow our focus. Suppose we are interested in the time it takes dogs to be adopted. There are more than just dogs in our AdoptedSamp, so the first step is to create a subset of our data that only contains information on the adoption of dogs. We can do that using the subset command:

Dogs<-subset(AdoptedSamp, AdoptedSamp$animal_type == "Dog")
  1. Using the z.test function, create a 95% confidence interval for the average number of days a dog is in the Shelter prior to being adopted. Interpret the interval within the context of the data.
  1. Consider the two 95% confidence intervals you have created (Questions 8 and 18). Based on these intervals, do you think there might be relationship between whether or not an animal is a dog and the number of days it takes for the animal to be adopted? Explain your reasoning.
This lab was written by Nicole Dalzell at Wake Forest University, using data provided here through the Austin Open Data initiative. This lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.