Complete all Questions, and submit final documents in PDF form on Canvas.

The Data

The Austin Animal Shelter in Austin, Texas, is the largest no-kill shelter in the United States. The organization cares for dogs, cats, and other animals in need, and each year, this work results in thousands of adoptions. Today, we will be working with random sample of 10,000 animals adopted from the Shelter in 2016 or 2017. Our task is to determine the average number of number of days an animal tends to spend in the Shelter prior to being adopted. We will be using the techniques of building confidence intervals and performing hypothesis tests to take the results from our sample and try to generalize to the population.

  1. What is the population of interest?

EDA

As we know, the first step in any data analysis is EDA. We need to know what kind of data we are working with, and what questions we might be able to explore using this data. To start, we need to load in the data. This data set is on Canvas under Lab 6, so go there to download the data. The data is a cleaned version of a data set available here . If you want to remind yourself of the steps needed to read a .csv file into R, look back at Lab 2. There are detailed steps provided on how to load the data into your workspace. Check to be sure you see AdoptedSamp in your workspace and that you have put the necessary code into a chunk in your RMarkdown document before you proceed..

The data set we will be working with contains information on 16 different variables. The meaning of each variable is as follows:

  • animal_type: the type of animal, i.e., dog, cat, etc.
  • breed: the specific breed of animal; example: Cocker Spaniel.
  • color: the coat/feather color of the animal.
  • intake_type: the reason that the animal was taken in to the shelter. Levels: Owner Surrender, Public Assist, Stray, Wildlife.
  • intake_condition: the condition of the animal upon its arrival at the shelter.
  • age_upon_intake_.days.: the age of the animal in days upon its arrival at the shelter.
  • age_upon_intake_.years.: the age of the animal in years (rounded) upon its arrival at the shelter.
  • intake_month: the month of the year the animal arrived at the shelter.
  • intake_year: the year the animal arrived at the shelter.
  • time_in_shelter_days: the number of days the animal remained in the shelter prior to being adopted.
  • outcome_year: the year the animal was adopted from the shelter.
  • outcome_subtype: if the adoption was not to a family how, where was the animal adopted? Options: Blank (adopted to a family), Foster (sent to a foster home) or Offsite (adopted by an organization or business).
  • outcome_type: for all animals in our subset, the outcome of being in the shelter is "Adopted".
  • sex_upon_outcome: the sex of the animal, including information about neutering/spaying.
  • age_upon_outcome_.days.: the age of the animal in days upon its departure from the shelter.
  • age_upon_outcome_.years.: the age of the animal in years upon its departure from the shelter.

We are interested in the variable time_in_shelter_days, which expresses the number of days an animal is in the Shelter prior to being adopted.

The population standard deviation is provided to us as 44.45. This is something we have from a national study of the number of days animals stay in a shelter. For these data, we are interested in finding our sample mean.

  1. Find the sample mean for the number of days an animal stays in the Shelter. Remember that the code mean will be useful for this.

Confidence Interval

We have been asked to determine how many days on average an animal remains in the Shelter in Austin before being adopted.

  1. We have already computed the sample mean. Why would it not be reasonable to just report this point estimate (i.e., sample mean) in response to this question?
  1. Make and interpret a 95% confidence interval for the number of days an animal remains in the Shelter before being adopted.

Hypothesis Test

An individual who works at the Shelter claims that, on average, animals spend less than a month (30 days) in the Shelter prior to being adopted. In order to check this claim, we could either build a confidence interval or conduct a hypothesis test.

  1. Based on the confidence interval you have built, how would you reply to the claim that on average, animals spend less than 30 days in the Shelter prior to being adopted?
  1. Compute the effect size for these data. Does the result suggest no effect, a small effect, a moderate effect, a large effect, or a very large effect? Explain.
  1. Run a hypothesis test to determine whether or not the average length of time an animal remains in the Shelter prior to being adopted is less than 30 days. Specify your significance level, write down all 6 steps, interpret the p-value, and state your conclusion. Note: In Step 4, just state the distribution; you do not need to draw a picture. To produce the necessary mathematical notation for the 6 steps, copy and paste the following into the white space in your Markdown.

    Step 1: $H_0:$, $H_a:$

    Step 2: $\bar{x} =$, $SE= \frac{\sigma}{\sqrt{n}} =$

    Step 3:

    Step 4: The sampling distribution of the test statistic under the null is a ? distribution. In other words, if the null were true, we would expect that $\bar{X}$ would follow a ? distribution with mean ? and standard error ?.

    Step 5: The p-value is ?. This means that ...

    Step 6: Using a significance level of ?, we ...

    As part of a hypothesis test, we need to compute a p-value. We know how to do this using our z-tables, but we can also do it in R. Suppose we have computed a z-score for some sample mean, and found that it is -2.4. In order to compute the probability of seeing a sample mean as extreme as -2.4, we use the code:

    pnorm(-2.4, mean = 0, sd = 1)

    Note that, just like the tables or the applet, this computes the probability of being less than or equal to the z-score of -2.4.

So, here we can look at our two results - the hypothesis test and the confidence interval - and see that both can help us answer our question about the average length of a time an animal spends in a shelter prior to being adopted.

  1. What is the probability of making a Type 1 Error in our hypothesis test?

While we can use both hypothesis tests and confidence intervals to evaluate claims, confidence intervals provide more information than a hypothesis test. Hypothesis tests allow you to evaluate a claim and make a statement such as "There is strong evidence the mean is less than 7." Confidence intervals still provide this information, but they also provide you a range of values for where the population parameter might lie. For instance, with a confidence interval, we can make claims such as "We are 95% confident that the mean is between 4 and 5." Hypothesis tests are widely used in practice, but you should ALWAYS prefer reporting a confidence interval to reporting a p-value.

The z.test function

Now that we have worked with the data, and practiced performing inference the "long way", let's use R to speed up the process. The confidence intervals and hypothesis tests we have been conducting so far rely on the standard normal distribution. As the standard normal is often represented as a Z, we call such procedures Z-procedures. To perform these procedures in R, we use a function called z.test.

This z.test function is not one that RStudio already knows (like hist or mean). Instead, we have to teach R this function by loading in a package, a downloadable set of functions written for R. To load the function we need, copy the following two lines of code, paste them into a chunk, and hit play.

install.packages("BSDA")
library(BSDA)

You have now loaded the package called BSDA and used the library to tell R to get ready to use the functions stored in the package. The function z.test that we need is included in that package. Now that you have installed the package, you need to comment out that line. In other words, you need to put a # symbol in front of install.packages. Your chunk should now look like this:

#install.packages("BSDA")
library(BSDA)

To get started, let's calculate a 98% confidence interval for the average number of days an animal stays at the Shelter before being adopted. To do this using R, we would use

z.test(x = AdoptedSamp$time_in_shelter_days, alternative = "two.sided", mu = 30, sigma.x = 44.45, conf.level = .98 )

Let's go through the arguments of this function.

  • The first argument is x = AdoptedSamp$time_in_shelter_days, which is the variable that we are interested in working with.
  • The second argument, alternative, tells R what our alternative hypothesis is: less than or equal to ("less"), greater than or equal to ("more") or is not equal to ("two.sided") the value specified in the null hypothesis. If we are only building a confidence interval, we should set alternative = "two.sided".
  • Next we decide on the mu we want to use as our null value. In other words, what value of the population mean are we testing? NOTE:If we are just building a confidence interval, this argument is irrelevant and you may feel free to leave it blank. The computer will fill in a default value of 0, but you will only look at the confidence interval produced in the output.
  • Next we supply the population standard deviation of our data, which is denoted in the function as sigma.x. R will convert this value into the SE as a part of the function.
  • The conf.level is the confidence level used to define our confidence interval.

You will notice that we do not specify a significance level when we run this function. This means that once the function is run, you have to look at the p-value supplied as a result of the hypothesis test and make a decision.

  1. Interpret the 98% confidence interval.

Considering Sample Size

We have discussed the impact of sample size on sample means. According to the law of large numbers, as the sample size increases, the sample mean approaches the population mean. Since sample means are critical to hypothesis testing, this suggests that sample size should also play a role in hypothesis tests.

To explore this, let's try working with a smaller sample. Run the lines of code below to create a new sample AdoptSmallof 500 animals.

set.seed(234)
AdoptSmall <- AdoptedSamp[sample(1:10000, 500, replace=FALSE),]

Now, let's try building a 95% confidence interval using this smaller sample.

  1. Using the z.test function and the sample AdoptSmall, create a 95% confidence interval for the average number of days an animal is in the Shelter prior to being adopted.
  1. How is this interval different from the interval you built in Question 4? Why do you think the change occurred?

What we have just seen is extremely important for statistical significance: sample size affects statistical significance, and hence the results of hypothesis tests and confidence intervals can change for samples of different sizes. In general, large samples tend to exhibit very little sampling variability. In other words, results are expected to be very consistent from sample to sample (the standard error is small). This means that even small changes may register as statistically significant. This means that with large samples, it is generally easier to reject the null hypothesis when you see a sample mean that is different from your null.

On the other hand, small random samples tend to vary greatly from sample to sample. This means that there is a large standard error expected. This means that unless an effect is very large, it can be hard for significance tests to register an effect as actually begin statistically significant. This means that with small samples, it is generally harder to reject the null hypothesis even when you see a sample mean that is different from your null.

Just Dogs

Let's go back to our original large sample AdoptedSamp, but let's narrow our focus. Suppose we are interested in the time it takes dogs to be adopted. There are more than just dogs in our AdoptedSamp, so the first step is to create a subset of our data that only contains information on the adoption of dogs. We can do that using the subset command:

Dogs<-subset(AdoptedSamp, AdoptedSamp$animal_type == "Dog")

From another national survey, we know the standard deviation of the number of days it takes dogs to be adopted is 43.26.

  1. Using the z.test function, create a 95% confidence interval for the average number of days a dog is in the Shelter prior to being adopted. Interpret the interval within the context of the data.
  1. Consider the two 95% confidence intervals you have created (Questions 4 and 12). Based on these intervals, do you think there might be relationship between whether or not an animal is a dog and the number of days it takes for the animal to be adopted? Explain your reasoning.
This lab was written by Nicole Dalzell at Wake Forest University, using data provided here through the Austin Open Data initiative. This lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.