Question 2

For the remaining exercises in this set, we will use one of R’s built-in data sets, called the “ChickWeight” data set. According to the documentation for R, the ChickWeight data set contains information on the weight of chicks in grams up to 21 days after hatching. Use the summary(ChickWeight) command to reveal basic information about the ChickWeight data set. You will find that ChickWeight contains four different variables. Name the four variables. Use the dim(ChickWeight) command to show the dimensions of the ChickWeight data set. The second number in the output, 4, is the number of columns in the data set, in other words the number of variables. What is the first number? Report it and describe briefly what you think it signifies.

summary(ChickWeight)
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506
dim(ChickWeight)
## [1] 578   4

Obversations

The four variables for this data set are Weight, Time, Chick, and Diet.Utilizing the dim() function allows an analyst to check the number of columns(variables) and rows(observations). This data set contains four variables with 578 observations.

Question 3

When a data set contains more than one variable, R offers another subsetting operator, $, to access each variable individually. For the exercises below, we are interested only in the contents of one of the variables in the data set, called weight. We can access the weight variable by itself, using the \(, with this expression: ChickWeight\)weight. Run the following commands, say what the command does, report the output, and briefly explain each piece of output:

summary(ChickWeight$weight)

head(ChickWeight$weight)

mean(ChickWeight$weight)

myChkWts <- ChickWeight$weight

quantile(myChkWts,0.50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    35.0    63.0   103.0   121.8   163.8   373.0
## [1] 42 51 59 64 76 93
## [1] 121.8183
## 50% 
## 103

Observations

The summary for weight variable provides the quantile statistics. What this function returns is dependent on the class/type of the variable. Since weight of type float, it returns numeric statistics about the all the data.

The Head function returns the outputs of the first 5 values of this observation. You can specify the amount you want returned by including a second N parameter.

The mean provides the average weight of the chicks by dividing all the sum of all the observations and dividing that by the number of observations.

myChkWts sets a variable to vector of all the observations. This variable is then used in the quantile fucntion which can subset the into sections. In this example, the 50% quantile is returned which is 103. This same information is provided by the summary function.

Question 4

In the second to last command of the previous exercise, you created a copy of the weight data from the ChickWeight data set and put it in a new vector called myChkWts. You can continue to use this myChkWts variable for the rest of the exercises below. Create a histogram for that variable. Then write code that will display the 2.5% and 97.5% quantiles of the distribution for that variable. Write an interpretation of the variable, including descriptions of the mean, median, shape of the distribution, and the 2.5% and 97.5% quantiles. Make sure to clearly describe what the 2.5% and 97.5% quantiles signify.

Observations

This histogram displays a right skewed distibution of the weight data. We can see here that both the mean and the median(121 and 103, respectively) are to the right of peak of the distribution.

The quantile information shows how much of are data falls into the 2.5% subset of data and how much falls in to the 97.5% subset of data. Given the frequency, it can be observed that much more of the data falls in the 2.5% of the quantile.

Question 5

Write R code that will construct a sampling distribution of means from the weight data (as noted above, if you did exercise 3 you can use myChkWts instead of ChickWeight$weight to save yourself some typing). Make sure that the sampling distribution contains at least 1,000 means. Store the sampling distribution in a new variable that you can keep using. Use a sample size of n = 11 (sampling with replacement). Show a histogram of this distribution of sample means. Then, write and run R commands that will display the 2.5% and 97.5% quantiles of the sampling distribution on the histogram with a vertical line.

#Observations

Question 6

If you did Exercise 4, you calculated some quantiles for a distribution of raw data. If you did Exercise 5, you calculated some quantiles for a sampling distribution of means. Briefly describe, from a conceptual perspective and in your own words, what the difference is between a distribution of raw data and a distribution of sampling means. Finally, comment on why the 2.5% and 97.5% quantiles are so different between the raw data distribution and the sampling distribution of means.

Observations

The distribution of raw data generally provides a mean that falls within the bounds of the original dataset, but not necessairly the exact mean from the original sample. The distribution of sampling means generally generates a mean that is the same as the original mean of the sample. This is an example of the samples converging at the mean. The reason the distribution of the tails and heads have normalized is because through the 1000 trials those outliers means became less and less likely compared to the actual means.

Question 7

Redo Exercise 5, but this time use a sample size of n = 100 (instead of the original sample size of n = 11 used in Exercise 5). Explain why the 2.5% and 97.5% quantiles are different from the results you got for Exercise 5. As a hint, think about what makes a sample “better.”

Observations

These results look similar to the the intial sample because we’re taking a larger number of observations from vector to create the mean and we’re more likely to pull out items from the head or tail if we’re pulling out 100 observations and replacing them as opposed to the 11 observations. The 100 may give a better representation of the 518 observations vs the 11.