# load packages
library(tidyverse) # loads 8 packagesL01 Introduction
Data Science 1 with R (STAT 301-1)
Loading Packages & Datasets
Tasks
Complete the following tasks. For many of these you’ll need to simply indicate that you have completed the task.
Task 1
—Completed
Task 2
—Completed
Task 3
—Completed
Task 4
—Completed
Task 5 (optional)
x
Exercises
catdog <- read_delim('catsvdogs.txt')
Exercise 1
Suppose a random variable \(X\) has finite variance, then as we take larger random samples (i.e. as \(n\) increases) we have that \[\bar{X} \sim N\left(\mu_{\bar{X}}=\mu_X, \sigma^2_{\bar{X}} = \frac{\sigma_X^2}{n}\right)\] This is an informal statement of which important statistical theorem?
Solution
This Is A Statement Of The Central Limit Theorem
Exercise 2
Solution
Undergraduate
Exercise 3
cRead the codebook for the catsvdogs.txt dataset and upload it using readr::read_delim() function. The readr:: tells you that the function read_delim() function is from the readr package which is part of the tidyverse.
catdog <- read_delim("data/catsvdogs.txt", delim = "|")What was the percentage of dog owners for Illinois in 2012?
catdog %>%
filter(location == "Illinois") %>%
select(percent_dog_owners)# A tibble: 1 × 1
percent_dog_owners
<dbl>
1 32.4
Solution
The percentgae of dog owners for Illinois in 2012 was 32.4%.
Exercise 4
Apply the skim() function from the skimr package to the catdog dataset. What does the skim() function return?
#| label : data-skim
library(skimr)
skim(catdog)| Name | catdog |
| Number of rows | 49 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| location | 0 | 1 | 4 | 20 | 0 | 49 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| num_households | 0 | 1 | 2403.90 | 2514.05 | 221.0 | 765.0 | 1759.0 | 2632.0 | 12974.0 | ▇▂▁▁▁ |
| percent_households_with_pets | 0 | 1 | 56.86 | 6.93 | 21.9 | 53.6 | 56.8 | 61.3 | 70.8 | ▁▁▁▇▃ |
| num_pet_households | 0 | 1 | 1342.59 | 1358.25 | 63.0 | 475.0 | 957.0 | 1611.0 | 6865.0 | ▇▃▁▁▁ |
| percent_dog_owners | 0 | 1 | 36.97 | 6.67 | 13.1 | 32.9 | 36.6 | 42.5 | 47.9 | ▁▁▇▇▇ |
| dog_owning_households | 0 | 1 | 876.37 | 891.83 | 38.0 | 273.0 | 638.0 | 1069.0 | 4260.0 | ▇▃▁▁▁ |
| mean_num_dogs_per_households | 0 | 1 | 1.59 | 0.20 | 1.1 | 1.4 | 1.6 | 1.7 | 2.1 | ▂▇▇▃▁ |
| dog_population | 0 | 1 | 1414.16 | 1464.66 | 42.0 | 410.0 | 1097.0 | 1798.0 | 7163.0 | ▇▃▁▁▁ |
| percent_cat_owners | 0 | 1 | 31.64 | 5.68 | 11.6 | 29.0 | 31.3 | 33.8 | 49.5 | ▁▁▇▂▁ |
| cat_owning_households | 0 | 1 | 728.06 | 717.29 | 33.0 | 247.0 | 501.0 | 876.0 | 3687.0 | ▇▃▁▁▁ |
| mean_num_cats | 0 | 1 | 2.04 | 0.19 | 1.7 | 1.9 | 2.0 | 2.2 | 2.6 | ▂▇▆▂▁ |
| cat_population | 0 | 1 | 1492.80 | 1459.86 | 63.0 | 514.0 | 1185.0 | 1844.0 | 7118.0 | ▇▃▁▁▁ |
Solution
The ‘skim()’ function returns a complete data summary on the catdog dataset.
Exercise 5
Calculate the mean of percent_dog_owners. Do you think this is a reasonable estimate for the percent of US dog owners? Why or why not?
catdog %>%
summarize(
avg_pct_dog_owner_unweighted = mean(percent_dog_owners),
avg_pct_dog_owner_weighted = weighted.mean(percent_dog_owners, num_households)
)# A tibble: 1 × 2
avg_pct_dog_owner_unweighted avg_pct_dog_owner_weighted
<dbl> <dbl>
1 37.0 36.5
Solution
The unweighted mean is 37% and the weighted mean is 36.5%. The median for the percent of U.S dog owners is 36.8%, so this is not a case where one outlier is present, and, as recently, as 2021, that number was reported to be 38.4%; this slight uptick makes sense given the increased desire for dogs during the COVID-19 pandemic. However, the only confirmation for the mean being seemingly accurate is the weighted mean is very close to it, as different states are of different size. Thus, the solution that should be used is 36.5%, provided by the unweighted mean.