L01 Introduction

Data Science 1 with R (STAT 301-1)

Author

YOUR NAME

Loading Packages & Datasets

# load packages
library(tidyverse) # loads 8 packages

Tasks

Complete the following tasks. For many of these you’ll need to simply indicate that you have completed the task.

Task 1

—Completed


Task 2

—Completed


Task 3

—Completed


Task 4

—Completed


Task 5 (optional)

x


Exercises

catdog <- read_delim('catsvdogs.txt')

Exercise 1

Suppose a random variable \(X\) has finite variance, then as we take larger random samples (i.e. as \(n\) increases) we have that \[\bar{X} \sim N\left(\mu_{\bar{X}}=\mu_X, \sigma^2_{\bar{X}} = \frac{\sigma_X^2}{n}\right)\] This is an informal statement of which important statistical theorem?

Solution

This Is A Statement Of The Central Limit Theorem


Exercise 2

Solution

Undergraduate

Exercise 3

cRead the codebook for the catsvdogs.txt dataset and upload it using readr::read_delim() function. The readr:: tells you that the function read_delim() function is from the readr package which is part of the tidyverse.


catdog <- read_delim("data/catsvdogs.txt", delim = "|")


What was the percentage of dog owners for Illinois in 2012?


catdog %>%
  filter(location == "Illinois") %>%
  select(percent_dog_owners)
# A tibble: 1 × 1
  percent_dog_owners
               <dbl>
1               32.4

Solution

The percentgae of dog owners for Illinois in 2012 was 32.4%.

Exercise 4

Apply the skim() function from the skimr package to the catdog dataset. What does the skim() function return?

#| label : data-skim

library(skimr)
skim(catdog)
Data summary
Name catdog
Number of rows 49
Number of columns 12
_______________________
Column type frequency:
character 1
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
location 0 1 4 20 0 49 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num_households 0 1 2403.90 2514.05 221.0 765.0 1759.0 2632.0 12974.0 ▇▂▁▁▁
percent_households_with_pets 0 1 56.86 6.93 21.9 53.6 56.8 61.3 70.8 ▁▁▁▇▃
num_pet_households 0 1 1342.59 1358.25 63.0 475.0 957.0 1611.0 6865.0 ▇▃▁▁▁
percent_dog_owners 0 1 36.97 6.67 13.1 32.9 36.6 42.5 47.9 ▁▁▇▇▇
dog_owning_households 0 1 876.37 891.83 38.0 273.0 638.0 1069.0 4260.0 ▇▃▁▁▁
mean_num_dogs_per_households 0 1 1.59 0.20 1.1 1.4 1.6 1.7 2.1 ▂▇▇▃▁
dog_population 0 1 1414.16 1464.66 42.0 410.0 1097.0 1798.0 7163.0 ▇▃▁▁▁
percent_cat_owners 0 1 31.64 5.68 11.6 29.0 31.3 33.8 49.5 ▁▁▇▂▁
cat_owning_households 0 1 728.06 717.29 33.0 247.0 501.0 876.0 3687.0 ▇▃▁▁▁
mean_num_cats 0 1 2.04 0.19 1.7 1.9 2.0 2.2 2.6 ▂▇▆▂▁
cat_population 0 1 1492.80 1459.86 63.0 514.0 1185.0 1844.0 7118.0 ▇▃▁▁▁

Solution

The ‘skim()’ function returns a complete data summary on the catdog dataset.

Exercise 5

Calculate the mean of percent_dog_owners. Do you think this is a reasonable estimate for the percent of US dog owners? Why or why not?

catdog %>%
  summarize(
    avg_pct_dog_owner_unweighted = mean(percent_dog_owners),  
    avg_pct_dog_owner_weighted = weighted.mean(percent_dog_owners, num_households)
           )
# A tibble: 1 × 2
  avg_pct_dog_owner_unweighted avg_pct_dog_owner_weighted
                         <dbl>                      <dbl>
1                         37.0                       36.5

Solution

The unweighted mean is 37% and the weighted mean is 36.5%. The median for the percent of U.S dog owners is 36.8%, so this is not a case where one outlier is present, and, as recently, as 2021, that number was reported to be 38.4%; this slight uptick makes sense given the increased desire for dogs during the COVID-19 pandemic. However, the only confirmation for the mean being seemingly accurate is the weighted mean is very close to it, as different states are of different size. Thus, the solution that should be used is 36.5%, provided by the unweighted mean.