Load library

The exercise from the high school does not require any packages, but for the sake of convenience, I recommend you to use following packages tidyverse. Some of the other packages are used in this answer, and I will explain them when I use.

pacman::p_load("tidyverse")

Overview of the data “sleep”

You do not have to search the data available in BaseR. The following code is sufficient to show you the basic info about data stored in BaseR.

?sleep

Import data

Since the data is available in BaseR, technically speaking you do not have to import data set from your local computer. datasets indicate packages in R and when you type ::, R recall a particular package so that you can access datasets or codes without loading packages. Here, I recall the package, named as datasets, and access the sleep data.

sleep <- datasets::sleep

# check
head(sleep)

##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

Data exploration

Here are required codes in you assignment, but usually they are not much informative. My favorite functions are skim and glimpse. skimr demonstrates type of data, complete rate, quantiles, and some other info at one time. glimpse is very handy to check rows and columns quickly.

# check data structure
str(data)

## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)

# check class of "sleep"
class(sleep)

## [1] "data.frame"

# my recommendation
skimr::skim(sleep)

Data summary
Name	sleep
Number of rows	20
Number of columns	3
_______________________
Column type frequency:
factor	2
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
group	0	1	FALSE	2	1: 10, 2: 10
ID	0	1	FALSE	10	1: 2, 2: 2, 3: 2, 4: 2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
extra	0	1	1.54	2.02	-1.6	-0.03	0.95	3.4	5.5	▃▇▃▃▃

# OR
dplyr::glimpse(sleep)

## Rows: 20
## Columns: 3
## $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1.9, 0.8, ~
## $ group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Summary of the data set

Please check codes required for this exercise. The code I wrote below is useful when you have multiple categorical variables. For instance, if the number of group is 100, it is very tedious to define 100 variables before exploring data sets. Instead, you can use group function from the dplyr package and show the summary of data with just three line codes.

# define vars
sleep1 <- sleep[sleep$group == 1,]
sleep2 <- sleep[sleep$group == 2,]

# summary
summary(sleep1)

##      extra        group        ID   
##  Min.   :-1.600   1:10   1      :1  
##  1st Qu.:-0.175   2: 0   2      :1  
##  Median : 0.350          3      :1  
##  Mean   : 0.750          4      :1  
##  3rd Qu.: 1.700          5      :1  
##  Max.   : 3.700          6      :1  
##                          (Other):4

summary(sleep2)

##      extra        group        ID   
##  Min.   :-0.100   1: 0   1      :1  
##  1st Qu.: 0.875   2:10   2      :1  
##  Median : 1.750          3      :1  
##  Mean   : 2.330          4      :1  
##  3rd Qu.: 4.150          5      :1  
##  Max.   : 5.500          6      :1  
##                          (Other):4

# OR
quantile(sleep1$extra)

##     0%    25%    50%    75%   100% 
## -1.600 -0.175  0.350  1.700  3.700

quantile(sleep2$extra)

##     0%    25%    50%    75%   100% 
## -0.100  0.875  1.750  4.150  5.500

# My recommendation
sleep %>% 
  group_by(group) %>% 
  summarise(skimr::skim(extra))

## # A tibble: 2 x 13
##   group skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
##   <fct> <chr>     <chr>             <int>         <dbl>        <dbl>      <dbl>
## 1 1     numeric   data                  0             1         0.75       1.79
## 2 2     numeric   data                  0             1         2.33       2.00
## # ... with 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>,
## #   numeric.p50 <dbl>, numeric.p75 <dbl>, numeric.p100 <dbl>,
## #   numeric.hist <chr>

Sample Answer for exerciseE

Rikiya Honda

01/20/2021

Load library

Overview of the data “sleep”

Import data

Data exploration

Summary of the data set