Load library

The exercise from the high school does not require any packages, but for the sake of convenience, I recommend you to use following packages tidyverse. Some of the other packages are used in this answer, and I will explain them when I use.

pacman::p_load("tidyverse")

Overview of the data “sleep”

You do not have to search the data available in BaseR. The following code is sufficient to show you the basic info about data stored in BaseR.

?sleep

Import data

Since the data is available in BaseR, technically speaking you do not have to import data set from your local computer. datasets indicate packages in R and when you type ::, R recall a particular package so that you can access datasets or codes without loading packages. Here, I recall the package, named as datasets, and access the sleep data.

sleep <- datasets::sleep

# check
head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

Data exploration

Here are required codes in you assignment, but usually they are not much informative. My favorite functions are skim and glimpse. skimr demonstrates type of data, complete rate, quantiles, and some other info at one time. glimpse is very handy to check rows and columns quickly.

# check data structure
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)
# check class of "sleep"
class(sleep)
## [1] "data.frame"
# my recommendation
skimr::skim(sleep)
Data summary
Name sleep
Number of rows 20
Number of columns 3
_______________________
Column type frequency:
factor 2
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
group 0 1 FALSE 2 1: 10, 2: 10
ID 0 1 FALSE 10 1: 2, 2: 2, 3: 2, 4: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
extra 0 1 1.54 2.02 -1.6 -0.03 0.95 3.4 5.5 ▃▇▃▃▃
# OR
dplyr::glimpse(sleep)
## Rows: 20
## Columns: 3
## $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1.9, 0.8, ~
## $ group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Summary of the data set

Please check codes required for this exercise. The code I wrote below is useful when you have multiple categorical variables. For instance, if the number of group is 100, it is very tedious to define 100 variables before exploring data sets. Instead, you can use group function from the dplyr package and show the summary of data with just three line codes.

# define vars
sleep1 <- sleep[sleep$group == 1,]
sleep2 <- sleep[sleep$group == 2,]

# summary
summary(sleep1)
##      extra        group        ID   
##  Min.   :-1.600   1:10   1      :1  
##  1st Qu.:-0.175   2: 0   2      :1  
##  Median : 0.350          3      :1  
##  Mean   : 0.750          4      :1  
##  3rd Qu.: 1.700          5      :1  
##  Max.   : 3.700          6      :1  
##                          (Other):4
summary(sleep2)
##      extra        group        ID   
##  Min.   :-0.100   1: 0   1      :1  
##  1st Qu.: 0.875   2:10   2      :1  
##  Median : 1.750          3      :1  
##  Mean   : 2.330          4      :1  
##  3rd Qu.: 4.150          5      :1  
##  Max.   : 5.500          6      :1  
##                          (Other):4
# OR
quantile(sleep1$extra)
##     0%    25%    50%    75%   100% 
## -1.600 -0.175  0.350  1.700  3.700
quantile(sleep2$extra)
##     0%    25%    50%    75%   100% 
## -0.100  0.875  1.750  4.150  5.500
# My recommendation
sleep %>% 
  group_by(group) %>% 
  summarise(skimr::skim(extra))
## # A tibble: 2 x 13
##   group skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
##   <fct> <chr>     <chr>             <int>         <dbl>        <dbl>      <dbl>
## 1 1     numeric   data                  0             1         0.75       1.79
## 2 2     numeric   data                  0             1         2.33       2.00
## # ... with 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>,
## #   numeric.p50 <dbl>, numeric.p75 <dbl>, numeric.p100 <dbl>,
## #   numeric.hist <chr>