The exercise from the high school does not require any packages, but for the sake of convenience, I recommend you to use following packages tidyverse. Some of the other packages are used in this answer, and I will explain them when I use.
You do not have to search the data available in BaseR. The following code is sufficient to show you the basic info about data stored in BaseR.
Since the data is available in BaseR, technically speaking you do not have to import data set from your local computer. datasets indicate packages in R and when you type ::, R recall a particular package so that you can access datasets or codes without loading packages. Here, I recall the package, named as datasets, and access the sleep data.
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
Here are required codes in you assignment, but usually they are not much informative. My favorite functions are skim and glimpse. skimr demonstrates type of data, complete rate, quantiles, and some other info at one time. glimpse is very handy to check rows and columns quickly.
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
## [1] "data.frame"
| Name | sleep |
| Number of rows | 20 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| group | 0 | 1 | FALSE | 2 | 1: 10, 2: 10 |
| ID | 0 | 1 | FALSE | 10 | 1: 2, 2: 2, 3: 2, 4: 2 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| extra | 0 | 1 | 1.54 | 2.02 | -1.6 | -0.03 | 0.95 | 3.4 | 5.5 | ▃▇▃▃▃ |
## Rows: 20
## Columns: 3
## $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1.9, 0.8, ~
## $ group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Please check codes required for this exercise. The code I wrote below is useful when you have multiple categorical variables. For instance, if the number of group is 100, it is very tedious to define 100 variables before exploring data sets. Instead, you can use group function from the dplyr package and show the summary of data with just three line codes.
# define vars
sleep1 <- sleep[sleep$group == 1,]
sleep2 <- sleep[sleep$group == 2,]
# summary
summary(sleep1)## extra group ID
## Min. :-1.600 1:10 1 :1
## 1st Qu.:-0.175 2: 0 2 :1
## Median : 0.350 3 :1
## Mean : 0.750 4 :1
## 3rd Qu.: 1.700 5 :1
## Max. : 3.700 6 :1
## (Other):4
## extra group ID
## Min. :-0.100 1: 0 1 :1
## 1st Qu.: 0.875 2:10 2 :1
## Median : 1.750 3 :1
## Mean : 2.330 4 :1
## 3rd Qu.: 4.150 5 :1
## Max. : 5.500 6 :1
## (Other):4
## 0% 25% 50% 75% 100%
## -1.600 -0.175 0.350 1.700 3.700
## 0% 25% 50% 75% 100%
## -0.100 0.875 1.750 4.150 5.500
## # A tibble: 2 x 13
## group skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
## <fct> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 1 numeric data 0 1 0.75 1.79
## 2 2 numeric data 0 1 2.33 2.00
## # ... with 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>,
## # numeric.p50 <dbl>, numeric.p75 <dbl>, numeric.p100 <dbl>,
## # numeric.hist <chr>