Nowadays there are a lot of packages for data analysis in R presented. However, sometimes we are too sticked to a certain tool and stop exploring something new. However, there are more and more packages created every year. In this work, I would like to introduce some of those which can become a more powerful alternative to basic R functions. All of them are quite new since their documentation is dated by 2020 year.
The dataset on which I am going to test this package can be found here. It is an educational dataset which consists of the marks of students` socio-demographical characteristics and their grades for various subjects. The other aspects of the sata will be explored with the help of the packages skimr, summarytools, gtsummary.
Besides, all the packages will be assessed on a scale from 0(not suitable) to 3(very suitable) according to the following criteria:
The first package I want to introduce is skimr. Basically, it is perfect for having a glance in your dataset to get a quick overview of the variables you have, their types and overall number of observations. The documentation can be found here
As it is said by the reviewer of the package, skimr main advantage is:
Let us have a look at this package in work.
We begin with putting the whole dataset in the main function of the package called skim.
| Name | df |
| Number of rows | 1000 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | fem: 518, mal: 482 |
| race/ethnicity | 0 | 1 | FALSE | 5 | gro: 319, gro: 262, gro: 190, gro: 140 |
| parental level of education | 0 | 1 | FALSE | 6 | som: 226, ass: 222, hig: 196, som: 179 |
| lunch | 0 | 1 | FALSE | 2 | sta: 645, fre: 355 |
| test preparation course | 0 | 1 | FALSE | 2 | non: 642, com: 358 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| math score | 0 | 1 | 66.09 | 15.16 | 0 | 57.00 | 66 | 77 | 100 | ▁▁▅▇▃ |
| reading score | 0 | 1 | 69.17 | 14.60 | 17 | 59.00 | 70 | 79 | 100 | ▁▂▆▇▃ |
| writing score | 0 | 1 | 68.05 | 15.20 | 10 | 57.75 | 69 | 79 | 100 | ▁▂▅▇▃ |
As you can see, skimr provides us with a detailed summary of our data. We can get from it:
Note: numeric and factor variables are separated, which makes table more easy to read.
Next, we are going to explore some basic features which are helpful for summarizing while having in minds our research question.
In skimr you can select the columns on which you want to get a summary statistics by just printing its names after the name of the dataset. The example below shows summary on socio-demographical variables which can be used as controls for our hypothetical research.
| Name | df |
| Number of rows | 1000 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | fem: 518, mal: 482 |
| race/ethnicity | 0 | 1 | FALSE | 5 | gro: 319, gro: 262, gro: 190, gro: 140 |
| parental level of education | 0 | 1 | FALSE | 6 | som: 226, ass: 222, hig: 196, som: 179 |
Skimr is nicely working in combination with dplyr functions. So, you can aggregate the data and look at the summary just by adding one line to your dplyr code. The most usable feature here is that skimr can provide summary statistics for levels of factors that has been grouped using dplyr::group_by.
For example, if we want to know how math score are different for students of different genders, we can start with the statistics presented below. What we get is the summary statistics on math scores for the levels of our grouping variable(gender).
| Name | Piped data |
| Number of rows | 1000 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | gender |
Variable type: numeric
| skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| math score | female | 0 | 1 | 63.63 | 15.49 | 0 | 54 | 65 | 74 | 100 | ▁▁▆▇▂ |
| math score | male | 0 | 1 | 68.73 | 14.36 | 27 | 59 | 69 | 79 | 100 | ▁▃▇▇▃ |
Finally, let me provide a score by each criteria for skimr to highlight its pros,cons and usability.
skimr provides informative summary about the data and allows to focus on the variables of our interest. To get it, you can use only one function skim. Additional plus goes for counting NA`s for each variable since it is crucial to detect them in the first step of analysis. However, it does not work with statistical tests.skimr does not have a good-looking output. It contains information about a tibble and the type of names of variables (just look under skim_variable in the lst table) Besides, it does not look intuitively understandable for people who do not dive in statistics.Overall, skim have all the chances to become your favorite package for the very first step of analysis. Although you will never show this tables to anyone else, this simple tool will help you not to get mad while starting your work with a large dataset.
Well, let us imagine that we have had a look at our dataset, dealt with the type of our data, and get rid of missings. By this point, I want to introduce a package called summarytools. This one is helpful when you want to learn more about your variables: distributions, number of observations per a category and get some graphs. The documentation can be found here
As the reviewer of the package puts it, the advantages of the package can be described as follows:
dfSummary(), all core functions support sampling weightBesides, it contains a set of functions for different types of summary statistics. Lets load the library and have a look!
As always, we start with summarizing the whole dataset. The output is a shiny.tag. To make your graphs attractive, specify method="render.
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gender [factor] | 1. female 2. male |
|
0 (0%) | |||||||||||||||||||||||||
| race/ethnicity [factor] | 1. group A 2. group B 3. group C 4. group D 5. group E |
|
0 (0%) | |||||||||||||||||||||||||
| parental level of education [factor] | 1. associate's degree 2. bachelor's degree 3. high school 4. master's degree 5. some college 6. some high school |
|
0 (0%) | |||||||||||||||||||||||||
| lunch [factor] | 1. free/reduced 2. standard |
|
0 (0%) | |||||||||||||||||||||||||
| test preparation course [factor] | 1. completed 2. none |
|
0 (0%) | |||||||||||||||||||||||||
| math score [numeric] | Mean (sd) : 66.1 (15.2) min < med < max: 0 < 66 < 100 IQR (CV) : 20 (0.2) | 81 distinct values | 0 (0%) | |||||||||||||||||||||||||
| reading score [numeric] | Mean (sd) : 69.2 (14.6) min < med < max: 17 < 70 < 100 IQR (CV) : 20 (0.2) | 72 distinct values | 0 (0%) | |||||||||||||||||||||||||
| writing score [numeric] | Mean (sd) : 68.1 (15.2) min < med < max: 10 < 69 < 100 IQR (CV) : 21.2 (0.2) | 77 distinct values | 0 (0%) |
Generated by summarytools 0.9.6 (R version 4.0.2)
2020-12-12
As you can see, the variables in the tables goes in the order as it is in the dataset. We have information about dimensions, duplicates, variable types and missings. However, the most outstanding features are:
With summarytools it is easy to get a detailed summary statistics for your numeric variables. The function is called descr, it ignores all non-numeric variables and show statistics on numeric ones. Let us try!
## Descriptive Statistics
## df
## N: 1000
##
## math score reading score writing score
## ----------------- ------------ --------------- ---------------
## Mean 66.09 69.17 68.05
## Std.Dev 15.16 14.60 15.20
## Min 0.00 17.00 10.00
## Q1 57.00 59.00 57.50
## Median 66.00 70.00 69.00
## Q3 77.00 79.00 79.00
## Max 100.00 100.00 100.00
## MAD 14.83 14.83 16.31
## IQR 20.00 20.00 21.25
## CV 0.23 0.21 0.22
## Skewness -0.28 -0.26 -0.29
## SE.Skewness 0.08 0.08 0.08
## Kurtosis 0.26 -0.08 -0.05
## N.Valid 1000.00 1000.00 1000.00
## Pct.Valid 100.00 100.00 100.00
The most important for us is presented in the table: beside mean, median and SD, summary contains information about skewness and kurtosis, which will help us to learn more about the distribution.
Summarytools allows to get cross-tabulations between two categorical variables. Besides, it displays the chi-square statistic, For 2 x 2 tables we can also get odds ratio and risk ratio.
df %$%
ctable(`test preparation course`, gender,
chisq = TRUE, OR = TRUE, RR = TRUE,
headings = FALSE) %>%
print(method = "render")| gender | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| test preparation course | female | male | Total | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| completed | 184 | ( | 51.4% | ) | 174 | ( | 48.6% | ) | 358 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| none | 334 | ( | 52.0% | ) | 308 | ( | 48.0% | ) | 642 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Total | 518 | ( | 51.8% | ) | 482 | ( | 48.2% | ) | 1000 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Χ2 = .0155 df = 1 p = .9008 O.R. (95% C.I.) = 0.98 (0.75 - 1.26) R.R. (95% C.I.) = 0.99 (0.87 - 1.12) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Generated by summarytools 0.9.6 (R version 4.0.2)
2020-12-12
Finally, let me provide a score by each criteria for summarytools to highlight its pros,cons and usability.
summarytools provides informative summary about the data with only one function and allows to dive in the variables of our interest. It detects NA`s and duplicates. You can get a detailed statistics with another two separate functions for categorical and numeric variables and get results of statistical tests + get beautiful tables. The only disadvantage is that this package does not work with regression tables.Summarytools provides you with a good-looking output for summary of the whole dataset, which you can use for publishing your paper. The problem can occur if your dataset contains a huge number of variables, since the table will be too long. However, cross-tables and statistics for numerics does not look attractive (but quite readable)By and large, summarytools is a very useful package when you need to explore your variables very quickly, get a strong structure of your analysis and good-looking tables for publishing.
You have done a great job in getting known your data and variables. The next (and probably the last) step is modeling. For this, you will probably want to use gtsummary package. This one is very new, the publication date is 29.09.20. As the creator of the packages claims:
The documentation can be found here. Let`s explore!
We start our exploration with a basic function applied to the whole dataset.
| Characteristic | N = 1,0001 |
|---|---|
| gender | |
| female | 518 (52%) |
| male | 482 (48%) |
| race/ethnicity | |
| group A | 89 (8.9%) |
| group B | 190 (19%) |
| group C | 319 (32%) |
| group D | 262 (26%) |
| group E | 140 (14%) |
| parental level of education | |
| associate's degree | 222 (22%) |
| bachelor's degree | 118 (12%) |
| high school | 196 (20%) |
| master's degree | 59 (5.9%) |
| some college | 226 (23%) |
| some high school | 179 (18%) |
| lunch | |
| free/reduced | 355 (36%) |
| standard | 645 (64%) |
| test preparation course | |
| completed | 358 (36%) |
| none | 642 (64%) |
| math score | 66 (57, 77) |
| reading score | 70 (59, 79) |
| writing score | 69 (58, 79) |
|
1
Statistics presented: n (%); Median (IQR)
|
|
By now, summary statistics is good-looking, but not very informative one (compared to two previous packages). It includes number of observations on each level for factor variables and median values for numeric variables.
As we remember, the power of gtsummary is in its customization, so we are going to customize it. There is a variety of ways, but now we will use the example code with comments provided by Daniel Sjoberg. Here he suggests using aggregation, adding significance measure from statistical tests and modifying the way the output looks like.
table2 <-
tbl_summary(
df,
by = gender, # split table by group
missing = "no" # don't list missing data separately
) %>%
add_n() %>% # add column with total number of non-missing observations
add_p() %>% # test for a difference between groups
modify_header(label = "**Variable**") %>% # update the column header
bold_labels()
table2| Variable | N | female, N = 5181 | male, N = 4821 | p-value2 |
|---|---|---|---|---|
| race/ethnicity | 1,000 | 0.060 | ||
| group A | 36 (6.9%) | 53 (11%) | ||
| group B | 104 (20%) | 86 (18%) | ||
| group C | 180 (35%) | 139 (29%) | ||
| group D | 129 (25%) | 133 (28%) | ||
| group E | 69 (13%) | 71 (15%) | ||
| parental level of education | 1,000 | 0.6 | ||
| associate's degree | 116 (22%) | 106 (22%) | ||
| bachelor's degree | 63 (12%) | 55 (11%) | ||
| high school | 94 (18%) | 102 (21%) | ||
| master's degree | 36 (6.9%) | 23 (4.8%) | ||
| some college | 118 (23%) | 108 (22%) | ||
| some high school | 91 (18%) | 88 (18%) | ||
| lunch | 1,000 | 0.5 | ||
| free/reduced | 189 (36%) | 166 (34%) | ||
| standard | 329 (64%) | 316 (66%) | ||
| test preparation course | 1,000 | >0.9 | ||
| completed | 184 (36%) | 174 (36%) | ||
| none | 334 (64%) | 308 (64%) | ||
| math score | 1,000 | 65 (54, 74) | 69 (59, 79) | <0.001 |
| reading score | 1,000 | 73 (63, 83) | 66 (56, 75) | <0.001 |
| writing score | 1,000 | 74 (64, 82) | 64 (53, 74) | <0.001 |
|
1
Statistics presented: n (%); Median (IQR)
2
Statistical tests performed: chi-square test of independence; Wilcoxon rank-sum test
|
||||
Much better! Among all previously presented, now we have the results of statistical tests performed for gender and other variables. Besides, now our variables have bold labels, which adds several points to readability of our table.
Moving next, I believe we all want to see how gtsummary works with regression tables! Although for this dataset it is logical to create a linear regression model, I want to show you the result on the logistic regression (to impress you even more!).
Let`s imagine that we want to describe a relationship between getting additional course for test preparation and socio-demographical characteristics of students.
mod1 <- glm(`test preparation course` ~ gender + `race/ethnicity` + `parental level of education`, df, family = binomial)
t1 <- tbl_regression(mod1, exponentiate = TRUE)
t1| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| gender | |||
| female | — | — | |
| male | 0.96 | 0.74, 1.25 | 0.8 |
| `race/ethnicity`group B | 0.91 | 0.53, 1.55 | 0.7 |
| `race/ethnicity`group C | 0.88 | 0.53, 1.44 | 0.6 |
| `race/ethnicity`group D | 1.14 | 0.68, 1.90 | 0.6 |
| `race/ethnicity`group E | 0.68 | 0.39, 1.19 | 0.2 |
| `parental level of education`bachelor's degree | 0.90 | 0.57, 1.44 | 0.7 |
| `parental level of education`high school | 1.44 | 0.95, 2.18 | 0.087 |
| `parental level of education`master's degree | 1.09 | 0.60, 2.02 | 0.8 |
| `parental level of education`some college | 1.11 | 0.75, 1.63 | 0.6 |
| `parental level of education`some high school | 0.74 | 0.49, 1.11 | 0.2 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
As we can see, gtsummary gives a very informative summary with all necessary parts presented for analyzing logistic regression. - there is a reference level for our outcome variable - convenient presentation of levels for categorical variables - p-values - confidence intervals - odds-ratios
Besides, it perfectly works with side-by-side regression models, so you should definitely watch the documentation to learn more about this cool package!
Finally, let me provide a score by each criteria for gtsummary to highlight its pros,cons and usability.
gtsummary provides informative summary about the data and the relationship between variables. However, it does not tell much about the variables itself, so you will have to do something more to explore them. Therefore, this package is not the best one to get summary statistics of your dataset, it just takes your time. However, regression tables this package provides is probably the best thing in your coding life. With one line of code you can get everything you need to analyze your regression model. So, in this case, gtsummary saves a lot of time for you.gtsummary is probably one of the best packages for publishing your results, since you can customize your table according to requirements you have.To conclude, although gtsummary lacks of functionality for descriptive statistics, it is easy to use for analyzing regression models and customizing tables.
Overall, the presented work can serve as a guide for doing data analysis in R. There is a quite good structure created for you to get everything you want very quick and with minimal efforts. Let`s summarize:
skimr. Based on scientific literature, choose the variables for your analysis. You can explore the number of observations, number of levels for factors, NAs per variable, mean scores and SD. Convert your variable to an appropriate type and get rid of NAs. Aggregate your data according your outcome variable with dplyrpackage and have a look at mean values.summarytools. You should focus on frequencies for categorical variables to detect groups which are highly underrepresented and deal with them. Explore histograms to have a look at distributions. If you want more advanced visualizations of the variables and relationship between them, you can use ggplot2 package. Besides, learn more about the distribution of numeric variables by analyzing skewness and kurtosis ans about relationship between categorical variables with cross-tabs. If you want to publish some of you descriptive tables, do it with this package.gtsummary. It won`t take a lot of time, but will provide you with cool customized tables containing all the necessary information which is ready for publishing. Do not forget to check assumptions :)All in all, although some basic functions in R can provide you with all necessary information, I hope you will use some of the presented packages to spend less and efforts and make your outputs more attractive.