Intro

Nowadays there are a lot of packages for data analysis in R presented. However, sometimes we are too sticked to a certain tool and stop exploring something new. However, there are more and more packages created every year. In this work, I would like to introduce some of those which can become a more powerful alternative to basic R functions. All of them are quite new since their documentation is dated by 2020 year.

The dataset on which I am going to test this package can be found here. It is an educational dataset which consists of the marks of students` socio-demographical characteristics and their grades for various subjects. The other aspects of the sata will be explored with the help of the packages skimr, summarytools, gtsummary.

Besides, all the packages will be assessed on a scale from 0(not suitable) to 3(very suitable) according to the following criteria:

  • Research requirements
  • Publishing standards
  • Personal preferences

The first step of your analysis: Skimr

The first package I want to introduce is skimr. Basically, it is perfect for having a glance in your dataset to get a quick overview of the variables you have, their types and overall number of observations. The documentation can be found here

# install.packages("skimr")
library(skimr)

As it is said by the reviewer of the package, skimr main advantage is:

  • reporting the larger set of statistics than base R functions
  • handling different data types
  • defining missing, complete, n, and sd

Let us have a look at this package in work.

Summarizing the whole dataset

We begin with putting the whole dataset in the main function of the package called skim.

skim(df)
Data summary
Name df
Number of rows 1000
Number of columns 8
_______________________
Column type frequency:
factor 5
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 fem: 518, mal: 482
race/ethnicity 0 1 FALSE 5 gro: 319, gro: 262, gro: 190, gro: 140
parental level of education 0 1 FALSE 6 som: 226, ass: 222, hig: 196, som: 179
lunch 0 1 FALSE 2 sta: 645, fre: 355
test preparation course 0 1 FALSE 2 non: 642, com: 358

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
math score 0 1 66.09 15.16 0 57.00 66 77 100 ▁▁▅▇▃
reading score 0 1 69.17 14.60 17 59.00 70 79 100 ▁▂▆▇▃
writing score 0 1 68.05 15.20 10 57.75 69 79 100 ▁▂▅▇▃

As you can see, skimr provides us with a detailed summary of our data. We can get from it:

  • number of rows and columns
  • columns` types and the number of variables per type
  • number of missings for each variable
  • for factors: whether ordered, number of unique values (= #levels), number of observations per groups
  • for numerics: number of missings, mean, standard deviations, values from the lowest(p0) to the highest(p100) for our population with 25% step, and a small histogram

Note: numeric and factor variables are separated, which makes table more easy to read.

Specific research needs

Next, we are going to explore some basic features which are helpful for summarizing while having in minds our research question.

  • specific columns to summarize

In skimr you can select the columns on which you want to get a summary statistics by just printing its names after the name of the dataset. The example below shows summary on socio-demographical variables which can be used as controls for our hypothetical research.

skim(df, gender, `race/ethnicity`,`parental level of education`)
Data summary
Name df
Number of rows 1000
Number of columns 8
_______________________
Column type frequency:
factor 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 fem: 518, mal: 482
race/ethnicity 0 1 FALSE 5 gro: 319, gro: 262, gro: 190, gro: 140
parental level of education 0 1 FALSE 6 som: 226, ass: 222, hig: 196, som: 179
  • dealing with grouped data

Skimr is nicely working in combination with dplyr functions. So, you can aggregate the data and look at the summary just by adding one line to your dplyr code. The most usable feature here is that skimr can provide summary statistics for levels of factors that has been grouped using dplyr::group_by.

For example, if we want to know how math score are different for students of different genders, we can start with the statistics presented below. What we get is the summary statistics on math scores for the levels of our grouping variable(gender).

df %>% 
  dplyr::select(gender, `math score`) %>% 
  dplyr::group_by(gender) %>% 
  skim()
Data summary
Name Piped data
Number of rows 1000
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables gender

Variable type: numeric

skim_variable gender n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
math score female 0 1 63.63 15.49 0 54 65 74 100 ▁▁▆▇▂
math score male 0 1 68.73 14.36 27 59 69 79 100 ▁▃▇▇▃

Conclusion

Finally, let me provide a score by each criteria for skimr to highlight its pros,cons and usability.

  • Research requirements: 1/3. Actually, skimr provides informative summary about the data and allows to focus on the variables of our interest. To get it, you can use only one function skim. Additional plus goes for counting NA`s for each variable since it is crucial to detect them in the first step of analysis. However, it does not work with statistical tests.
  • Publishing standards: 0/3. As you see, skimr does not have a good-looking output. It contains information about a tibble and the type of names of variables (just look under skim_variable in the lst table) Besides, it does not look intuitively understandable for people who do not dive in statistics.
  • Personal preferences: 3/3. As for me, I really enjoy this package for getting summary in any place I need since it works with dplyr perfectly. Therefore, it is convenient and time-saving. Moreover, it is really cool for a quick view on a dataset with a huge number of variables which is always a case. It also helpful for quick detecting of the wrong types of variables in a dataset.

Overall, skim have all the chances to become your favorite package for the very first step of analysis. Although you will never show this tables to anyone else, this simple tool will help you not to get mad while starting your work with a large dataset.

The second step of analysis: summarytools

Well, let us imagine that we have had a look at our dataset, dealt with the type of our data, and get rid of missings. By this point, I want to introduce a package called summarytools. This one is helpful when you want to learn more about your variables: distributions, number of observations per a category and get some graphs. The documentation can be found here

As the reviewer of the package puts it, the advantages of the package can be described as follows:

  • it is flexible in terms of output formats and contents
  • it is pipe-friendly
  • it is multilingual: built-in translations exist for French, Portuguese, Spanish, Russian and Turkish
  • it is weights-ready: except for dfSummary(), all core functions support sampling weight

Besides, it contains a set of functions for different types of summary statistics. Lets load the library and have a look!

#install.packages("summarytools")
library(summarytools)

Summarizing the whole dataset

As always, we start with summarizing the whole dataset. The output is a shiny.tag. To make your graphs attractive, specify method="render.

print(dfSummary(df), method = "render", varnumbers = FALSE, valid.col = FALSE)

Data Frame Summary

df

Dimensions: 1000 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
gender [factor] 1. female 2. male
518(51.8%)
482(48.2%)
0 (0%)
race/ethnicity [factor] 1. group A 2. group B 3. group C 4. group D 5. group E
89(8.9%)
190(19.0%)
319(31.9%)
262(26.2%)
140(14.0%)
0 (0%)
parental level of education [factor] 1. associate's degree 2. bachelor's degree 3. high school 4. master's degree 5. some college 6. some high school
222(22.2%)
118(11.8%)
196(19.6%)
59(5.9%)
226(22.6%)
179(17.9%)
0 (0%)
lunch [factor] 1. free/reduced 2. standard
355(35.5%)
645(64.5%)
0 (0%)
test preparation course [factor] 1. completed 2. none
358(35.8%)
642(64.2%)
0 (0%)
math score [numeric] Mean (sd) : 66.1 (15.2) min < med < max: 0 < 66 < 100 IQR (CV) : 20 (0.2) 81 distinct values 0 (0%)
reading score [numeric] Mean (sd) : 69.2 (14.6) min < med < max: 17 < 70 < 100 IQR (CV) : 20 (0.2) 72 distinct values 0 (0%)
writing score [numeric] Mean (sd) : 68.1 (15.2) min < med < max: 10 < 69 < 100 IQR (CV) : 21.2 (0.2) 77 distinct values 0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.2)
2020-12-12

As you can see, the variables in the tables goes in the order as it is in the dataset. We have information about dimensions, duplicates, variable types and missings. However, the most outstanding features are:

  • for factors: convenient representation of names of all the levels and the number of observations per each level
  • for numerics: mean, standard deviation, and minimum-median-maximum values
  • for all: good-looking graphs

More specific research needs

  • Diving into numeric variables

With summarytools it is easy to get a detailed summary statistics for your numeric variables. The function is called descr, it ignores all non-numeric variables and show statistics on numeric ones. Let us try!

descr(df)
## Descriptive Statistics  
## df  
## N: 1000  
## 
##                     math score   reading score   writing score
## ----------------- ------------ --------------- ---------------
##              Mean        66.09           69.17           68.05
##           Std.Dev        15.16           14.60           15.20
##               Min         0.00           17.00           10.00
##                Q1        57.00           59.00           57.50
##            Median        66.00           70.00           69.00
##                Q3        77.00           79.00           79.00
##               Max       100.00          100.00          100.00
##               MAD        14.83           14.83           16.31
##               IQR        20.00           20.00           21.25
##                CV         0.23            0.21            0.22
##          Skewness        -0.28           -0.26           -0.29
##       SE.Skewness         0.08            0.08            0.08
##          Kurtosis         0.26           -0.08           -0.05
##           N.Valid      1000.00         1000.00         1000.00
##         Pct.Valid       100.00          100.00          100.00

The most important for us is presented in the table: beside mean, median and SD, summary contains information about skewness and kurtosis, which will help us to learn more about the distribution.

  • Diving into categorical variables

Summarytools allows to get cross-tabulations between two categorical variables. Besides, it displays the chi-square statistic, For 2 x 2 tables we can also get odds ratio and risk ratio.

df %$% 
  ctable(`test preparation course`, gender,
         chisq = TRUE, OR = TRUE, RR = TRUE,
         headings = FALSE) %>%
  print(method = "render")
gender
test preparation course female male Total
completed 184 ( 51.4% ) 174 ( 48.6% ) 358 ( 100.0% )
none 334 ( 52.0% ) 308 ( 48.0% ) 642 ( 100.0% )
Total 518 ( 51.8% ) 482 ( 48.2% ) 1000 ( 100.0% )
 Χ2 = .0155   df = 1   p = .9008
O.R. (95% C.I.) = 0.98  (0.75 - 1.26)
R.R. (95% C.I.) = 0.99  (0.87 - 1.12)

Generated by summarytools 0.9.6 (R version 4.0.2)
2020-12-12

Conclusion

Finally, let me provide a score by each criteria for summarytools to highlight its pros,cons and usability.

  • Research requirements: 2/3. Actually, summarytools provides informative summary about the data with only one function and allows to dive in the variables of our interest. It detects NA`s and duplicates. You can get a detailed statistics with another two separate functions for categorical and numeric variables and get results of statistical tests + get beautiful tables. The only disadvantage is that this package does not work with regression tables.
  • Publishing standards: 1.5/3. Summarytools provides you with a good-looking output for summary of the whole dataset, which you can use for publishing your paper. The problem can occur if your dataset contains a huge number of variables, since the table will be too long. However, cross-tables and statistics for numerics does not look attractive (but quite readable)
  • Personal preferences: 3/3. I really like this package since its functions save your time and provides you with a clear structure for doing descriptive statistics.

By and large, summarytools is a very useful package when you need to explore your variables very quickly, get a strong structure of your analysis and good-looking tables for publishing.

Third step of analysis: gtsumary

You have done a great job in getting known your data and variables. The next (and probably the last) step is modeling. For this, you will probably want to use gtsummary package. This one is very new, the publication date is 29.09.20. As the creator of the packages claims:

  • it automatically detects continuous, categorical, and dichotomous variables in the data set
  • works with common regression models
  • has a lot of opportunities for customization

The documentation can be found here. Let`s explore!

#install.packages("gtsummary")
library(gtsummary)

Summarizing the whole dataset

We start our exploration with a basic function applied to the whole dataset.

tbl_summary(df)
Characteristic N = 1,0001
gender
female 518 (52%)
male 482 (48%)
race/ethnicity
group A 89 (8.9%)
group B 190 (19%)
group C 319 (32%)
group D 262 (26%)
group E 140 (14%)
parental level of education
associate's degree 222 (22%)
bachelor's degree 118 (12%)
high school 196 (20%)
master's degree 59 (5.9%)
some college 226 (23%)
some high school 179 (18%)
lunch
free/reduced 355 (36%)
standard 645 (64%)
test preparation course
completed 358 (36%)
none 642 (64%)
math score 66 (57, 77)
reading score 70 (59, 79)
writing score 69 (58, 79)

1 Statistics presented: n (%); Median (IQR)

By now, summary statistics is good-looking, but not very informative one (compared to two previous packages). It includes number of observations on each level for factor variables and median values for numeric variables.

As we remember, the power of gtsummary is in its customization, so we are going to customize it. There is a variety of ways, but now we will use the example code with comments provided by Daniel Sjoberg. Here he suggests using aggregation, adding significance measure from statistical tests and modifying the way the output looks like.

table2 <- 
  tbl_summary(
    df,
    by = gender, # split table by group
    missing = "no" # don't list missing data separately
  ) %>%
  add_n() %>% # add column with total number of non-missing observations
  add_p() %>% # test for a difference between groups
  modify_header(label = "**Variable**") %>% # update the column header
  bold_labels() 

table2
Variable N female, N = 5181 male, N = 4821 p-value2
race/ethnicity 1,000 0.060
group A 36 (6.9%) 53 (11%)
group B 104 (20%) 86 (18%)
group C 180 (35%) 139 (29%)
group D 129 (25%) 133 (28%)
group E 69 (13%) 71 (15%)
parental level of education 1,000 0.6
associate's degree 116 (22%) 106 (22%)
bachelor's degree 63 (12%) 55 (11%)
high school 94 (18%) 102 (21%)
master's degree 36 (6.9%) 23 (4.8%)
some college 118 (23%) 108 (22%)
some high school 91 (18%) 88 (18%)
lunch 1,000 0.5
free/reduced 189 (36%) 166 (34%)
standard 329 (64%) 316 (66%)
test preparation course 1,000 >0.9
completed 184 (36%) 174 (36%)
none 334 (64%) 308 (64%)
math score 1,000 65 (54, 74) 69 (59, 79) <0.001
reading score 1,000 73 (63, 83) 66 (56, 75) <0.001
writing score 1,000 74 (64, 82) 64 (53, 74) <0.001

1 Statistics presented: n (%); Median (IQR)

2 Statistical tests performed: chi-square test of independence; Wilcoxon rank-sum test

Much better! Among all previously presented, now we have the results of statistical tests performed for gender and other variables. Besides, now our variables have bold labels, which adds several points to readability of our table.

Specific research needs

Moving next, I believe we all want to see how gtsummary works with regression tables! Although for this dataset it is logical to create a linear regression model, I want to show you the result on the logistic regression (to impress you even more!).

Let`s imagine that we want to describe a relationship between getting additional course for test preparation and socio-demographical characteristics of students.

mod1 <- glm(`test preparation course` ~ gender + `race/ethnicity` + `parental level of education`, df, family = binomial)

t1 <- tbl_regression(mod1, exponentiate = TRUE)
t1
Characteristic OR1 95% CI1 p-value
gender
female
male 0.96 0.74, 1.25 0.8
`race/ethnicity`group B 0.91 0.53, 1.55 0.7
`race/ethnicity`group C 0.88 0.53, 1.44 0.6
`race/ethnicity`group D 1.14 0.68, 1.90 0.6
`race/ethnicity`group E 0.68 0.39, 1.19 0.2
`parental level of education`bachelor's degree 0.90 0.57, 1.44 0.7
`parental level of education`high school 1.44 0.95, 2.18 0.087
`parental level of education`master's degree 1.09 0.60, 2.02 0.8
`parental level of education`some college 1.11 0.75, 1.63 0.6
`parental level of education`some high school 0.74 0.49, 1.11 0.2

1 OR = Odds Ratio, CI = Confidence Interval

As we can see, gtsummary gives a very informative summary with all necessary parts presented for analyzing logistic regression. - there is a reference level for our outcome variable - convenient presentation of levels for categorical variables - p-values - confidence intervals - odds-ratios

Besides, it perfectly works with side-by-side regression models, so you should definitely watch the documentation to learn more about this cool package!

Conclusion

Finally, let me provide a score by each criteria for gtsummary to highlight its pros,cons and usability.

  • Research requirements: 2.5/3. Actually, gtsummary provides informative summary about the data and the relationship between variables. However, it does not tell much about the variables itself, so you will have to do something more to explore them. Therefore, this package is not the best one to get summary statistics of your dataset, it just takes your time. However, regression tables this package provides is probably the best thing in your coding life. With one line of code you can get everything you need to analyze your regression model. So, in this case, gtsummary saves a lot of time for you.
  • Publishing standards: 3/3. gtsummary is probably one of the best packages for publishing your results, since you can customize your table according to requirements you have.
  • Personal preferences: 3/3. If I could, I would put 10000 here. By now, this is my favorite package for working with regression tables. It is convenient and easy to read. Besides, it looks attractive and excels in functionality.

To conclude, although gtsummary lacks of functionality for descriptive statistics, it is easy to use for analyzing regression models and customizing tables.

Final thoughts

Overall, the presented work can serve as a guide for doing data analysis in R. There is a quite good structure created for you to get everything you want very quick and with minimal efforts. Let`s summarize:

  • First, have a look at your data with skimr. Based on scientific literature, choose the variables for your analysis. You can explore the number of observations, number of levels for factors, NAs per variable, mean scores and SD. Convert your variable to an appropriate type and get rid of NAs. Aggregate your data according your outcome variable with dplyrpackage and have a look at mean values.
  • Second, learn more about your variables with summarytools. You should focus on frequencies for categorical variables to detect groups which are highly underrepresented and deal with them. Explore histograms to have a look at distributions. If you want more advanced visualizations of the variables and relationship between them, you can use ggplot2 package. Besides, learn more about the distribution of numeric variables by analyzing skewness and kurtosis ans about relationship between categorical variables with cross-tabs. If you want to publish some of you descriptive tables, do it with this package.
  • Finally, build regression models and analyze them with gtsummary. It won`t take a lot of time, but will provide you with cool customized tables containing all the necessary information which is ready for publishing. Do not forget to check assumptions :)

All in all, although some basic functions in R can provide you with all necessary information, I hope you will use some of the presented packages to spend less and efforts and make your outputs more attractive.