In this little piece of paper I am going to evaluate several R packages as for the purpose of doing descriptive data and model statistics. The packages will be evaluated based on the following criteria:
research requirements
publishing standards
my personal preferences.
To show how exactly different functions perform, I am going to use simulated data provided by the Princeton University Library (source:http://dss.princeton.edu/training/students.xls). It has records on several characteristics of made up students.
The packages I am going to cover are base, summarytools, psych.
It is a rare case for any data analysis report not to include a couple of functions from base R. It is quite convenient to use, as there is no need to pre-load any packages to start using its functions - base is loaded by default when IDE is launched - and it has enough capacity to provide a decent descriptive.
Base R is among the first things to use when loading a dataset to show its structure. The functions which can be deployed for this are str() and dim().
Here, I am loading the simulated students data and call for str() to explore its inner structure, tentatively changing some data types for the purpose of display.
library(readxl)
students <- read_excel("students.xls")
students$Gender <- as.factor(students$Gender)
students$Country <- as.factor(students$Country)
students$`Student Status` <- as.factor(students$`Student Status`)
students$Major <- as.factor(students$Major)
str(students)
## tibble [30 x 14] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:30] 1 2 3 4 5 6 7 8 9 10 ...
## $ Last Name : chr [1:30] "DOE01" "DOE02" "DOE01" "DOE02" ...
## $ First Name : chr [1:30] "JANE01" "JANE02" "JOE01" "JOE02" ...
## $ City : chr [1:30] "Los Angeles" "Sedona" "Elmira" "Lackawana" ...
## $ State : chr [1:30] "California" "Arizona" "New York" "New York" ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 2 1 1 1 ...
## $ Student Status : Factor w/ 2 levels "Graduate","Undergraduate": 1 2 1 1 1 1 1 2 2 1 ...
## $ Major : Factor w/ 3 levels "Econ","Math",..: 3 2 2 1 1 1 3 3 2 2 ...
## $ Country : Factor w/ 11 levels "Argentina","Bulgaria",..: 10 10 10 10 10 6 10 10 3 10 ...
## $ Age : num [1:30] 30 19 26 33 37 25 39 21 18 33 ...
## $ SAT : num [1:30] 2263 2006 2221 1716 1701 ...
## $ Average score (grade) : num [1:30] 67 63 78.1 77.8 65 ...
## $ Height (in) : num [1:30] 61 64 73 68 71 67 70 62 62 66 ...
## $ Newspaper readership (times/wk): num [1:30] 5 7 6 3 6 5 5 5 6 5 ...
The function shows the number of rows, observations, and columns, variables, in the loaded dataframe, along with the names and data types of the variables. It also sheds some light on the structure of the variables themselves. For instance, in case of ‘Height (in)’ variable we can see that it covers thirty values, and, as it is evidently numeric, those are 30 numbers. While ‘Gender’ variable has two levels, which are two possible labels that this variable has.
Other functions that let you glimpse into the contents of the data include head() and tail() that show either first of last dataset rows of the user-specified quantity.
Here is the first 8 rows, a head, of my data
head(students, 8)
Then, there is a possibility to create tables and cross-tables to find out the number of observations for each class of a variable(-s) with table() function.
table(students$Gender)
##
## Female Male
## 15 15
The table above shows distribution of observations by the ‘Gender’ variable, while the table below displays the distribution of records by two variables at the same time, ‘Gender’ and ‘Student Status’.
table(students$Gender, students$`Student Status`)
##
## Graduate Undergraduate
## Female 5 10
## Male 10 5
Cross-tables can also be done with xtabs() function
xtabs(~ students$Gender + students$`Student Status`)
## students$`Student Status`
## students$Gender Graduate Undergraduate
## Female 5 10
## Male 10 5
Besides, there is an option to call for a table which displays shares of observations by one or two variables with prop.table().
prop.table(table(students$Gender, students$`Student Status`), 1)
##
## Graduate Undergraduate
## Female 0.3333333 0.6666667
## Male 0.6666667 0.3333333
Further, a need to summarize central tendency measures for all variables of the dataframe, explore the relationships between pairs of variables, and, perhaps, visualize them arises. Base R can very well accomplish that.
Call for summary() will provide you with the CT metrics and quartile ranges summarized for each numeric variable in the data. However, the function will be pretty useless and misleading in case of categorical variables. It does not give the mode value, only some noisy display of all classes mixed up.
For the purpose of display I am running this function on numeric variables solely - to avoid enormous useless output.
summary(students[, 10:14])
## Age SAT Average score (grade) Height (in)
## Min. :18.0 Min. :1338 Min. :63.00 Min. :59.00
## 1st Qu.:19.0 1st Qu.:1658 1st Qu.:72.00 1st Qu.:63.00
## Median :23.0 Median :1817 Median :79.75 Median :66.50
## Mean :25.2 Mean :1849 Mean :80.40 Mean :66.43
## 3rd Qu.:30.0 3rd Qu.:2032 3rd Qu.:88.00 3rd Qu.:70.75
## Max. :39.0 Max. :2309 Max. :95.88 Max. :75.00
## Newspaper readership (times/wk)
## Min. :3.000
## 1st Qu.:4.000
## Median :5.000
## Mean :4.867
## 3rd Qu.:6.000
## Max. :7.000
Nevertheless, the is an option to get such a summary by groups (classes) of any categorical variable. For instance, here is the summary by ‘Gender’ variable’s categories:
by(students[,4:14], students$Gender, summary)
## students$Gender: Female
## City State Gender Student Status
## Length:15 Length:15 Female:15 Graduate : 5
## Class :character Class :character Male : 0 Undergraduate:10
## Mode :character Mode :character
##
##
##
##
## Major Country Age SAT
## Econ :3 US :10 Min. :18.0 Min. :1338
## Math :8 Canada : 1 1st Qu.:18.5 1st Qu.:1685
## Politics:4 China : 1 Median :20.0 Median :1821
## Holland : 1 Mean :23.2 Mean :1872
## Mexico : 1 3rd Qu.:27.5 3rd Qu.:2144
## Venezuela: 1 Max. :38.0 Max. :2309
## (Other) : 0
## Average score (grade) Height (in) Newspaper readership (times/wk)
## Min. :63.00 Min. :59.0 Min. :3.0
## 1st Qu.:69.00 1st Qu.:61.5 1st Qu.:5.0
## Median :79.00 Median :63.0 Median :5.0
## Mean :78.78 Mean :63.4 Mean :5.2
## 3rd Qu.:88.00 3rd Qu.:65.5 3rd Qu.:6.0
## Max. :95.42 Max. :68.0 Max. :7.0
##
## ------------------------------------------------------------
## students$Gender: Male
## City State Gender Student Status
## Length:15 Length:15 Female: 0 Graduate :10
## Class :character Class :character Male :15 Undergraduate: 5
## Mode :character Mode :character
##
##
##
##
## Major Country Age SAT
## Econ :7 US :10 Min. :18.0 Min. :1434
## Math :2 Argentina: 1 1st Qu.:20.5 1st Qu.:1669
## Politics:6 Bulgaria : 1 Median :28.0 Median :1787
## Israel : 1 Mean :27.2 Mean :1826
## Russia : 1 3rd Qu.:31.5 3rd Qu.:1921
## Sweden : 1 Max. :39.0 Max. :2279
## (Other) : 0
## Average score (grade) Height (in) Newspaper readership (times/wk)
## Min. :65.00 Min. :63.00 Min. :3.000
## 1st Qu.:77.96 1st Qu.:67.00 1st Qu.:3.500
## Median :81.53 Median :71.00 Median :4.000
## Mean :82.02 Mean :69.47 Mean :4.533
## 3rd Qu.:88.00 3rd Qu.:72.50 3rd Qu.:5.500
## Max. :95.88 Max. :75.00 Max. :7.000
##
Still quite long output, though.
It is also possible to get minimum, maximum, medium, etc. values of a numeric variable separately, as I am doing it with students’ age below
max(students$Age)
## [1] 39
mean(students$Age)
## [1] 25.2
median(students$Age)
## [1] 23
min(students$Age)
## [1] 18
Or to plot a distribution of its values
hist(students$SAT, col = "white")
Next, we can get correlations of all pairs of numeric variables present in the data
cor(students[, 10:14])
## Age SAT Average score (grade)
## Age 1.00000000 -0.12600068 0.04848456
## SAT -0.12600068 1.00000000 -0.04584287
## Average score (grade) 0.04848456 -0.04584287 1.00000000
## Height (in) 0.06615254 0.07927135 -0.01198504
## Newspaper readership (times/wk) 0.05413956 0.05443260 -0.24074412
## Height (in) Newspaper readership (times/wk)
## Age 0.06615254 0.05413956
## SAT 0.07927135 0.05443260
## Average score (grade) -0.01198504 -0.24074412
## Height (in) 1.00000000 -0.07097090
## Newspaper readership (times/wk) -0.07097090 1.00000000
And plot any two of them at once against each other
plot(students$`Average score (grade)`, students$SAT)
Additionally, base R covers statistical testing which is very useful in describing relationships between variables.
Example of this functionality, a two-sample t-test on students’ SAT scores by gender:
t.test(students$SAT ~ students$Gender)
##
## Welch Two Sample t-test
##
## data: students$SAT by students$Gender
## t = 0.4496, df = 26.756, p-value = 0.6566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -163.3048 254.9048
## sample estimates:
## mean in group Female mean in group Male
## 1871.8 1826.0
Finally, a handy option to summarize the main aspects of linear model output using the same summary()
linmodel <- lm(`Average score (grade)`~ SAT + `Newspaper readership (times/wk)` + Age + Country + Gender,
data = students)
summary(linmodel)
##
## Call:
## lm(formula = `Average score (grade)` ~ SAT + `Newspaper readership (times/wk)` +
## Age + Country + Gender, data = students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9886 -1.5573 0.0000 0.7011 13.6944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.392684 25.545229 4.517 0.000409 ***
## SAT -0.015947 0.008869 -1.798 0.092303 .
## `Newspaper readership (times/wk)` -0.318138 1.841426 -0.173 0.865144
## Age -0.043862 0.328360 -0.134 0.895511
## CountryBulgaria -15.582715 15.008859 -1.038 0.315608
## CountryCanada 7.217909 16.482967 0.438 0.667698
## CountryChina -7.493113 17.132265 -0.437 0.668076
## CountryHolland -14.779794 16.350153 -0.904 0.380308
## CountryIsrael -23.444998 15.351244 -1.527 0.147511
## CountryMexico 18.512428 14.265629 1.298 0.213995
## CountryRussia -25.997579 16.778551 -1.549 0.142111
## CountrySweden -2.905334 14.969570 -0.194 0.848715
## CountryUS -7.967423 11.689834 -0.682 0.505900
## CountryVenezuela 14.900585 14.872628 1.002 0.332291
## GenderMale 8.221231 4.684862 1.755 0.099682 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.58 on 15 degrees of freedom
## Multiple R-squared: 0.5352, Adjusted R-squared: 0.1014
## F-statistic: 1.234 on 14 and 15 DF, p-value: 0.3449
To sum it up, base R allows a researcher to do a decent summary statistics, some simple and fast visualizations, statistical testing, and linear modeling. Compared to other packages which are to be explored further and Python programming language, where the same actions require importing of additional modules per each step, it is a great deal of functionality. It might not be giving you the prettiest and/or the cleanest output, but it does its job quite well, considering that it is not a ‘separated’ package and is loaded by default when RStudio is launched. The greatest advantage of it, in my view, is that there never will be any errors related to incompatibility of package’s and environment’s versions - the problem which usually prevents you from going on with your work.
However, when it comes to publication of a researcher’s work, publishing the outputs of data analysis done with base may become tricky. Tables with numbers, dataframe’s structure and statistical testing outputs cannot be saved as a picture or PDF. There are displayed inside the R editor and when knitted to HTML reports with RMarkdown, but even those inside the latter ones are not pictures but rather pieces of text. You can either screenshot them, which is not usually acceptable by format requirements for a scientific publication, or re-type the results given in the output to a Word table template or something like that. The situation is easier with plots as they can be exported as images or PDF documents, but the plot’s decoration, so to say (borders, font, reflection of colors, etc.) may not be, again, suitable for a journal or a book one is publishing in. For instance, some series only accept graphics made with lattice package.
The next package to be explored is summarytools. I chose it as the following right after the base to compare them. Although it is reduced in functionality - no statistical testing or advanced visualization, it is quite capable of descriptive statistics and (!) produces much more nicely looking outputs. If I am not mistaken, its functions have embedded table formatting which allows a researcher to get clean descriptive tables without any effort.
Let’s refer to the examples. Here how the descriptive of numeric variables would look. It
library(summarytools)
print(descr(students[, 10:14],
headings = FALSE,
stats = "common",
style = "rmarkdown"), method = 'render', table.classes = 'st-small')
| Age | Average score (grade ) |
Height (in) | Newspaper readership ( times/wk) |
SAT | |
|---|---|---|---|---|---|
| Mean | 25.20 | 80.40 | 66.43 | 4.87 | 1848.90 |
| Std.Dev | 6.87 | 10.11 | 4.66 | 1.28 | 275.11 |
| Min | 18.00 | 63.00 | 59.00 | 3.00 | 1338.00 |
| Median | 23.00 | 79.75 | 66.50 | 5.00 | 1817.00 |
| Max | 39.00 | 95.88 | 75.00 | 7.00 | 2309.00 |
| N.Valid | 30 | 30 | 30 | 30 | 30 |
| Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06
And that is how the descriptive of all (!) variables would look. Yes, this package supports summary for categorical variables and even produces small visuals.
print(dfSummary(students, plain.ascii = FALSE, style = "grid",
graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp"), method = 'render')
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ID [numeric] | Mean (sd) : 15.5 (8.8) min < med < max: 1 < 15.5 < 30 IQR (CV) : 14.5 (0.6) | 30 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 2 | Last Name [character] | 1. DOE01 2. DOE02 3. DOE03 4. DOE04 5. DOE05 6. DOE06 7. DOE07 8. DOE08 9. DOE09 10. DOE10 [ 5 others ] |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 3 | First Name [character] | 1. JANE01 2. JANE02 3. JANE03 4. JANE04 5. JANE05 6. JANE06 7. JANE07 8. JANE08 9. JANE09 10. JANE10 [ 20 others ] |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 4 | City [character] | 1. New York 2. Acme 3. Amsterdam 4. Beijing 5. Buenos Aires 6. Caracas 7. Cimax 8. Defiance 9. Drunkard Creek 10. Elmira [ 19 others ] |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 5 | State [character] | 1. New York 2. Argentina 3. Arizona 4. Bulgaria 5. California 6. Canada 7. China 8. Holland 9. Israel 10. Kansas [ 16 others ] |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 6 | Gender [factor] | 1. Female 2. Male |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 7 | Student Status [factor] | 1. Graduate 2. Undergraduate |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 8 | Major [factor] | 1. Econ 2. Math 3. Politics |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 9 | Country [factor] | 1. Argentina 2. Bulgaria 3. Canada 4. China 5. Holland 6. Israel 7. Mexico 8. Russia 9. Sweden 10. US 11. Venezuela |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 10 | Age [numeric] | Mean (sd) : 25.2 (6.9) min < med < max: 18 < 23 < 39 IQR (CV) : 11 (0.3) | 13 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 11 | SAT [numeric] | Mean (sd) : 1848.9 (275.1) min < med < max: 1338 < 1817 < 2309 IQR (CV) : 374.8 (0.1) | 30 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 12 | Average score (grade) [numeric] | Mean (sd) : 80.4 (10.1) min < med < max: 63 < 79.7 < 95.9 IQR (CV) : 16 (0.1) | 28 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 13 | Height (in) [numeric] | Mean (sd) : 66.4 (4.7) min < med < max: 59 < 66.5 < 75 IQR (CV) : 7.8 (0.1) | 16 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 14 | Newspaper readership (times/wk) [numeric] | Mean (sd) : 4.9 (1.3) min < med < max: 3 < 5 < 7 IQR (CV) : 2 (0.3) |
|
0 (0%) |
Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06
There are also functions for frequency tables and cross-tables, of course. And, this time, we can get values count, their shares in percentage and cumulative values in one output table, all nicely formatted.
print(freq(students$Major,
report.nas = FALSE,
style = "rmarkdown"), method = 'render')
| Major | Freq | % | % Cum. |
|---|---|---|---|
| Econ | 10 | 33.33 | 33.33 |
| Math | 10 | 33.33 | 66.67 |
| Politics | 10 | 33.33 | 100.00 |
| Total | 30 | 100.00 | 100.00 |
Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06
A cross table, a frequency table for two variables:
print(ctable(x = students$Major,
y = students$`Student Status`, style = "rmarkdown"), method = 'render')
| `Student Status` | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Major | Graduate | Undergraduat<br/>e | Total | |||||||||
| Econ | 4 | ( | 40.0% | ) | 6 | ( | 60.0% | ) | 10 | ( | 100.0% | ) |
| Math | 3 | ( | 30.0% | ) | 7 | ( | 70.0% | ) | 10 | ( | 100.0% | ) |
| Politics | 8 | ( | 80.0% | ) | 2 | ( | 20.0% | ) | 10 | ( | 100.0% | ) |
| Total | 15 | ( | 50.0% | ) | 15 | ( | 50.0% | ) | 30 | ( | 100.0% | ) |
Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06
And there is also an option to produce a cross-table for three variables at once! Or more like a cross-tables including two variables by groups, labels, of one more variable. Here is how it is done on my data to display frequencies in students’ status by students’ Major, distinguished by a gender group.
print(stby(list(x = students$Major,
y = students$`Student Status`),
INDICES = students$Gender, # for each gender
FUN = ctable,
style = "rmarkdown"), method = 'render')
| students$`Student Status` | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Major | Graduate | Undergraduat<br/>e | Total | |||||||||
| Econ | 0 | ( | 0.0% | ) | 3 | ( | 100.0% | ) | 3 | ( | 100.0% | ) |
| Math | 2 | ( | 25.0% | ) | 6 | ( | 75.0% | ) | 8 | ( | 100.0% | ) |
| Politics | 3 | ( | 75.0% | ) | 1 | ( | 25.0% | ) | 4 | ( | 100.0% | ) |
| Total | 5 | ( | 33.3% | ) | 10 | ( | 66.7% | ) | 15 | ( | 100.0% | ) |
| students$`Student Status` | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Major | Graduate | Undergraduat<br/>e | Total | |||||||||
| Econ | 4 | ( | 57.1% | ) | 3 | ( | 42.9% | ) | 7 | ( | 100.0% | ) |
| Math | 1 | ( | 50.0% | ) | 1 | ( | 50.0% | ) | 2 | ( | 100.0% | ) |
| Politics | 5 | ( | 83.3% | ) | 1 | ( | 16.7% | ) | 6 | ( | 100.0% | ) |
| Total | 10 | ( | 66.7% | ) | 5 | ( | 33.3% | ) | 15 | ( | 100.0% | ) |
Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06
And, finally, a summary of numeric variables descriptive by a grouping variable, students’ gender in this case:
print(stby(data = students[, 10:14],
INDICES = students$Gender,
FUN = descr, # descriptive statistics
stats = "common"), method = 'render')
| Age | Average score (grade ) |
Height (in) | Newspaper readership ( times/wk) |
SAT | |
|---|---|---|---|---|---|
| Mean | 23.20 | 78.78 | 63.40 | 5.20 | 1871.80 |
| Std.Dev | 6.58 | 10.72 | 3.11 | 1.21 | 307.59 |
| Min | 18.00 | 63.00 | 59.00 | 3.00 | 1338.00 |
| Median | 20.00 | 79.00 | 63.00 | 5.00 | 1821.00 |
| Max | 38.00 | 95.42 | 68.00 | 7.00 | 2309.00 |
| N.Valid | 15 | 15 | 15 | 15 | 15 |
| Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| Age | Average score (grade ) |
Height (in) | Newspaper readership ( times/wk) |
SAT | |
|---|---|---|---|---|---|
| Mean | 27.20 | 82.02 | 69.47 | 4.53 | 1826.00 |
| Std.Dev | 6.77 | 9.55 | 3.94 | 1.30 | 247.08 |
| Min | 18.00 | 65.00 | 63.00 | 3.00 | 1434.00 |
| Median | 28.00 | 81.53 | 71.00 | 4.00 | 1787.00 |
| Max | 39.00 | 95.88 | 75.00 | 7.00 | 2279.00 |
| N.Valid | 15 | 15 | 15 | 15 | 15 |
| Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06
So, as you might have been proven, this package provides better options for publication-suited outputs. The problem with export of tables still remains, I guess, but it is more convincing for a screnshot, right?)
As I have already said, the package are only capable of summarizing the counts and descriptive of a dataframe, so no model summary and statistical testing is available to a researcher when using summarytools. Yet, a decent part of exploratory data analysis can be done with it.
Last but not least, the psych package. It was initially created for data analysis in Psychology research, yet has some nice functions which can be universally applied.
Let me begin with the headTail() function which works analogically to head() and tail() from base R but displays beginning and ending rows of a dataframe simultaneously.
library(psych)
headTail(students, 8)
A function that summarizes the descriptive of the dataframe variables is called describe(). It has the same flaws as base’s summary() in terms of categorical variables.
I will show you the example output, anyways. It produces more compact table than it was with base.
describe(students)
The package also supports summarizing descriptive by a grouping variable. Here is how it is done with students’ average grade and their learning status:
describeBy(students$`Average score (grade)`, students$`Student Status`, mat = T)
Psych also is capable of producing something called a scatterplot matrix that, by my best guess, displays correlation values between pairs of variables along with small visualizations of each variable and some diagnostics graphics, perhaps(?)
pairs.panels(students)
Anyways, the next feature that deserves attention is outlier() function that highlights outliers in the data and draws a Q-Q plot
outlier(students[, 10:14])
## 1 2 3 4 5 6 7 8
## 6.880623 5.732858 4.393128 4.198304 6.731186 1.391109 7.292029 1.714440
## 9 10 11 12 13 14 15 16
## 4.512375 3.055604 3.194081 7.260425 2.165526 7.273375 4.369994 3.816711
## 17 18 19 20 21 22 23 24
## 7.191014 9.034259 4.748880 7.838962 5.586692 2.788470 2.346411 2.908424
## 25 26 27 28 29 30
## 1.950053 5.684288 3.136373 5.727889 6.629236 5.447283
One more piece is a function displaying 95% confidence intervals for each variable in the data. Again, numeric ones included only.
error.bars(students[, 10:14])
Normally, there would be violin charts for each variable. But it does not work properly with simulated data somehow.
Finally, a function producing both correlation matrix and a correlation plot at the same time. The latter was not possible with base R.
corPlot(lowerCor(students[, 10:14]))
## Age SAT As(g) Hg(n) Nr(/)
## Age 1.00
## SAT -0.13 1.00
## Average score (grade) 0.05 -0.05 1.00
## Height (in) 0.07 0.08 -0.01 1.00
## Newspaper readership (times/wk) 0.05 0.05 -0.24 -0.07 1.00
So, this package definitely has some things to show. It provides more functionality than summarytools and produces neater tables than base. It is useful for data descriptives, for sure, and is trusted by the psychological research community for its usability and extended functionality in psychometric applications.
The package also includes functions (fa2latex(), cor2latex(), irt2latex(), and df2latex()) that allow to format outputs in APA style tables, which then can be conveniently put right in an article’s text.
Personally, I do descriptive statistics, quite rarely. My research is related to machine learning, so even in the case of modeling it is hard to go with common summaries, other, more advanced functions are needed. I also use more Python than R in my own projects, so it is hard to speak about preferences in terms of R packages. And when I need to make some overview of my data for the text that is to be published I go straightly to LaTeX features.
Nevertheless, when I use R and need to do with descriptive statistics I enjoy starter functions from base which give some impression on the structure of the data and describing functions from psych. To avoid not so appealing output of model summaries from base I use tab_model() function from sjPlot. It gives a very nice layout, worthy of publication I would say. Here is how it looks with the linear model I have created several steps earlier:
library(sjPlot)
tab_model(linmodel)
| Average score(grade) | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 115.39 | 60.94 – 169.84 | <0.001 |
| SAT | -0.02 | -0.03 – 0.00 | 0.092 |
|
Newspaper readership (times/wk) |
-0.32 | -4.24 – 3.61 | 0.865 |
| Age | -0.04 | -0.74 – 0.66 | 0.896 |
| Country [Bulgaria] | -15.58 | -47.57 – 16.41 | 0.316 |
| Country [Canada] | 7.22 | -27.91 – 42.35 | 0.668 |
| Country [China] | -7.49 | -44.01 – 29.02 | 0.668 |
| Country [Holland] | -14.78 | -49.63 – 20.07 | 0.380 |
| Country [Israel] | -23.44 | -56.17 – 9.28 | 0.148 |
| Country [Mexico] | 18.51 | -11.89 – 48.92 | 0.214 |
| Country [Russia] | -26.00 | -61.76 – 9.77 | 0.142 |
| Country [Sweden] | -2.91 | -34.81 – 29.00 | 0.849 |
| Country [US] | -7.97 | -32.88 – 16.95 | 0.506 |
| Country [Venezuela] | 14.90 | -16.80 – 46.60 | 0.332 |
| Gender [Male] | 8.22 | -1.76 – 18.21 | 0.100 |
| Observations | 30 | ||
| R2 / R2 adjusted | 0.535 / 0.101 | ||
If I was to advise R beginner on the choice of packages to use when doing descriptive statistics, I would recommend to consider the purpose of the analysis. If it is an interim data analysis or a student’s homework which is not going to be published, it is okay and even advisable to go with base R due to its multifunctionality, simplicity and ease of use. However, if the results are to be prepared for publication anywhere, even a beginner should think of some pretty-looking wrapper for his/her outputs. If one does not want to dig into documentation, summarytools package is a nice way to go. It is quite intuitive.
https://cran.r-project.org/web/packages/summarytools/vignettes/Introduction.html
Comtois, Dominic. “Summarytools: Tools to quickly and neatly summarize data.” R package version 0.8. 72018Available from: https://CRAN. R-project. org/package= summarytools (2018).
Revelle, William. “An overview of the psych package.” Dep Psychol Northwest Univ 3 (2011): 1-25.
Revelle, William. “psych: Procedures for psychological, psychometric, and personality research.” Northwestern University, Evanston, Illinois 165 (2014): 1-10.