Doing Descriptive Statistics in R

Introduction

In this little piece of paper I am going to evaluate several R packages as for the purpose of doing descriptive data and model statistics. The packages will be evaluated based on the following criteria:

research requirements
publishing standards
my personal preferences.

To show how exactly different functions perform, I am going to use simulated data provided by the Princeton University Library (source:http://dss.princeton.edu/training/students.xls). It has records on several characteristics of made up students.

The packages I am going to cover are base, summarytools, psych.

Base R

It is a rare case for any data analysis report not to include a couple of functions from base R. It is quite convenient to use, as there is no need to pre-load any packages to start using its functions - base is loaded by default when IDE is launched - and it has enough capacity to provide a decent descriptive.

Base R is among the first things to use when loading a dataset to show its structure. The functions which can be deployed for this are str() and dim().

Here, I am loading the simulated students data and call for str() to explore its inner structure, tentatively changing some data types for the purpose of display.

library(readxl)
students <- read_excel("students.xls")
students$Gender <- as.factor(students$Gender)
students$Country <- as.factor(students$Country)
students$`Student Status` <- as.factor(students$`Student Status`)
students$Major <- as.factor(students$Major)

str(students)

## tibble [30 x 14] (S3: tbl_df/tbl/data.frame)
##  $ ID                             : num [1:30] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Last Name                      : chr [1:30] "DOE01" "DOE02" "DOE01" "DOE02" ...
##  $ First Name                     : chr [1:30] "JANE01" "JANE02" "JOE01" "JOE02" ...
##  $ City                           : chr [1:30] "Los Angeles" "Sedona" "Elmira" "Lackawana" ...
##  $ State                          : chr [1:30] "California" "Arizona" "New York" "New York" ...
##  $ Gender                         : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 2 1 1 1 ...
##  $ Student Status                 : Factor w/ 2 levels "Graduate","Undergraduate": 1 2 1 1 1 1 1 2 2 1 ...
##  $ Major                          : Factor w/ 3 levels "Econ","Math",..: 3 2 2 1 1 1 3 3 2 2 ...
##  $ Country                        : Factor w/ 11 levels "Argentina","Bulgaria",..: 10 10 10 10 10 6 10 10 3 10 ...
##  $ Age                            : num [1:30] 30 19 26 33 37 25 39 21 18 33 ...
##  $ SAT                            : num [1:30] 2263 2006 2221 1716 1701 ...
##  $ Average score (grade)          : num [1:30] 67 63 78.1 77.8 65 ...
##  $ Height (in)                    : num [1:30] 61 64 73 68 71 67 70 62 62 66 ...
##  $ Newspaper readership (times/wk): num [1:30] 5 7 6 3 6 5 5 5 6 5 ...

The function shows the number of rows, observations, and columns, variables, in the loaded dataframe, along with the names and data types of the variables. It also sheds some light on the structure of the variables themselves. For instance, in case of ‘Height (in)’ variable we can see that it covers thirty values, and, as it is evidently numeric, those are 30 numbers. While ‘Gender’ variable has two levels, which are two possible labels that this variable has.

Other functions that let you glimpse into the contents of the data include head() and tail() that show either first of last dataset rows of the user-specified quantity.

Here is the first 8 rows, a head, of my data

head(students, 8)

Then, there is a possibility to create tables and cross-tables to find out the number of observations for each class of a variable(-s) with table() function.

table(students$Gender)

## 
## Female   Male 
##     15     15

The table above shows distribution of observations by the ‘Gender’ variable, while the table below displays the distribution of records by two variables at the same time, ‘Gender’ and ‘Student Status’.

table(students$Gender, students$`Student Status`)

##         
##          Graduate Undergraduate
##   Female        5            10
##   Male         10             5

Cross-tables can also be done with xtabs() function

xtabs(~ students$Gender + students$`Student Status`)

##                students$`Student Status`
## students$Gender Graduate Undergraduate
##          Female        5            10
##          Male         10             5

Besides, there is an option to call for a table which displays shares of observations by one or two variables with prop.table().

prop.table(table(students$Gender, students$`Student Status`), 1)

##         
##           Graduate Undergraduate
##   Female 0.3333333     0.6666667
##   Male   0.6666667     0.3333333

Further, a need to summarize central tendency measures for all variables of the dataframe, explore the relationships between pairs of variables, and, perhaps, visualize them arises. Base R can very well accomplish that.

Call for summary() will provide you with the CT metrics and quartile ranges summarized for each numeric variable in the data. However, the function will be pretty useless and misleading in case of categorical variables. It does not give the mode value, only some noisy display of all classes mixed up.

For the purpose of display I am running this function on numeric variables solely - to avoid enormous useless output.

summary(students[, 10:14])

##       Age            SAT       Average score (grade)  Height (in)   
##  Min.   :18.0   Min.   :1338   Min.   :63.00         Min.   :59.00  
##  1st Qu.:19.0   1st Qu.:1658   1st Qu.:72.00         1st Qu.:63.00  
##  Median :23.0   Median :1817   Median :79.75         Median :66.50  
##  Mean   :25.2   Mean   :1849   Mean   :80.40         Mean   :66.43  
##  3rd Qu.:30.0   3rd Qu.:2032   3rd Qu.:88.00         3rd Qu.:70.75  
##  Max.   :39.0   Max.   :2309   Max.   :95.88         Max.   :75.00  
##  Newspaper readership (times/wk)
##  Min.   :3.000                  
##  1st Qu.:4.000                  
##  Median :5.000                  
##  Mean   :4.867                  
##  3rd Qu.:6.000                  
##  Max.   :7.000

Nevertheless, the is an option to get such a summary by groups (classes) of any categorical variable. For instance, here is the summary by ‘Gender’ variable’s categories:

by(students[,4:14], students$Gender, summary)

## students$Gender: Female
##      City              State              Gender         Student Status
##  Length:15          Length:15          Female:15   Graduate     : 5    
##  Class :character   Class :character   Male  : 0   Undergraduate:10    
##  Mode  :character   Mode  :character                                   
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##       Major        Country        Age            SAT      
##  Econ    :3   US       :10   Min.   :18.0   Min.   :1338  
##  Math    :8   Canada   : 1   1st Qu.:18.5   1st Qu.:1685  
##  Politics:4   China    : 1   Median :20.0   Median :1821  
##               Holland  : 1   Mean   :23.2   Mean   :1872  
##               Mexico   : 1   3rd Qu.:27.5   3rd Qu.:2144  
##               Venezuela: 1   Max.   :38.0   Max.   :2309  
##               (Other)  : 0                                
##  Average score (grade)  Height (in)   Newspaper readership (times/wk)
##  Min.   :63.00         Min.   :59.0   Min.   :3.0                    
##  1st Qu.:69.00         1st Qu.:61.5   1st Qu.:5.0                    
##  Median :79.00         Median :63.0   Median :5.0                    
##  Mean   :78.78         Mean   :63.4   Mean   :5.2                    
##  3rd Qu.:88.00         3rd Qu.:65.5   3rd Qu.:6.0                    
##  Max.   :95.42         Max.   :68.0   Max.   :7.0                    
##                                                                      
## ------------------------------------------------------------ 
## students$Gender: Male
##      City              State              Gender         Student Status
##  Length:15          Length:15          Female: 0   Graduate     :10    
##  Class :character   Class :character   Male  :15   Undergraduate: 5    
##  Mode  :character   Mode  :character                                   
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##       Major        Country        Age            SAT      
##  Econ    :7   US       :10   Min.   :18.0   Min.   :1434  
##  Math    :2   Argentina: 1   1st Qu.:20.5   1st Qu.:1669  
##  Politics:6   Bulgaria : 1   Median :28.0   Median :1787  
##               Israel   : 1   Mean   :27.2   Mean   :1826  
##               Russia   : 1   3rd Qu.:31.5   3rd Qu.:1921  
##               Sweden   : 1   Max.   :39.0   Max.   :2279  
##               (Other)  : 0                                
##  Average score (grade)  Height (in)    Newspaper readership (times/wk)
##  Min.   :65.00         Min.   :63.00   Min.   :3.000                  
##  1st Qu.:77.96         1st Qu.:67.00   1st Qu.:3.500                  
##  Median :81.53         Median :71.00   Median :4.000                  
##  Mean   :82.02         Mean   :69.47   Mean   :4.533                  
##  3rd Qu.:88.00         3rd Qu.:72.50   3rd Qu.:5.500                  
##  Max.   :95.88         Max.   :75.00   Max.   :7.000                  
##

Still quite long output, though.

It is also possible to get minimum, maximum, medium, etc. values of a numeric variable separately, as I am doing it with students’ age below

max(students$Age)

## [1] 39

mean(students$Age)

## [1] 25.2

median(students$Age)

## [1] 23

min(students$Age)

## [1] 18

Or to plot a distribution of its values

hist(students$SAT, col = "white")

Next, we can get correlations of all pairs of numeric variables present in the data

cor(students[, 10:14])

##                                         Age         SAT Average score (grade)
## Age                              1.00000000 -0.12600068            0.04848456
## SAT                             -0.12600068  1.00000000           -0.04584287
## Average score (grade)            0.04848456 -0.04584287            1.00000000
## Height (in)                      0.06615254  0.07927135           -0.01198504
## Newspaper readership (times/wk)  0.05413956  0.05443260           -0.24074412
##                                 Height (in) Newspaper readership (times/wk)
## Age                              0.06615254                      0.05413956
## SAT                              0.07927135                      0.05443260
## Average score (grade)           -0.01198504                     -0.24074412
## Height (in)                      1.00000000                     -0.07097090
## Newspaper readership (times/wk) -0.07097090                      1.00000000

And plot any two of them at once against each other

plot(students$`Average score (grade)`, students$SAT)

Additionally, base R covers statistical testing which is very useful in describing relationships between variables.

Example of this functionality, a two-sample t-test on students’ SAT scores by gender:

t.test(students$SAT ~ students$Gender)

## 
##  Welch Two Sample t-test
## 
## data:  students$SAT by students$Gender
## t = 0.4496, df = 26.756, p-value = 0.6566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -163.3048  254.9048
## sample estimates:
## mean in group Female   mean in group Male 
##               1871.8               1826.0

Finally, a handy option to summarize the main aspects of linear model output using the same summary()

linmodel <- lm(`Average score (grade)`~ SAT + `Newspaper readership (times/wk)` + Age + Country + Gender,
               data = students)
summary(linmodel)

## 
## Call:
## lm(formula = `Average score (grade)` ~ SAT + `Newspaper readership (times/wk)` + 
##     Age + Country + Gender, data = students)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9886  -1.5573   0.0000   0.7011  13.6944 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       115.392684  25.545229   4.517 0.000409 ***
## SAT                                -0.015947   0.008869  -1.798 0.092303 .  
## `Newspaper readership (times/wk)`  -0.318138   1.841426  -0.173 0.865144    
## Age                                -0.043862   0.328360  -0.134 0.895511    
## CountryBulgaria                   -15.582715  15.008859  -1.038 0.315608    
## CountryCanada                       7.217909  16.482967   0.438 0.667698    
## CountryChina                       -7.493113  17.132265  -0.437 0.668076    
## CountryHolland                    -14.779794  16.350153  -0.904 0.380308    
## CountryIsrael                     -23.444998  15.351244  -1.527 0.147511    
## CountryMexico                      18.512428  14.265629   1.298 0.213995    
## CountryRussia                     -25.997579  16.778551  -1.549 0.142111    
## CountrySweden                      -2.905334  14.969570  -0.194 0.848715    
## CountryUS                          -7.967423  11.689834  -0.682 0.505900    
## CountryVenezuela                   14.900585  14.872628   1.002 0.332291    
## GenderMale                          8.221231   4.684862   1.755 0.099682 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.58 on 15 degrees of freedom
## Multiple R-squared:  0.5352, Adjusted R-squared:  0.1014 
## F-statistic: 1.234 on 14 and 15 DF,  p-value: 0.3449

To sum it up, base R allows a researcher to do a decent summary statistics, some simple and fast visualizations, statistical testing, and linear modeling. Compared to other packages which are to be explored further and Python programming language, where the same actions require importing of additional modules per each step, it is a great deal of functionality. It might not be giving you the prettiest and/or the cleanest output, but it does its job quite well, considering that it is not a ‘separated’ package and is loaded by default when RStudio is launched. The greatest advantage of it, in my view, is that there never will be any errors related to incompatibility of package’s and environment’s versions - the problem which usually prevents you from going on with your work.

However, when it comes to publication of a researcher’s work, publishing the outputs of data analysis done with base may become tricky. Tables with numbers, dataframe’s structure and statistical testing outputs cannot be saved as a picture or PDF. There are displayed inside the R editor and when knitted to HTML reports with RMarkdown, but even those inside the latter ones are not pictures but rather pieces of text. You can either screenshot them, which is not usually acceptable by format requirements for a scientific publication, or re-type the results given in the output to a Word table template or something like that. The situation is easier with plots as they can be exported as images or PDF documents, but the plot’s decoration, so to say (borders, font, reflection of colors, etc.) may not be, again, suitable for a journal or a book one is publishing in. For instance, some series only accept graphics made with lattice package.

Summarytools package

The next package to be explored is summarytools. I chose it as the following right after the base to compare them. Although it is reduced in functionality - no statistical testing or advanced visualization, it is quite capable of descriptive statistics and (!) produces much more nicely looking outputs. If I am not mistaken, its functions have embedded table formatting which allows a researcher to get clean descriptive tables without any effort.

Let’s refer to the examples. Here how the descriptive of numeric variables would look. It

library(summarytools)
print(descr(students[, 10:14],
      headings = FALSE,
      stats = "common",
      style = "rmarkdown"), method = 'render', table.classes = 'st-small')

	Age	Average score (grade )	Height (in)	Newspaper readership ( times/wk)	SAT
Mean	25.20	80.40	66.43	4.87	1848.90
Std.Dev	6.87	10.11	4.66	1.28	275.11
Min	18.00	63.00	59.00	3.00	1338.00
Median	23.00	79.75	66.50	5.00	1817.00
Max	39.00	95.88	75.00	7.00	2309.00
N.Valid	30	30	30	30	30
Pct.Valid	100.00	100.00	100.00	100.00	100.00

Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06

And that is how the descriptive of all (!) variables would look. Yes, this package supports summary for categorical variables and even produces small visuals.

print(dfSummary(students, plain.ascii = FALSE, style = "grid", 
          graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp"), method = 'render')

Data Frame Summary

students

Dimensions: 30 x 14
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

ID [numeric]

Mean (sd) : 15.5 (8.8) min < med < max: 1 < 15.5 < 30 IQR (CV) : 14.5 (0.6)

30 distinct values

0 (0%)

Last Name [character]

1. DOE01 2. DOE02 3. DOE03 4. DOE04 5. DOE05 6. DOE06 7. DOE07 8. DOE08 9. DOE09 10. DOE10 [ 5 others ]

2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
2	(	6.7%	)
10	(	33.3%	)

0 (0%)

First Name [character]

1. JANE01 2. JANE02 3. JANE03 4. JANE04 5. JANE05 6. JANE06 7. JANE07 8. JANE08 9. JANE09 10. JANE10 [ 20 others ]

1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
20	(	66.7%	)

0 (0%)

City [character]

1. New York 2. Acme 3. Amsterdam 4. Beijing 5. Buenos Aires 6. Caracas 7. Cimax 8. Defiance 9. Drunkard Creek 10. Elmira [ 19 others ]

2	(	6.7%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
19	(	63.3%	)

0 (0%)

State [character]

1. New York 2. Argentina 3. Arizona 4. Bulgaria 5. California 6. Canada 7. China 8. Holland 9. Israel 10. Kansas [ 16 others ]

5	(	16.7%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
16	(	53.3%	)

0 (0%)

Gender [factor]

1. Female 2. Male

15	(	50.0%	)
15	(	50.0%	)

0 (0%)

Student Status [factor]

1. Graduate 2. Undergraduate

15	(	50.0%	)
15	(	50.0%	)

0 (0%)

Major [factor]

1. Econ 2. Math 3. Politics

10	(	33.3%	)
10	(	33.3%	)
10	(	33.3%	)

0 (0%)

Country [factor]

1. Argentina 2. Bulgaria 3. Canada 4. China 5. Holland 6. Israel 7. Mexico 8. Russia 9. Sweden 10. US 11. Venezuela

1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
1	(	3.3%	)
20	(	66.7%	)
1	(	3.3%	)

0 (0%)

Age [numeric]

Mean (sd) : 25.2 (6.9) min < med < max: 18 < 23 < 39 IQR (CV) : 11 (0.3)

13 distinct values

0 (0%)

SAT [numeric]

Mean (sd) : 1848.9 (275.1) min < med < max: 1338 < 1817 < 2309 IQR (CV) : 374.8 (0.1)

30 distinct values

0 (0%)

Average score (grade) [numeric]

Mean (sd) : 80.4 (10.1) min < med < max: 63 < 79.7 < 95.9 IQR (CV) : 16 (0.1)

28 distinct values

0 (0%)

Height (in) [numeric]

Mean (sd) : 66.4 (4.7) min < med < max: 59 < 66.5 < 75 IQR (CV) : 7.8 (0.1)

16 distinct values

0 (0%)

Newspaper readership (times/wk) [numeric]

Mean (sd) : 4.9 (1.3) min < med < max: 3 < 5 < 7 IQR (CV) : 2 (0.3)

3	:	6	(	20.0%	)
4	:	5	(	16.7%	)
5	:	9	(	30.0%	)
6	:	7	(	23.3%	)
7	:	3	(	10.0%	)

0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06

There are also functions for frequency tables and cross-tables, of course. And, this time, we can get values count, their shares in percentage and cumulative values in one output table, all nicely formatted.

print(freq(students$Major,
     report.nas = FALSE,
     style = "rmarkdown"), method = 'render')

Frequencies

students$Major

Type: Factor

Major	Freq	%	% Cum.
Econ	10	33.33	33.33
Math	10	33.33	66.67
Politics	10	33.33	100.00
Total	30	100.00	100.00

Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06

A cross table, a frequency table for two variables:

print(ctable(x = students$Major,
       y = students$`Student Status`, style = "rmarkdown"), method = 'render')

Cross-Tabulation, Row Proportions

Major * `Student Status`

Data Frame: students

	`Student Status`
Major	Graduate				Undergraduat<br/>e				Total
Econ	4	(	40.0%	)	6	(	60.0%	)	10	(	100.0%	)
Math	3	(	30.0%	)	7	(	70.0%	)	10	(	100.0%	)
Politics	8	(	80.0%	)	2	(	20.0%	)	10	(	100.0%	)
Total	15	(	50.0%	)	15	(	50.0%	)	30	(	100.0%	)

Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06

And there is also an option to produce a cross-table for three variables at once! Or more like a cross-tables including two variables by groups, labels, of one more variable. Here is how it is done on my data to display frequencies in students’ status by students’ Major, distinguished by a gender group.

print(stby(list(x = students$Major, 
          y = students$`Student Status`), 
     INDICES = students$Gender, # for each gender
     FUN = ctable,
     style = "rmarkdown"), method = 'render')

Cross-Tabulation, Row Proportions

Major * students$`Student Status`

Data Frame: students
Group: Gender = Female

	students$`Student Status`
Major	Graduate				Undergraduat<br/>e				Total
Econ	0	(	0.0%	)	3	(	100.0%	)	3	(	100.0%	)
Math	2	(	25.0%	)	6	(	75.0%	)	8	(	100.0%	)
Politics	3	(	75.0%	)	1	(	25.0%	)	4	(	100.0%	)
Total	5	(	33.3%	)	10	(	66.7%	)	15	(	100.0%	)

Group: Gender = Male

	students$`Student Status`
Major	Graduate				Undergraduat<br/>e				Total
Econ	4	(	57.1%	)	3	(	42.9%	)	7	(	100.0%	)
Math	1	(	50.0%	)	1	(	50.0%	)	2	(	100.0%	)
Politics	5	(	83.3%	)	1	(	16.7%	)	6	(	100.0%	)
Total	10	(	66.7%	)	5	(	33.3%	)	15	(	100.0%	)

Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06

And, finally, a summary of numeric variables descriptive by a grouping variable, students’ gender in this case:

print(stby(data = students[, 10:14],
     INDICES = students$Gender, 
     FUN = descr, # descriptive statistics
     stats = "common"), method = 'render')

Descriptive Statistics

students

Group: Gender = Female
N: 15

	Age	Average score (grade )	Height (in)	Newspaper readership ( times/wk)	SAT
Mean	23.20	78.78	63.40	5.20	1871.80
Std.Dev	6.58	10.72	3.11	1.21	307.59
Min	18.00	63.00	59.00	3.00	1338.00
Median	20.00	79.00	63.00	5.00	1821.00
Max	38.00	95.42	68.00	7.00	2309.00
N.Valid	15	15	15	15	15
Pct.Valid	100.00	100.00	100.00	100.00	100.00

Group: Gender = Male
N: 15

	Age	Average score (grade )	Height (in)	Newspaper readership ( times/wk)	SAT
Mean	27.20	82.02	69.47	4.53	1826.00
Std.Dev	6.77	9.55	3.94	1.30	247.08
Min	18.00	65.00	63.00	3.00	1434.00
Median	28.00	81.53	71.00	4.00	1787.00
Max	39.00	95.88	75.00	7.00	2279.00
N.Valid	15	15	15	15	15
Pct.Valid	100.00	100.00	100.00	100.00	100.00

Generated by summarytools 0.9.6 (R version 4.0.3)
2020-12-06

So, as you might have been proven, this package provides better options for publication-suited outputs. The problem with export of tables still remains, I guess, but it is more convincing for a screnshot, right?)

As I have already said, the package are only capable of summarizing the counts and descriptive of a dataframe, so no model summary and statistical testing is available to a researcher when using summarytools. Yet, a decent part of exploratory data analysis can be done with it.

Psych package

Last but not least, the psych package. It was initially created for data analysis in Psychology research, yet has some nice functions which can be universally applied.

Let me begin with the headTail() function which works analogically to head() and tail() from base R but displays beginning and ending rows of a dataframe simultaneously.

library(psych)
headTail(students, 8)

A function that summarizes the descriptive of the dataframe variables is called describe(). It has the same flaws as base’s summary() in terms of categorical variables.

I will show you the example output, anyways. It produces more compact table than it was with base.

describe(students)

The package also supports summarizing descriptive by a grouping variable. Here is how it is done with students’ average grade and their learning status:

describeBy(students$`Average score (grade)`, students$`Student Status`, mat = T)

Psych also is capable of producing something called a scatterplot matrix that, by my best guess, displays correlation values between pairs of variables along with small visualizations of each variable and some diagnostics graphics, perhaps(?)

pairs.panels(students)

Anyways, the next feature that deserves attention is outlier() function that highlights outliers in the data and draws a Q-Q plot

outlier(students[, 10:14])

##        1        2        3        4        5        6        7        8 
## 6.880623 5.732858 4.393128 4.198304 6.731186 1.391109 7.292029 1.714440 
##        9       10       11       12       13       14       15       16 
## 4.512375 3.055604 3.194081 7.260425 2.165526 7.273375 4.369994 3.816711 
##       17       18       19       20       21       22       23       24 
## 7.191014 9.034259 4.748880 7.838962 5.586692 2.788470 2.346411 2.908424 
##       25       26       27       28       29       30 
## 1.950053 5.684288 3.136373 5.727889 6.629236 5.447283

One more piece is a function displaying 95% confidence intervals for each variable in the data. Again, numeric ones included only.

error.bars(students[, 10:14])

Normally, there would be violin charts for each variable. But it does not work properly with simulated data somehow.

Finally, a function producing both correlation matrix and a correlation plot at the same time. The latter was not possible with base R.

corPlot(lowerCor(students[, 10:14]))

##                                 Age   SAT   As(g) Hg(n) Nr(/)
## Age                              1.00                        
## SAT                             -0.13  1.00                  
## Average score (grade)            0.05 -0.05  1.00            
## Height (in)                      0.07  0.08 -0.01  1.00      
## Newspaper readership (times/wk)  0.05  0.05 -0.24 -0.07  1.00

So, this package definitely has some things to show. It provides more functionality than summarytools and produces neater tables than base. It is useful for data descriptives, for sure, and is trusted by the psychological research community for its usability and extended functionality in psychometric applications.

The package also includes functions (fa2latex(), cor2latex(), irt2latex(), and df2latex()) that allow to format outputs in APA style tables, which then can be conveniently put right in an article’s text.

Personal preferences

Personally, I do descriptive statistics, quite rarely. My research is related to machine learning, so even in the case of modeling it is hard to go with common summaries, other, more advanced functions are needed. I also use more Python than R in my own projects, so it is hard to speak about preferences in terms of R packages. And when I need to make some overview of my data for the text that is to be published I go straightly to LaTeX features.

Nevertheless, when I use R and need to do with descriptive statistics I enjoy starter functions from base which give some impression on the structure of the data and describing functions from psych. To avoid not so appealing output of model summaries from base I use tab_model() function from sjPlot. It gives a very nice layout, worthy of publication I would say. Here is how it looks with the linear model I have created several steps earlier:

library(sjPlot)
tab_model(linmodel)

	Average score(grade)
Predictors	Estimates	CI	p
(Intercept)	115.39	60.94 – 169.84	<0.001
SAT	-0.02	-0.03 – 0.00	0.092
Newspaper readership (times/wk)	-0.32	-4.24 – 3.61	0.865
Age	-0.04	-0.74 – 0.66	0.896
Country [Bulgaria]	-15.58	-47.57 – 16.41	0.316
Country [Canada]	7.22	-27.91 – 42.35	0.668
Country [China]	-7.49	-44.01 – 29.02	0.668
Country [Holland]	-14.78	-49.63 – 20.07	0.380
Country [Israel]	-23.44	-56.17 – 9.28	0.148
Country [Mexico]	18.51	-11.89 – 48.92	0.214
Country [Russia]	-26.00	-61.76 – 9.77	0.142
Country [Sweden]	-2.91	-34.81 – 29.00	0.849
Country [US]	-7.97	-32.88 – 16.95	0.506
Country [Venezuela]	14.90	-16.80 – 46.60	0.332
Gender [Male]	8.22	-1.76 – 18.21	0.100
Observations	30
R² / R² adjusted	0.535 / 0.101

Advising (oh my..)

If I was to advise R beginner on the choice of packages to use when doing descriptive statistics, I would recommend to consider the purpose of the analysis. If it is an interim data analysis or a student’s homework which is not going to be published, it is okay and even advisable to go with base R due to its multifunctionality, simplicity and ease of use. However, if the results are to be prepared for publication anywhere, even a beginner should think of some pretty-looking wrapper for his/her outputs. If one does not want to dig into documentation, summarytools package is a nice way to go. It is quite intuitive.

References:

https://www.princeton.edu/
https://cran.r-project.org/web/packages/summarytools/vignettes/Introduction.html
https://tanjakec.github.io/blog/introduction-to-r/
Comtois, Dominic. “Summarytools: Tools to quickly and neatly summarize data.” R package version 0.8. 72018Available from: https://CRAN. R-project. org/package= summarytools (2018).
Revelle, William. “An overview of the psych package.” Dep Psychol Northwest Univ 3 (2011): 1-25.
Revelle, William. “psych: Procedures for psychological, psychometric, and personality research.” Northwestern University, Evanston, Illinois 165 (2014): 1-10.

Doing Descriptive Statistics in R

Anna Moroz

Introduction

Base R

Summarytools package

Data Frame Summary

students

Frequencies

students$Major

Cross-Tabulation, Row Proportions

Major * `Student Status`

Cross-Tabulation, Row Proportions

Major * students$`Student Status`

Descriptive Statistics

students

Psych package

Personal preferences

Advising (oh my..)

References: