library(dplyr)

Project Description

Today I am going to compare different tools and packages in R for presenting statistical descriptions of some data. I will evaluate in in terms of ease of use and how well does it present necessary for me data.

Introduction to the data

For this project I have decided to let you know a bit of my thesis paper. This is a study based on the data from Laboratory of Sociology in Education and Science in HSE. The purpose of it is to discover how parents with different socio-economic statuses choose schools for their children. The main idea is that in a field, where similar opportunities for enrollment had been created - every parent can choose up to three schools and the main factor of enrollment is prescription of school to the house, where child lives, - reproduction of social status remains the same - children from families with higher SES are enrolled in better schools and get better education, when children from the contrasting group.

So, I use data from the survey of parents of children from first two forms of primary school, where they have been answering questions about their experience in school choice.

Summary statisctics

First, In exploring the data I want to know which variables it contains, types of these variables and presence of missing values.

Default summary() is very simple as it gives a nice amount of information for some overview - which categories variables have or some of its descriptive statistics (mean, medium, min/max, NA, etc.). I do really like this function because of its simplicity: with minimal efforts it gives useful information for further analysis information.

Next, function from the "psych" package - describeBy(). It gives more detailed descriptive statistics, which is especially meaningful for numeric variables, and in contrast to thу summary() function its look is more pleasant and tidy.

There is also some kind of similar function describe() from a package "Hmisc". It shows total number, missing and distinct values, frequency and proportions for categorical values, which sometimes can be very usefull for the first glance on data. For numeric variables it shows relatively the same info as describeBy(), but in a more extensive way (5,10,25,50,75,90,95th percentiles, 5 lowest and 5 highest scores).

As for my personal taste I use basic summary() function most of the time. It helps me a lot in general understanding of the data that I have, giving all necessary for the first time information. For more details I use str() function from utils to get overview on the types of my variables, number of levels in factors and so on.

summary(mydata)

##                             schtype                     schtype2   
##  General Secondary School       :1379   0-No advanced status:1379  
##  Lyceum/ gymnasium              : 436   1-Advanced school   : 895  
##  School with some specialization: 459                              
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##      momedu                       momedu2     fathedu    
##  Min.   :1.000   0-NO higher education: 629   1+2 : 277  
##  1st Qu.:3.000   1-Higher education   :1506   3   : 547  
##  Median :4.000   NA's                 : 139   4   :1229  
##  Mean   :3.557                                NA's: 221  
##  3rd Qu.:4.000                                           
##  Max.   :4.000                                           
##  NA's   :139                                             
##                               q41              q42      
##  1 shelf and less (<40 books)   :586   Don't know: 128  
##  2-3 shelves (60-120 books)     :588   No        :  19  
##  4-5 shelves (120-200 books)    :322   Yes       :1965  
##  6 shelves and more (>200 books):607   NA's      : 162  
##  NA's                           :171                    
##                                                         
##                                                         
##                q22         q24            ISEI      
##  1-Parents       : 424   No  :1349   Min.   :23.00  
##  2-Elder children: 349   Yes : 819   1st Qu.:43.00  
##  3-Others        : 107   NA's: 106   Median :52.00  
##  4-Nobody        :1291               Mean   :51.22  
##  NA's            : 103               3rd Qu.:59.00  
##                                      Max.   :85.00  
##                                      NA's   :1008

library(psych)
psych::describeBy(mydata)

library(Hmisc)
Hmisc::describe(mydata)

## mydata 
## 
##  10  Variables      2274  Observations
## ---------------------------------------------------------------------------
## schtype 
##        n  missing distinct 
##     2274        0        3 
##                                                                           
## Value             General Secondary School               Lyceum/ gymnasium
## Frequency                             1379                             436
## Proportion                           0.606                           0.192
##                                           
## Value      School with some specialization
## Frequency                              459
## Proportion                           0.202
## ---------------------------------------------------------------------------
## schtype2 
##        n  missing distinct 
##     2274        0        2 
##                                                     
## Value      0-No advanced status    1-Advanced school
## Frequency                  1379                  895
## Proportion                0.606                0.394
## ---------------------------------------------------------------------------
## momedu : Образование матери - четыре варианта  Format:F11.1 
##        n  missing distinct     Info     Mean      Gmd 
##     2135      139        4    0.641    3.557   0.6885 
##                                   
## Value          1     2     3     4
## Frequency    116    84   429  1506
## Proportion 0.054 0.039 0.201 0.705
## ---------------------------------------------------------------------------
## momedu2 
##        n  missing distinct 
##     2135      139        2 
##                                                       
## Value      0-NO higher education    1-Higher education
## Frequency                    629                  1506
## Proportion                 0.295                 0.705
## ---------------------------------------------------------------------------
## fathedu 
##        n  missing distinct 
##     2053      221        3 
##                             
## Value        1+2     3     4
## Frequency    277   547  1229
## Proportion 0.135 0.266 0.599
## ---------------------------------------------------------------------------
## q41 
##        n  missing distinct 
##     2103      171        4 
##                                                                           
## Value         1 shelf and less (<40 books)      2-3 shelves (60-120 books)
## Frequency                              586                             588
## Proportion                           0.279                           0.280
##                                                                           
## Value          4-5 shelves (120-200 books) 6 shelves and more (>200 books)
## Frequency                              322                             607
## Proportion                           0.153                           0.289
## ---------------------------------------------------------------------------
## q42 
##        n  missing distinct 
##     2112      162        3 
##                                            
## Value      Don't know         No        Yes
## Frequency         128         19       1965
## Proportion      0.061      0.009      0.930
## ---------------------------------------------------------------------------
## q22 
##        n  missing distinct 
##     2171      103        4 
##                                                              
## Value             1-Parents 2-Elder children         3-Others
## Frequency               424              349              107
## Proportion            0.195            0.161            0.049
##                            
## Value              4-Nobody
## Frequency              1291
## Proportion            0.595
## ---------------------------------------------------------------------------
## q24 
##        n  missing distinct 
##     2168      106        2 
##                       
## Value         No   Yes
## Frequency   1349   819
## Proportion 0.622 0.378
## ---------------------------------------------------------------------------
## ISEI : ISEI отца - его социально-профессиональный статус  Format:F8.2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1266     1008       98        1    51.22    11.97       33       37 
##      .25      .50      .75      .90      .95 
##       43       52       59       65       66 
## 
## lowest : 23.0 23.5 25.0 27.0 27.5, highest: 74.0 75.0 76.0 78.0 85.0
## ---------------------------------------------------------------------------

str(mydata)

## tibble [2,274 x 10] (S3: tbl_df/tbl/data.frame)
##  $ schtype : Factor w/ 3 levels "General Secondary School",..: 1 1 1 1 1 1 2 3 1 1 ...
##  $ schtype2: Factor w/ 2 levels "0-No advanced status",..: 1 1 1 1 1 1 2 2 1 1 ...
##  $ momedu  : 'haven_labelled' num [1:2274] 4 4 4 4 NA 4 4 4 3 4 ...
##   ..- attr(*, "label")= chr "Образование матери - четыре варианта"
##   ..- attr(*, "format.spss")= chr "F11.1"
##   ..- attr(*, "display_width")= int 0
##   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "Среднее (школа)" "Начальное профессиональное (училище/ПТУ)" "Среднее профессиональное (колледж/техникум)" "Высшее (в т. ч. незаконченное высшее)"
##  $ momedu2 : Factor w/ 2 levels "0-NO higher education",..: 2 2 2 2 NA 2 2 2 1 2 ...
##  $ fathedu : Factor w/ 3 levels "1+2","3","4": 2 3 2 3 NA 1 3 3 2 2 ...
##  $ q41     : Factor w/ 4 levels "1 shelf and less (<40 books)",..: 2 3 2 4 2 1 2 1 2 1 ...
##  $ q42     : Factor w/ 3 levels "Don't know","No",..: NA 3 3 3 3 3 3 3 3 3 ...
##  $ q22     : Factor w/ 4 levels "1-Parents","2-Elder children",..: 2 NA NA NA NA NA NA NA NA NA ...
##  $ q24     : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 NA 1 ...
##  $ ISEI    : num [1:2274] 55.5 NA NA 61 NA 47 62.5 57.5 NA NA ...
##   ..- attr(*, "label")= chr "ISEI отца - его социально-профессиональный статус"
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "display_width")= int 0

Visualization of tests

For data description I most often use the following tests and vizualizations:

chi-square for relation between two categorical variables, and show the result via barplots (if it is significant);
t-test and anova for relation between categorical and numerical variables. To present such results I call boxplots.

That is literally my startitng pack.

chi-square, contigency tables and barplots

Below there is my “ready to go” tool - combination of sjtab() contigency table and sjp.xtab() plot with printed test output, both from "sjPlot" package. It shows necessary statistic from chi-square test, so I do not need to add it like in case with ggplot() function. Also it is really nice looking for me.

library(sjPlot)
set_theme(base = theme_classic())
mydata %>% select( "momedu2", "q24") %>%
  sjtab(fun = "xtab", var.labels=c("Mother's education", "Considering other options"),
        show.row.prc=T, show.col.prc=T, show.summary=T, show.exp=T, show.legend=T, encoding = "UTF-8")

Mother’s education	Considering other options		Total
Mother’s education	No	Yes	Total
0-NO higher education	416 374 68.8 % 32.2 %	189 231 31.2 % 23.7 %	605 605 100 % 29 %
1-Higher education	874 916 59 % 67.8 %	607 565 41 % 76.3 %	1481 1481 100 % 71 %
Total	1290 1290 61.8 % 100 %	796 796 38.2 % 100 %	2086 2086 100 % 100 %
χ²=16.879 · df=1 · φ=0.091 · p=0.000

observed values
expected values
% within Mother’s education
% within Considering other options

sjp.xtab(mydata$q24, mydata$momedu2,  
         margin = "row", 
         bar.pos = "stack",  
         axis.titles = "Considering other options", 
         legend.title = "Mother's education", 
         show.summary = TRUE, 
         coord.flip = TRUE)

To dig deeper in chi-square and look at the residuals I print the results this way sometimes:

t_momedu_q24 <- with(mydata, table(momedu2, q24))
chi_momedu_q24 <- chisq.test(t_momedu_q24)
chi_momedu_q24

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t_momedu_q24
## X-squared = 16.879, df = 1, p-value = 3.984e-05

chi_momedu_q24$stdres %>% round(digits = 2)

##                        q24
## momedu2                    No   Yes
##   0-NO higher education  4.16 -4.16
##   1-Higher education    -4.16  4.16

It shows everything in a simple and clear way.

In addition to show residuals I have also used a more complex formattable() function from the self-titled package, which is more aimed to present some information in a pretty way. However it was kind of… elaborately for me, I do not want to repeat it, so I will just show you a piece of some other part of analysis, where I compared how important parents with different education evaluate different school caracteristis. Table below shows their relation with closeness to the place of living (X2=23, sig.<0.05). With red and green colors significant residuals are highlighted accordingly if it shows negative or positive relations.

t_momedu_q2710 <- with(ds, table(momedu2, q27.10))
chi_momedu_q2710 <- chisq.test(t_momedu_q2710)
chi_momedu_q2710

## 
##  Pearson's Chi-squared test
## 
## data:  t_momedu_q2710
## X-squared = 23.042, df = 3, p-value = 3.957e-05

chi_momedu_q2710$stdres %>% round(digits = 2)

##                        q27.10
## momedu2                     0     1     2     3
##   0-NO higher education -1.56  4.09 -0.80 -3.51
##   1-Higher education     1.56 -4.09  0.80  3.51

library(formattable)
# closeness
Mother_ed <- c("Has higher education", "Has NO higher education")
ignore <- c("1.56", "-1.56") %>% as.numeric()
imp1 <- c("-4.09 ", "4.09 ") %>% as.numeric()
imp2 <- c("0.80", "-0.80") %>% as.numeric()
imp3 <- c("3.51", "-3.51") %>% as.numeric()
res27.10 <- data.frame(Mother_ed, ignore, imp1, imp2, imp3)

res27.1_10 <- formattable(res27.10, list(
  ignore = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
  imp1 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
  imp2 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
  imp3 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey")))))
res27.1_10

Mother_ed	ignore	imp1	imp2	imp3
Has higher education	1.56	-4.09	0.8	3.51
Has NO higher education	-1.56	4.09	-0.8	-3.51

t-test, anova and box-plots

Although chi-square is good it is not enough to analyse the data, so I need t.test and anova. Below there is an example for searching if level of ISEI differs in two groups of parents - who were choosing schools among the alternative options and who were not. To present these result I most ofthen use ggplot() function. However, it is not perfect, as all test results I have to add to the plot manually in subheading.

t.test(mydata$ISEI ~ mydata$q24)

## 
##  Welch Two Sample t-test
## 
## data:  mydata$ISEI by mydata$q24
## t = -3.899, df = 1019.4, p-value = 0.0001029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.581530 -1.183411
## sample estimates:
##  mean in group No mean in group Yes 
##          50.23529          52.61776

mydata %>% 
   filter(!is.na(q24)) %>%
  ggplot() + 
  geom_boxplot(aes(x = q24, y = ISEI), 
               colour = "blue") + 
  labs(y="ISEI", 
       x="Considering other option", 
       title="Difference in average ISEI of families who have been considering among\ndifferent options of schools and not",
       subtitle="t.test = -3.9, sig. < 0.05, difference in means = 2.4") +
  theme_bw()

As for the Analysis of Variation I may simple use aov() function from "stats" package and then explore its summary and ran TukeyHSD test to understand test’s significance (example below).

library(stats)
stats::aov(mydata$ISEI ~ mydata$q22)

## Call:
##    stats::aov(formula = mydata$ISEI ~ mydata$q22)
## 
## Terms:
##                 mydata$q22 Residuals
## Sum of Squares     2522.22 132822.93
## Deg. of Freedom          3      1233
## 
## Residual standard error: 10.37899
## Estimated effects may be unbalanced
## 1037 observations deleted due to missingness

a.22.isei <- aov(ds$ISEI ~ ds$q22) # значимые различия между группами 1-2 и 1-3 : в группе 1 isei меньше, чем в 2 и 3.
summary(a.22.isei) # F-value = 7.8, sig. < 0.05

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## ds$q22         3   2522   840.7   7.805 3.66e-05 ***
## Residuals   1233 132823   107.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1037 observations deleted due to missingness

TukeyHSD(a.22.isei) # parents-elder_child (dif. = 2.73, sig.= 0.04), parents-noone (dif. = 3.61, sig.= 0.00003)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ds$ISEI ~ ds$q22)
## 
## $`ds$q22`
##                                  diff         lwr      upr     p adj
## 2-Elder children-1-Parents  2.7257457  0.09392832 5.357563 0.0390433
## 3-Others-1-Parents          0.6803454 -3.14757585 4.508267 0.9682228
## 4-Nobody-1-Parents          3.6054728  1.58248032 5.628465 0.0000296
## 3-Others-2-Elder children  -2.0454002 -5.95315167 1.862351 0.5333888
## 4-Nobody-2-Elder children   0.8797271 -1.29053520 3.049989 0.7242696
## 4-Nobody-3-Others           2.9251273 -0.60142298 6.451678 0.1429906

mydata %>% 
   filter(!is.na(q22)) %>%
  ggplot() + 
  geom_boxplot(aes(x = q22, y = ISEI), 
               colour = "blue") + 
  labs(y="ISEI", 
       x="If anyone else have been studying in this school", 
       title="Difference in average ISEI of families who enrolled their children in school\nwhere they have already had experience",
       subtitle="F-value = 3.6; significant difference between groups parents-elder_child (dif. = 2.73, sig.= 0.04),\nparents-noone (dif. = 3.61, sig.= 0.00003)") +
  theme_bw()

However, as we remember - some packages count everything for us and print resuts on the graph, called by a simple line. It is the story about "sjPlot" package, which with one line shows me the graph with all printed results. It is great for some fast analysis, but I prefer deep glance on the problem.

sjPlot::sjp.aov1(mydata$ISEI, mydata$q22)

Doing Descriptive Statistics in R

Anna Podolskaya

05/12/2020