library(dplyr)

Project Description

Today I am going to compare different tools and packages in R for presenting statistical descriptions of some data. I will evaluate in in terms of ease of use and how well does it present necessary for me data.

Introduction to the data

For this project I have decided to let you know a bit of my thesis paper. This is a study based on the data from Laboratory of Sociology in Education and Science in HSE. The purpose of it is to discover how parents with different socio-economic statuses choose schools for their children. The main idea is that in a field, where similar opportunities for enrollment had been created - every parent can choose up to three schools and the main factor of enrollment is prescription of school to the house, where child lives, - reproduction of social status remains the same - children from families with higher SES are enrolled in better schools and get better education, when children from the contrasting group.

So, I use data from the survey of parents of children from first two forms of primary school, where they have been answering questions about their experience in school choice.

Summary statisctics

First, In exploring the data I want to know which variables it contains, types of these variables and presence of missing values.

Default summary() is very simple as it gives a nice amount of information for some overview - which categories variables have or some of its descriptive statistics (mean, medium, min/max, NA, etc.). I do really like this function because of its simplicity: with minimal efforts it gives useful information for further analysis information.

Next, function from the "psych" package - describeBy(). It gives more detailed descriptive statistics, which is especially meaningful for numeric variables, and in contrast to thу summary() function its look is more pleasant and tidy.

There is also some kind of similar function describe() from a package "Hmisc". It shows total number, missing and distinct values, frequency and proportions for categorical values, which sometimes can be very usefull for the first glance on data. For numeric variables it shows relatively the same info as describeBy(), but in a more extensive way (5,10,25,50,75,90,95th percentiles, 5 lowest and 5 highest scores).

As for my personal taste I use basic summary() function most of the time. It helps me a lot in general understanding of the data that I have, giving all necessary for the first time information. For more details I use str() function from utils to get overview on the types of my variables, number of levels in factors and so on.

summary(mydata)
##                             schtype                     schtype2   
##  General Secondary School       :1379   0-No advanced status:1379  
##  Lyceum/ gymnasium              : 436   1-Advanced school   : 895  
##  School with some specialization: 459                              
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##      momedu                       momedu2     fathedu    
##  Min.   :1.000   0-NO higher education: 629   1+2 : 277  
##  1st Qu.:3.000   1-Higher education   :1506   3   : 547  
##  Median :4.000   NA's                 : 139   4   :1229  
##  Mean   :3.557                                NA's: 221  
##  3rd Qu.:4.000                                           
##  Max.   :4.000                                           
##  NA's   :139                                             
##                               q41              q42      
##  1 shelf and less (<40 books)   :586   Don't know: 128  
##  2-3 shelves (60-120 books)     :588   No        :  19  
##  4-5 shelves (120-200 books)    :322   Yes       :1965  
##  6 shelves and more (>200 books):607   NA's      : 162  
##  NA's                           :171                    
##                                                         
##                                                         
##                q22         q24            ISEI      
##  1-Parents       : 424   No  :1349   Min.   :23.00  
##  2-Elder children: 349   Yes : 819   1st Qu.:43.00  
##  3-Others        : 107   NA's: 106   Median :52.00  
##  4-Nobody        :1291               Mean   :51.22  
##  NA's            : 103               3rd Qu.:59.00  
##                                      Max.   :85.00  
##                                      NA's   :1008
library(psych)
psych::describeBy(mydata)
library(Hmisc)
Hmisc::describe(mydata)
## mydata 
## 
##  10  Variables      2274  Observations
## ---------------------------------------------------------------------------
## schtype 
##        n  missing distinct 
##     2274        0        3 
##                                                                           
## Value             General Secondary School               Lyceum/ gymnasium
## Frequency                             1379                             436
## Proportion                           0.606                           0.192
##                                           
## Value      School with some specialization
## Frequency                              459
## Proportion                           0.202
## ---------------------------------------------------------------------------
## schtype2 
##        n  missing distinct 
##     2274        0        2 
##                                                     
## Value      0-No advanced status    1-Advanced school
## Frequency                  1379                  895
## Proportion                0.606                0.394
## ---------------------------------------------------------------------------
## momedu : Образование матери - четыре варианта  Format:F11.1 
##        n  missing distinct     Info     Mean      Gmd 
##     2135      139        4    0.641    3.557   0.6885 
##                                   
## Value          1     2     3     4
## Frequency    116    84   429  1506
## Proportion 0.054 0.039 0.201 0.705
## ---------------------------------------------------------------------------
## momedu2 
##        n  missing distinct 
##     2135      139        2 
##                                                       
## Value      0-NO higher education    1-Higher education
## Frequency                    629                  1506
## Proportion                 0.295                 0.705
## ---------------------------------------------------------------------------
## fathedu 
##        n  missing distinct 
##     2053      221        3 
##                             
## Value        1+2     3     4
## Frequency    277   547  1229
## Proportion 0.135 0.266 0.599
## ---------------------------------------------------------------------------
## q41 
##        n  missing distinct 
##     2103      171        4 
##                                                                           
## Value         1 shelf and less (<40 books)      2-3 shelves (60-120 books)
## Frequency                              586                             588
## Proportion                           0.279                           0.280
##                                                                           
## Value          4-5 shelves (120-200 books) 6 shelves and more (>200 books)
## Frequency                              322                             607
## Proportion                           0.153                           0.289
## ---------------------------------------------------------------------------
## q42 
##        n  missing distinct 
##     2112      162        3 
##                                            
## Value      Don't know         No        Yes
## Frequency         128         19       1965
## Proportion      0.061      0.009      0.930
## ---------------------------------------------------------------------------
## q22 
##        n  missing distinct 
##     2171      103        4 
##                                                              
## Value             1-Parents 2-Elder children         3-Others
## Frequency               424              349              107
## Proportion            0.195            0.161            0.049
##                            
## Value              4-Nobody
## Frequency              1291
## Proportion            0.595
## ---------------------------------------------------------------------------
## q24 
##        n  missing distinct 
##     2168      106        2 
##                       
## Value         No   Yes
## Frequency   1349   819
## Proportion 0.622 0.378
## ---------------------------------------------------------------------------
## ISEI : ISEI отца - его социально-профессиональный статус  Format:F8.2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1266     1008       98        1    51.22    11.97       33       37 
##      .25      .50      .75      .90      .95 
##       43       52       59       65       66 
## 
## lowest : 23.0 23.5 25.0 27.0 27.5, highest: 74.0 75.0 76.0 78.0 85.0
## ---------------------------------------------------------------------------
str(mydata)
## tibble [2,274 x 10] (S3: tbl_df/tbl/data.frame)
##  $ schtype : Factor w/ 3 levels "General Secondary School",..: 1 1 1 1 1 1 2 3 1 1 ...
##  $ schtype2: Factor w/ 2 levels "0-No advanced status",..: 1 1 1 1 1 1 2 2 1 1 ...
##  $ momedu  : 'haven_labelled' num [1:2274] 4 4 4 4 NA 4 4 4 3 4 ...
##   ..- attr(*, "label")= chr "Образование матери - четыре варианта"
##   ..- attr(*, "format.spss")= chr "F11.1"
##   ..- attr(*, "display_width")= int 0
##   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "Среднее (школа)" "Начальное профессиональное (училище/ПТУ)" "Среднее профессиональное (колледж/техникум)" "Высшее (в т. ч. незаконченное высшее)"
##  $ momedu2 : Factor w/ 2 levels "0-NO higher education",..: 2 2 2 2 NA 2 2 2 1 2 ...
##  $ fathedu : Factor w/ 3 levels "1+2","3","4": 2 3 2 3 NA 1 3 3 2 2 ...
##  $ q41     : Factor w/ 4 levels "1 shelf and less (<40 books)",..: 2 3 2 4 2 1 2 1 2 1 ...
##  $ q42     : Factor w/ 3 levels "Don't know","No",..: NA 3 3 3 3 3 3 3 3 3 ...
##  $ q22     : Factor w/ 4 levels "1-Parents","2-Elder children",..: 2 NA NA NA NA NA NA NA NA NA ...
##  $ q24     : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 NA 1 ...
##  $ ISEI    : num [1:2274] 55.5 NA NA 61 NA 47 62.5 57.5 NA NA ...
##   ..- attr(*, "label")= chr "ISEI отца - его социально-профессиональный статус"
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "display_width")= int 0

Visualization of tests

For data description I most often use the following tests and vizualizations:

  • chi-square for relation between two categorical variables, and show the result via barplots (if it is significant);
  • t-test and anova for relation between categorical and numerical variables. To present such results I call boxplots.

That is literally my startitng pack.

chi-square, contigency tables and barplots

Below there is my “ready to go” tool - combination of sjtab() contigency table and sjp.xtab() plot with printed test output, both from "sjPlot" package. It shows necessary statistic from chi-square test, so I do not need to add it like in case with ggplot() function. Also it is really nice looking for me.

library(sjPlot)
set_theme(base = theme_classic())
mydata %>% select( "momedu2", "q24") %>%
  sjtab(fun = "xtab", var.labels=c("Mother's education", "Considering other options"),
        show.row.prc=T, show.col.prc=T, show.summary=T, show.exp=T, show.legend=T, encoding = "UTF-8")
Mother’s education Considering other
options
Total
No Yes
0-NO higher
education
416
374
68.8 %
32.2 %
189
231
31.2 %
23.7 %
605
605
100 %
29 %
1-Higher education 874
916
59 %
67.8 %
607
565
41 %
76.3 %
1481
1481
100 %
71 %
Total 1290
1290
61.8 %
100 %
796
796
38.2 %
100 %
2086
2086
100 %
100 %
χ2=16.879 · df=1 · φ=0.091 · p=0.000

observed values
expected values
% within Mother’s education
% within Considering other options

sjp.xtab(mydata$q24, mydata$momedu2,  
         margin = "row", 
         bar.pos = "stack",  
         axis.titles = "Considering other options", 
         legend.title = "Mother's education", 
         show.summary = TRUE, 
         coord.flip = TRUE)

To dig deeper in chi-square and look at the residuals I print the results this way sometimes:

t_momedu_q24 <- with(mydata, table(momedu2, q24))
chi_momedu_q24 <- chisq.test(t_momedu_q24)
chi_momedu_q24
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t_momedu_q24
## X-squared = 16.879, df = 1, p-value = 3.984e-05
chi_momedu_q24$stdres %>% round(digits = 2)
##                        q24
## momedu2                    No   Yes
##   0-NO higher education  4.16 -4.16
##   1-Higher education    -4.16  4.16

It shows everything in a simple and clear way.

In addition to show residuals I have also used a more complex formattable() function from the self-titled package, which is more aimed to present some information in a pretty way. However it was kind of… elaborately for me, I do not want to repeat it, so I will just show you a piece of some other part of analysis, where I compared how important parents with different education evaluate different school caracteristis. Table below shows their relation with closeness to the place of living (X2=23, sig.<0.05). With red and green colors significant residuals are highlighted accordingly if it shows negative or positive relations.

t_momedu_q2710 <- with(ds, table(momedu2, q27.10))
chi_momedu_q2710 <- chisq.test(t_momedu_q2710)
chi_momedu_q2710
## 
##  Pearson's Chi-squared test
## 
## data:  t_momedu_q2710
## X-squared = 23.042, df = 3, p-value = 3.957e-05
chi_momedu_q2710$stdres %>% round(digits = 2)
##                        q27.10
## momedu2                     0     1     2     3
##   0-NO higher education -1.56  4.09 -0.80 -3.51
##   1-Higher education     1.56 -4.09  0.80  3.51
library(formattable)
# closeness
Mother_ed <- c("Has higher education", "Has NO higher education")
ignore <- c("1.56", "-1.56") %>% as.numeric()
imp1 <- c("-4.09 ", "4.09 ") %>% as.numeric()
imp2 <- c("0.80", "-0.80") %>% as.numeric()
imp3 <- c("3.51", "-3.51") %>% as.numeric()
res27.10 <- data.frame(Mother_ed, ignore, imp1, imp2, imp3)

res27.1_10 <- formattable(res27.10, list(
  ignore = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
  imp1 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
  imp2 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
  imp3 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey")))))
res27.1_10
Mother_ed ignore imp1 imp2 imp3
Has higher education 1.56 -4.09 0.8 3.51
Has NO higher education -1.56 4.09 -0.8 -3.51

t-test, anova and box-plots

Although chi-square is good it is not enough to analyse the data, so I need t.test and anova. Below there is an example for searching if level of ISEI differs in two groups of parents - who were choosing schools among the alternative options and who were not. To present these result I most ofthen use ggplot() function. However, it is not perfect, as all test results I have to add to the plot manually in subheading.

t.test(mydata$ISEI ~ mydata$q24)
## 
##  Welch Two Sample t-test
## 
## data:  mydata$ISEI by mydata$q24
## t = -3.899, df = 1019.4, p-value = 0.0001029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.581530 -1.183411
## sample estimates:
##  mean in group No mean in group Yes 
##          50.23529          52.61776
mydata %>% 
   filter(!is.na(q24)) %>%
  ggplot() + 
  geom_boxplot(aes(x = q24, y = ISEI), 
               colour = "blue") + 
  labs(y="ISEI", 
       x="Considering other option", 
       title="Difference in average ISEI of families who have been considering among\ndifferent options of schools and not",
       subtitle="t.test = -3.9, sig. < 0.05, difference in means = 2.4") +
  theme_bw()

As for the Analysis of Variation I may simple use aov() function from "stats" package and then explore its summary and ran TukeyHSD test to understand test’s significance (example below).

library(stats)
stats::aov(mydata$ISEI ~ mydata$q22)
## Call:
##    stats::aov(formula = mydata$ISEI ~ mydata$q22)
## 
## Terms:
##                 mydata$q22 Residuals
## Sum of Squares     2522.22 132822.93
## Deg. of Freedom          3      1233
## 
## Residual standard error: 10.37899
## Estimated effects may be unbalanced
## 1037 observations deleted due to missingness
a.22.isei <- aov(ds$ISEI ~ ds$q22) # значимые различия между группами 1-2 и 1-3 : в группе 1 isei меньше, чем в 2 и 3.
summary(a.22.isei) # F-value = 7.8, sig. < 0.05
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## ds$q22         3   2522   840.7   7.805 3.66e-05 ***
## Residuals   1233 132823   107.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1037 observations deleted due to missingness
TukeyHSD(a.22.isei) # parents-elder_child (dif. = 2.73, sig.= 0.04), parents-noone (dif. = 3.61, sig.= 0.00003)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ds$ISEI ~ ds$q22)
## 
## $`ds$q22`
##                                  diff         lwr      upr     p adj
## 2-Elder children-1-Parents  2.7257457  0.09392832 5.357563 0.0390433
## 3-Others-1-Parents          0.6803454 -3.14757585 4.508267 0.9682228
## 4-Nobody-1-Parents          3.6054728  1.58248032 5.628465 0.0000296
## 3-Others-2-Elder children  -2.0454002 -5.95315167 1.862351 0.5333888
## 4-Nobody-2-Elder children   0.8797271 -1.29053520 3.049989 0.7242696
## 4-Nobody-3-Others           2.9251273 -0.60142298 6.451678 0.1429906
mydata %>% 
   filter(!is.na(q22)) %>%
  ggplot() + 
  geom_boxplot(aes(x = q22, y = ISEI), 
               colour = "blue") + 
  labs(y="ISEI", 
       x="If anyone else have been studying in this school", 
       title="Difference in average ISEI of families who enrolled their children in school\nwhere they have already had experience",
       subtitle="F-value = 3.6; significant difference between groups parents-elder_child (dif. = 2.73, sig.= 0.04),\nparents-noone (dif. = 3.61, sig.= 0.00003)") +
  theme_bw()

However, as we remember - some packages count everything for us and print resuts on the graph, called by a simple line. It is the story about "sjPlot" package, which with one line shows me the graph with all printed results. It is great for some fast analysis, but I prefer deep glance on the problem.

sjPlot::sjp.aov1(mydata$ISEI, mydata$q22)