library(dplyr)
Today I am going to compare different tools and packages in R for presenting statistical descriptions of some data. I will evaluate in in terms of ease of use and how well does it present necessary for me data.
For this project I have decided to let you know a bit of my thesis paper. This is a study based on the data from Laboratory of Sociology in Education and Science in HSE. The purpose of it is to discover how parents with different socio-economic statuses choose schools for their children. The main idea is that in a field, where similar opportunities for enrollment had been created - every parent can choose up to three schools and the main factor of enrollment is prescription of school to the house, where child lives, - reproduction of social status remains the same - children from families with higher SES are enrolled in better schools and get better education, when children from the contrasting group.
So, I use data from the survey of parents of children from first two forms of primary school, where they have been answering questions about their experience in school choice.
First, In exploring the data I want to know which variables it contains, types of these variables and presence of missing values.
Default summary() is very simple as it gives a nice amount of information for some overview - which categories variables have or some of its descriptive statistics (mean, medium, min/max, NA, etc.). I do really like this function because of its simplicity: with minimal efforts it gives useful information for further analysis information.
Next, function from the "psych" package - describeBy(). It gives more detailed descriptive statistics, which is especially meaningful for numeric variables, and in contrast to thу summary() function its look is more pleasant and tidy.
There is also some kind of similar function describe() from a package "Hmisc". It shows total number, missing and distinct values, frequency and proportions for categorical values, which sometimes can be very usefull for the first glance on data. For numeric variables it shows relatively the same info as describeBy(), but in a more extensive way (5,10,25,50,75,90,95th percentiles, 5 lowest and 5 highest scores).
As for my personal taste I use basic summary() function most of the time. It helps me a lot in general understanding of the data that I have, giving all necessary for the first time information. For more details I use str() function from utils to get overview on the types of my variables, number of levels in factors and so on.
summary(mydata)
## schtype schtype2
## General Secondary School :1379 0-No advanced status:1379
## Lyceum/ gymnasium : 436 1-Advanced school : 895
## School with some specialization: 459
##
##
##
##
## momedu momedu2 fathedu
## Min. :1.000 0-NO higher education: 629 1+2 : 277
## 1st Qu.:3.000 1-Higher education :1506 3 : 547
## Median :4.000 NA's : 139 4 :1229
## Mean :3.557 NA's: 221
## 3rd Qu.:4.000
## Max. :4.000
## NA's :139
## q41 q42
## 1 shelf and less (<40 books) :586 Don't know: 128
## 2-3 shelves (60-120 books) :588 No : 19
## 4-5 shelves (120-200 books) :322 Yes :1965
## 6 shelves and more (>200 books):607 NA's : 162
## NA's :171
##
##
## q22 q24 ISEI
## 1-Parents : 424 No :1349 Min. :23.00
## 2-Elder children: 349 Yes : 819 1st Qu.:43.00
## 3-Others : 107 NA's: 106 Median :52.00
## 4-Nobody :1291 Mean :51.22
## NA's : 103 3rd Qu.:59.00
## Max. :85.00
## NA's :1008
library(psych)
psych::describeBy(mydata)
library(Hmisc)
Hmisc::describe(mydata)
## mydata
##
## 10 Variables 2274 Observations
## ---------------------------------------------------------------------------
## schtype
## n missing distinct
## 2274 0 3
##
## Value General Secondary School Lyceum/ gymnasium
## Frequency 1379 436
## Proportion 0.606 0.192
##
## Value School with some specialization
## Frequency 459
## Proportion 0.202
## ---------------------------------------------------------------------------
## schtype2
## n missing distinct
## 2274 0 2
##
## Value 0-No advanced status 1-Advanced school
## Frequency 1379 895
## Proportion 0.606 0.394
## ---------------------------------------------------------------------------
## momedu : Образование матери - четыре варианта Format:F11.1
## n missing distinct Info Mean Gmd
## 2135 139 4 0.641 3.557 0.6885
##
## Value 1 2 3 4
## Frequency 116 84 429 1506
## Proportion 0.054 0.039 0.201 0.705
## ---------------------------------------------------------------------------
## momedu2
## n missing distinct
## 2135 139 2
##
## Value 0-NO higher education 1-Higher education
## Frequency 629 1506
## Proportion 0.295 0.705
## ---------------------------------------------------------------------------
## fathedu
## n missing distinct
## 2053 221 3
##
## Value 1+2 3 4
## Frequency 277 547 1229
## Proportion 0.135 0.266 0.599
## ---------------------------------------------------------------------------
## q41
## n missing distinct
## 2103 171 4
##
## Value 1 shelf and less (<40 books) 2-3 shelves (60-120 books)
## Frequency 586 588
## Proportion 0.279 0.280
##
## Value 4-5 shelves (120-200 books) 6 shelves and more (>200 books)
## Frequency 322 607
## Proportion 0.153 0.289
## ---------------------------------------------------------------------------
## q42
## n missing distinct
## 2112 162 3
##
## Value Don't know No Yes
## Frequency 128 19 1965
## Proportion 0.061 0.009 0.930
## ---------------------------------------------------------------------------
## q22
## n missing distinct
## 2171 103 4
##
## Value 1-Parents 2-Elder children 3-Others
## Frequency 424 349 107
## Proportion 0.195 0.161 0.049
##
## Value 4-Nobody
## Frequency 1291
## Proportion 0.595
## ---------------------------------------------------------------------------
## q24
## n missing distinct
## 2168 106 2
##
## Value No Yes
## Frequency 1349 819
## Proportion 0.622 0.378
## ---------------------------------------------------------------------------
## ISEI : ISEI отца - его социально-профессиональный статус Format:F8.2
## n missing distinct Info Mean Gmd .05 .10
## 1266 1008 98 1 51.22 11.97 33 37
## .25 .50 .75 .90 .95
## 43 52 59 65 66
##
## lowest : 23.0 23.5 25.0 27.0 27.5, highest: 74.0 75.0 76.0 78.0 85.0
## ---------------------------------------------------------------------------
str(mydata)
## tibble [2,274 x 10] (S3: tbl_df/tbl/data.frame)
## $ schtype : Factor w/ 3 levels "General Secondary School",..: 1 1 1 1 1 1 2 3 1 1 ...
## $ schtype2: Factor w/ 2 levels "0-No advanced status",..: 1 1 1 1 1 1 2 2 1 1 ...
## $ momedu : 'haven_labelled' num [1:2274] 4 4 4 4 NA 4 4 4 3 4 ...
## ..- attr(*, "label")= chr "Образование матери - четыре варианта"
## ..- attr(*, "format.spss")= chr "F11.1"
## ..- attr(*, "display_width")= int 0
## ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "Среднее (школа)" "Начальное профессиональное (училище/ПТУ)" "Среднее профессиональное (колледж/техникум)" "Высшее (в т. ч. незаконченное высшее)"
## $ momedu2 : Factor w/ 2 levels "0-NO higher education",..: 2 2 2 2 NA 2 2 2 1 2 ...
## $ fathedu : Factor w/ 3 levels "1+2","3","4": 2 3 2 3 NA 1 3 3 2 2 ...
## $ q41 : Factor w/ 4 levels "1 shelf and less (<40 books)",..: 2 3 2 4 2 1 2 1 2 1 ...
## $ q42 : Factor w/ 3 levels "Don't know","No",..: NA 3 3 3 3 3 3 3 3 3 ...
## $ q22 : Factor w/ 4 levels "1-Parents","2-Elder children",..: 2 NA NA NA NA NA NA NA NA NA ...
## $ q24 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 NA 1 ...
## $ ISEI : num [1:2274] 55.5 NA NA 61 NA 47 62.5 57.5 NA NA ...
## ..- attr(*, "label")= chr "ISEI отца - его социально-профессиональный статус"
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 0
For data description I most often use the following tests and vizualizations:
That is literally my startitng pack.
Below there is my “ready to go” tool - combination of sjtab() contigency table and sjp.xtab() plot with printed test output, both from "sjPlot" package. It shows necessary statistic from chi-square test, so I do not need to add it like in case with ggplot() function. Also it is really nice looking for me.
library(sjPlot)
set_theme(base = theme_classic())
mydata %>% select( "momedu2", "q24") %>%
sjtab(fun = "xtab", var.labels=c("Mother's education", "Considering other options"),
show.row.prc=T, show.col.prc=T, show.summary=T, show.exp=T, show.legend=T, encoding = "UTF-8")
| Mother’s education |
Considering other options |
Total | |
|---|---|---|---|
| No | Yes | ||
|
0-NO higher education |
416 374 68.8 % 32.2 % |
189 231 31.2 % 23.7 % |
605 605 100 % 29 % |
| 1-Higher education |
874 916 59 % 67.8 % |
607 565 41 % 76.3 % |
1481 1481 100 % 71 % |
| Total |
1290 1290 61.8 % 100 % |
796 796 38.2 % 100 % |
2086 2086 100 % 100 % |
χ2=16.879 · df=1 · φ=0.091 · p=0.000 |
observed values
expected values
% within Mother’s education
% within Considering other options
sjp.xtab(mydata$q24, mydata$momedu2,
margin = "row",
bar.pos = "stack",
axis.titles = "Considering other options",
legend.title = "Mother's education",
show.summary = TRUE,
coord.flip = TRUE)
To dig deeper in chi-square and look at the residuals I print the results this way sometimes:
t_momedu_q24 <- with(mydata, table(momedu2, q24))
chi_momedu_q24 <- chisq.test(t_momedu_q24)
chi_momedu_q24
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t_momedu_q24
## X-squared = 16.879, df = 1, p-value = 3.984e-05
chi_momedu_q24$stdres %>% round(digits = 2)
## q24
## momedu2 No Yes
## 0-NO higher education 4.16 -4.16
## 1-Higher education -4.16 4.16
It shows everything in a simple and clear way.
In addition to show residuals I have also used a more complex formattable() function from the self-titled package, which is more aimed to present some information in a pretty way. However it was kind of… elaborately for me, I do not want to repeat it, so I will just show you a piece of some other part of analysis, where I compared how important parents with different education evaluate different school caracteristis. Table below shows their relation with closeness to the place of living (X2=23, sig.<0.05). With red and green colors significant residuals are highlighted accordingly if it shows negative or positive relations.
t_momedu_q2710 <- with(ds, table(momedu2, q27.10))
chi_momedu_q2710 <- chisq.test(t_momedu_q2710)
chi_momedu_q2710
##
## Pearson's Chi-squared test
##
## data: t_momedu_q2710
## X-squared = 23.042, df = 3, p-value = 3.957e-05
chi_momedu_q2710$stdres %>% round(digits = 2)
## q27.10
## momedu2 0 1 2 3
## 0-NO higher education -1.56 4.09 -0.80 -3.51
## 1-Higher education 1.56 -4.09 0.80 3.51
library(formattable)
# closeness
Mother_ed <- c("Has higher education", "Has NO higher education")
ignore <- c("1.56", "-1.56") %>% as.numeric()
imp1 <- c("-4.09 ", "4.09 ") %>% as.numeric()
imp2 <- c("0.80", "-0.80") %>% as.numeric()
imp3 <- c("3.51", "-3.51") %>% as.numeric()
res27.10 <- data.frame(Mother_ed, ignore, imp1, imp2, imp3)
res27.1_10 <- formattable(res27.10, list(
ignore = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
imp1 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
imp2 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey"))),
imp3 = formatter("span", style = x ~ ifelse(x >= 1.8, "color:green", ifelse(x <= -1.8, "color:red", "color:grey")))))
res27.1_10
| Mother_ed | ignore | imp1 | imp2 | imp3 |
|---|---|---|---|---|
| Has higher education | 1.56 | -4.09 | 0.8 | 3.51 |
| Has NO higher education | -1.56 | 4.09 | -0.8 | -3.51 |
Although chi-square is good it is not enough to analyse the data, so I need t.test and anova. Below there is an example for searching if level of ISEI differs in two groups of parents - who were choosing schools among the alternative options and who were not. To present these result I most ofthen use ggplot() function. However, it is not perfect, as all test results I have to add to the plot manually in subheading.
t.test(mydata$ISEI ~ mydata$q24)
##
## Welch Two Sample t-test
##
## data: mydata$ISEI by mydata$q24
## t = -3.899, df = 1019.4, p-value = 0.0001029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.581530 -1.183411
## sample estimates:
## mean in group No mean in group Yes
## 50.23529 52.61776
mydata %>%
filter(!is.na(q24)) %>%
ggplot() +
geom_boxplot(aes(x = q24, y = ISEI),
colour = "blue") +
labs(y="ISEI",
x="Considering other option",
title="Difference in average ISEI of families who have been considering among\ndifferent options of schools and not",
subtitle="t.test = -3.9, sig. < 0.05, difference in means = 2.4") +
theme_bw()
As for the Analysis of Variation I may simple use aov() function from "stats" package and then explore its summary and ran TukeyHSD test to understand test’s significance (example below).
library(stats)
stats::aov(mydata$ISEI ~ mydata$q22)
## Call:
## stats::aov(formula = mydata$ISEI ~ mydata$q22)
##
## Terms:
## mydata$q22 Residuals
## Sum of Squares 2522.22 132822.93
## Deg. of Freedom 3 1233
##
## Residual standard error: 10.37899
## Estimated effects may be unbalanced
## 1037 observations deleted due to missingness
a.22.isei <- aov(ds$ISEI ~ ds$q22) # значимые различия между группами 1-2 и 1-3 : в группе 1 isei меньше, чем в 2 и 3.
summary(a.22.isei) # F-value = 7.8, sig. < 0.05
## Df Sum Sq Mean Sq F value Pr(>F)
## ds$q22 3 2522 840.7 7.805 3.66e-05 ***
## Residuals 1233 132823 107.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1037 observations deleted due to missingness
TukeyHSD(a.22.isei) # parents-elder_child (dif. = 2.73, sig.= 0.04), parents-noone (dif. = 3.61, sig.= 0.00003)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = ds$ISEI ~ ds$q22)
##
## $`ds$q22`
## diff lwr upr p adj
## 2-Elder children-1-Parents 2.7257457 0.09392832 5.357563 0.0390433
## 3-Others-1-Parents 0.6803454 -3.14757585 4.508267 0.9682228
## 4-Nobody-1-Parents 3.6054728 1.58248032 5.628465 0.0000296
## 3-Others-2-Elder children -2.0454002 -5.95315167 1.862351 0.5333888
## 4-Nobody-2-Elder children 0.8797271 -1.29053520 3.049989 0.7242696
## 4-Nobody-3-Others 2.9251273 -0.60142298 6.451678 0.1429906
mydata %>%
filter(!is.na(q22)) %>%
ggplot() +
geom_boxplot(aes(x = q22, y = ISEI),
colour = "blue") +
labs(y="ISEI",
x="If anyone else have been studying in this school",
title="Difference in average ISEI of families who enrolled their children in school\nwhere they have already had experience",
subtitle="F-value = 3.6; significant difference between groups parents-elder_child (dif. = 2.73, sig.= 0.04),\nparents-noone (dif. = 3.61, sig.= 0.00003)") +
theme_bw()
However, as we remember - some packages count everything for us and print resuts on the graph, called by a simple line. It is the story about "sjPlot" package, which with one line shows me the graph with all printed results. It is great for some fast analysis, but I prefer deep glance on the problem.
sjPlot::sjp.aov1(mydata$ISEI, mydata$q22)