class: center, middle, inverse, title-slide .title[ # An Introduction to
[ comment ]
and RStudio for Educational Researchers ] .subtitle[ ##
Descriptive and Inferential Statistics: Descriptive Statistics ] .author[ ### Jorge Sinval ] .date[ ### 2025-11-18 ] --- class: inverse, center, middle <style> .orange { color: #EB811B; } .kbd { display: inline-block; padding: .2em .5em; font-size: 0.75em; line-height: 1.75; color: #555; vertical-align: middle; background-color: #fcfcfc; border: solid 1px #ccc; border-bottom-color: #bbb; border-radius: 3px; box-shadow: inset 0 -1px 0 #bbb } </style>
# 2. Descriptive Statistics and Graphics <html><div style='float:left'></div><hr color='#EB811B' size=1px width=800px></html> --- # 2.1 Descriptive Statistics > …area of statistics which has the objective of summarizing information of the variables under study, and present it in a way that promotes **mental acquisition, and easy to visualize**... Usually done with one or more of the following "statistics" families: -- - Measures of central tendency -- - Measures of non-central tendency -- - Measures of dispersion -- - Measures of shape -- - Measures of association -- - etc. --- # 2.1 Descriptive Statistics ## Statistics, Statistic, Parameters, Estimator -- **Statistics**: Emerging area of study in mathematics whose objective is to collect, analyze, and organize information as well as model and infer results for populations from the study of samples from these populations… -- **Parameters**: Population quantities that characterize the variables in populations `\((\mu, \sigma, \rho, \theta)\)` -- **Statistics**: Quantities calculated in the samples that characterize the behavior of the variables under study. Roman letters `\((\bar x, s, r, \tilde x)\)` -- **Estimator**: expression for calculating statistics (eg, for the sample mean `\(\bar X = \frac{1}{n} \sum\limits_{i=1}^n X_i\)`; or for the population mean `\(\hat \mu = \frac{1}{n} \sum\limits_{i=1}^n X_i\)`). --- # 2.2 Central tendency > indicate the value, or central tendency, of the data -- **Mean**: is the distribution center of gravity: `\(\bar X = \frac{1}{n} \sum\limits_{i=1}^n X_i\)` -- **Median**: is the central point of distribution All values are sorted in ascending order, and the central point is determined: `\(\tilde X= \begin{cases} \frac{X_\frac{n}{2}+X_\frac{n+2}{2}}{2}, & \text{if } \textit{n} \text{ even} \\ X_\frac{n+1}{2}, & \text{if } \textit{n} \text{ odd}\end{cases}\)` -- **Mode**: most frequent value --- # 2.2 Central tendency .pull-left[**Mean**: Advantages: - Familiar concept - It always exist - It's a unique value Disadvantages: - It is strongly influenced by outliers] .pull-right[ <center> <img src="assets/img/mean_funny.jpg" width = 100%> </center> ] --- # 2.2 Central tendency **Median**: Advantages: - Not influenced by outliers Disadvantages: - Does not consider the magnitude of the values, only their "order" --- # 2.3 Non-central tendency **Quartiles**: values below which are a `\(\frac{Q}{4}\%\)` of sample elements -- `\(Q_1=P_{25}\)`: value below which `\(25\%\)` of the sample elements are found -- `\(Q_2=P_{50}=Me\)` value below which `\(50\%\)` of the sample elements are found -- `\(Q_3=P_{75}\)`: value below which `\(75\%\)` of the sample elements are found --- # 2.4 Dispersion > Measures of central tendency do not provide information about the dispersion, and variation of the data. Generalizations may be incorrect when the dispersion or variation of the data is not considered. **Sample Variance**: Inertia of the distribution `\(S'^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (X_i- \bar X)^2\)` -- **Range**: `\(R=Max-Min\)` -- **Interquartile range**: `\(IQR = Q_3 – Q_1\)` --- # 2.4 Dispersion <br> <br> <br> The **standard deviation** is the most used measure of dispersion <sup>🤔</sup>: `$$S'=\sqrt{S'^2}=\sqrt{\frac{1}{n-1} \sum\limits_{i=1}^n (X_i- \bar X)^2}$$` -- …because it's in the same units as the variable. .footnote[🤔 The variance is in `\(^2\)` units.] --- # 2.5 Shape ## Skewness > Distribution of data relative to the central point (mean). Measure of lack of symmetry. `\(g_1=\frac{n^2M^3}{(n-1)(n-2)S'^3}~~~~~being~~~~~M^k=\frac{\sum \limits_{i=1}^n(X_i-\bar X)^k}{n}\)` -- .center[  ] --- # 2.5 Shape ##Skewness ``` r variable <- rnorm(n = 10000000,mean = 0,sd = 1) *PerformanceAnalytics::skewness(variable,method = "sample") ``` -- ``` *## [1] 0.0005335455 ``` --- # 2.5 Shape ## Kurtosis > Distribution of data relative to the central point (mean). Measure of relative peakedness of a distribution. `\(g_2=\frac{n^2M^4}{(n-1)(n-2)(n-3)S'^4}-3\frac{(n-1)^2}{(n-2)(n-3)}\)` -- .center[  ] --- # 2.5 Shape ## Kurtosis ``` r variable <- rnorm(n = 10000000,mean = 0,sd = 1) *PerformanceAnalytics::kurtosis(variable, method = "sample_excess") ``` -- ``` *## [1] 0.0001213784 ``` --- # 2.5 Shape ## Sensitivity <br> <br> <br> >An item is said to have high "sensitivity"<sup>🤔</sup> when it can discriminate individuals which are structurally different. Skewness and kurtosis define the sensitivity of an item. .footnote[ 🤔 Do not confound psychometric "sensitivity" with diagnostic "sensitivity" (Biomedecine -> Sensitivity: Probability of detecting the disease if it is present; Specificity: Probability of not detecting the disease when it is not present)] --- # 2.5 Shape ## Sensitivity .pull-left[ An item with high sensitivity has `\(sk(g_1)≈0\)` and `\(ku(g_2) ≈0\)` (Normal distribution has `\(g_1=0\)` and `\(g_2=0\)`). Values of `\(|g1|\)` and `\(|g2|\)` above 3 and 7 respectively reveal severe sensitivity problems (Marôco, 2021)<sup>📚</sup >. Ideal situation `\((sk\sim0\)` and `\(ku\sim0)\)`: ] .pull-right[  ] .footnote[ 📖 Marôco, J. (2021). _Análise de equações estruturais: Fundamentos teóricos, software & aplicações_ (3rd ed.). ReportNumber. ] --- # Summary The type of (non-)central tendency and dispersion statistics to use depends on the measurement scale: <br> | Measurement scale | Measures of central tendency | Measures of non-central tendency | Measures of dispersion | |----------------|-----------------------------------------------------------------------|---------------------------------------------------------| | Qualitative:<br>Nominal | `\(Mode\)` | - | - | | Qualitative:<br>Ordinal| `\(Mode\)`<br> `\(\tilde X\)` | `\(P_k,~Q\)` | `\(IQR\)` | | Quantitative:<br>Interval<br>Ratio| `\(\bar X\)`<br> `\(\tilde X\)`<br> `\(Mode\)` | `\(P_k,~Q\)` | `\(R\)` <br> `\(IQR\)`<br> `\(S',~SEM,~CV\)` | --- # 2.6 Measures of association **_What is it for?_** Study the association between two variables... -- Correlation means association, not causality. If a variable causes the other, then the variables will be correlated, the opposite may not be true! -- Appropriate measures of association to study association (magnitude and direction) depend on the metric of the variables. --- # 2.6 Measures of association >measure the strength of the association between variables .pull-left[ **Covariance**: common variation of two quantitative variables<sup>⚠️</sup>. ] -- .pull-right[ `\(\operatorname{Cov}({X_1},{X_2})=\frac{\sum\limits_{i=1}^{n}\left( X_{1i}-\bar{X_1} \right) \left( X_{2i}-\bar{X}_2 \right)}{n-1}\)` ] -- **Pearson correlation coefficient** `\((-1 \leq R \leq 1)\)`: measures the direction, and strength of the .orange[linear association] between two quantitative variables... -- `\(R_{X_1X_2} = \frac{S_{X_1X_2}}{S_{X_1}S_{X_2}} =\frac{\sum\limits_{i=1}^n (X_{1i}-\bar X_1)(X_{2i}-\bar X_2)}{\sqrt{\sum\limits_{i=1}^n (X_{1i}-\bar X_1)^2}~\sqrt{\sum\limits_{i=1}^n (X_{2i}-\bar X_2)^2}}\)` -- **Correlation means association.**<sup>⚠️⚠️</sup> .footnote[⚠️ `\(-∞ < Cov (X_1, X_2) < +∞\)` ⚠️⚠️ Does not imply causality.] .plot-callout[ <img src="data:image/png;base64,#slides4of9_files/figure-html/large-plot-callout-1.png" width="100%" height="88%" /> ] --- # Pearson correlation >**Pearson correlation coefficient** `\((-1 \leq R \leq 1)\)`: measures the direction, and strenght of the .orange[linear association] between two quantitative variables... -- For two **quantitative** variables `\(X_1\)` and `\(X_2\)` .orange[linearly related]: `\(R_{X_1X_2} = \frac{S_{X_1X_2}}{S_{X_1}S_{X_2}} =\frac{\sum\limits_{i=1}^n (X_{1i}-\bar X_1)(X_{2i}-\bar X_2)}{\sqrt{\sum\limits_{i=1}^n (X_{1i}-\bar X_1)^2}~\sqrt{\sum\limits_{i=1}^n (X_{2i}-\bar X_2)^2}}\)` -- .orange[Associations exist in many forms, not necessarily linear…]<sup>⚠️</sup> -- It should not be estimated without first checking the linearity of the bivariate distribution and/or the existence of outliers <sup>🤔</sup>. -- -- .footnote[🤔 Outliers are observations that fall <br> outside the trend of other observations. <br> ⚠️ Examples of non-linear relations ➡️ ] .plot-callout[ <img src="data:image/png;base64,#slides4of9_files/figure-html/large-plot-callout3-1.png" width="100%" height="88%" /> ] --- # Pearson correlation <center> <img src="data:image/png;base64,#slides4of9_files/figure-html/large-plot-callout3-1.png" width="100%" height="88%" /> --- # Pearson correlation Cohen (2016)<sup>📖</sup> suggests some reference values for the magnitude of the correlation (i.e. effect size) `\(r\)`. | Magnitude `\(r\)` | Positive direction | Negative direction | |----------------|-----------------------------------------------------------------------|---------------------------------------------------------| | Small | [.1; .3[ | [-.1; -.3[ | | Moderate | [.3; .5[ | [-.3; -.5[ | | Large | [.5; 1.0[ | [-.5; -1.0[ | If `\(r = 0\)` it is considered _null correlation_ if `\(r=1\)` or `\(r=-1\)` the correlation is _perfect positive_ or _negative_ (respectively). -- .footnote[ 📖 Cohen, J. (2016). A power primer. In <br> A. E. Kazdin (Ed.), _Methodological issues and <br> strategies in clinical research_ (pp. 279–284). <br> American Psychological Association. <br> https://doi.org/10.1037/14805-018 ] -- .plot-callout[ <img src="data:image/png;base64,#slides4of9_files/figure-html/large-plot2-callout-1.png" width="100%" height="88%" /> ] --- # Pearson correlation <img src="data:image/png;base64,#slides4of9_files/figure-html/large-plot2-full-output-1.png" width="80%" height="88%" /> Anscombe, F. J. (1973). Graphs in statistical analysis. _The American Statistician, 27_(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966 --- # Pearson correlation | Statistic | Value | |----------------|-----------------------------------------------------------------------| | `\(M_x\)` | `\(9\)` | | `\(S'_x\)` | `\(11\)` | | `\(M_y\)` | `\(7.5\)` | | `\(S'_y\)` | `\(4.12\)` | | `\(r\)` | `\(.816\)` | | Regression line | `\(Y=3+0.5X\)` | .footnote[ The graphs show the same <br> statistics (shown in the table). Only one fulfills the assumption of the <br>linear bivariate relationship. ➡️] -- .plot-callout[ <img src="data:image/png;base64,#slides4of9_files/figure-html/large-plot_Anscombe-callout-1.png" width="100%" height="88%" /> ] --- # Spearman correlation **Spearman correlation coefficient** Two variables `\(X_1\)` and `\(X_2\)` measured on an ordinal or higher measurement scale (interval or ratio). Alternative to the Pearson correlation coefficient significance test when the assumptions of normality and/or linearity are not verified (i.e. non-parametric test). `\(R_S =\frac{\sum\limits_{i=1}^n (r_{1i}-\bar r_1)(r_{2i}-\bar r_2)}{\sqrt{\sum\limits_{i=1}^n (r_{1i}-\bar r_1)^2}~\sqrt{\sum\limits_{i=1}^n (r_{2i}-\bar r_2)^2}}\)` being `\(-1\leq R_S\leq1\)` Where: `\(r_{1i}\)` are the ranks of `\(X_{1i}\)` and `\(\bar r_1\)` is the mean of `\(r_{1i}\)` `\(r_{2i}\)` are the ranks of `\(X_{2i}\)` and `\(\bar r_2\)` is the mean of `\(r_{2i}\)` --- # Cramér's V correlation Two variables `\(X_1\)` and `\(X_2\)` measured on a qualitative scale (nominal or ordinal) where the data are *counts* arranged in a contingency table. .pull-left[ <table class="table table-striped" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Variable A</div></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Variable B</div></th> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> 1 </th> <th style="text-align:left;"> 2 </th> <th style="text-align:left;"> ... </th> <th style="text-align:left;"> C </th> <th style="text-align:left;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> 1 </td> <td style="text-align:left;"> \(O_{11}\) </td> <td style="text-align:left;"> \(O_{11}\) </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> \(O_{C1}\) </td> <td style="text-align:left;"> \(n_{1.}\) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> 2 </td> <td style="text-align:left;"> \(O_{21}\) </td> <td style="text-align:left;"> \(O_{22}\) </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> \(O_{C2}\) </td> <td style="text-align:left;"> \(n_{2.}\) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> R </td> <td style="text-align:left;"> \(O_{R1}\) </td> <td style="text-align:left;"> \(O_{R2}\) </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> \(O_{RC}\) </td> <td style="text-align:left;"> \(n_{R.}\) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> </td> <td style="text-align:left;"> \(n_{.1}\) </td> <td style="text-align:left;"> \(n_{.2}\) </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> \(n_{.C}\) </td> <td style="text-align:left;"> \(N\) </td> </tr> </tbody> </table> ] .pull-right[ <br> `\(V=\sqrt{\frac{X^2}{N[min(R,C)-1]}}\)` with `\(0 \leq V \leq 1\)`<sup>💡</sup> <br> `\(X^2= \sum\limits_{i=1}^R\sum\limits_{j=1}^C \frac{(O_{ij}-E_{ij})^2}{E_{ij}}\)` <br> `\(E_{ij}=\frac{n_{i.}\times n_{.j}}{n_{ij}}\)` ] .footnote[ 💡 When `\(X_1\)` and `\(X_2\)` are two dichotomous nominal variables (i.e. `\(2 \times 2\)` tables) Cramér's V is equal to the coefficient `\(\phi\)` ] --- # Summary <br> <br> <br> | Measurement Scale of the two variables | Correlation Coefficient | |----------------|-----------------------------------------------------------------------| | Nominal (both dichotomous)| `\(\phi\)` | | Nominal | `\(Cramér's~V\)` | | Ordinal | `\(Spearman\)` | | Quantitative (with linear relationship between both) | `\(Pearson\)` | --- class: center, bottom, inverse # More info -- Slides created with the <svg viewBox="0 0 581 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#384CB7;"> [ comment ] <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> package [`xaringan`](https://github.com/yihui/xaringan). -- <svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <g label="icon" id="layer6" groupmode="layer"> <path id="path2" d="M 132.62426,316.69067 C 119.2805,301.94483 112.56962,274.5073 112.56962,234.39862 v -54.79191 c 0,-37.32217 -5.81677,-63.58084 -17.532347,-78.83466 -11.6757,-15.293118 -31.159702,-22.922596 -58.353466,-22.922596 -5.958581,0 -11.409226,0.22492 -16.45319,0.5917 -5.04455,0.427121 -9.742846,1.037046 -14.1564111,1.83092 V 95.057199 H 16.671281 c 12.325533,0 20.908335,3.82414 25.667559,11.532201 4.77973,7.74964 7.139712,25.48587 7.139712,53.14663 v 68.01321 c 0,42.12298 13.016861,74.19672 39.233939,96.16314 19.627549,16.47424 46.636229,27.23363 81.030059,32.40064 v -20.17708 c -16.3928,-4.27176 -29.04346,-10.51565 -37.11829,-19.44413 z m 246.75144,0 c 13.34377,-14.74584 20.05466,-42.18337 20.05466,-82.29205 v -54.79191 c 0,-37.32217 5.81673,-63.58084 17.53235,-78.83466 11.67568,-15.293118 31.15971,-22.922596 58.35348,-22.922596 5.95858,0 11.40922,0.22492 16.45315,0.5917 5.04457,0.427121 9.74287,1.037046 14.15645,1.83092 v 14.785125 h -10.59712 c -12.32549,0 -20.90826,3.82414 -25.66752,11.532201 -4.77974,7.74964 -7.13972,25.48587 -7.13972,53.14663 v 68.01321 c 0,42.12298 -13.01688,74.19672 -39.23394,96.16314 -19.6275,16.47424 -46.63622,27.23363 -81.03006,32.40064 v -20.17708 c 16.39279,-4.27176 29.04347,-10.51565 37.11827,-19.44413 z M 303.95857,87.165762 c 8.42049,-6.691524 25.52576,-10.536158 51.23486,-11.492333 V 63.999997 H 156.80716 v 11.673432 c 26.1755,0.956175 43.38268,4.800809 51.68248,11.492333 8.31852,6.73139 12.40691,20.033568 12.40691,39.904818 V 384.6851 c 0,20.80641 -4.08839,34.5146 -12.40691,41.02332 -8.2998,6.56905 -25.50698,10.10729 -51.68248,10.65744 V 448 h 197.71597 l 0.67087,-11.63414 c -25.50471,-0.54955 -42.56835,-4.35266 -51.07201,-11.40918 -8.4182,-6.95638 -12.73153,-20.44184 -12.73153,-40.27158 V 127.07058 c 0,-19.87125 4.16983,-33.173428 12.56922,-39.904818 z" style="stroke-width:0.0753388"></path> </g></svg> + <svg viewBox="0 0 581 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#384CB7;"> [ comment ] <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> = <svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;position:relative;display:inline-block;top:.1em;fill:red;"> [ comment ] <path d="M462.3 62.6C407.5 15.9 326 24.3 275.7 76.2L256 96.5l-19.7-20.3C186.1 24.3 104.5 15.9 49.7 62.6c-62.8 53.6-66.1 149.8-9.9 207.9l193.5 199.8c12.5 12.9 32.8 12.9 45.3 0l193.5-199.8c56.3-58.1 53-154.3-9.8-207.9z"></path></svg> -- <svg viewBox="0 0 581 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#384CB7;"> [ comment ] <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> has infinite possibilities. -- Practice is the best strategy for learning. -- . -- _In God we trust, all others bring data_ -- Edwards Deming -- . -- . -- . -- THE END --- class: center, bottom, inverse 