Demystifying RStudio

Ronald Wesonga, Ph.D,


Motivation of RStudio

  • Development of statistical theory is argued to be a significant reason for the emergency of computing machines. Historically, it has been shown that Wilhelm Schickard (1592-1635) developed the Wilhelm Schickard, which was able to perform multiplication as well as division. (Koetsier, 2001)
  • This was followed by Blaise Pascal (1623-1662) who developed a calculating machine able to add and substract, mainly meant for accounting purpose.(Loevinger, 1996)
  • Noticeably, around the same time, John Graunt (1620-1674) developed the mortality tables in 1662 (Greenwood, 1938), while William Petty (1623-1687) ventured into quantitative statistics. (McCormick, 2009)
  • Meanwhile Gottfried Achenwall (1719-1772) is popularly known as the father of statistics.(John, 1883)
  • Development of both statistics and computing machines was not just a coincidence, but rather proof of complimentarity of disciplines. Gentle et al. (2012)
  • Ashurst (1995) noted that the 1880 Census, with about 50 million people, took over 7 years to tabulate, while the 1890 Census, with over 62 million people, took less than a year. This was because of the Hermann Hollerith’s machine, which was developed for that census.(Watnik, 2011)
  • Interestingly, in 1908, Gosset performed his now well-known Monte Carlo-type simulation which led to his discovery of the t-distribution using height and finger length measurements of criminals as approximately normal deviates. Writing as “Student” (Student, 1908), Gosset noted, “Before I had succeeded in solving my problem analytically, I had endeavoured to do so empirically.”
  • It is easy to note that the computer has revolutionized simulation and made the replication of Gosset’s experiment little more than an exercise, though, at the time, such a project was quite time-consuming since it was done by hand.

R language as a derivative of the S language

  • Bell Laboratories were unquestionably the source where R evolved, and specifically the S language. The S language developed from the Fortran libraries, that is the Statistical Computing Subroutines, is said to have given birth to the R language.
  • Chambers (1977) documents in his book the achievements and breakthrough realized during his ten years at the Bell Labs.
  • Whereas, the Statistics Research Department (SRD) was investigating Exploratory Data Analysis Tukey (1993), the Computer Science Research Department was developing UNIX OS and the C Compiler. (Ritchie and Thompson, 1978)
  • At the time, the Statistics Research Department accessed the C language, based on the portable C compiler to develop a graphics facility.
  • Becker and Chambers (1977) wrote a subroutine library, GRZ to produce flexible, portable, device-independent graphics, whose parameters controlled graphical layout and rendering. The GRZ system provided for graphic input as well as output and became a cornerstone for graphics in S.
  • It should be noted that S like R is:
    • a collection of functions residing in separate single overlays
    • based on formal grammar with as few restrictions as possible, subscripting and object naming.
    • facilitates portability using two character sets (BCD and ASCII) and two system calls (Batch and Time-sharing)
    • enabled with detailed online documentation
    • promoted with the X window system developed by (Scheifler and Gettys, 1986)

The RStudio Environment

  • Is it simply a graphical user interface?
  • Is it an integrated development environment?

GUI:

  • The graphical user interface is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, instead of text-based user interfaces, typed command labels or text navigation. Lawrence and Verzani (2018)
  • Leeuw (2010) pointed out that the GUI revolution reached the statistical software more or less at the arrival of the Macintosh computers. Around 1985, after the long dominance of the big three statistical packages, that is BMDP, SPSS, and SAS, a program with a truly innovative GUI, DataDesk (Velleman, 1997), was released.
  • The R capability of linking to other languages permitted the release of the R Commander (Fox, 2005), which facilitated a GUI for R. Ihaka and Gentleman (1996)

IDE

  • The IDE or integrated development environment can be defined as a software that combines, in one place, all the tools needed for developing a project.
  • On a more basic level, they provide interfaces for users to write code, organize text groups, and automate programming redundancies. The IDEs combine the functionality of multiple programming processes into one, instead of a bare-bones code editor Allaire (2012),Racine (2012) and Studio (2018).

Why IDE?

  • The process of writing, creating, and testing projects requires that developers employ a variety of tools. Text editors, code libraries, bug tracking software, compilers, and test platforms are among the most common development tools.
  • An integrated development environment combines several of these development-related technologies into a single framework. When all utilities are represented on the same workbench, developers don’t have to spend hours learning how to operate them separately.
  • Most IDE capabilities, such as intelligent code completion and automatic code creation, are designed to save time by eliminating writing out complete character sequences. The integrated toolset aims to make project development easier while also detecting and reducing code errors and typos.
  • Some other popular IDE features assist developers in streamlining their workflow and problem-solving. IDEs parse code as it is written, identifying problems caused in real time. Most IDEs also include syntax highlighting, which employs visual clues to discern grammar in the text editor.

Common Features of IDE

  • Text editor: Virtually every IDE will have a text editor designed to write and manipulate source code. Some tools may have visual components to drag and drop front-end components, but most have a simple interface highlighting language-specific syntax.
  • Debugger: Debugging tools assist users in identifying and remedying errors within source code. They often simulate real-world scenarios to test functionality and performance. Programmers and software engineers can usually test the various code segments and identify errors before the application is released.
  • Compiler: Compilers are components that translate programming language into a form machines can process, such as binary code. The machine code is analyzed to ensure its accuracy. The compiler then parses and optimizes the code to optimize performance.
  • Code completion: Code complete features assist programmers by intelligently identifying and inserting common code components. These features save developers time writing code and reduce the likelihood of typos and bugs.
  • Programming language support: IDEs are typically specific to a single programming language, though several also offer multi-language support. As such, the first step is to figure out which languages you will be coding in and narrow your prospective IDE list down accordingly. Examples include Ruby, Python, and Java IDE tools.
  • Integrations and plugins: With the name integrated development environment, it is no surprise that integrations need to be considered when looking at IDEs. Your IDE is your development portal, so incorporating all your other development tools will improve development workflows and productivity. Poor integrations can cause numerous issues and lead to many headaches.

The RMarkdown: Power of Dynamic Data Analytics

  • Replicability is one of the main components of science.
  • Statistics supports replicability through RStudio.
    • Data
    • Code
    • Inform

Reproducibility

WHy:

  • Standard to judge scientific claims. It opens claims to scrutiny, allowing us to keep what works and discard what doesn’t. It requires the complete and open exchange of data, procedures, and materials. Scientific conclusions that are not replicable should be abandoned or modified.
  • Reproducibility enhances replicability. If other researchers are able to clearly understand how a finding was originally made, then they will be able to conduct comparable research in meaningful attempts to replicate the original findings.
  • Limits duplication & encourages cumulative knowledge development.
  • New knowledge and dependable analyses. Dynamic reproducible documents in particular can make changing things much easier. Changes made to one part of a research project have a way of cascading through the other parts. For example, if you used data imputation or matching methods you may need to rerun these models. You then have to update your main statistical analyses, and recreate the tables and graphs you used to present the results, thus making it easier for yourself as well as others to reproduce your research.

What is RMarkdown

  • The RMarkdown
  • The Source Editor
  • The Output Formats
What is RMarkdown?
The Source Editor
The Output Formats

Case studies

Galton’s data on the heights of parents and their children

Galton (1886) presented these data in a table, showing a cross-tabulation of 928 adult children born to 205 fathers and mothers, by their height and their mid-parent’s height. He visually smoothed the bivariate frequency distribution and showed that the contours formed concentric and similar ellipses, thus setting the stage for correlation, regression and the bivariate normal distribution.

plot with regression line with data ellipses and lowess smooth

  data(Galton)

  with(Galton,
  {
    sunflowerplot(parent,child, xlim=c(62,74), ylim=c(62,74))
    reg <- lm(child ~ parent)
    abline(reg)
    lines(lowess(parent, child), col="blue", lwd=2)
    if(require(car)) {
    dataEllipse(parent,child, xlim=c(62,74), ylim=c(62,74), plot.points=FALSE) }
  })

Halley’s Life Table

  • Heywood (1985), Seal (1980): In 1693 the famous English astronomer Edmond Halley studied the birth and death records of the city of Breslau, which had been transmitted to the Royal Society by Caspar Neumann.
  • He produced a life table showing the number of people surviving to any age from a cohort born the same year. -He also used his table to compute the price of life annuities. \[\frac{P_{k+1}}{P_k}~~~\forall~~~P_{k+1}=P_k-D_k \]
  • This method had the great advantage of not requiring a general census but only knowledge of the number of births and deaths and of the age at which people died during a few years.
    data(HalleyLifeTable)

    # plot survival vs. age
    par(mfrow=c(1,3))
    plot(number ~ age, data=HalleyLifeTable, type="h", ylab="Number surviving")
    # population pyramid is transpose of this -s>
    plot(age ~ number, data=HalleyLifeTable, type="l", xlab="Number surviving")
    with(HalleyLifeTable, segments(0, age, number, age, lwd=2))
    # conditional probability of survival, one more year
    plot(ratio ~ age, data=HalleyLifeTable, ylab="Probability survive one more year")

Macdonell’s Data on Height and Finger Length of Criminals, Gosset (1908)

  • In the second issue of Biometrika, Macdonell (1902) published an extensive paper, On Criminal Anthropometry and the Identification of Criminals in which he included numerous tables of physical characteristics about 3000 non-habitual male criminals serving their sentences in England and Wales.
  • His Table III (p. 216) recorded a bivariate frequency distribution of height by finger length.
  • His main purpose was to show that Scotland Yard could have indexed their material more efficiently, and find a given profile more quickly.
  • Gösset (1908) used these data in two classic papers in 1908, in which he derived various characteristics of the sampling distributions of the mean, standard deviation and Pearson’s r.
  • He said, “Before I had succeeded in solving my problem analytically, I had endeavoured to do so empirically.” Among his experiments, he randomly shuffled the 3000 observations from Macdonell’s table, and then grouped them into samples of size \(4, 8,\cdots\) calculating the sample means, standard deviations and correlations for each sample. (Hanley et al., 2008)
  • Original data looks
    Original Data

Naive contour plots and Galton smoothed 2-D of height and finger

  • Smoother 2-D frequencies’ contour plot by Galton
  • Worked with a mathematician the iso-density contours of a bivariate Gaussian distribution

Bivariate kernel density estimate of height and finger

Sampling distributions of sample sd (s) and z=(ybar-mu)/s

  • Note that Gosset used a divisor of n (not n-1) to get the sd.
  • He also used Sheppard’s correction for the ‘binning’ or grouping.
  • With concatenated height measurements
  • 750 samples of size n=4 (as Gosset did)
## [1] 5.4196250 0.2131819

Conclusion

Is the future of statistical programming now?

How do embrace the RStudio Integrated Development Environment?

Is RStudio really an enabler of Data Analytics?

If S is to R, then R will be to - ?

What if Gosset had R, where would statistics be?

References

Allaire, J., 2012. RStudio: Integrated development environment for r. Boston, MA 770, 165–171.
Ashurst, D., 1995. St. Mary’s church, worsbrough, south yorkshire: A review of the accu-racy of a parish register. Local Population Studies 55, 46–57.
Becker, R.A., Chambers, J.M., 1977. ‘GR-z: A system of graphical subroutines for data analysis, in: Proc. 10th Interface Symp. On Statistics and Computing. pp. 409–415.
Chambers, A., 1977. The reader in the book: Notes from work in progress. Signal 23, 64.
Fox, J., 2005. The r commander: A basic-statistics graphical user interface to r. Journal of statistical software 14, 1–42.
Galton, F., 1886. I. Family likeness in stature. Proceedings of the Royal Society of London 40, 42–73.
Gentle, J.E., Härdle, W.K., Mori, Y., 2012. How computational statistics became the backbone of modern data science, in: Handbook of Computational Statistics. Springer, pp. 3–16.
Gösset, W., 1908. The probable error of a mean. Biometrika 6, 1–25.
Greenwood, M., 1938. The first life table. Notes and Records of the Royal Society of London 1, 70–72.
Grier, D.A., 1991. Statistics and the introduction of digital computers. Chance 4, 30–36.
Hanley, J.A., Julien, M., Moodie, E.E.M., 2008. Student’s z, t, and s: What if gosset had r? The American Statistician 62, 64–69.
Heywood, G., 1985. Edmond halley: Astronomer and actuary. Journal of the Institute of Actuaries 112, 279–301.
Ihaka, R., Gentleman, R., 1996. R: A language for data analysis and graphics. Journal of computational and graphical statistics 5, 299–314.
John, V., 1883. The term " statistics.". Journal of the Statistical Society of London 46, 656–679.
Koetsier, T., 2001. On the prehistory of programmable machines: Musical automata, looms, calculators. Mechanism and Machine theory 36, 589–603.
Lawrence, M.F., Verzani, J., 2018. Programming graphical user interfaces in r. Chapman; Hall/CRC.
Leeuw, J.D., 2010. Statistical software-overview.
Loevinger, L., 1996. The invention and future of the computer. Interdisciplinary Science Reviews 21, 221–234.
Macdonell, W., 1902. On criminal anthropometry and the identification of criminals. Biometrika 1, 177–227.
McCormick, T., 2009. William petty: And the ambitions of political arithmetic. Oxford University Press.
Racine, J.S., 2012. RStudio: A platform-independent IDE for r and sweave.
Ritchie, D.M., Thompson, K., 1978. The UNIX time-sharing system. Bell System Technical Journal 57, 1905–1929.
Scheifler, R.W., Gettys, J., 1986. The x window system. ACM Transactions on Graphics (TOG) 5, 79–109.
Seal, H., 1980. Early uses of graunt’s life table. Journal of the Institute of Actuaries 107, 507–511.
Student, 1908. The probable error of a mean. Biometrika 1–25.
Studio, R., 2018. Integrated development environment. Boston, MA: R Studio Inc.
Team, R.C., others, 2013. R: A language and environment for statistical computing.
Tukey, J.W., 1972. Data analysis, computation and mathematics. Quarterly of Applied Mathematics 30, 51–65.
Tukey, J.W., 1993. Exploratory data analysis: Past, present and future. PRINCETON UNIV NJ DEPT OF STATISTICS.
Valero-Mora, P.M., Ledesma, R., 2012. Graphical user interfaces for r. Journal of statistical Software 49, 1–8.
Velleman, P.F., 1997. Learning data analysis with DataDesk student version 5.0. Addison-Wesley Longman Publishing Co., Inc.
Watnik, M., 2011. Early computational statistics. Journal of Computational and Graphical Statistics 20, 811–817.