---
title: "Utilization of Statistical Programming Software Among Researchers: Patterns, Determinants, and Impact on Research Quality"
subtitle: "Chapter 2: Literature Review"
output: 
  word_document:
    reference_docx: null 
fontsize: 12pt
linestretch: 2.0
---

# Chapter 2: Literature Review

## 2.1 Introduction
The landscape of scientific inquiry has been fundamentally transformed by the integration of computational tools and statistical software. Historically, data analysis was a manual, labor-intensive process, but the emergence of specialized software has allowed researchers to handle complex datasets with greater speed, accuracy, and repeatability (Samal, 2013). According to Masuadi et al. (2021), these advancements have not only simplified complex calculations but have also enhanced the visual representation of findings through sophisticated charts and graphs, making empirical research more accessible to the global scientific community.

However, the proliferation of various statistical tools has created a paradoxical challenge: the abundance of choice often leads to confusion and uncertainty regarding which tool is most appropriate for a specific study design. Research suggests that while software usage is ubiquitous, the choice of a particular package is rarely neutral; it is often influenced by a researcher’s background, institutional mandates, and the specific requirements of the chosen methodology (Dembe et al., 2011). This chapter reviews the current state of knowledge regarding software utilization, exploring the determinants that drive preferences and the impact of these choices on research quality.

## 2.2 Review of Knowledge Among Researchers
A significant theme in the literature is the disparity between the high frequency of software use and the low level of formal training among researchers. Momcheva and Tollerud (2015) conducted an informal survey of the astronomical community and found that while all participants utilized software and 90% wrote their own code, only 8% reported receiving substantial formal training in software development. This suggests that the "knowledge" possessed by many researchers is largely self-taught, which may have implications for code readability, reusability, and the overall robustness of published methods.

Furthermore, Samal (2013) emphasizes that a dangerous misconception exists where researchers view a large software package as a substitute for a deep understanding of statistical theory. The widespread availability of GUI-based tools (Graphical User Interface) can lead to the inappropriate application of statistical tests if the researcher lacks the conceptual knowledge to interpret the software's output correctly. Thus, knowledge among researchers must be viewed as a two-fold construct: the technical ability to operate the software and the theoretical ability to conduct valid numerical reasoning.

## 2.3 Review of Preference Among Researchers
Research preferences regarding statistical software are often split between user-friendly interfaces and high-level programming flexibility. Orhani (2024) notes that SPSS (Statistical Package for the Social Sciences) remains the preferred tool for many social science researchers due to its intuitive interface and standardized results. Conversely, programming languages like R and Python have gained significant popularity among researchers who require greater flexibility, automation, and the ability to integrate with other complex systems (Raichal, 2024, as cited in Orhani, 2024).

In the field of ecology, the preference for R has seen a dramatic increase over the last decade. Gao et al. (2025) analyzed over 125,000 articles and found that R utilization grew from 10.3% in 2008 to nearly 67% in 2023. This preference is driven by the extensive library of specialized packages, such as 'lme4' for mixed-effect models, which allow ecologists to solve specific scientific problems that standard commercial software may not easily accommodate. These findings highlight that preference is often a function of the specialized needs of the researcher's specific field.

## 2.4 Review of Accessibility of Softwares
The accessibility of statistical tools is often defined by the cost of licensing versus the availability of open-source alternatives. Historically, commercial vendors like SAS and SPSS dominated the market, but their high costs often limited access to well-funded institutions (Dembe et al., 2011). The rise of open-access publishing and open-source programming has democratized data analysis, allowing researchers worldwide to access powerful tools like R and Python without the burden of expensive licensing fees (Masuadi et al., 2021).

However, accessibility is not merely a matter of financial cost; it also involves the "functional accessibility" related to the learning curve. While Excel is universally accessible and familiar to most users, it is often limited in its ability to perform complex multivariate analyses or handle voluminous datasets (Orhani, 2024). Samal (2013) notes that for junior investigators, the barrier to entry for more advanced software like SAS or Stata remains high due to the requirement for specific computational skills, even if the software itself is technically available within their institution.

## 2.5 Review of Institutional Preferences
Institutional mandates and departmental traditions play a critical role in shaping software utilization patterns. Dembe et al. (2011) point out that in the United States, institutions often standardize their curricula around specific software like Stata or SAS to ensure a consistent training environment for research trainees. Because many health services researchers are trained on these specific platforms during their doctoral programs, they tend to carry these institutional preferences into their professional careers, creating a cycle of software "path dependency."

Furthermore, the choice of institutional software is often driven by the perceived reliability and documentation standards of commercial products. Institutions may prefer SAS because it is considered a "standard" in regulated industries like pharmaceuticals, whereas academic institutions focused on "open science" may push for the adoption of R to enhance transparency and code sharing (Gao et al., 2025). The literature suggests that the institutional environment provides the infrastructure—licenses, support staff, and peer networks—that ultimately dictates which tools a researcher can effectively employ.

## 2.6 Continental and Geographical Preferences
Geographical variation is a prominent determinant in the choice of statistical software. Dembe et al. (2011) discovered a significant "US versus non-US" divide in software usage. Their study revealed that Stata was used in 49.5% of US-authored articles but in only 14.8% of articles authored by researchers from outside the US. This suggests that while Stata has a strong foothold in North American health services research, it has not achieved the same level of global penetration as more traditional tools like SAS.

Similarly, Masuadi et al. (2021) observed that SPSS maintains a dominant position in various international contexts, particularly in health sciences journals in regions such as Saudi Arabia and Pakistan. This geographical preference may be attributed to different marketing strategies by software vendors in different parts of the world, as well as the regional availability of training workshops. These patterns indicate that the global scientific community is not a monolith, but rather a collection of regional ecosystems with distinct software "cultures."

## 2.7 Review of Statistical Methods Used
The selection of statistical software is often intrinsically linked to the specific study design and methods employed by the researcher. Masuadi et al. (2021) found that SPSS was most commonly associated with observational (61.1%) and experimental (65.3%) study designs. In contrast, researchers conducting systematic reviews and meta-analyses overwhelmingly preferred Review Manager (43.7%) or Stata (38.3%), tools specifically designed for pooling effect sizes and creating forest plots.

Advancements in statistical methods also drive software adoption. For example, the increasing complexity of ecological data has necessitated the use of mixed-effect models, leading to the massive adoption of the R package 'lme4' (Gao et al., 2025). As researchers move away from simple descriptive statistics toward more sophisticated inferential tests and predictive modeling, they are forced to transition from basic tools like Excel toward advanced programming environments that can accommodate non-standard methods of data analysis (Orhani, 2024; Samal, 2013).

## 2.8 Research Gap
While the existing literature effectively documents the trends and frequencies of software usage, there is a notable research gap regarding the direct link between software choice and the *quality* of research outcomes. Most studies, such as Masuadi et al. (2021) and Dembe et al. (2011), are bibliometric or descriptive in nature. There is a lack of empirical research investigating whether the "underlying estimation methods" of different software packages actually lead to significantly different conclusions in peer-reviewed literature. Furthermore, few studies explore how the lack of formal training (as identified by Momcheva & Tollerud, 2015) impacts the rate of statistical errors in published research, creating a need for studies that combine software usage patterns with an audit of statistical accuracy.

## 2.9 Research Theory
This study is grounded in the **Technology Acceptance Model (TAM)**, which posits that "Perceived Usefulness" (PU) and "Perceived Ease of Use" (PEOU) are the primary determinants of technology adoption. In the context of researchers, PEOU explains the continued dominance of GUI-based software like SPSS (Orhani, 2024), while PU explains the rapid growth of R among ecologists who require specialized analytical power (Gao et al., 2025). Additionally, the study incorporates **Diffusion of Innovation Theory**, which helps explain how software "innovators" (those with substantial training) influence "late adopters" (the 49% with "little" training identified by Momcheva & Tollerud, 2015) through institutional and peer networks.

## 2.10 Conclusions and Remarks
In conclusion, the utilization of statistical software is a complex phenomenon driven by a nexus of personal knowledge, field-specific preferences, and institutional or geographical constraints. The literature indicates a clear global shift toward open-source platforms like R and Python, yet this transition is hampered by a significant gap in formal training and a lingering reliance on user-friendly but less flexible tools like SPSS and Excel. 

Ultimately, the choice of software is not a mere technicality but a methodological decision that impacts the repeatability and transparency of research. As scientific inquiry enters the era of "Big Data," the ability to not only use but to deeply understand and script statistical analysis will become a defining characteristic of high-quality research. This review emphasizes that moving forward, the research community must prioritize formal computational training to ensure that the tools of the digital age are used to their fullest and most accurate potential.

# References
Dembe, A. E., Partridge, J. S., & Geist, L. C. (2011). Statistical software applications used in health services research: analysis of published studies in the U.S. *BMC Health Services Research, 11*, 252. https://doi.org/10.1186/1472-6963-11-252

Gao, M., Ye, Y., Zheng, Y., & Lai, J. (2025). A comprehensive analysis of R’s application in ecological research from 2008 to 2023. *Journal of Plant Ecology, 18*(rtaf010). https://doi.org/10.1093/jpe/rtaf010

Masuadi, E., Mohamud, M., Almutairi, M., Alsunaidi, A., Alswayed, A. K., & Aldhafeeri, O. F. (2021). Trends in the usage of statistical software and their associated study designs in health sciences research: A bibliometric analysis. *Cureus, 13*(1), e12639. https://doi.org/10.7759/cureus.12639

Momcheva, I., & Tollerud, E. (2015). Software use in astronomy: An informal survey. *arXiv preprint arXiv:1507.03989*.

Orhani, S. (2024). Comparative analysis of statistical results generated by Python, R, SPSS, and Excel. *International Journal of Progressive Research in Engineering Management and Science (IJPREMS), 5*(10), 265-277. https://www.doi.org/10.58257/IJPREMS44131

Samal, J. (2013). An introduction to computer aided health science research. *Scholars Journal of Applied Medical Sciences (SJAMS), 1*(4), 265-268.