Syllabus_Bio_240.Rmd Syllabus_Bio_240.html Abstract
In Bio 240 we use the R
programming language and environment for the graphical visualization and
statistical analysis of biological data. A computer is required (Win or
Mac, but not Chrome OS as it does not run R). You will
learn how to write programs in base R that deal with
data (store, read, edit, manipulate, graph, explore, analyze,
model). Bio 240 is based on, and requires you to learn to use
R in the analysis of data.
Bio 240 Learning Goal : Data-driven ability to
use graphical visualization and statistical reasoning in Biology, by
custom computer programming in base R.
R is global in reach, globally crowd-sourced,
multilingual, revised every 6 months and remains free and open source
(https://www.r-project.org). R was
written by statisticians for statisticians and can be counted on to
be statistically correct. All commercial statistical
applications (e.g., SPSS, SAS, JMP, Stata, MATLAB, Mathematica,
GraphPad) and especially spreadsheets like Excel, are simply not
competitive with R in terms of depth, breadth,
innovation and cutting-edge features. With practice, R
is not difficult to use.
This self-contained HTML file is specified by and was rendered from,
a .Rmd file, an R Markdown file, which is
a plain text file containing R code and prose
(.Rmd source file downloadable by Code button, top
right this page). R Markdown files are easy to read and relatively easy
to write, given a few days of practice. Reproducible
Research is this melding of writing, data, code, graphs and
results, within the same document. It’s the wave of the future in
Science because reproducibility of results, given the prose,
computer program and data, is paramount. Due to time constraints,
Reproducible Research is not part of Bio 240 but there will be optional,
exposure to it.
Study Sections 1-12 for a traditional
Syllabus.
Study Sections 13 & 14 for R
coding, relevant to some graphical and statistical methods we cover in
Bio 240.
Study Section 15 on word clouds, on ugly code versus
clean well structured readable code, on career development advice, and
on recovering R Session information.
Click on a Section just below to go to it, or simply
scroll down this document.
P R E A M B L E
Bio 240 Biostatistics is a required course for Biology Majors in the BS
degree program at Lehman College of the City University of New York.
Nationwide, we recognize that skills in the statistical analysis and
graphical visualization of biological data, are essential. Biostatistics
is required for biology majors at the leading Colleges and Universities
across the USA.
Students -- if you've not had a statistics course or a course in computer
programming, then get over it!
In Bio 240 if you show up, show up on time, pay attention, take notes, work
hard and don't cheat, then you'll acquire deep skills in Biostatistics and in
R programming. If you have had a course in statistics and/or in programming
then you'll quickly learn R. Of course, all statistical and programming
environments have a learning curve.
Bio 240 provides the opportunity for you to exit the course functioning in R
coding for Biostatistics, at Biology Graduate School level, and I have case
studies to prove it.
Last rendered: Thursday August 24, 2023 at
19:25:37
__________________
R
data
graphs
statistics
writing about the results
email: You already have my College email
address. In Bio 240 we make frequent use of email for file transfers, Q
& A, Tests and general communication.
Zoom: Twenty minutes before each class, I
email the Zoom meeting invitation with URL link.
Zoom
sessions occupy the 4 hour class time for the section of Bio 240 that
enjoys your formal registration.
Twitter: https://twitter.com/LC_biostat 2,100+ Tweets
dedicated to R, Data, Graphs, Biostatistics, Bio 240, Science, Career
Development, and related topics. In the transition from Twitter to X,
glitches remain.
RPubs: RPubs is a free website at https://rpubs.com and it’s
handy for publishing R Markdown documents from within
the RStudio IDE. RPubs enables the public to view and to comment on
documents but not to download. However, the foo.Rmd
source code of a document may be downloaded if the
author allowed for same, in the document’s meta, YAML code. Go to https://rpubs.com/Bio240_Biostat to see my documents
published there, so far. View the Syllabus at https://rpubs.com/Bio240_Biostat/Syllabus
At https://rpubs.com ,
see the most recently published documents, worldwide. Many of these
(HTML, PDF, Word, slides, etc.) are student homework assignments! Entire
courses are posted by faculty worldwide, and multilingual. Indeed, I may
go that route for Bio 240, as my skills improve in HTML5, CSS3, SVG and
JavaScript.
Blackboard: Blackboard has very limited
use in Bio 240. The Syllabus is posted there and I may use it to send
Announcements. The 3 Test grades will be posted there but realize that I
only use Blackboard to post your raw Test grades, not
to tabulate averages or anything else.
phone : currently not available, please use email.
__________________
We are not meeting in person for Bio 240, this term. Because I am Zoom cam shy, this photo is of yours truly 9 years ago, with a very alive lobster on the Bay of Fundy off Grand Manan Island in the Province of New Brunswick, Canada.
This lobster’s job was to impress tourists, not to be eaten.
It
was returned to the ocean when we departed.
The Bay of Fundy has the highest tides in the world – 45 vertical feet where photo was taken. https://en.wikipedia.org/wiki/Bay_of_Fundy
Obviously I regret not having the crew remove the rubber bands before my wife took the photo, but I might have lost a finger as even its smaller, abdominal chelae were biting fiercely in revolt from being plucked from the sea.
Bay of Fundy, 2014
Question: Is Bio 240 a course in
statistics or in computer
programming?
Answer: It’s both, because increasingly, the
scientist, teacher, physician, investigator — must be able to do
both.
__________________
The Undergraduate Bulletin is your contract
with the College:
http://lehman.smartcatalogiq.com/2019-2021/Undergraduate-Bulletin
; it has the rules on degrees, majors, minors, and more.
College policy on Academic Integrity:
http://lehman.smartcatalogiq.com/en/2019-2021/Undergraduate-Bulletin/Academic-Services-and-Policies/Academic-Integrity
View: http://lehman.smartcatalogiq.com/2019-2021/Undergraduate-Bulletin/Academic-Services-and-Policies/Undergraduate-Grading-Systems-and-Related-Policies/Grade-Appeals
for Grade Appeals.
General Grade Appeals
Students not satisfied with a grade received in a course should first consult the instructor involved. It is the instructor's sole judgement that determines the grade recorded in the Office of the Registrar. The instructor's first judgement is always taken. Second and later judgements resulting from personal appeals and hardship claims are rarely honored. Occasional errors do occur, and these are always corrected promptly when properly certified by the instructor to the Office of Academic Standards.
Departmental of Biological Sciences: Grade AppealsWhen a student considers a grade unjust, the student should first confer with the instructor. Students are required to initiate grade apppeals before the end of the eight week of a semester following the entry of a permanent grade. If a student is not satisfied with the outcome, the student may appeal in writing to the department Chair.
1. The Chair will appoint a review committee of at least two department members who, with the Chair, will review the appeal. The committee will examine all materials relevant to the appeal, submitted by both the instructor and the student, and will prepare a written report of its findings, either sustaining the original grade or recommending a change.
2. The Chair will notify the student, the instructor, and the Office of Academic Standards of the committee's decision. If the committee recommends a grade change, the Chair will initiate a Grade correction.
3. The decision of the review committee will be binding on both parties.
Summer 2023 Depart. of Biological Sciences, Lehman College CUNY
General ExemptionsAll students are required to review the course schedule and record important dates for exams, quizzes, and labs on the Syllabus. For death in the family or emergency travels, accomodations that are feasible will be provided ONLY with a dated written notification of the event.
Religious ExemptionsAccomodations must be requested from the Division of Student Affairs (Shuster Hall, Room 204) or course instructor. To provide a timely resolution, it is strongly recommended that requests are submitted within the first week of the semester.
Long-term Health-Related ExemptionsIf you encounter a health-related issue during the semester that may cause you to miss more than two classes in a row, contact your general, Academic Advisor to discuss your options for withdrawing from the course or remaining in the course, and be prepared to provide official medical documentation.
Special-Needs ExemptionsAccomodation must be requested through the Office of Student Disability Services (Shuster Hall, Room 238) and approved by the instructor within the first two weeks of the semester.
Note 1: The Student Health Center at Lehman College provides health-care services to all Lehman students free of charge or for a small fee. The Counseling Center at Lehman is available to assist students with emotional, developmental, and psychological concerns. In New York City, some health-care services are also available to all residents (go to site for: NYC CARE).
Note 2: Students are responsible for checking their Lehman email where important information is sent by the instructor and College. We will not be able to accomodate you if you missed an email from the College with information that affects your class attendance or performance. If you have difficulty accessing your Lehman email account, follow instructions online. For help, contact IT Services for Students as soon as possible or email them at help.desk@lehman.cuny.edu
__________________
A computer (Win or Mac, but not Chrome OS because it will not run R) is REQUIRED for this synchronous, online section of Bio 240.
If we were meeting in Davis 225, there are Windows desktop terminals available but most still prefer their less restrictive, personal laptop computer, which also allows you to take R and all the software and data with you, after the semester is over. In Davis 225 you would need your Lehman email account for wireless access, needed for downloading R and for downloading R packages across the semester.
We will download to each machine the free and open-source, R language and system for statistical analysis and graphics (https://www.r-project.org). I will demonstrate.
R was created by statisticians for statisticians; therefore, R is statistically correct and can be trusted to be so. However, R is now used by and contributed to, by a much wider range of folks, spanning all scholarly disciplines including even literature, audio, art, video and Wall Street.
R is becoming the statistical and graphical, computer language of choice for scientists worldwide, with Python, JavaScript and Julia (MIT) as worthy, free and open source competitors.
There are R packages enabling your code in native C, C++, Python, Julia or other languages, to run within R. An R “package” may contain any number of new functions, data sets, demos, documentation, links and other resources.
Because R was created for data, graphs and statistical computing; it is not a general purpose programming language. Indeed, some use R just for making graphs and this is a good place to start learning R, as we shall do so in Bio 240.
Explore the R website, https://www.r-project.org to learn about the immense, R environment. Check out the Task Views, The R Journal (https://journal.r-project.org/ peer reviewed, open access), free R Manuals, FAQs, Help/Search engines, annual Conferences, related projects (R Blog, R-Forge) and affiliated organizations such as Bioconductor https://www.bioconductor.org/ for bioinformatics.
Notice the rich, global and multilingual nature of the R environment.
You may follow R on Twitter, https://twitter.com/_r_foundation
During class I will demonstrate on a Windows OS laptop, the installation of R and the installation of contributed packages from among the 19,000+ packages that are vetted and available in the latest release, R version 4.2.2 (2022-10-31 ucrt).
A (real) computer is required – Win or Mac, but not Chrome OS because it
can’t run R.
The College has a laptop loaner program. To apply,
visit: https://lehman.edu/coronavirus/student.php
and
click: Student Device Loan Application.
R logo
For Spring 2023 my sections of Bio 240 remain, 100% online synchronous and hosted by Zoom meetings during the regularly scheduled class period that enjoys your formal registration.
Twenty minutes before each class, I email the Zoom URL link.
Bio 240 makes limited use of Blackboard (Syllabus, posting of Test grades). I will also email files to you, regularly.
Distance learning material includes my lectures and demonstrations, polls, handouts, data files, graphs, R Editor files, PDF files of readings, URL links, videos, pop quizzes and practice Tests.
We meet online in Zoom during the regularly scheduled class times for Bio 240 (Friday 2pm - 5:50pm; Saturday 10am - 1:50pm).
I begin each class exactly on time.
Top students show up, show up on time, pay attention, take notes, work hard and don’t cheat – thus moving their career development forward.
In class, you will participate in
active learning by collaborative coding in R
The key to learning is to practice over-and-over on the computer, dealing with data, graphs and statistical methods in R, combined with willingness to engage R by trial-and-error and in real time, typing code at your computer along with me during class, as we write and debug code together.
A computer program is merely a text document that one has typed! All computer programs whether small or large, reside in a constant state of revision.
You must grapple with the subject matter and work hard during class and outside of class.
There are no term papers and no class presentations but you must spend considerable time working outside of class practicing data analysis and graphics by calling functions in R and constructing R Editor files (programs) for the practice problems. Otherwise, you will do poorly.
Your extended attention is required in Bio 240 just as it’s required in biological research and just as it’s required in your professional career development.
“The process of preparing programs for a digital computer is especially attractive not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music”
— Donald E. Knuth. 1973. The Art of Computer Programming, 2nd ed, Vol 1, Fundamental Algorithms. Professor Emeritus, Computer Science Department, Stanford University, CA
born 1938
In Bio 240, I try to convince you that writing programs in R for dealing with biological data, “…can be an aesthetic experience much like composing poetry or music.”
__________________
By the above formula, the average of the 3 formal Tests is 100% of the final grade.
Although not graded, data assignments, practice Tests, Zoom polls and
other
activities, reflect material to keep you moving forward in the
course.
The number of Tests, even the formula itself may change, reflecting internal as well as external factors that may be beyond our control during the pandemic.
There is no extra credit work in Bio 240, beyond the +2 to +10 points of extra credit questions that are at the end of each Test, usually. Therefore, please do not ask for extra credit work.
Tests are form fill PDF files that are downloaded to your machine, engaged, saved and then emailed back to me.
You must be able to handle form fill PDF files else you must drop Bio 240. This will be addressed on the first day of class.
While many apps, even browsers, can partially deal with form fill PDF files, the best one is Acrobat Reader DC which is free and by Adobe, the company that invented PDF: https://get.adobe.com/reader/
Acrobat Reader DC works on all operating systems; it’s what I use. Download it.
Tests are strictly timed at 4 hours and are not cumulative. 😀
Final average Letter grade Details
>= 90 A-, A 90 <= A- < 93
93 <= A < 100
>= 80 & < 90 B-, B, B+ 80 <= B- < 83
83 <= B < 85
85 <= B+ < 90
>= 70 & < 80 C-, C, C+ 70 <= C- < 73
73 <= C < 75
75 <= C+ < 80
>= 60 & < 70 D, D+ 60 <= D < 65
65 <= D+ < 70
< 60 F F < 60
How to succeed in Bio 240 Biostatistics
❶ Show up
❷ Show up on time
❸ Pay attention
❹ Take notes
❺ Work hard
❻ Ask questions
❼ Be honest, don't cheat
❽ Practice, practice, practice
__________________
Verzani, John (2014) Using R for Introductory Statistics, 2nd ed. CRC Press, NY. 502 pages. $27 at Amazon. Dr. Verzani is a math Professor at College of Staten Island, CUNY.
Murrell, Paul (2019) R Graphics, 3rd ed. CRC
Press, FL. 423 pages. $82 at Amazon. Paul Murrell is appreciated as the
leading expert on R graphics and author of R packages, including
grid which enables the packages lattice,
ggplot2 and other advanced graphic environments. This
comprehensive book is not for the fainthearted. However, his early
chapters on base R graphics are fundamental, understandable and
precisely demonstrated.
de Vries, A. and J. Meys (2015) R for Dummies, 2nd ed. John Wiley, NJ. paper, 418 pages. $19 at Amazon.
Navarro, Danielle (2015) Learning Statistics with R, Version 0.6. Creative Commons BY-SA license for free use. 613 pages. PDF download at: https://learningstatisticswithr.com
books on Data, Graphs, Statistics, R programming
Chang, Winston (2019) R Graphics Cookbook Practical Recipes for Visualizing Data, 2nd ed. O’Reilly, CA. paper, 425 pages. $48 at Amazon. Read online for free at: https://r-graphics.org
Long, J.D. and P. Teetor (2019) R Cookbook Proven Recipies for Data Analysis, Statistics, and Graphics, 2nd ed. O’Reilly, CA. paper, 579 pages. $41 at Amazon. Read online for free at: https://rc2e.com
Wickham, H. and G. Grolemund (2017) R for Data Science Import, Tidy, Transform, Visualize, and Model Data. O’Reilly, CA. paper, 492 pages. $40 at Amazon. 2nd ed. in progress. Read online for free at: https://r4ds.had.co.nz
Wickham, H. (2019) Advanced R, 2nd ed. Chapman & Hall/CRC, NY. paper, 604 pages. $42 at Amazon. Read online for free, at: https://adv-r.hadley.nz/
Gillespie, C. and R. Lovelace (2016) Efficient R Programming a practical guide to smarter programming. O’Reilly, CA. paper, 222 pages. $24 at Amazon. Read online for free, at: http://csgillespie.github.io/efficientR/
Grolemund, G. (2014) Hands-On Programming with R Write your own functions and simulations, O’Reilly, CA. paper, 230 pages. $26 at Amazon. Read online for free, at: https://rstudio-education.github.io/hopr/
books on Reproducible Research and Interactive Apps
Xie, Y. (2015) Dynamic Documents with R and knitr, 2nd ed. Chapman & Hall/CRC, NY. paper, 294 pages. $40 at Amazon. Read online for free, at: https://yihui.org/knitr/
Wickham, H. and Jenny Bryan (2023) R Packages Organize, test, document, and share your code, 2nd ed. O’Reilly, CA. paper. $60 at Amazon. Read online for free at: https://r-pkgs.org/
Wickham H. (2021) Mastering shiny: build Interactive Apps, Reports & Dashboards, O’Reilly, CA. paper. 372 pages. $38 at Amazon. Read online for free at: https://mastering-shiny.org
Xie, Y. (2016) bookdown Authoring books and Technical Documents with R Markdown. Chapman & Hall/CRC, NY. paper, 138 pages. $23 at Amazon. Read online for free, at: https://bookdown.org/yihui/bookdown/
Xie, Y., A. Thomas and A.P. Hill (2017) blogdown Creating Websites with R Markdown. Chapman & Hall/CRC, NY. paper, 172 pages. $27 at Amazon. Read online for free at: https://bookdown.org/yihui/blogdown/
Xie, Y., J.J. Allaire and G. Grolemund (2018) R Markdown The Definitive Guide. Chapman & Hall/CRC, NY. paper, 338 pages. $34 at Amazon. Read online for free, at: https://bookdown.org/yihui/rmarkdown/
__________________
Available as PDF files by my dropbox link:
https://www.dropbox.com/sh/qr1uowvfi8m0oae/AAC9oGH0KPmaYmU4y02MtKUZa?dl=0
Scan these 4 reading assignments
Dalgaard, Peter (2008) — Ch 1. Basics, Ch 2. The R Environment, Ch 4. Descriptive statistics and graphics
Murrell, Paul (2011) R Graphics. 2nd ed. — Ch 1. An Introduction
to R Graphics, Ch 2. Simple Usage of Traditional Graphics, Ch 3.
Customizing Traditional Graphics.
This is not from his 3rd edition
but content is nearly the same for these 3 chapters on base R
graphics.
Baldi & Moore (2009) The Practice of Statistics in the Life
Sciences. — Ch 1. Picturing Distributions with Graphs,
Ch 2.
Describing Distributions with Numbers
Sokal & Rohlf (1995) Biometry. 3rd ed. — Ch 2. Data in Biology, Ch 4. Descriptive Statistics
Contributed by the R community, the R website (https://www.r-project.org/) has links for free books, pamphlets and information sheets.
__________________
Advice on learning to code to benefit your scientific career
Please watch this video at https://youtu.be/0L4CQUK--dA
In June 2021, the journal Nature presented a 60 minute Webcast on “Advice on learning to code to benefit your scientific career.” It had 3 speakers, Q&A, and R was emphasized along with Python. Much of what they discussed resonates with our learning path in Bio 240 Biostatistics.
Question: Do you know what FOSS stands for? I am proud to be a FOSS person.
__________________
Dr. Stephen Linn Chew. How to Get the Most out of Studying
https://www.samford.edu/arts-and-sciences/directory/Chew-Stephen-Linn – his website
Video 1. Beliefs That Make You Fail…Or Succeed
https://youtu.be/RH95h36NChI 6:54 minutes
Video 2. What Students Should Understand About How People
Learn
https://youtu.be/9O7y7XEC66M 7:15 minutes
Video 3. Cognitive Principles For Optimizing
Learning
https://youtu.be/1xeHh5DnCIw 5:45 minutes
Video 4. Putting The Principles For Optimizing Learning
Into Practice
https://youtu.be/E9GrOxhYZdQ 9:17 minutes
Video 5. I Blew The Exam, Now What?
https://youtu.be/-QVRiMkdRsU 7:29 minutes
Download the Teaching Guide by Dr. Chew. 15 pages
https://www.samford.edu/departments/files/Academic_Success_Center/How-to-Study-Teaching_Resources.pdf
__________________
Dr. Pardis Sabeti. What is Statistics?
What is Statistics? https://www.learner.org/series/against-all-odds-inside-statistics/what-is-statistics/
6 minutes
Designing experiments https://www.learner.org/series/against-all-odds-inside-statistics/designing-experiments/ 11 minutes. In Bio 240 we shall analyze data from observational studies as well as from experimental studies
Summary https://www.learner.org/series/against-all-odds-inside-statistics/summary/
6 minutes
Stem plots https://www.learner.org/series/against-all-odds-inside-statistics/stemplots-2/ 11 minutes
Histograms https://www.learner.org/series/against-all-odds-inside-statistics/histograms/ 9 minutes
Boxplots https://www.learner.org/series/against-all-odds-inside-statistics/boxplots/ 9 minutes
Scatterplots https://www.learner.org/series/against-all-odds-inside-statistics/scatterplots/ 8 minutes
Measures of center https://www.learner.org/series/against-all-odds-inside-statistics/measures-of-center/ 8 minutes
Standard deviation https://www.learner.org/series/against-all-odds-inside-statistics/standard-deviation/ 9 minutes
Confidence intervals https://www.learner.org/series/against-all-odds-inside-statistics/confidence-intervals/ 10 minutes
Normal calculations https://www.learner.org/series/against-all-odds-inside-statistics/normal-calculations/ 12 minutes
Normal curves https://www.learner.org/series/against-all-odds-inside-statistics/normal-curves/ 12 minutes
Checking assumption of normality https://www.learner.org/series/against-all-odds-inside-statistics/checking-assumption-of-normality/ 10 minutes
Sampling distributions https://www.learner.org/series/against-all-odds-inside-statistics/sampling-distributions/ 12 minutes
Random variables https://www.learner.org/series/against-all-odds-inside-statistics/random-variables/
11 minutes
Featuring the NASA Challenger disaster of 1986 caused by O-ring
failure
Tests of significance https://www.learner.org/series/against-all-odds-inside-statistics/tests-of-significance/ 16 minutes
Small sample inference for one mean https://www.learner.org/series/against-all-odds-inside-statistics/small-sample-inference-for-one-mean/ 12 minutes
Comparing two means https://www.learner.org/series/against-all-odds-inside-statistics/comparing-two-means/ 11 minutes Featuring research by a CUNY Professor at Hunter College
One-way Anova https://www.learner.org/series/against-all-odds-inside-statistics/one-way-anova/ 12 minutes
Correlation https://www.learner.org/series/against-all-odds-inside-statistics/correlation/ 10 minutes
The question of causation https://www.learner.org/series/against-all-odds-inside-statistics/the-question-of-causation/ 14 min Smoking and lung cancer
Fitting lines to data https://www.learner.org/series/against-all-odds-inside-statistics/fitting-lines-to-data/ 10 minutes
Inference for regression https://www.learner.org/series/against-all-odds-inside-statistics/inference-for-regression/ 13 minutes History of DDT and bird eggs
Introduction to probability https://www.learner.org/series/against-all-odds-inside-statistics/introduction-to-probability/ 11 minutes
Probability models https://www.learner.org/series/against-all-odds-inside-statistics/probability-models/ 10 minutes
Binomial distributions https://www.learner.org/series/against-all-odds-inside-statistics/binomial-distributions/ 11 minutes
Inference for proportions https://www.learner.org/series/against-all-odds-inside-statistics/inference-for-proportions/ 10 min
Inference for two-way tables https://www.learner.org/series/against-all-odds-inside-statistics/inference-for-two-way-tables/ 10 minutes Dr. Sabeti’s lab work
Glossary https://www.learner.org/series/against-all-odds-inside-statistics/glossary/
Note: The viewing order for Dr. Sabeti’s short, statistics videos is not that important. Each is introductory, quasi-independent and each uses real-world examples. There are 36 units, some with PDF transcript downloads. Units 34-36 are interactive applications. Interestingly, there is no coverage of effect size. As time allows, you may profit from engaging these 36 units that are optional for Bio 240.
She is a physician scientist, computational geneticist, prominent
Ebola, Lassa fever and Covid-19 researcher, Professor at Harvard
University and an Iranian American. She earned the PhD from Oxford
University and MD from Harvard University.
What is Statistics is just one from the Against All
Odds: Inside Statistics series of videos, all of which are
short, excellent and with transcripts available as .pdf downloads.
https://www.sabetilab.org/pardissabeti/ – Her lab’s website
https://www.hsph.harvard.edu/pardis-sabeti/ – Professor Sabeti, Harvard T.H. Chan School of Public Health
https://oeb.harvard.edu/people/pardis-sabeti – Professor Sabeti, Harvard Depart. Organismic & Evolutionary Biology
https://en.wikipedia.org/wiki/Pardis_Sabeti – Wikipedia
https://www.hup.harvard.edu/catalog.php?isbn=9780674260474&content=toc
– her 2021 book:
OUTBREAK CULTURE – The Ebola Crisis and the next
epidemic
Free download: https://get.adobe.com/reader/ but I suggest not selecting any of the other ‘free’ offers in the dialog box.
__________________
Videos by the American Statistical Association
https://www.amstat.org/ (ASA Student membership is
only $25 per year)
Why You Need to Study Statistics. 3 minutes https://youtu.be/wV0Ks7aS7YI
This is Statistics: Genevera Allen. 1:41 minutes https://youtu.be/xURkTKtDq_M
This is Statistics: Chandra Erdman. 1:40 minutes https://youtu.be/uss7MXcpHaQ
This is Statistics: Roger Peng. 1:50 minutes https://youtu.be/WMDAR2bZEp0
Statisticians Making a Difference. 3:08 minutes https://youtu.be/_EnoTvnx2gQ
LinkedIn Data Scientist Talks Statistics. 1:51 minutes https://youtu.be/jf2YYTL-wYE
Dr. Sudipta Roy – Why statistics matter for everyone. 0:30 minutes https://youtu.be/UKgYNIHtmN4
Statisticians in Other Fields. 3:08 minutes https://youtu.be/NqngbsS0iKY
World Statistics Day: Statistics All Around Us. by U.S. Census Bureau. 5:39 minutes https://youtu.be/piSCkkSvoMo
Some Speakers in the above, American Statistical
Association (ASA) videos.
__________________
Dr. Talithia Williams
https://www.hmc.edu/mathematics/people/faculty/talithia-williams/ – Professor Williams, Harvey-Mudd College, CA
https://www.talithiawilliams.com/ – Professor Williams web page
https://www.talithiawilliams.com/book – Her latest book: Power in Numbers, The Rebel Women of Mathematics
https://www.thegreatcourses.com/talithiawilliams – Learning Statistics: Concepts and Applications in R , 24 videos, $40
Own Your Body’s Data – 17 minute TED talk, February 2014
__________________
Dr. Hans Rosling (1948 - 2017)
He invented “bubble graphs” which are 2D scatterplots that display 5 variables. He was Professor of International Health at Karolinska Institute, Sweden.
https://en.wikipedia.org/wiki/Hans_Rosling – Wikipedia biography
HANS ROSLING, 200 Countries, 200 Years, 4 Minutes. 4:47 minutes https://youtu.be/jbkSRLYSojo
Florence Nightingale: Joy of Stats. 3:42 minutes https://youtu.be/yhX0OR1_Vfc
Averages: Joy of Stats. 3:17 minutes https://youtu.be/hUGUWr-TjR8
Crime Spotting: Joy of Stats. 4:58 minutes https://youtu.be/en2ix9f8ceM
Correlation: Joy of Stats. 3:38 minutes https://youtu.be/6RzDMEW5omc
Automatic translation: Joy of Stats. 4:09 minutes https://youtu.be/AEac-jP5Eho
__________________
Video from The BBC and The Open University
When can you trust statistics? The modern world is littered with
statistical noise. Here’s how to find the signal. 4:50 minutes
https://youtu.be/aPh3E8IoHzk
This video does
not deal with Statistics from a scientific point of
view. Rather, it examines how we as individuals in everyday life, may
view and interpret Statistics as reported in the media
across all topics, e.g., socio-economic, political. They promote: be
Calm and be wary of your Emotions, get
Context, be Curious and consider the
Source of the Data and of the
Statistics. This advice also makes good sense for
researchers on the bench, dealing with their own data and
statistics.
__________________
foo.R R Editor files
and foo.pdf graph files. Create well
structured, readable, R Editor files (R programs) with comments,
spacing, indentation, and the wise naming of variables and
objects. Store data in R Editor files as vectors and data frames. __________________
R
data
graphs
statistics
writing about the results
Data-driven ability to use statistical reasoning in Biology by computer programming in R.
Store, edit and manipulate data in standard formats and visualize in publication quality graphs using R. Read and write, standard .csv files of row-by-column data.
Perform statistical analysis (traditional, computationally intensive, Bayesian) and express results as tables, graphs and writing appropriate for scientific journals.
Ability to call R functions, deal with function arguments and
interact with the objects returned by functions. Create, debug and run,
well-structured and readable, R Editor files ( .R ,
i.e., R programs). Optional : Create new R functions as
convenience wrappers for extant functions in R; create new R functions
for new tasks.
Write critical, statistical reviews of scientific journal papers relative to tables of statistics, graphs and their legends.
Optional: Entry level ability in reproducible research using R Markdown to make small, HTML documents that combine R code, data, results, graphs and writing. RStudio makes this process tractable, albeit with a learning curve. I will demonstrate in class and provide templates. This is our only use for RStudio. For everything else in Bio 240, we use R (R Console, R Editor, R Graphics Window) because I want you close to the language itself, not to a commercial IDE that promotes a dialect of R.
After each Biostatistics class
ask yourself these 3 questions
▶ What is the most important thing your learned?
▶ Name one thing you did not understand.
▶ What question remains in your mind?🦁
Academic Integrity Statement : Lehman College
School of Natural and Social Sciences. I will email it for you to
consider and optionally, to return by email to me. It is a form fill PDF
file.
Rather fail with honor than succeed by fraud — Sophocles (b. 497, d. 406 BC)
In Bio 240 you are on your personal honor to take
Tests without cheating. To cheat is to give or to receive answers or
help. The use of generative A.I. is cheating. Cheating is immoral, it is
a sin. It can get you expelled from the College. Don’t do it, it’s not
worth the consequences.
Cheating destroys career development. For instance, cheating vitiates
letters of recommendation. Faculty may refuse to write a letter or may
write a negative letter.
Remember, a recommendation letter for graduate school or professional
school is evaluated several ways, including as a character reference. By
definition, a cheater is a person of low character and graduate schools
do not want this kind of person in their program, whether MS, PhD or MD.
Student Disability Statement : “Lehman
College commits to providing access to programs and curricula to all
students. Students with disabilities who need classroom accommodations
should register with Office of Student Disability
Services, Shuster Hall room 238, 718-960-8441.”
__________________
Fall 2023 Undergraduate Academic Calendar — Lehman College, City University of New York, Bronx NY 10468
Friday August 25 --- Fall 2023 classes begin
September 1 thru 14 --- Withdraw (grade of WD) period
Monday September 4 --- College Closed, no classes; Labor Day
September 15 thru 17 --- no classes; a religious holiday
September 24 thru 25 --- no classes; a religious holiday
Monday October 9 --- College Closed; Columbus Day
Tuesday October 10 --- All Classes follow a Monday schedule
Wednesday November 22 --- no classes
November 23 thru 26 --- no classes; Thanksgiving Holiday
Monday December 11 --- Last day of Classes & "W" grade deadline
Tuesday December 12 --- Reading Day
Wednesday December 13 --- Reading Day
Thursday Dec. 14 thru Wed. Dec. 20 --- Final Examinations at Lehman College
Friday pm section: 2 pm - 5:50 pm
Friday Aug. 25 --- Class 1 --- Get started in base R: coding, data, graphs, statistics I
Friday Sept. 1 --- Class 2 --- " " " " " " " " II
Friday Sept. 8 --- Class 3 --- " " " " " " " " III
Friday Sept. 22 --- Class 4 --- " " " " " " " " IV
Friday Sept. 29 --- Class 5 --- TEST 1
___________________________________________
Friday Oct. 6 --- Class 6 --- ANOVA: concepts, history, calculations, Anova Table, F-
distribution, significance test, assumptions, writing
Friday Oct. 13 --- Class 7 --- ANOVA: effect size (R-sq, Cohen D, Variance Components),
Confidence Intervals by t-distribution & bootstrap,
statistical power analysis: traditional & bootstrap
Friday Oct. 20 --- Class 8 --- Continued,...
Friday Oct. 27 --- Class 9 --- Multiple Comparisons following a significant Anova,
planned vs. unplanned comparisons, Tukey HSD,
pairwise t-tests, Dunnett's Test,
Holm-Bonferroni method, graphical visualization
Friday Nov. 3 --- Class 10 --- Kruskal-Wallis & other nonparametric methods, Bayes
Factor Anova, Permutation tests, Monte Carlo data
simulation, graphical visualization of one-way data,
writing about statistical & graphical results
Friday Nov. 10 --- Class 11 --- Continued,...
Friday Nov. 17 --- Class 12 --- TEST 2
___________________________________________
Friday Dec. 1 --- Class 13 --- Correlation, graphical visualization, related topics I
Friday Dec. 8 --- Class 14 --- " " " " " II
Friday Dec. 15 --- Class 15 --- TEST 3. The Final Exam for FRIDAY Section
Note: Days for Test 1 and Test 2 are subject to change. Test topics are subject to change.
__________________
Bio 240 is a required course for the Biology BS degree but not for the Biology BA.
But, changes are pending for the Biology BS to enable course substitutions for Bio 240.
We meet the first class for the full period, so be prepared to work, to listen and to code collaboratively with me, in class.
Use your laptop or desktop computer for Zoom meetings. Phones, ipads and Chrome books handle Zoom but cannot run R.
Bio 240 is based on R and requires you to use R, for data analysis and graphics.
R runs well on Linux e.g., Ubuntu, but the look and feel is different so don’t use a Linux machine unless you’re somewhat comfortable at the command line of the Linux terminal and you’re willing/able to understand my projected, Windows coding in R during class, into same, in some third party text editor, say Notepad++ in Linux or Ubuntu’s native text editor or the Linux version of RStudio.
I routinely program R in Ubuntu using Notepad++ as text editor (instead of R Editor which doesn’t exist on the Linux side) and enjoy the speedup relative to running R on a Windows machine, other factors being equal.
If you have experience in any computer programming language you will immediately take to R, find it easy, and enjoy the readable code.
If you have no programming experience then get over it, and get started in this class with R, as you advance your career development.
Whether one has had a statistics course before or not, seems to make little difference in Bio 240. Likewise for Calculus. In Bio 240 we use basic algebraic and symbolic thinking while letting R do the heavy lifting.
Disturbingly in recent semesters, some students did not know how to move files off and onto their hard drives using Lehman email and USB ports (e.g., USB flash drive). Others did not know the name or location of their working directory or what a working directory is.
Many refused to, or could not use their computer the way a scientist uses a computer – for instance, all apps out of RAM except the one being used, R in our case – and suffered OS freeze or glacial slowness, as a result. Also, avoid having multiple instances of R or it will eventually crash. Avoid docking R windows else confusion may ruin your R Session. An R Session is the period of time from when you open R to when you close R.
Many had trouble creating, naming and saving computer files properly and this is a serious issue during a Test, and in real life.
Please note that Windows machines must be configured so the user can know and “see” the full name of all files else one has difficulty reading files into R and trouble properly naming files. The default Win setting “hides” file extensions, like .csv, .R, .pdf. Mac OS assumes humans are smart enough to handle seeing the complete file name so it displays the complete file name, wisely. The file name includes the extension, e.g., foo.csv and foo.R are complete file names while foo is not.
If your Windows machine does not show you the complete name of any and all files, then do the following. Go to Settings to Update & Security to For developers to File Explorer, then make sure to click, such that the settings will show file extensions. Now you will see the full name of every file and you can write code in R to name/read/write files successfully and properly.
Obviously, all college students own laptop or desktop computers. Should you be in the market for one I can provide advice based on my 50+ years using computers. Be advised that Bio 240 does not require an expensive laptop computer but it cannot be Chrome OS because Chrome can’t run R.
Everything in Bio 240 runs on my old $169 ASUS laptop (11.6”, Win 10, 2 core Centrino CPU, 4 gb RAM). Although I was responsidle for bringing the first Apple computers to this campus in 1979, Macs are over-priced and more difficult to use and to maintain than Windows machines, at the present time; nevertheless, if you know how to use one like a scientist uses a computer, then Macs are fast and reliable, albeit too expensive.
As you know, the College is makes some laptops available to students during the pandemic – see your emails from the College on how to request one. Relative to Bio 240, avoid iPads and Chrome machines because they don’t run R, but they are Zoom compatible. I realize that a Chrome machine or iPad would be handy for courses other than Bio 240.
Newfoundland, Canada
The top of an iceberg looks large but it’s small compared to the bottom of the iceberg.
What’s more important, the top or the bottom?
I learned this firsthand in the 1970s in Newfoundland Canada. I recklessly scrambled over the edge of massive coastal boulders to inspect and touch a jostling iceberg fragment the size of a small house, that calved off the gigantic parental iceberg further out in the bay.
Icebergs look white. But, I was astonished that up close, the ice was crystal clear and mysterious, as I peared deeply within and pondered where it came from, how old it was, where it was going and how long it would last.
In Bio 240, of course there is required material reflected in Tests. And, I will expose you to much more — to the optional bottom of the iceberg for those pushing career development farther and faster than merely earning the grade of A or B.
Periodically across the term I’ll demonstrate a particular statistical method, a different type of graph or delve into some R coding strategy while announcing it’s bottom of the iceberg \({ }\) meaning optional. Some extra credit questions on tests might reflect the bottom of the iceberg.
Distribution of grades
The graph below, created in R, is a density histogram with kernel density estimate, of the final average grades for Spring 2020. Notice the peaked central tendency (positive kurtosis) and the heavy left-hand tail (negative skewness), relative to a normal distribution.
Don’t worry about understanding the graph and the descriptive statistics right now, because we will develop these things in detail, in class.
The two graphs below, visualize the distribution of final grades in Bio 240 for the Friday and Saturday sections, Fall 2022. The final average is the mean of the 3 Tests. I added 10 points to the final average of each student, as indicated on the graphs.
Notice that by the last day of class, n=10 had dropped from the Friday section with the “W” grade; and n=13 dropped from the Saturday section, all due to very low test grades. As a result, 85% earned A or B in the Friday section while 76% earned A or B in the Saturday section.
Interestingly, the median final grade (50th percentile) was nearly identical between the 2 sections: 90.88 for Friday, 90.97 for Saturday. The left graph is a boxplot and on the right, a kernel density graph. These graph types (histogram, boxplot, kernel density) are extremely common in scientific journal papers and you will become expert in coding for them in R.
__________________
A proven way to begin learning R is to engage the simplest type of numeric data: \({ }\) the univariate. In other words, single samples of numeric data whether real data or simulated data. By engage I mean — how to type and store it in R code, how to read it in from an external file, how to manipulate it, how to graph it, how to analyze it statistically.
Indeed, this is exactly how I got started teaching myself R, years ago. After reading about R in the 2004, first edition of our textbook by Prof. Peter Dalgaard, I reasoned that if I could store small, numeric samples in an R Editor file, get descriptive statistics for these samples, as well as graphs like histograms, then I could probably move forward to handling and analyzing more complicated data sets in R. I did and you can to.
It’s not rocket science but it does require patience, copious practice, attention to detail, the ability to engage R coding in trial-and-error fashion plus observing our rules and conventions for building and debugging, clean, readable, well-structure and wisely commented code, as opposed to dreaded ugly code that’s hard to read, change, print and debug.
In class, we mostly use R to graph and analyze real data, and data that’s largely biological.
But, there’s enduring value in learning early on, how to create and use, fake data. The word fake might be misleading; here, it’s shorthand for: Monte Carlo data simulation from theoretical probability distributions. Early in this Section we use simulated data from the normal probability distribution. Then, we move into one-way data (merely sets of single samples) that’s simulated data as well as real data.
As engaging and meaningful as it is in its own right, writing code is
not doing statistics.
One of the big advantages of working with R is
that you can do quite bit of statistics with just a handful of functions
and the simplest syntax. R is a tool that helps you keep moving forward.
If you want to see something then plot it. If the data are in the wrong
format, then mutate it.
John Chambers 2014 Keynote Address
UserR!
The line of R code below, y1 <- rnorm(60, 50, 9)
draws a random sample of n=60 from the normal
distribution of mean 50 and standard deviation 9.
Note the assignment operator <- that
we use in R, not the equal character =. But, we do use
the equal sign for arguments in function calls.
Early on as you’re learning R, it pays to get comfortable generating
fake data: \({\ }\)
random draws from say, the normal distribution
rnorm() , which is bell-shaped, the uniform
distribution runif() , which is flat-topped, and the
Weibull distribution rweibull() which is truncated
on the left and skewed to the right.
I cannot be more serious. Generating and using fake data is not just a technique, it’s a way of life for the analyst!
It enables one to get numbers to explore and to learn a function in R, to test/verify/learn/compare competing statistical procedures, and to address questions about our observed data, such as, what might happen if I repeated many times, my data sampling design (by a similar sampling design in Monte Carlo) and statistical analysis sequence?
set.seed(123) # seed random number generator, for repeatability, as desired
# get 60 random numbers from normal distribution of mean 50, SD 9; assign to y1
y1 <- rnorm(60, 50, 9) # >?rnorm for html documentation
y1 # print the sample at R Console
## [1] 44.95572 47.92840 64.02837 50.63458 51.16359 65.43558 54.14825 38.61445
## [9] 43.81832 45.98904 61.01674 53.23832 53.60694 50.99614 44.99743 66.08222
## [17] 54.48065 32.30045 56.31220 45.74488 40.38959 48.03823 40.76596 43.43998
## [25] 44.37465 34.81976 57.54008 51.38036 39.75677 61.28433 53.83818 47.34436
## [33] 58.05613 57.90320 57.39423 56.19776 54.98526 49.44279 47.24634 46.57576
## [41] 43.74764 48.12874 38.61143 69.52060 60.87166 39.89202 46.37404 45.80010
## [49] 57.01969 49.24968 52.27987 49.74308 49.61417 62.31742 47.96806 63.64824
## [57] 36.06122 55.26152 51.11469 51.94347
This is not the place for a formal treatment of the Normal Distribution, the most important probability distribution in all Statistics, although a brief introduction follows.
The Normal Distribution is defined by 2 parameters:
its mean, \({mu}\) and Standard
Deviation, \({sigma}\). It’s a
continuous distribution, symmetrical about its central tendency. The
so-called, standard normal distribution N(0,
1) has mean=0 and SD=1. Our fake data y1 is a
random sample from N(50, 9), using standard
notation.
\[\begin{align*} The\ PDF\ of\ the\ normal\ distribution:\ \ \ f(x) &= \frac{1}{\sqrt{2\pi\sigma^2}} e^{ - \frac{(x - \mu)^2} {2\sigma^2}} \\[1ex] \end{align*}\]
The Normal Distribution is a continuous distribution extending from
negative infinity to positive infinity, asymptotically. Its shape is
symmetrical on either side of its mean. The mean,
mode and median are the same number.
Some call it the “bell-shaped” curve. The formula for the PDF
(Probability Distribution
Function) of the normal distribution is presented above
— elsewhere, we develop various graphical visualizations and assessments
of normality, using our custom R code as well as by calling functions
from R packages.
The Normal Distribution is the most important distribution in all
Statistics. There are important reasons why. First, many
continuous measurement variables are normally distributed, such as human
height. Second, the Central Limit Theorem holds
that as sample size increases, the distribution of the sample means
becomes normally distributed thus allowing traditional, 95% Confidence
Intervals for population means to be calculated based on a single
observed mean, using its Standard Error (SD/sqrt(n)) and the
t-distribution. Third, many traditional, parametric statistical
procedures (e.g., t-test, Anova, OLS linear regression, Pearson
correlation) have assumptions, and one of these is that the observed
data are normally distributed.
Mathematically and causally in the physical world, what makes a
response variable normally distributed?
Think about human height – what causes it?
Obviously, there are environmental variables such as socio-economics,
nutrition (its timing, quality, quantity) and there are genetic
variables (many genes plus gender).
In brief, many causal factors each with small, additive effects and each
factor independent of the others, generate a normal distribution. In
math, the expansion of the binomial distribution generates the normal
distribution, by the same process. Galton’s “bean box” from the late
19th Century, demonstrates this mechanically by machine but we will use
a GIF animated simulation in R, using the “animation” package.
All statistics textbooks deliver solid treatments of the normal
distribution, as well as say, Wikipedia, https://en.wikipedia.org/wiki/Normal_distribution
Recovering a standard set of descriptive statistics is
always an early task in data analysis, and there are many different ways
to do it. The describe() function in the psych
package is handy. CRAN package psych is by Professor
William Revelle, Department of Psychology, Northwestern University. His
package has many functions. At the R Console,
> ?describe brings up HTML documentation, in a format
standardized across all funs in all CRAN packages. As an aside, we often
abbreviate fun for function, arg for a
function argument and foo for a dummy name say,
foo.R as a generic R Editor file.
From its documentation, the describe() fun has these
args, but here, we need only one arg, the first arg, which is the name
of the univariate sample (numeric vector in R). In all R funs the most
important args are listed first; args often have default values. When
learning how to use a fun that’s new to us, we always begin with a
minimal call to that function and explicitly add args to the fun call,
as needed. Here, we need only one arg: y.
describe(x, na.rm=TRUE, interp=FALSE, skew=TRUE, ranges=TRUE, trim=.1, type=3,
check=TRUE, fast=NULL, quant=NULL, IQR=FALSE, omit=FALSE, data=NULL)
Round-off may be controlled if the object returned by
describe() is printed using print() which is
an extremely important and flexible, function in base R.
library(psych) # load package from hard drive, into memory at R Console
## Warning: package 'psych' was built under R version 4.2.3
describe(y1) # a fun in package 'psych'
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 60 50.59 8.19 50.19 50.5 7.73 32.3 69.52 37.22 0.08 -0.46 1.06
print(describe(y1), digits=3) # print() is an important function in base R
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 60 50.591 8.193 50.189 50.505 7.728 32.3 69.521 37.22 0.082 -0.462
## se
## X1 1.058
Alternatively, most R users write their own short functions for many
things, including descriptive statistics. One of mine is
new.stats() as defined in my,
K funs UNIVARIATE.R an R Editor file of function
definitions. The norm.test argument in the call to
new.stats() turns on a batch of goodness of fit tests of
normality, to inform the curious analyst.
new.stats(y1, data.name="sim data") # from: K funs UNIVARIATE.R
## sim data
## N 60.000
## Mean 50.591
## SD 8.193
## CV% 16.195
## Min 32.300
## Max 69.521
## skew 0.082
## kurtosis -0.462
## SEM 1.058
## LB.95%CI.mean 48.474
## UB.95%CI.mean 52.707
## Median 50.189
## 25th 45.558
## 75th 56.226
## IQR 10.668
new.stats(y1, norm.test=TRUE) # add normality tests
## $statistics
## y1
## N 60.000
## Mean 50.591
## SD 8.193
## CV% 16.195
## Min 32.300
## Max 69.521
## skew 0.082
## kurtosis -0.462
## SEM 1.058
## LB.95%CI.mean 48.474
## UB.95%CI.mean 52.707
## Median 50.189
## 25th 45.558
## 75th 56.226
## IQR 10.668
##
## $goodness.of.fit.tests.for.normality
## p-value
## Shapiro-Wilk test 0.9800224
## Robust Jarque-Berra test 0.8098829
## Anderson-Darling test 0.9791136
## Cramer-von Mises test 0.9742690
## Shapiro-Francia test 0.9930478
## Pearson test 0.9301044
## D'Agostino test 0.7708575
## Anscombe-Glynn test 0.6995321
## Bonett-Seier test 0.8212358
## Kolmogorov-Smirnov test 0.9949754
## Lillifors Kolmogorov-Smirnov test 0.9583959
Let’s use a histogram to visualize the above sample
y1of n=60.
The histogam is a graph that’s well known and understood by all
educated folks across all scholarly disciplines, worldwide. It’s taught
in grade school.
Histograms are as instantly recognized as say, the words DNA
and Homo sapiens. No matter the scientific journal or how
advanced the topic of the research, histograms are used to visualize the
distribution of the data.
I prove this point during class by showing you current issues of
leading journals, including Science, Science Advances,
PNAS, NEJM and Nature.
In statistics, what do we mean by a distribution? It’s the range of the data from the minimum to the maximum value; it’s the central tendency or peaks; it’s the dispersion of the data about the central tendency; it’s the data grouped into intervals spanning the range of the data. Histograms display the distribution of single samples of data.
For me, a histogram helps me see the “shape” of the data.
In R coding, Comments begin with the hashtag # and are ignored by R (code to the right of # is ignored). Comments are for the human. You will quickly learn to appreciate comments as you wisely lavish them on your code.
hist(y1) # frequency histogram of y1; minimal call to hist() which has many args
Distribution of y1 by a minimal call to hist( ), ‘draft quality’
The histogram above, graphs the tabular, “frequency distribution”
which is calculated by a short function I wrote,
freq.table() which extracts vectors from the
hist() object. The function definition for
freq.table() is listed below then the function is called
and it returns a frequency distribution table for y1.
freq.table <- function(y=rnorm(1e2, 50, 10), roundoff=4, ... ){
# -------------------------------------------------------------------
# ARGUMENTS frequency distribution TABLE for a numeric vector
#
# y numeric vector for frequency distribution table
# missing values OK
#
# roundoff default: 4 decimal places
#
# ... args to hist() that's only used for extraction,
# e.g., breaks=22, breaks="sturges", "scott", "fd"
#
# Function RETURNS a data frame by EXTRACTION from hist() object
#
# Note: for the bins, it's > ... <= 'left open, right closed'
# function may be called devoid of arguments, as a DEMO
#
# AUTHOR: Dwight Kincaid, PhD dwight.kincaid@lehman.cuny.edu
# -------------------------------------------------------------------
stopifnot(is.numeric(y), length(y) > 1) # argument checking
y <- y[!is.na(y)] # remove NA, missing values
a <- hist(y, plot=FALSE, ...) # to extract from hist() object
width <- a$mids[2] - a$mids[1] # Width of hist bins
L.level <- round(a$mids - (width/2), roundoff) # vector: Lower bounds
U.level <- round(a$mids + (width/2), roundoff) # vector: Upper bounds
my.level <- paste(L.level, ",", U.level, sep="") # character vector
N <- length(y) # N of response variable
# frequency distribution table by extraction from hist() object 'a'
dfr <- data.frame(interval = 1:length(a$mids),
class = my.level,
midpt = a$mids,
FREQ = a$counts,
rel.freq = round(a$counts/N, roundoff),
cum.f = cumsum(a$counts),
cum.rel.f = round(cumsum(a$counts/N), roundoff),
density = round(a$density, roundoff))
return(dfr) # data frame object returned
} # end function definition
Continuing with y1 of n=60, let’s get the frequency
distribution table for breaks=4 argument in the call to
freq.table() which can pass arguments to
hist() by its final, ... argument, often
called dot-dot-dot.
# 'breaks' arg passed to hist() which may not deliver 'exactly' the breaks specified
freq.table(y1, breaks=4)
## interval class midpt FREQ rel.freq cum.f cum.rel.f density
## 1 1 30,40 35 7 0.1167 7 0.1167 0.0117
## 2 2 40,50 45 23 0.3833 30 0.5000 0.0383
## 3 3 50,60 55 21 0.3500 51 0.8500 0.0350
## 4 4 60,70 65 9 0.1500 60 1.0000 0.0150
Large sample sizes are not a problem for this function.
freq.table(rnorm(6e7), roundoff=8) # random sample of 60 million from N(0,1)
## interval class midpt FREQ rel.freq cum.f cum.rel.f density
## 1 1 -6,-5.5 -5.75 1 0.00000002 1 0.00000002 0.00000003
## 2 2 -5.5,-5 -5.25 16 0.00000027 17 0.00000028 0.00000053
## 3 3 -5,-4.5 -4.75 162 0.00000270 179 0.00000298 0.00000540
## 4 4 -4.5,-4 -4.25 1684 0.00002807 1863 0.00003105 0.00005613
## 5 5 -4,-3.5 -3.75 11914 0.00019857 13777 0.00022962 0.00039713
## 6 6 -3.5,-3 -3.25 66875 0.00111458 80652 0.00134420 0.00222917
## 7 7 -3,-2.5 -2.75 291844 0.00486407 372496 0.00620827 0.00972813
## 8 8 -2.5,-2 -2.25 992429 0.01654048 1364925 0.02274875 0.03308097
## 9 9 -2,-1.5 -1.75 2643205 0.04405342 4008130 0.06680217 0.08810683
## 10 10 -1.5,-1 -1.25 5510111 0.09183518 9518241 0.15863735 0.18367037
## 11 11 -1,-0.5 -0.75 8996129 0.14993548 18514370 0.30857283 0.29987097
## 12 12 -0.5,0 -0.25 11484601 0.19141002 29998971 0.49998285 0.38282003
## 13 13 0,0.5 0.25 11482921 0.19138202 41481892 0.69136487 0.38276403
## 14 14 0.5,1 0.75 8994566 0.14990943 50476458 0.84127430 0.29981887
## 15 15 1,1.5 1.25 5512059 0.09186765 55988517 0.93314195 0.18373530
## 16 16 1.5,2 1.75 2644362 0.04407270 58632879 0.97721465 0.08814540
## 17 17 2,2.5 2.25 993882 0.01656470 59626761 0.99377935 0.03312940
## 18 18 2.5,3 2.75 292424 0.00487373 59919185 0.99865308 0.00974747
## 19 19 3,3.5 3.25 67017 0.00111695 59986202 0.99977003 0.00223390
## 20 20 3.5,4 3.75 11907 0.00019845 59998109 0.99996848 0.00039690
## 21 21 4,4.5 4.25 1685 0.00002808 59999794 0.99999657 0.00005617
## 22 22 4.5,5 4.75 188 0.00000313 59999982 0.99999970 0.00000627
## 23 23 5,5.5 5.25 18 0.00000030 60000000 1.00000000 0.00000060
How hard would it be to annotate the above histogram with the sample
mean and standard deviation, the p-value from Shapiro-Wilk test of
normality, turn off the graph title, change the x-axis label, increase
font size of the labels, rotate the y-axis labels, color the bars, and
get rid of the black ink (sensu Edward Tufte) around the
histogram bars? We use more arguments to our call of the
function hist(). Function
arguments are separated by commas. The
mtext() and text() functions are handy for
annotating graphs. Notice how I add a space after a
comma in function calls; this is wise coding practice
as it eases code editing and enhances readability; R does not
need the space but the human does.
Functions and their arguments are small, reusable tools in R.
\({\Large\ To \ use\ R\ is\ to\ call\ functions\ }\) — John Chambers
m <- round(mean(y1), 2) # get mean of y1 and round off to 2 decimals
s <- round(sd(y1), 3) # get SD of y1 and round off
p <- round(shapiro.test(y1)$p, 3) # extract p-value, Shapiro-Wilk test of normality
hist(y1,
col="lightblue", # color of histogram, default "lightgray"
main=NULL, # string for title of histogram; NULL omits it
cex.lab=1.4, cex.axis=1.3, las=1, # typography
border="white", # color of lines around histogram bars
xlab="Random sample from N(50, 9)") # x-axis label
clr <- "darkred" # color of text
mtext(paste("N =", length(y1)), adj=.95, line= 0, cex=.91, col=clr)
mtext(paste("Mean =", m), adj=.95, line=-1, cex=.91, col=clr)
mtext(paste("SD =", s), adj=.95, line=-2, cex=.91, col=clr)
mtext(paste("Shapiro-Wilk test\np =", p), adj=.03, line=-1, cex=.91, col=clr)
Annotated histogram
A graph destined for a Poster or Presentation in class, at a scientific meeting or at an interview, often benefits from a graph title and by a small modest, personalized time stamp to denote credit/blame. Below is a call to
hist()showing how to add the time stamp by a call to themtext()function, using Arlo as investigator name.While personalized time stamp and title are not appropriate for a publication quality graph, they are often desirable for a presentation quality graph.
Can you find Arlo’s personalized time stamp in the graph window below?
hist(y1,
col="coral",
border="white",
main="Preliminary Experiment #1",
xlab="Random sample from N(50, 9)",
cex.lab=1.4, cex.axis=1.3, cex.main=2, col.main="black", las=1) # typography
# add time stamp, personalized
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.9, col="gray")
Presentation quality histogram with personalized time stamp
Question: For graphs, what’s the difference between draft quality, presentation quality and publication quality? The answer concerns the structure of the graph itself, as well as the intended audience. In Bio 240 we use R to make all three types of graphs.
Instead of a histogram, one may visualize the distribution of
y1by a kernel density estimate calculated by functiondensity(). Below, we plot it using the workhorse graphing function in base R,plot().
dens <- density(y1) # get x,y coordinates of kernel density curve, fit to data y
plot(dens, xlab="Random sample from N(50, 10)", main="", col="red")
Kernel density estimate of the distribution of y1
If desired, the kernel density estimate may be filled with a color using the
polygon()function. The kernel density is a far more rigorous visualization of the distribution of a sample of data than the histogram. At the R Console, enter?densityto view documentation for this function including its arguments, how it’s calculated and bibliographic references.
dens <- density(y1) # get x,y coordinates of kernel density curve fit to data
plot(dens, xlab="Random sample from N(50, 10)", main="")
polygon(dens, col="peru") # at R Console try > ?color for names of all 657 colors
Kernel density estimate, color filled
We could overlay the density histogram of y1 with the normal distribution PDF using the observed mean and standard deviation as its parameters. This is done to show the apparent normality of the observed data or to emphasize the lack of normality.
There are several ways to do this in R, I present one below.
Here, we expect our sample to be reasonably close to the normal PDF because the observed, n=60 is a random sample from the normal distribution. But we must remember that at relatively small sample size, random sampling often generates extremely unexpected distributions, not just in simulated data but in research data as well.
pts <- 250 # number of x,y points for the normal PDF curve
g <- hist(y1, plot=FALSE) # for plotting args
x <- seq(from=min(g$breaks), to=max(g$breaks), length=pts) # x-coord of normal PDF
y <- dnorm(x, mean(y1), sd(y1)) # y-coord " " "
mx <- max(c(y, g$density)) # get y-max for plotting
hist(y1, freq=FALSE, ylim=c(0, mx), main=NULL) # minimal, density histogram
lines(x, y, lwd=1.5, col="blue") # add the normal PDF
Density histogram with normal fit
We often use EDA to help understand the data, to reveal something unforseen, to communicate to others and to stimulate research hypothesis generation.
EDA: Exploratory Data Analysis, was developed, if not created by John Tukey and associates in the 1970s. The influence of EDA on the entire field of statistics, remains profound. Analysts rely on EDA to learn about their data.
In data analysis, we often use this repeating loop :
\({\quad\quad \large\it {Research} \ {Hypotheses}\ \longleftrightarrow \ \Large \bf {E}\ {D}\ {A}\ \longleftrightarrow \ \large\it {Statistical} \ {Analysis/Modeling}}\)
My function EDA() in K funs
UNIVARIATE.R makes the graphic window below, featuring graphs
and statistical analyses, all to be taken as exploratory.
#source("C:\\Users\\dwigh\\Desktop\\K funs UNIVARIATE.R") # 'source' the file
EDA(y1, data.name="sim data, N(50, 9)")
## Warning: package 'DescTools' was built under R version 4.2.3
This is an interesting and diverse topic that benefits from statistical wisdom gained by experience in data analysis. It’s not the time nor place for a deep dive into it. Formerly, only traditional statistical hypothesis testing was used to answer the question, “are my data normally distributed as assumed by Anova?” Today, tests like the Shapiro-Wilk test of normality are still commonly reported.
Normal Quantile-Quantile plots have long been used to assess normality of observed data.
Increasingly, many place emphasis on the graphical visualization of an observed distribution versus simulated data of the same sample size, sampled from the normal distribution. I recently invented one such approach as seen below.
Below, see a visual assessment of the distribution of sample
y1 relative to the normal distribution, using a novel
function defined in my R Editor file: K funs
UNIVARIATE.R
# Kincaid function using Monte Carlo data simulation to assess normality
Normal.kernel.band(y1, NS=500)
We simulate a one-way data set of 4 groups with total N=281. Two groups are random samples from the normal distribution (bell-shaped) and the other two groups are random samples from the uniform distribution (flat-topped).
A one-way data set is where a datum is classified by one criterion – its group membership. And, there can be any number of groups and any sample size per group.
We develop graphical visualizations and statistical analyses, both
traditional, computationally intensive (permutation & bootstrap) and
Bayesian (Bayes Factor Anova). Note that every time this
.Rmd code is rendered in RStudio a different
fake data set is generated because the
set.seed() function is not used.
Another advantage of kernel density estimates over histograms is to visualize more than one distribution on the same graph, as we do below using simulated data (fake data) from the normal and uniform distributions. Plotting these 4 groups as histograms in the same graph would be an overlap mess.
set.seed(876) # set seed for random number generator
x1 <- rnorm(50, 32, 5) # sample from normal distribution
x2 <- rnorm(51, 34, 7) #
x3 <- runif(100, 10, 45) # sample from uniform distribution
x4 <- runif(80, 19, 30) #
x1 <- round(x1, 2); x2 <- round(x2, 2); x3 <- round(x3, 2); x4 <- round(x4, 2)
# organize our simulated, one-way data set into a DATA FRAME ----------------
response <- c(x1, x2, x3, x4) # numeric, response variable
# get grouping variable
n <- c(length(x1), length(x2), length(x3), length(x4)) # vector of sample sizes
group <- c(rep("x1", n[1]), rep("x2", n[2]), rep("x3", n[3]), rep("x4", n[4]))
group <- factor(group, levels=c("x1", "x2", "x3", "x4")) # desired presentation order
sim.df <- data.frame(response, group) # make data frame
str(sim.df) # view structure of the object -- CHECKING
## 'data.frame': 281 obs. of 2 variables:
## $ response: num 32.9 35.9 26.7 30.7 32.6 ...
## $ group : Factor w/ 4 levels "x1","x2","x3",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(sim.df) # get simple stats -- CHECKING
## response group
## Min. :10.21 x1: 50
## 1st Qu.:23.31 x2: 51
## Median :28.82 x3:100
## Mean :28.88 x4: 80
## 3rd Qu.:34.12
## Max. :54.23
head(sim.df) # first 6 rows -- CHECKING
## response group
## 1 32.85 x1
## 2 35.91 x1
## 3 26.68 x1
## 4 30.68 x1
## 5 32.57 x1
## 6 41.70 x1
tail(sim.df) # last 6 rows -- CHECKING
## response group
## 276 26.55 x4
## 277 22.42 x4
## 278 24.63 x4
## 279 23.94 x4
## 280 22.41 x4
## 281 27.67 x4
# some descriptive stats by some basic programming in R -----
n <- tapply(response, group, length) # vector of sample sizes
Mean <- tapply(response, group, mean) # vector of means
SD <- tapply(response, group, sd) # SD
Min <- tapply(response, group, min) # minimum
Max <- tapply(response, group, max) # maximum
Median <- tapply(response, group, median) # median
iqr <- tapply(response, group, IQR) # IQR
CV <- 100 * SD/Mean # Coefficient of Variation %
SEM <- round(SD/sqrt(n), 4) # standard error of the mean
Q1 <- tapply(response, group, quantile, probs=.25) # 25th percentile
Q3 <- tapply(response, group, quantile, probs=.75) # 75th "
# stats per group ----- printed 2 different ways: cbind() vs. rbind() ----
print(cbind(n, Mean, SD, CV, SEM, Min, Max, Median, iqr, Q1, Q3), digits=4)
## n Mean SD CV SEM Min Max Median iqr Q1 Q3
## x1 50 32.62 4.253 13.04 0.6015 23.14 44.17 32.67 4.153 30.12 34.28
## x2 51 33.82 5.577 16.49 0.7809 20.60 54.23 33.49 6.415 30.05 36.47
## x3 100 27.93 10.414 37.29 1.0414 10.21 44.59 27.82 17.775 19.05 36.83
## x4 80 24.58 3.125 12.71 0.3494 19.08 29.74 24.50 4.718 22.02 26.73
print(rbind(n, Mean, SD, CV, SEM, Min, Max, Median, iqr, Q1, Q3))
## x1 x2 x3 x4
## n 50.000000 51.000000 100.00000 80.000000
## Mean 32.621600 33.822353 27.92920 24.582375
## SD 4.253431 5.576637 10.41364 3.124856
## CV 13.038695 16.488022 37.28585 12.711776
## SEM 0.601500 0.780900 1.04140 0.349400
## Min 23.140000 20.600000 10.21000 19.080000
## Max 44.170000 54.230000 44.59000 29.740000
## Median 32.670000 33.490000 27.82500 24.505000
## iqr 4.152500 6.415000 17.77500 4.717500
## Q1 30.125000 30.050000 19.05500 22.015000
## Q3 34.277500 36.465000 36.83000 26.732500
# get 512 x,y points for kernel density estimates ---------
d1 <- density(x1)
d2 <- density(x2)
d3 <- density(x3)
d4 <- density(x4)
xlm <- range(c(d1$x, d2$x, d3$x, d4$x)) # get min, max for plotting
ylm <- range(c(d1$y, d2$y, d3$y, d4$y))
plot(d1, type="n", main="", # get graph going
xlim=xlm, ylim=ylm, # global x,y plotting ranges
cex.lab=1.35, font.lab=2, cex.axis=1.2, # typography
xlab="Simulated data from the Normal and Uniform distribution")
lines(d1, col="black", lwd=3) # add kernel density estimates ---------------
lines(d2, col="red", lwd=3)
lines(d3, col="blue", lty=2, lwd=3)
lines(d4, col="green", lwd=3)
# legend with stats ---- by trial-and-error coding with mtext() -------------
cl <- c("black", "red", "blue", "green")
lne <- c(-2.5, -3.5, -4.5, -5.5)
ad <- c(.75, .81, .91, .98)
mtext(c("group", "N", "Mean", "SD"), line=-1, adj=ad, cex=.9)
mtext(c("x1", "x2", "x3", "x4"), line=lne, adj=.75, font=2 )
mtext(cl, line=lne, adj=.65, col=cl, cex=1.1)
mtext(c(n[1], round(Mean[1],2), round(SD[1],2)), line=-2.5,
adj=c(.82, .9, .98), cex=.9)
mtext(c(n[2], round(Mean[2],2), round(SD[2],2)), line=-3.5,
adj=c(.82, .9, .98), cex=.9)
mtext(c(n[3], round(Mean[3],2), round(SD[3],2)), line=-4.5,
adj=c(.82, .9, .98), cex=.9)
mtext(c(n[4], round(Mean[4],2), round(SD[4],2)), line=-5.5,
adj=c(.82, .9, .98), cex=.9)
Multiple kernel density estimates
The above graph of 4 kernel density estimates was done using base R graphics. Below, we use the
ggplot2package by Hadley Wickham. It uses Paul Murrell’sgridpackage, not base R.
The ggplot2 visualization environment is vast and
powerful, but it’s beyond our time constraints and not my objective in
Bio 240. However I will demonstrate ggplot2 graphics across
the course, as time allows.
In my opinion, it’s best to learn base R graphics first, before
learning other visualization environments such as lattice
and ggplot2. The ggplot2 environment reflects
software that is nearly a programming language in of itself. The base R
graphics environment often meets our scientific and aesthetic graphing
needs and does so within a simple and consistent coding pattern that
allows a high level of control.
library(ggplot2) # load the package
my.colors <- c(x1="black", x2="red", x3="blue", x4="green")
# 4 kernel density lines, NO FILL
p1 <- ggplot(sim.df, aes(x=response, color=group)) + geom_density(size=1.2) +
scale_colour_manual(values=my.colors)
print(p1)
# 4 kernel densities, FILLED
p2 <- ggplot(sim.df, aes(x=response, fill=group)) + geom_density(alpha=.2) +
scale_fill_manual(values=my.colors)
print(p2)
# 4 kernel densities, FILLED, x & y labels improved, xlim()
p3 <- ggplot(sim.df, aes(x=response, fill=group)) + geom_density(alpha=.2) +
xlab("Simulated data") + ylab("Density") + xlim(0, 60) +
scale_fill_manual(values=my.colors) +
#scale_fill_grey() +
#scale_fill_viridis_d() +
#scale_fill_brewer() +
#scale_fill_hue(l=45) +
#scale_fill_brewer(palette="Oranges") +
#scale_fill_brewer(palette="Greys") +
# scale_fill_brewer(palette="Blues") +
# scale_fill_brewer(palette="Set1") +
theme(axis.text=element_text(face="plain", colour="black", size=14),
axis.title=element_text(face="bold", size=20))
print(p3)
boxplot(response ~ group,
boxwex=.6, # width of boxes, default .8
col=terrain.colors(4), # color of box fill, default "gray"
ylim=c(8, 55), # specify min,max of y-axis to allow for sample sizes
xlab="Group", ylab="Response", # x,y labels
cex.lab=1.4, cex.axis=1.3, las=1) # typography
#mtext(paste("n =", n), adj=c(.125, .375, .625, .875)) # sample sizes atop graph
text(1:4, 8, paste("n =", n), cex=1.2) # sample sizes below boxplots
__________________
anova(lm(response ~ group, data=sim.df)) # traditional, one-way Anova
## Analysis of Variance Table
##
## Response: response
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 3513.7 1171.24 23.259 1.857e-13 ***
## Residuals 277 13948.8 50.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.p <- anova(lm(response ~ group, data=sim.df))[1, 5] # extract p-value
cat("one-way Anova, full resolution p-value =", aov.p, "\n")
## one-way Anova, full resolution p-value = 1.856901e-13
TukeyHSD(aov(response ~ group), ordered=TRUE) # Tukey test, Multiple Comparisons
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = response ~ group)
##
## $group
## diff lwr upr p adj
## x3-x4 3.346825 0.5955828 6.098067 0.0099250
## x1-x4 8.039225 4.7326434 11.345807 0.0000000
## x2-x4 9.239978 5.9534061 12.526550 0.0000000
## x1-x3 4.692400 1.5155392 7.869261 0.0009511
## x2-x3 5.893153 2.7371242 9.049182 0.0000136
## x2-x1 1.200753 -2.4495437 4.851050 0.8302898
TukeyHSD(aov(response ~ group), ordered=TRUE)[1] # Tukey test, p-values, not rounded
## $group
## diff lwr upr p adj
## x3-x4 3.346825 0.5955828 6.098067 9.925005e-03
## x1-x4 8.039225 4.7326434 11.345807 7.641266e-09
## x2-x4 9.239978 5.9534061 12.526550 2.296408e-11
## x1-x3 4.692400 1.5155392 7.869261 9.510613e-04
## x2-x3 5.893153 2.7371242 9.049182 1.364760e-05
## x2-x1 1.200753 -2.4495437 4.851050 8.302898e-01
# default call is Holm-Bonferroni method, all pairwise, p-value adjustment
pairwise.t.test(response, group, p.adjust.method="holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: response and group
##
## x1 x2 x3
## x2 0.3959 - -
## x3 0.0005 9.2e-06 -
## x4 6.4e-09 2.3e-11 0.0037
##
## P value adjustment method: holm
# "planned" all pairwise comparisons, no adjustment of p-valyes
pairwise.t.test(response, group, p.adjust.method="none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: response and group
##
## x1 x2 x3
## x2 0.39593 - -
## x3 0.00017 2.3e-06 -
## x4 1.3e-09 3.8e-12 0.00185
##
## P value adjustment method: none
kruskal.test(response, group)
##
## Kruskal-Wallis rank sum test
##
## data: response and group
## Kruskal-Wallis chi-squared = 72.594, df = 3, p-value = 1.188e-15
cat("Kruskal-Wallis test, full resolution p-value =",
kruskal.test(response, group)$p.value, "\n")
## Kruskal-Wallis test, full resolution p-value = 1.18767e-15
library(gplots)
plotmeans(response ~ group,
connect=FALSE, # connect means? default: TRUE
n.label=FALSE, # add sample sizes
main="Means with 95% CI", # title
xlab="Group", ylab="Response", # x,y labels
# mean.labels=TRUE, digits=1, col="darkgray", # print means
cex.lab=1.4, cex.axis=1.3, cex.main=1.7, font.lab=2, # typography
col.main="blue", font.main=1) # typography
# 'source' the R Editor file of new function definitions
source("C:\\Users\\dwigh\\Desktop\\K funs ONE WAY.R")
Kincaid.perm.oneway(response, group, NS=5e5, describeF=FALSE)
##
## ----------------------------------------------------------------------------
##
## Kincaid.perm.oneway(y, x, NS, describeF, returnF) --> for one-way data
##
## The Call: Kincaid.perm.oneway(response, group, NS = 5e+05, describeF = FALSE)
##
## Test statistic: observed F from one-way Anova on observed data.
## Null hypothesis same as in classical one-way Anova. Ho: means are equal
##
## Start LOW on NS (e.g., 1e3), note time required then go HIGH.
## Thu Dec 15 21:36:54 2022
## ----------------------------------------------------------------------------
##
## TEST STATISTIC:
##
## observed F = 23.25887 in classical one-way Anova
##
## RESULTS: minutes elapsed = 4.9848 for NS = 5e+05 shuffles
##
## NGE = 0 --> number of null F-stats >= observed F-statistic
## NGE is size of right tail of null F distribution
##
## p = 1.999996e-06
## achieved after NS = 5e+05 permutations of y,
## holding constant, the grouping variable, x
##
## Thu Dec 15 21:41:53 2022
## ----------------------------------------------------------------------------
##
Continuing with this fake, simulated one-way data of 4 groups, we are always interested in confidence intervals of the means. Below find the traditional, 95% CI for the 4 means calculated using normal theory (t-distribution).
CImeans.normaltheory(response, group, prob=.95) # from: K funs ONE WAY.R
##
##
## Means with 95% confidence intervals
## using t-distribution
## LB UB Mean N
## x1 31.41279 33.83041 32.62160 50
## x2 32.25390 35.39081 33.82235 51
## x3 25.86291 29.99549 27.92920 100
## x4 23.88697 25.27778 24.58237 80
How would the boostrap confidence intervals of these 4 means, compare to the traditional confidence intervals? Notice that the confidence intervals are astonishingly similar – the boostrap just works! My function below, uses the bootstrap percentile method to calculate CI from the bootstrap distribution of the means.
bootCImeanOneWay(response, group, NS=2e5) # from: K funs ONE WAY.R
##
##
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
##
## OUTPUT from: bootCImeanOneWay() by Prof. Dwight Kincaid, PhD
##
## The Call: bootCImeanOneWay(response, group, NS = 2e+05)
##
## time stamp: Thu Dec 15 21:42:07 2022
##
## Note: the vector index is bootstrapped
##
## BOOTSTRAP underway, patience...
##
## Start LOW on NS, note time required, then go HIGH, as time is
## proportional. All the while, realize that the CI results from
## multiple runs can be averaged
##
## minutes elapsed = 0.9155
##
## ---------------
## 95% confidence intervals of group means
##
## x1 x2 x3 x4
## 2.5% 31.46245 32.33837 25.89000 23.89742
## 97.5% 33.81523 35.40476 29.96661 25.26944
## Mean 32.62160 33.82235 27.92920 24.58237
## N 50.00000 51.00000 100.00000 80.00000
## Min 23.14000 20.60000 10.21000 19.08000
## Max 44.17000 54.23000 44.59000 29.74000
## NS 200000.00000 200000.00000 200000.00000 200000.00000
##
## ---------------
## 99% confidence intervals of group means
##
## x1 x2 x3 x4
## 0.5% 31.10783 31.87571 25.25682 23.68519
## 99.5% 34.19861 35.95000 30.60965 25.48390
## Mean 32.62160 33.82235 27.92920 24.58237
## N 50.00000 51.00000 100.00000 80.00000
## Min 23.14000 20.60000 10.21000 19.08000
## Max 44.17000 54.23000 44.59000 29.74000
## NS 200000.00000 200000.00000 200000.00000 200000.00000
## ---------------
##
## time stamp: Thu Dec 15 21:43:02 2022
##
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Effect Size – Cohen’s d is a widely used, standardized effect size, as is R-square and Variance Components. One may wish to report these “statistics” with their 95% confidence intervals but I am not aware of formulaic methods for this that have much support among statisticians. Bootstrap to the rescue.
bootCohenD.oneway(response, group, NS=2e5)
##
## -----------------------------------------------------------------
## OUTPUT from bootCohenD.oneway( ) by Dwight Kincaid, PhD
##
## The Call:
## bootCohenD.oneway(response, group, NS = 2e+05)
##
##
## Observed, Cohen's D: 0.4983125 in one-way Anova
## STATS for bootstrap distribution of Cohen's D
## n: 2e+05 min, max: 0.311712 , 0.7842091
## mean: 0.5085103 median: 0.5068693
## SD: 0.04766198 skew: 0.2157265
##
## NS: 2e+05 bootstrap samples minutes: 4.464
##
## Nonparametric, Percentile Bootstrap Confidence Intervals
## for Cohen's D in one-way Anova
##
## 90% CI 95% CI 99% CI
## lower-bound 0.4330335 0.419604 0.3937449
## upper-bound 0.5896419 0.606727 0.6418604
##
## Time stamp: Thu Dec 15 21:47:30 2022
## -----------------------------------------------------------------
bootRsq.oneway(response, group, NS=2e5)
boot.vc.oneway(response, group, data.name="Fake data", NS=1e5)
##
## -----------------------------------------------------------------
## BOOTSTRAP Variance Components in one-way Anova
##
## time stamp: Thu Dec 15 21:47:52 2022
## The Call:
## boot.vc.oneway(response, group, data.name = "Fake data", NS = 1e+05)
##
## Data set: Fake data
##
## NS = 1e+05 bootstrap resamples from observed data frame
## Note: the data frame 'row index' was resampled
##
## Using the 'percentile method'
## the 95% confidence intervals are --
##
## Explained VC% observed: 24.61714
## 2.5% 97.5%
## 18.77828 32.88067
##
## STATS for bootstrap distribution of EXPLAINED VC%
## min, max: 12.64455 , 42.30946
## mean: 25.48481 median: 25.35333
## SD: 3.601187 skew: 0.1918715
##
## Error VC% observed: 75.38286
## 2.5% 97.5%
## 67.11933 81.22172
##
## CPU minutes: 2.948 for bootstrap resampling
##
## Output from: Kincaid.boot.vc.oneway() by Dwight Kincaid, PhD
## -----------------------------------------------------------------
Statistical power analysis – There a 2 ways: 1) traditional power analysis using normal theory after Cohen (1988) and 2) bootstrap power analysis. A power curve is the best way to visualize tabular output of any type of power analysis. Bootstrap power analysis uses no formulas and there are no assumptions. Traditional power analysis operates within the assumptions of the parametric method in question (one-way Anova, here); bootstrap power analysis, does not.
The pwr package has a function for traditional power analysis after Cohen (1988) by solving the power equations (non-central F-distribution) and assuming equal sample size per group, which our data does not reflect.
My wrapper function calls function pwr.anova.test() from
this package to get power across a range of sample sizes and for alpha
of .05, .01 and .001, and plot power curves with the table of results
sent to the R Console.
powercurve.Cohen.oneway(response, group, low.n=4, high.n=24, by.n=2,
data.name="sim fake data, 4 groups, total n=281")
## total.N N.per.group pwr.at 05 pwr.at 01 pwr.at 001 Cohen's D
## [1,] 16 4 0.2669 0.0900 0.0150 0.498
## [2,] 24 6 0.4338 0.1936 0.0468 0.498
## [3,] 32 8 0.5838 0.3183 0.1020 0.498
## [4,] 40 10 0.7064 0.4480 0.1794 0.498
## [5,] 48 12 0.8000 0.5697 0.2735 0.498
## [6,] 56 14 0.8678 0.6758 0.3765 0.498
## [7,] 64 16 0.9148 0.7630 0.4806 0.498
## [8,] 72 18 0.9463 0.8314 0.5792 0.498
## [9,] 80 20 0.9669 0.8829 0.6679 0.498
## [10,] 88 22 0.9799 0.9205 0.7442 0.498
## [11,] 96 24 0.9880 0.9470 0.8074 0.498
#powercurve.unbalanced.oneway(response, group, low.n=20, high.n=92, by.n=4,
# data.name="sim fake data, 4 groups, total n=281")
To do bootstrap power analysis I resample the row
indices of the data frame and for each sample the p-value from one-way
Anova is recovered. The arguments in the function call
(low.n, high.n, by.n) specify the
range of sample sizes for which power is calculated for alpha of .05,
.01 and .001, plotted as power curves with the table of results sent to
the R Console.
NS is the number of bootstrap samples of the data frame for each sample size. Here, statistical power is simply the proportion of the NS number of Anova F-tests that were significant at .05, .01 and .001 level, at each sample size.
boot.powercurve.oneway(response, group, NS=2e4, low.n=20, high.n=92, by.n=4,
data.name="sim fake data, 4 groups, total n=281")
##
##
## Output from Kincaid function: boot.powercurve.oneway()
##
## DATA: sim fake data, 4 groups, total n=281
##
## observed total N: 281
## Bootstrap resamples
## for each, total sample size: 20000
## Minutes needed: 8.93
## total.N pwr.at .05 pwr.at .01 pwr.at .001
## [1,] 20 0.292 0.124 0.042
## [2,] 24 0.372 0.158 0.050
## [3,] 28 0.455 0.200 0.062
## [4,] 32 0.547 0.255 0.076
## [5,] 36 0.627 0.316 0.097
## [6,] 40 0.712 0.396 0.130
## [7,] 44 0.778 0.466 0.162
## [8,] 48 0.838 0.545 0.210
## [9,] 52 0.881 0.622 0.265
## [10,] 56 0.918 0.689 0.320
## [11,] 60 0.944 0.752 0.382
## [12,] 64 0.965 0.807 0.449
## [13,] 68 0.976 0.848 0.505
## [14,] 72 0.985 0.884 0.577
## [15,] 76 0.991 0.917 0.631
## [16,] 80 0.995 0.939 0.691
## [17,] 84 0.997 0.958 0.752
## [18,] 88 0.998 0.969 0.789
## [19,] 92 0.999 0.978 0.833
Bayes Factor Anova – The contributed R package,
BayesFactor has functions for Bayes Factor Anova,
correlation, linear regression and more. My function
BayesFactorAnova.oneway() is merely a wrapper around
function anovaBF() from BayesFactor. Here,
Bayes Factor Anova returned 39 billion times (0.02% Credible Interval)
more support for heterogeneity (H1:) among the 4 means than for
homogeneity (Ho:) among the means.
It’s helpful that Bayesian statisticians have written procedures that operate sort of within the conceptual framework of familiar, statistical models, including Anova; see Gelman, Andrew. (2005) Analysis of Variance — why it is more important than ever. Annals of Statistics, 33, pp. 1-53.
BayesFactorAnova(response, group)
##
## -----------------------------------------------------------------
## BAYES FACTOR one-way ANOVA
## using library(BayesFactor)
## anovaBF() is for fixed treatment or random effects Anova
##
## A Bayes factor is a multiplicative factor of support for H1 vs. Ho
## In other words, support for means being different vs. being equal.
##
## Documentation: > help(package=BayesFactor) >?anovaBF
## -----------------------------------------------------------------
##
## You specified an Anova model that's 'fixed effects'
##
## Bayes factor analysis
## --------------
## [1] x : 39308364304 ±0.02%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS
Under development
R and R Markdown enable interactive graphs. The plotly package makes ggplot2 graphs, interactive. Hover over the points. Click on top legend for options. Click on the legend to de/select continents. Circle size is proportional to population size in 2007 for Life Expectancy vs. GDP per person in US dollars (first graph) and Life Expectancy vs. Population Size (second graph).
The data is from https://www.Gapminder.org, founded in 2005 in Stockholm by Dr. Hans Rosling (Professor of International Health, Karolinska Institute) and his group.
He became world famous by his TED talks and videos, some of which
we’ll view in Bio 240.
Professor Rosling’s last book was
Factfulness: https://www.gapminder.org/factfulness-book/
Unfortunately, this type of interactive graph is beyond the time frame
of Bio 240.
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.2.3
p1 <- gapminder %>%
filter(year==2007) %>%
ggplot( aes(gdpPercap, lifeExp, size = pop, color=continent)) +
geom_point() +
scale_x_log10() +
theme_bw()
ggplotly(p1)
Put cursor inside graph to interact & atop graph for menu
p2 <- gapminder %>%
filter(year==2007) %>%
ggplot( aes(pop, lifeExp, size = gdpPercap, color=continent)) +
geom_point() +
scale_x_log10() +
theme_bw()
ggplotly(p2)
Put cursor inside graph to interact & atop graph for menu
Under development
Under development
This semester you will be required to write many R programs
(i.e., R Editor files, e.g., foo.R).
R Editor files are
just plain text files of R code that we submit top-down, to the R
Console for execution, interactively.
Each of
these R Editor files this semester, is motivated by a different
biological data set for which graphical visualization and
statistical analysis are desired.
In Bio 240, statistical
analysis refers to the exploratory and to the formal; to the
parametric and nonparametric; the traditional and the computationally
intensive (permutation, bootstrap, Monte Carlo simulation); and to the
Bayesian.
Graphical visualization obviously refers to
graphs. R leads the world in the quantity, quality and diversity of its
graphical programming environments.
The main focus of graphics
in Bio 240 is on what we call base R graphics. But, I will
introduce graph environments (e.g., hexbin, lattice, ggplot2) based
on
the famous grid package by Paul Murrell or those based on
JavaScript (e.g., plotly).
Initially, all this can be
overwhelming if you’ve had no computer programming or statistical
experience. However, it’s my job to make it easy by helping you get a
foothold in R, including by starting small.
In our R
coding workflow, it’s best to start small and
fail faster. The sooner we accept and learn from our
failures (R code that throws errors and/or does the
wrong thing) the quicker we can turn them into
successes (reliable code). This holds true for R
programmers who are beginners as well as for those who are experts.
Every class period I/you will write, debug and run code in R Editor files. We practice top-down coding and top-down debugging. leading to reliable, clean and readable code.
You will learn to appreciate the adage: To lavish your code with comments is to code wisely.
Your R Editor files must conform to my coding conventions. After Bio 240 you may write all the illiterate, ugly code you wish – code that if given to others or even when viewed by you 6 months later, may likely appear as gibberish.
My mandatory coding conventions are in the treasured spirit of literate programming.
In Bio 240 for R, these conventions include: valid & proper file names, comments, line spacing, line width of 85 characters max, indentation, content of first line of file, content of last line of file, valid names of objects, wise names of objects.
The ability to write, to debug and to run computer programs in R (R Editor files) is a major emphasis in Bio 240 and is required on each of the 3 Tests. It’s an accomplishment you will take with you beyond this course, and into your immediate and future, career development.
Using R and creating useful, R Editor files requires patience, attention to detail, much practice and comfort with trial-and-error. It’s that way with any programming language.
This past summer, as I struggled programming in R, HTML, CSS, JavaScript, LaTex and R Markdown – it was only by patience, practice and trial-and-error that anything was accomplished.
Sure, I am supposed to be an expert in R, but every day I write R code that throws errors at the R Console, I make poor coding decisions and my programming workflow is unwise. But, with patience and trial-and-error, I make progress in coding and getting reliable output.
Professors say writing is revision. The same holds for typing computer programs. If it takes an hour to write a good sentence in English, so what? It often takes me an hour. If it takes an hour to write several lines of proper R code, so what? In life, the good and the proper, are not easy and may not happen quickly.
Regardless of what a poem means, it must look good on the page. The same holds for computer programs. Writing a computer program is as creative, artistic and fulfilling as any human activity.
We seek R Editor files (programs: small or large) that are clear, clean, readable, printable, well-tested, bug-free, appropriately commented and a joy to view and to run. Those who work hard in Bio 240, will create R Editor files that make them proud.
As the semester proceeds in Bio 240, you may find yourself crafting R programs that will advance your career as you communicate these programs in your research lab, for employment opportunities or in your applications to professional and graduate schools.
R Editor files are the intellectual capital you craft, accumulate, use in Bio 240 and take with you, beyond Lehman College.
“We wanted users to be able to begin in an interactive environment, where they did not consciously think of programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming when the language and system aspects would become more important.” — John Chambers
Given below, please inspect the very small x,y data set on a 65
year old female patient being treated for Osteoporosis with Alendronate
which is a nitrogen containing bisphosphonate that binds to bone
surfaces and promotes bone density increase.
Our task is to
save this medical data in an R Editor file, graph it, explore it and
perform statistical analysis.
A graph incorporating
statistical analysis would inform her Physician on if the treatment was
effective across the 7.7 years and if it should be continued.
Obviously, this is important to the patient and to the Doctor’s
Hippocratic Oath.
In general, for very small sample sizes
graphical visualization remains paramount. Often, we simply graph
small data and let it speak for itself; this has always been acceptable
in Science and in peer reviewed journals.
Also for small sample
sizes, we proceed with statistical analysis using caution, realizing it
may have little if any, inferential value beyond the data at hand.
“A 65 year old woman with low bone density in 2002 was treated
with alendronate through the year 2009. Bone density was measured
irregularly over this period.” Source: p. 544-545 in
Rosner, Bernard. 2016. Fundamentals of Biostatistics, 8th
ed. Cengage, Boston. 927 pages. This textbook is used at Harvard’s
medical school and related graduate programs.
Time is months since baseline. At each visit, bone density (g/cm2) was measured by DEXA scan, for lumbar spine and femoral neck (hip). Time is our x variable, paired with bone density as y.
| Visit | Time (months) | Lumbar Spine g/cm2 | Femoral Neck (hip), g/cm2 |
|---|---|---|---|
| 1 | 0 | 0.797 | 0.643 |
| 2 | 8 | 0.806 | 0.638 |
| 3 | 18 | 0.817 | 0.648 |
| 4 | 48 | 0.825 | 0.674 |
| 5 | 64 | 0.837 | 0.640 |
| 6 | 66 | 0.841 | 0.676 |
| 7 | 79 | 0.886 | 0.651 |
| 8 | 92 | 0.881 | 0.680 |
Here I demonstrate best practices on how to establish R
Editor files to store, graph and analyze data in Bio 240.
R Editor files are also called computer programs or scripts.
The content (lines of code) of an R Editor file is submitted in top-down
fashion, to the R Console, for interactive execution. Later, we will
source entire R Editor files constructed to run
non-interactively; but, we first must learn interactive
submission to the R Console, of single lines of code and of
small code blocks, debugging and saving along the way.
This R
Editor file created below, benefits from our mandatory, coding
conventions in Bio 240.
Our coding conventions
include:
1. valid and proper file names
2. comments
3. line spacing
4. line width: 85 characters max
5. indentation
6. content of first line of file
7. content of last line of file
8. valid names for objects
9. wise names for objects
We begin by writing the first line and the last line of the R Editor file. We save it with the student’s name as part of the filename, as required in Bio 240 for all R Editor files including for homework and on Tests.
Let’s name it, bone density Arlo.R realizing that all valid and proper, R Editor files have .R as the file extension. Aardvark.R would be a valid name (would not throw an error in R) but would be an unwise file name.
Before launching into writing lines of R code, it’s wise to take this first step of creating the file, with an appropriate name and made up of only 2 lines: a first line of comments and a final line of comments.
This is what our R Editor file presents, as first established and saved. It does nothing, it returns nothing.
# bone density Arlo.R September 2, 2021 Arlo Atz, BS
# end. bone density Arlo.R
Inspect the first line of the above file.
It’s solely, a comment because it begins with the
hashtag # for which R interprets everything to the right of
it (on that line) as a comment and hence ignored. Comments are
essential; comments help humans; R ignores them. More about
comments later, as they have multiple uses.
What is the purpose of the first line of our R Editor file above?
Inspect the last line of the above file. It’s also solely, a comment.
What is the purpose of the last line of the file?
# end. foo Arlo.R is at the bottomWe are now ready to add R code between these 2 lines, typing it in top-down fashion, debugging top-down, and saving along the way.
Before storing the data as numeric vectors it’s wise to enter a few lines of comments explaining what is going on. It might seem obvious “what’s going on” but months later, the layout of the data and its source, may not be obvious.
Use the c() function to store the time variable as
numeric vector time.
Store lumbar spine bone density as spine and temporal neck
(hip) bone density as hip. I could have named these 3
numeric vectors Yonkers, Bronx and Australia which would not
throw an error in R, but it would be unwise
naming.
We seek short, informative names for variables and objects. I prefer simple names of lowercase.
The numeric vector is the fundamental data structure in R.
The c() function is the most commonly used function
in R.
The c() function combines its
arguments into a vector. Here, the args are data with each datum
separated by a comma. Vectors may be numeric or character. Vectors may
be combined to make more complicated data structures: factor, matrix,
data frame, list, array.
We call the cbind() function just to print a 3
column data matrix at R Console to allow a check for accuracy against
the original data. Wrong data entry generates wrong answers.
The data are now stored in objects that allow graphing and statistical analysis to proceed. Everything in R is an object whether data, functions, output, graphs or whatever. Think of R objects as containers.
To use R is to call functions (John Chambers).
# bone density Arlo.R August 26, 2021 Arlo Atz, BS
# Bone density was measured over 92 months in a 65 year old woman with low bone
# density, treated with Alendronate, 2002 through 2009. Bone mineral density
# (g/cm^2) was measured for lumbar spine and femoral neck (hip) (Rosner, B. 2016).
time <- c(0, 8, 18, 48, 64, 66, 79, 92) # months from baseline
spine <- c(.797, .806, .817, .825, .837, .841, .886, .881) # spine bone density
hip <- c(.643, .638, .648, .674, .64, .676, .651, .68) # hip " "
# view data at R Console to check carefully, versus original
cbind(time, spine, hip)
# end. bone density Arlo.R
The code chunk above, returns to the R Console, text
output from the call to cbind() which binds our 3 numeric
vectors by columns, into a matrix object and prints it
at R Console, as given below.
We must always check that the data as saved in any computer program, exactly reflects our original data. Proofread it against the original. In Bio 240 and in any environment – errors in data entry produce wrong results. Obviously, this also holds true for Tests in Bio 240 where zero credit is given for wrong answers caused by student errors in data entry.
If desired, > time would print content of the
time vector, > spine the spine vector and
> hip the hip vector.
The 2 dimensional index that’s printed at R Console, [row, column] is not part of the data matrix; it’s put there to inform the viewer.
[1, ] indicates row 1, all columns. A
missing integer in the row slot means all rows while a missing
integer in the column slot means all columns.
Consider: [4,2] , it returns 0.825, the 4th row, 2nd
column. Vectors are single indexed, [ ] while matrices and
data frames are double indexed, row by column,
[ , ].
In Bio 240 we use indexing for dealing with vectors, matrices and data frames. We also use indexing to extract information from objects returned from function calls. Indexing in R is just High School, algebraic thinking.
Output from cbind(time, spine, hip) :
## time spine hip
## [1,] 0 0.797 0.643
## [2,] 8 0.806 0.638
## [3,] 18 0.817 0.648
## [4,] 48 0.825 0.674
## [5,] 64 0.837 0.640
## [6,] 66 0.841 0.676
## [7,] 79 0.886 0.651
## [8,] 92 0.881 0.680
My practice is to teach you R coding by demonstrating the natural
workflow and evolution of short programs. (This is much better than only
showing you finished, polished R programs.)
Ideally, we do this
at the same time, as the Professor and the students type the same code,
in what I call collaborative coding.
Collaborative
coding is easy in the physical classroom yet more difficult in the
online classroom, but still possible.
Computer programs develop
by stages, by trial-and-error, by attention to detail, by coding
experiments and by top-down debugging along the way.
This
process results in clean, readable R code that executes without error
and delivers mature graphs and trustworthy statistical
results.
It’s utterly foolish to try to cut to the desired
graphical and statistical end product, right from the programming
get-go. Let this sentence burn-in.
Writing programs in R is similar to writing essays in English.
Both involve symbolic thinking, outlines, rough drafts, multiple
revisions, documentation and attention to detail.
plot() function is the workhorse graphing function
in base R. It makes a variety of different graphs, it takes many
arguments. Here, we plot the x,y data in a minimal call:
plot(x, y) where the x arg is the name of our
x numeric vector and y the name of the corresponding, y
numeric vector. At the R Console, >?plot brings up
extensive, HTML documentation for this extremely important graphing
function.# ------------------------
# preliminary scatterplots -------------------------------
# ------------------------
plot(time, spine) # 'spine' scatterplot: bone density by time
plot(time, hip) # 'hip' scatterplot: " " " "
The first line of code above, delivers a scatterplot for spine bone density (y) by time (x) – left graph below.
The second line of code, does the same for hip bone density by time – right graph. Examine these two graphs.
But, we want both sets of x,y points on the same graph and we want point-and-line graphs not scatterplots. Yet, we are encouraged and informed by these rough graphs. We’re making progress in visualizing the data.
Notice the scale of the y-axis on both scatterplots above. Clearly, spine bone density is much greater than hip bone density in this patient with Osteoporosis.
For both sets of points to fit in the same graph we need global
min, max for bone density. A call to the range() function
delivers. This function takes a numeric vector as its arg and returns
the minimum and maximum of that vector. We use
c(spine, hip) to create a numeric vector for the combined,
spine and hip density data. In R, we may call functions within
functions.
The new line of code below, delivers min, max (0.638 to 0.886 g/cm2) for the data set.
# both lines must fit in the same graph, so get global min,max for y
range(c(spine, hip)) # min, max: .638 to .886
## [1] 0.638 0.886
Now our task is to make a rough, minimal, draft quality, point-and-line graph for spine and for hip bone density. There are always different ways to accomplish the same thing in any programming language just as there are different ways to say “I love you” in any spoken language and experts disagree on the best way to do it.
In the first line of code below, we call
plot() for the spine x,y data and then use the
points() function in the second line to
add the points for the hip. This gives us full control
over the graph.
The arg type="b" specifies that
both points and lines are called for. (The default for
the type arg is type="p" for points only;
type="l" delivers only the line, "c" is empty
points joined by lines, "s" makes stairsteps,
"h" for vertical lines, "n" makes no points or
lines).
In the call to plot() we change the default scale
for the y-axis by using the argument ylim=c(.6, .9) or by
the argument ylim=range(c(spine, hip)); the latter is
preferable, in case the data vectors are changed upstream in the
program. The default for the x-axis scale looks fine but we could change
it with the xlim=() arg, if desired.
In the second line of code the call to
points() adds the x,y points for hip density by time. The
pch=16 arg specifies black, solid circles, If this arg is
omitted as we did in the call to plot() then the default
value of pch=1 is used which returns an open circle for the
points. (pch can take integers from 0 to 25, and beyond;
see>?points for documentation)
# minimal graph as proof of concept --------------------------------
plot(time, spine, type="b", ylim=range(c(spine, hip))) # spine data
points(time, hip, type="b", pch=16) # add hip data
Draft quality is what we have in the graph just above. It needs enhancement to make it publication quality or presentation quality. But, the hard work is over. The enhancements are relatively easy, given patience and some trial-and-error.
Axis labels need to be made proper, including with units of measurement and with larger font size. We might wish to rotate the y-axis tick labels. We often increase the font size of tick labels. Larger font size helps when graphs are reduced in print and helps when viewed on a poster or projected on a screen.
Publication quality refers to a graph intended for publication in a peer-reviewed scientific journal. Presentation quality refers to a graph intended for a professional, poster presentation or live seminar.
Publication quality graphs have no title and no personalized time stamp on the graph. Presentation quality graphs may enjoy informative titles and personalized time stamps to assign credit/blame for the graph and data in question.
We comment-in or comment-out args and lines of code to switch easily between publication and presentation quality graphs. More on that later. In short, mastering comment-in and comment-out is essential in coding across all programming languages.
Draft quality is self-explanatory. It’s where we
begin with a graph. Draft quality is easy, happens quickly and is often
sufficient for the bench and for viewing among your research
collaborators.
Having established above, by minimal calls to both
plot() and to points() that a pretty decent
graph is made, we now expand these two calls by using more
arguments.
The args in the code chunk below, are partly explained by comments in
the code itself. This is wise programming as it saves one the trouble of
>?plot to access the long, HTML documentation for the
plot() function.
Remember: arguments are separated by commas, except for the last arg in the call.
Some of the args below are self-explanatory, others have suggestive names, while others are oblique.
The best way to learn the arguments for a function is to use them, change them and see what happens in the output.
A nice thing is that most of the functions in base R
graphics, e.g., plot, points,
lines, abline, boxplot,
stripchart, histogram, barplot,
text, mtext, segments,
arrows, title, legend,
pie, symbols, dotchart,
grid, polygon, coplot,
sunflowerplot – share many of the same arguments.
The main arg is a string for graph title, if
desired; and it has typographic args, cex.main,
col.main and font.main for font size, font
color and font type (regular, bold, bold italic), respectively.
The xlab and ylab args allow custom
labels for x and y axis and they have typographic args,
cex.lab, cex.axis, font.lab, and
more.
The type arg specifies whether points alone are
plotted, type="p" , points connected by lines
type="b", lines connecting points but points not plotted
type="l", type="n" does nothing and
type="h" plots a bar from the bottom up to the point’s y
coordinate.
Change the values for the col, cex and
las args and see what happens.
# ----------------------------------------------------
# enhanced, point and line graph: 2 sets of x,y points --------------
# with each line labeled within the graph
# ----------------------------------------------------
plot(time, spine,
col="black", # color of points; default, col="black"
cex=1, # size of symbol plotted; default, cex=1
ylim=range(c(spine, hip)), # adjust y-axis range
xlim=c(-5, 100), # " x-axis "
las=0, # flip y-axis tick labels, las=1; default: 0
type="b", # default: type="p" points only; "b" both; "l" line only; "n", "h"
main="Bone density with Alendronate treatment\n65 year old female patient",
xlab="Time from Baseline, months",
ylab="Bone Density, g/cm^2",
font.lab=2, cex.lab=1.4, cex.axis=1.2, # label & axis typography
cex.main=1.8, col.main="darkred" # title typography
)
After the call to plot, the points function
is called for the hip x,y data. Here, we add the pch arg
which takes integers from 1 to 25 and more. When an arg is not
explicitly used in a function call then the default value for that arg
(specified in that function’s formal definition) is used, and the
default for pch is pch=1 which specifies that the x,y
coordinates are open circles; pch=16 returns a solid circle
and this clearly distinguishes the spine from the hip data in the
graph.
# add 2nd set of points with connecting line --------
points(time, hip, pch=16, type="b")
The text and mtext functions allow us to
put text within the graph, text and in the margins of the
graph mtext, although mtext may also be used
to draw text within the Cartesian coordinate space. It pays to learn how
to use these 2 functions.
For text() the first 2 args are x,y coordinates where
you want the text to be drawn; the next arg is the string you want
drawn. The args font, cex and col
control font type (default: font=1 for
regular, 2 for bold, 3 for italic,
4 for bold italic), font size (default:
cex=1, cex=2 is twice as large, etc.) and
font color (default: col="black").
Here we use the mtext() function to place a personalized
time stamp along the lower right edge of the graph window, using reduced
font size and gray color. This gives credit/blame to the investigator!
For mtext() the first arg is the string you want to draw,
the side arg is either 1, 2, 3 or 4 for lower side, left
side, top side, right side of graph window, respectively. The
adj arg specifies the left-right position of the text,
where adj=0 is left justified and adj=1 right
justified. The line arg has default of line=0
which is at the graph edge; line=4 is four lines below the
bottom edge of the graph. The cex arg controls font size,
as explained above, while the col arg controls text
color.
# label each line ------------------------------------------
text(20, .85, "Lumbar Spine", font=1, cex=1.2, col="darkred")
text(20, .69, "Temporal Neck (hip)", font=1, cex=1.2, col="darkred")
# add personalized time stamp; good for presentation quality graph ---------
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
The entire, R code chunk is below. It returns the enhanced graph
– a graph one would be proud to display and to submit.
# ----------------------------------------------------
# enhanced, point and line graph: 2 sets of x,y points --------------
# with each line labeled within the graph
# ----------------------------------------------------
plot(time, spine,
col="black", # color of points; default, col="black"
cex=1, # size of symbol plotted; default, cex=1
ylim=range(c(spine, hip)), # adjust y-axis range
xlim=c(-5, 100), # " x-axis "
las=0, # flip y-axis tick labels, las=1; default: 0
type="b", # default: type="p" points only; "b" both; "l" line only; "n", "h"
main="Bone density with Alendronate treatment\n65 year old female patient",
xlab="Time from Baseline, months",
ylab="Bone Density, g/cm^2",
font.lab=2, cex.lab=1.4, cex.axis=1.2, # label & axis typography
cex.main=1.8, col.main="darkred" # title typography
)
# add 2nd set of points with connecting line --------
points(time, hip, pch=16, type="b")
# label each line ------------------------------------------
text(20, .85, "Lumbar Spine", font=1, cex=1.2, col="darkred")
text(20, .69, "Temporal Neck (hip)", font=1, cex=1.2, col="darkred")
# add personalized time stamp; good for presentation quality graph ---------
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
Presentation quality graph with title, the 2 lines labeled
within graph and personalized time stamp.
Question: How do we modify the above code to make
a Publication quality graph – one without a title and without the
personalized time stamp?
Answer: We comment out the
main arg in plot() and we comment
out the call to mtext() for the personalized time
stamp.
In your interactive, R coding workflow you must master, comment out (to add the hashtag to the left of code to be ignored by R) and comment in (to remove the hastag) which activates the code.
plot(time, spine,
col="black", # color of points; default, col="black"
cex=1, # size of symbol plotted; default, cex=1
ylim=range(c(spine, hip)), # adjust y-axis range
xlim=c(-5, 100), # " x-axis "
las=0, # flip y-axis tick labels, las=1; default: 0
type="b", # default: type="p" points only; "b" both; "l" line only; "n", "h"
#main="Bone density with Alendronate treatment\n65 year old female patient",
xlab="Time from Baseline, months",
ylab="Bone Density, g/cm^2",
font.lab=2, cex.lab=1.4, cex.axis=1.2, # label & axis typography
cex.main=1.8, col.main="darkred" # title typography
)
# add 2nd set of points with connecting line --------
points(time, hip, pch=16, type="b")
# label each line ------------------------------------------
text(20, .85, "Lumbar Spine", font=1, cex=1.2, col="darkred")
text(20, .69, "Temporal Neck (hip)", font=1, cex=1.2, col="darkred")
# add personalized time stamp; good for presentation quality graph ---------
#mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
Publication quality graph
# ------------------------------
# Enhanced GRAPH, using legend() -- use of legend() is OPTIONAL in Bio 240 ---
# BUT, adding labels within graphs and in graph margins, is required
# ------------------------------
cx1 <- 1.6; cx2 <- 1.2; cx3 <- 1.8 # vars for typography convenience
c1 <- "darkred"
plot(time, spine,
type="b",
bty="l", # type of 'box'; default: "o", "l", "7", "c", "u", "]"
ylim=c(.62, .9), # x-axis range
xlim=c(-5, 100), # y-axis range
cex=1.4, cex.lab=cx1, cex.axis=cx2, font.lab=1, # typography
cex.main=cx3, col.main=c1, col.lab=c1, # typography
xlab="Time from Baseline, months",
ylab="", # turn-off y-axis label if doing it by mtext()
#ylab="Bone Density, g/cm^2",
#ylab=expression("Bone Density, g/cm" ^2),
main="Bone density with Alendronate treatment\n65 year old female patient"
#main="65 year old female patient", cex.main=cx3
)
# use mtext() for y-axis label to deal with crowding issue
mtext(expression("Bone Density, g/cm" ^ 2), side=2,
line=2.3, cex=cx1, col=c1, font=1)
points(time, hip, pch=16, cex=1.4, type="b") # add hip x,y data
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
# ------------------------
# legend() takes many args --- Optional in Bio 240 -------
# ------------------------
legend(x=55, y=.79, # x,y for top-left corner, by trial-and-error
legend=c("Lumbar Spine", "Temporal Neck (hip)"),
pch=c(1, 16), cex=1.2, text.font=3, pt.cex=1.5,
x.intersp=1.3, y.intersp=1.2, title.col="darkred",
title=" Alendronate treatment ")
Presentation quality graph using the
legend() function.
Rather than using the legend() function, nice
labels/legends are easily made using points() and
text() and with greater flexibility, in my opinion.
# -----------------------
# ALTERNATIVE to legend() -- nice labels by points() and text() --
# -----------------------
plot(time, spine, cex=1.2, ylim=c(.6, .9), type="b", cex.lab=1.3,
xlab="Time from Baseline, months", ylab="Bone Density, g/cm^2")
points(time, hip, cex=1.2, pch=16, type="b" )
points(60, .78, pch=1, cex=1.5) # put label
text(62, .78, "Lumbar Spine", pos=4, font=3) # within graph
points(60, .75, pch=16, cex=1.5) # label
text(62, .75, "Temporal Neck (hip)", pos=4, font=3)
The simple graph has brought more information to the data analyst’s
mind than any other device.
John Tukey (1915-2000) Princeton
University
Often, we wish to put statistical results within a graph, whether it’s draft quality, presentation quality or publication quality.
plot(time, spine, cex=1.2, ylim=c(.6, .9), type="b", cex.lab=1.3,
xlab="Time from Baseline, months", ylab="Bone Density, g/cm^2")
points(time, hip, cex=1.2, pch=16, type="b" )
points(62, .78, pch=1, cex=1.5) # put label
text(64, .78, "Lumbar Spine", pos=4, font=3) # within graph
points(62, .75, pch=16, cex=1.5) # label
text(64, .75, "Temporal Neck (hip)", pos=4, font=3)
# add OLS linear regression fit with slope & p-value ----------
spine.reg <- lm(spine ~ time) # perform OLS linear regression, spine by time
hip.reg <- lm(hip ~ time) # " " " " , hip by time
abline(spine.reg, col="gray", lwd=2, lty=2) # add linear regression fits
abline(hip.reg, col="gray", lwd=2, lty=2)
#summary(spine.reg) # slope .0008802, r-sq .86 # inspect the regression object
#summary(hip.reg) # slope .0003133, r-sq .38 # "
# add stats within graph --------------------------------------------------
# add r-sq% with correlation p-value -------------------------
#text(13, .85, "r-sq = 86%\np = 0.00046(1)", cex=1)
#text(13, .68, "r-sq = 38%\np = 0.052(1)", cex=1)
# add linear regression slope, with p-value ------------------
#text(13, .85, "slope = .0008802 \np = 0.00046(1)", cex=1)
#text(13, .68, "slope = .0003133 \np = 0.052(1)", cex=1)
# add r-sq%, linear regression slope, with p-value -----------
text(13, .86, "r-sq = 86%\nslope = 0.0008802\np = 0.00046(1)", cex=1)
text(13, .69, "r-sq = 38%\nslope = 0.0003133\np = 0.052(1)", cex=1)
mtext(paste("Bone density increased significantly with time\nfor Lumbar Spine",
"but not for Hip"), side=3, line=1, cex=1.5, col="darkred")
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
So far for this data set, we have stored and used vectors. Now, pouring the data into a row-by-column data frame object is needed to proceed.
# ---------------------
# construct data frame of 12 rows by 3 columns: months, dens, bone ----------
# ---------------------
dens <- c(spine, hip) # combine response vectors into one numeric vector
bone <- c(rep("spine", 8), rep("hip", 8)) # create matching, grouping variable
# change character vector 'bone' to a factor
# specify desired 'order' of factor levels
bone <- factor(bone, levels=c("spine", "hip"))
months <- c(time, time) # the 'months' column: numeric vector
dfr <- data.frame(months, dens, bone) # attach the 3 vectors as columns
str(dfr) # checking
## 'data.frame': 16 obs. of 3 variables:
## $ months: num 0 8 18 48 64 66 79 92 0 8 ...
## $ dens : num 0.797 0.806 0.817 0.825 0.837 0.841 0.886 0.881 0.643 0.638 ...
## $ bone : Factor w/ 2 levels "spine","hip": 1 1 1 1 1 1 1 1 2 2 ...
summary(dfr) # checking
## months dens bone
## Min. : 0.00 Min. :0.6380 spine:8
## 1st Qu.:15.50 1st Qu.:0.6502 hip :8
## Median :56.00 Median :0.7385
## Mean :46.88 Mean :0.7462
## 3rd Qu.:69.25 3rd Qu.:0.8280
## Max. :92.00 Max. :0.8860
dfr # print data frame
## months dens bone
## 1 0 0.797 spine
## 2 8 0.806 spine
## 3 18 0.817 spine
## 4 48 0.825 spine
## 5 64 0.837 spine
## 6 66 0.841 spine
## 7 79 0.886 spine
## 8 92 0.881 spine
## 9 0 0.643 hip
## 10 8 0.638 hip
## 11 18 0.648 hip
## 12 48 0.674 hip
## 13 64 0.640 hip
## 14 66 0.676 hip
## 15 79 0.651 hip
## 16 92 0.680 hip
We seek a batch of statistics for each of the two groups – the spine density x,y data set and the corresponding hip density data set. Having all the data organized in a data frame is always helpful and usually necessary.
As broken down and developed step-by-step in the Sections above, the complete R Editor file is available below in a scrolled window.
Read and study the code. Either type it in yourself (preferable), copy/paste it, or otherwise “acquire” this R Editor file. Own it by putting your name within the file name. Submit its code blocks, top-down and interactively at the R Console.
Experiment by changing the arguments in the graph functions and
practice flipping between presentation quality and publication quality
graphs, using comment in and comment out. Remember:
>?fun.name
This R Editor file reflects a typical workflow we often follow in R code development. It’s well-commented throughout and retains the beginning, minimal calls and the intermediate code, not just the code for say, a final, high quality graph.
It also makes modest use of text output from R Console, pasted in as lines, commented out. This is a convenience, to inform the investigator while coding. In other words, to make it easy for the coder to see salient results that may provide guidance and ideas.
You may also notice this file contains more R code than explicitly explained above. And, it may have minor stylistic differences from the above, code block by code block breakdown.
(A substantial task would be to transform the contents of this R Editor file, into a new function.)
# bone density Arlo.R January, 2022 Arlo Atz, BS
# Bone density was measured over 92 months in a 65 year old woman with low bone
# density, treated with Alendronate, 2002 through 2009. Bone mineral density
# (g/cm^2) was measured for lumbar spine and femoral neck(hip) (Rosner, B. 2016).
time <- c(0, 8, 18, 48, 64, 66, 79, 92) # months from baseline
spine <- c(.797, .806, .817, .825, .837, .841, .886, .881) # spine bone density
hip <- c(.643, .638, .648, .674, .64, .676, .651, .68) # hip " "
# view data at R Console to check carefully, versus original
cbind(time, spine, hip)
# time spine hip
#[1,] 0 0.797 0.643
#[2,] 8 0.806 0.638
#[3,] 18 0.817 0.648
#[4,] 48 0.825 0.674
#[5,] 64 0.837 0.640
#[6,] 66 0.841 0.676
#[7,] 79 0.886 0.651
#[8,] 92 0.881 0.680
# ------------------------
# preliminary scatterplots -------------------------------
# ------------------------
plot(time, spine) # 'spine' scatterplot: bone density by time
plot(time, hip) # 'hip' scatterplot: " " " "
# both lines must fit in the same graph, so get global min,max for y
range(c(spine, hip)) # min, max: .638 to .886
# minimal graph as proof of concept ----------------------------
plot(time, spine, type="b", ylim=range(c(spine, hip))) # spine data
points(time, hip, type="b", pch=16) # add hip data
# ----------------------------------------------------
# Enhanced, point and line graph: 2 sets of x,y points --------------
# with each line labeled within the graph
# ----------------------------------------------------
plot(time, spine,
col="black", # color of points; default, col="black"
cex=1, # size of symbol plotted; default, cex=1
ylim=range(c(spine, hip)), # adjust y-axis range
xlim=c(-5, 100), # " x-axis "
las=0, # flip y-axis tick labels, las=1; default: las=0
type="b", # default: type="p" points only; "b" both; "l" line only; "n", "h"
main="Bone density with Alendronate treatment\n65 year old female patient",
xlab="Time from Baseline, months",
ylab="Bone Density, g/cm^2",
font.lab=2, cex.lab=1.4, cex.axis=1.2, # label & axis typography
cex.main=1.8, col.main="darkred" # title typography
)
# add 2nd set of points with connecting line --------
points(time, hip, pch=16, type="b")
# label each line ------------------------------------------
text(20, .85, "Lumbar Spine", font=1, cex=1.2, col="darkred")
text(20, .69, "Temporal Neck (hip)", font=1, cex=1.2, col="darkred")
# add personalized time stamp; good for presentation quality graph ---------
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
# ------------------------------
# Enhanced GRAPH, using legend() -- use of legend() is OPTIONAL in Bio 240 ---
# BUT, adding labels within graphs and in graph margins, is required
# ------------------------------
cx1 <- 1.6; cx2 <- 1.2; cx3 <- 1.8 # vars for typography convenience
c1 <- "darkred"
plot(time, spine,
type="b",
bty="l", # type of 'box'; default: "o", "l", "7", "c", "u", "]"
ylim=c(.62, .9), # x-axis range
xlim=c(-5, 100), # y-axis range
cex=1.4, cex.lab=cx1, cex.axis=cx2, font.lab=1, # typography
cex.main=cx3, col.main=c1, col.lab=c1, # typography
xlab="Time from Baseline, months",
ylab="", # turn-off y-axis label if doing it by mtext()
#ylab="Bone Density, g/cm^2",
#ylab=expression("Bone Density, g/cm" ^2),
main="Bone density with Alendronate treatment\n65 year old female patient"
#main="65 year old female patient", cex.main=cx3
)
# use mtext() for y-axis label to deal with crowding issue
mtext(expression("Bone Density, g/cm" ^ 2), side=2,
line=2.3, cex=cx1, col=c1, font=1)
points(time, hip, pch=16, cex=1.4, type="b") # add hip x,y data
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
# ------------------------
# legend() takes many args --- Optional in Bio 240 -------
# ------------------------
legend(x=55, y=.79, # x,y for top-left corner, by trial-and-error
legend=c("Lumbar Spine", "Temporal Neck (hip)"),
pch=c(1, 16), cex=1.2, text.font=3, pt.cex=1.5,
x.intersp=1.3, y.intersp=1.2, title.col="darkred",
title=" Alendronate treatment ")
# -----------------------
# ALTERNATIVE to legend() -- nice labels by points() and text() --
# -----------------------
plot(time, spine, cex=1.2, ylim=c(.6, .9), type="b", cex.lab=1.3,
xlab="Time from Baseline, months", ylab="Bone Density, g/cm^2")
points(time, hip, cex=1.2, pch=16, type="b" )
points(62, .78, pch=1, cex=1.5) # put label
text(64, .78, "Lumbar Spine", pos=4, font=3) # within graph
points(62, .75, pch=16, cex=1.5) # label
text(64, .75, "Temporal Neck (hip)", pos=4, font=3)
# add OLS linear regression fit with slope & p-value ----------
spine.reg <- lm(spine ~ time) # perform OLS linear regression, spine by time
hip.reg <- lm(hip ~ time) # " " " " , hip by time
abline(spine.reg, col="gray", lwd=2, lty=2) # add linear regression fits
abline(hip.reg, col="gray", lwd=2, lty=2)
summary(spine.reg) # slope .0008802, r-sq .86 # inspect the regression object
summary(hip.reg) # slope .0003133, r-sq .38 # "
# add stats within graph --------------------------------------------------
# add r-sq% with correlation p-value -------------------------
#text(13, .85, "r-sq = 86%\np = 0.00046(1)", cex=1)
#text(13, .68, "r-sq = 38%\np = 0.052(1)", cex=1)
# add linear regression slope, with p-value ------------------
#text(13, .85, "slope = .0008802 \np = 0.00046(1)", cex=1)
#text(13, .68, "slope = .0003133 \np = 0.052(1)", cex=1)
# add r-sq%, linear regression slope, with p-value -----------
text(13, .86, "r-sq = 86%\nslope = 0.0008802\np = 0.00046(1)", cex=1)
text(13, .69, "r-sq = 38%\nslope = 0.0003133\np = 0.052(1)", cex=1)
mtext(paste("Bone density increased significantly with time\nfor Lumbar Spine",
"but not for Hip"), side=3, line=1, cex=1.5, col="darkred")
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
# ---------------------
# construct data frame of 12 rows by 3 columns: months, dens, bone ----------
# ---------------------
dens <- c(spine, hip) # combine response vectors into one numeric vector
bone <- c(rep("spine", 8), rep("hip", 8)) # create matching, grouping variable
# change character vector 'bone' to a factor
# specify desired 'order' of factor levels
bone <- factor(bone, levels=c("spine", "hip"))
months <- c(time, time) # the 'months' column: numeric vector
dfr <- data.frame(months, dens, bone) # attach the 3 vectors as columns
str(dfr) # checking
summary(dfr) # checking
dfr # print data frame
# months dens bone
#1 0 0.797 spine
#2 8 0.806 spine
#3 18 0.817 spine
#4 48 0.825 spine
#5 64 0.837 spine
#6 66 0.841 spine
#7 79 0.886 spine
#8 92 0.881 spine
#9 0 0.643 hip
#10 8 0.638 hip
#11 18 0.648 hip
#12 48 0.674 hip
#13 64 0.640 hip
#14 66 0.676 hip
#15 79 0.651 hip
#16 92 0.680 hip
# -----------------------------------------
# ------------------------
# some statistics by group, by some R programming ------------
# ------------------------
# >?tapply # for documentation on this handy function
N <- tapply(dens, bone, length) # vector of sample sizes per group
Mean <- tapply(dens, bone, mean) # means
Median <- tapply(dens, bone, median) # medians
SD <- tapply(dens, bone, sd) # standard deviations
Min <- tapply(dens, bone, min) # minimum
Max <- tapply(dens, bone, max) # maximum
spine.r <- cor(time, spine) # Pearson r, spine density vs. time
hip.r <- cor(time, hip) # Pearson r, hip density vs. time
p.spine <- cor.test(time, spine, alt="greater")$p.value # p-value of r, one-sided
p.hip <- cor.test(time, hip, alt="greater")$p.value
LB.CI.spine <- cor.test(time, spine)$conf.int[1] # LB 95% CI of r
LB.CI.hip <- cor.test(time, hip)$conf.int[1]
UB.CI.spine <- cor.test(time, spine)$conf.int[2] # UB 95% CI of r
UB.CI.hip <- cor.test(time, hip)$conf.int[2]
# statistical power analysis by package 'pwr' ------------
# 'power of the test' for signif. of Pearson r at alpha .01, one-sided
library(pwr) # statistical power analysis by normal theory after Cohen
# statistical power of cor.test at alpha .01, one-sided ---------
pwr.spine <- pwr.r.test(n=8, r=cor(time, spine), sig.level=.01,
alt="greater", power=NULL)$power
pwr.hip <- pwr.r.test(n=8, r=cor(time, hip), sig.level=.01,
alt="greater", power=NULL)$power
# ----------------------------------------------
# pour stats by group (spine, hip) into a matrix to view -------
# the NA is there just as a spacer
# ----------------------------------------------
mat <- rbind(N, Mean, Median, SD, Min, Max, NA,
Pearson.r=c(spine.r, hip.r),
r.sq.percent=c(spine.r, hip.r)^2* 100,
r.p.value1sided=c(p.spine, p.hip),
r.LB.95.CI=c(LB.CI.spine, LB.CI.hip),
r.UB.95.CI=c(UB.CI.spine, UB.CI.hip),
r.power.01.one.sided=c(pwr.spine, pwr.hip) )
colnames(mat) <- c("Spine bone density", " Hip density")
print(format(mat, scientific=F, digits=3), quote=F)
# Spine bone density Hip density
#N 8.000000 8.000000
#Mean 0.836250 0.656250
#Median 0.831000 0.649500
#SD 0.032631 0.017474
#Min 0.797000 0.638000
#Max 0.886000 0.680000
# NA NA
#Pearson.r 0.927236 0.616283
#r.sq.percent 85.976720 37.980533
#r.p.value1sided 0.000456 0.051858
#r.LB.95.CI 0.642130 -0.156242
#r.UB.95.CI 0.987003 0.920990
#r.power.01.one.sided 0.922672 0.247601
# ----------------------------------
# two-way ANOVA to compare the means ------------------------
# ----------------------------------
anova(lm(dens ~ bone + factor(months), data=dfr)) # two-way ANOVA, no interaction
#Response: dens
# Df Sum Sq Mean Sq F value Pr(>F)
#bone 1 0.129600 0.129600 314.8907 4.451e-07
#factor(months) 7 0.006710 0.000959 2.3291 0.1436
#Residuals 7 0.002881 0.000412
# comparison of mean, spine density vs. hip: diff= -.18, p=4.45e-7
# comparisons of mean density at month 0, against the other months
# mean density, only different from time zero,
# at 79 months, p=.048 & 92 months, p=.021
summary(lm(dens ~ bone + factor(months), data=dfr))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.81000 0.01521 53.235 2.16e-10
#bonehip -0.18000 0.01014 -17.745 4.45e-07 <- highly significant
#factor(months)8 0.00200 0.02029 0.099 0.9242
#factor(months)18 0.01250 0.02029 0.616 0.5573
#factor(months)48 0.02950 0.02029 1.454 0.1892
#factor(months)64 0.01850 0.02029 0.912 0.3921
#factor(months)66 0.03850 0.02029 1.898 0.0995
#factor(months)79 0.04850 0.02029 2.391 0.0481 <- barely significant
#factor(months)92 0.06050 0.02029 2.982 0.0205 <- significant
# effect on two-way Anova, of removal of the last data point: month 92 -------
# ----------------------------------------
# dfr[c(-8, -16), ] will delete row 8 and 16, and return all 3 columns
# r= .89745, p= .003058(1) << spine >>
cor.test(~ dens + months, data=dfr[c(-8, -16), ], subset=bone=="spine", alt="greater")
# r= .4587, p= .1503(1) << hip >>
cor.test(~ dens + months, data=dfr[c(-8, -16), ], subset=bone=="hip", alt="greater")
anova(lm(dens ~ bone + factor(months), data=dfr[c(-8, -16), ])) # two-way ANOVA no interaction
#Response: dens
# Df Sum Sq Mean Sq F value Pr(>F)
#bone 1 0.109651 0.109651 250.2507 4.047e-06 ***
#factor(months) 6 0.004029 0.000671 1.5324 0.3086
#Residuals 6 0.002629 0.000438
summary(lm(dens ~ bone + factor(months), data=dfr[c(-8, -16), ]))
# comparisons of mean density at time 0, against times: 8,18,48,64,66,79
# mean density at 0 months not different from other months
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.80850 0.01582 51.095 3.77e-09
#bonehip -0.17700 0.01119 -15.819 4.05e-06 <- highly significant
#factor(months)8 0.00200 0.02093 0.096 0.9270
#factor(months)18 0.01250 0.02093 0.597 0.5722
#factor(months)48 0.02950 0.02093 1.409 0.2084
#factor(months)64 0.01850 0.02093 0.884 0.4108
#factor(months)66 0.03850 0.02093 1.839 0.1155
#factor(months)79 0.04850 0.02093 2.317 0.0597 <- not significant
# --------------------------------------------
# is spine density correlated with hip density for this -------------------
# 65 yr. old female patient, across the 92 month period?
# --------------------------------------------
plot(spine, hip)
lines(lowess(spine, hip), col="red", type="l")
cor.test(spine, hip) # r= .4796 p= .2291(2)
# ------------------------------------------------------------------------
# VISUALIZATION of Pearson r with 95% CI, for
# spine and hip, bone density by time data
#
# NOTICE that by cor.test() the traditional 95% confidence intervals of
# Pearson r, overlap considerably. This indicates the two correlations are
# not significantly different.
#
# AND, due to small sample size, both 95% CIs are very wide.
# ------------------------------------------------------------------------
# a draft quality plot to visualize the 2 Pearson r values
# with their traditional lower and upper bound of 95% CIs
# two-sided cor.test() just for 95% CI of Pearson r -- conventional way
# the 2 Pearson 95% CIs overlap!
LB.CI.spine <- cor.test(time, spine)$conf.int[1] # spine 95% CI=.642, .987
UB.CI.spine <- cor.test(time, spine)$conf.int[2]
LB.CI.hip <- cor.test(time, hip)$conf.int[1] # hip 95% CI= -.156, .921
UB.CI.hip <- cor.test(time, hip)$conf.int[2]
spine.r <- cor(time, spine)
hip.r <- cor(time, hip)
# prepare vectors for stripchart() graph
# of the 2 r values w/95% CIs -------------------
y <- c(cor(spine, time), cor(hip, time))
group <- c("Spine", "Hip")
group <- factor(group, levels=c("Spine", "Hip"))
stripchart(y ~ group, vertical=TRUE,
xlim=c(.2, 2.8), ylim=c(-.15, 1), pch=16, cex=2,
xlab="Bone Mineral Density", ylab="Pearson correlation",
cex.lab=1.3, cex.axis=1.3, cex.main=1.5,
main="Pearson r with traditional, 95% Confidence intervals")
lines(x=c(1,1), y=c(LB.CI.spine, UB.CI.spine), lwd=3, col="lightblue")
lines(x=c(2,2), y=c(LB.CI.hip, UB.CI.hip), lwd=3, col="lightblue")
points(c(1,2), c(spine.r, hip.r), pch=16, cex=2) # intentional overplot
abline(h=c(0,1), lty=2) # add horizontal dotted lines at r=1, r=0 ---------
mtext("SPINE bone density\nby time, n=8", line=-3.9, adj=.03, cex=1, col="red")
mtext("HIP bone density\nby time, n=8", line=-10, adj=.95, cex=1, col="red")
text(1.76, c(hip.r, UB.CI.hip, LB.CI.hip),
c(round(hip.r,3), round(UB.CI.hip,3), round(LB.CI.hip,3)), cex=1.2)
text(1.24, c(spine.r, UB.CI.spine, LB.CI.spine),
c(round(spine.r,3), round(UB.CI.spine,3), round(LB.CI.spine,3)), cex=1.2)
text(.6, .22, paste("Visualization of the\n2 correlations with 95% CIs",
"\nby Fisher's Z transform\nand t-distribution"), col="gray", srt=45)
text(1.5, .22, paste("Note the wide width of each CI\n",
"and how much the\n2 Confidence Intervals\noverlap"), col="gray", srt=45)
text(2.5, -.01, "zero correlation", col="gray", cex=.8)
text(2.5, .99, "perfect, positive correlation", col="gray", cex=.8)
# ------------------------------------------------------------------
# PERMUTATION test for difference between the 2 Pearson CORRELATIONS
# for SPINE vs. HIP bone density by time
# ------------------------------------------------------------------
# Various formal tests for difference between two Pearson r values are
# found and argued about in textbooks. But, this small sample is ideal
# for a permutation test which if done properly, is unassailable.
# .3109528, our test stat, delta Pearson's, spine.r minus hip.r
diff <- cor(spine, time) - cor(hip, time)
# get standard normal deviates independently for each sample, spine & hip
densZ <- c((spine - mean(spine))/sd(spine), (hip - mean(hip))/sd(hip))
densZ
diff <- cor(time, densZ[1:8]) - cor(time, densZ[9:16])
# the 2 r values and 'diff' are the same but now data as Z scores,
# is on the same scale for both spine and hip
# the replicate() function is the loop iterator ----------------
# vectors for replicate() --
# 'time' is n=8, 0 to 92 months
# 'denzZ' is n=16; n=8 spine, n=8 hip density
#
# we index densZ[1:8] for spine
# index densZ[9:16] for hip
# number of permutations (shuffles) of global 'densZ' -- start low, go high
NS <- 1e5
timer <- proc.time()[[3]] # initialize timer for 'replicate' loop; grabs current time
null.diff <- replicate(NS,
{densZ <- sample(densZ); cor(time, densZ[1:8]) - cor(time, densZ[9:16])})
timer <- proc.time()[[3]] - timer # total seconds elapsed by replicate() loop
timer # seconds required
# query the 'null.diff' vector for how many are >= to test stat, 'diff' -----
NGE1 <- length(null.diff[null.diff >= diff]) # one-sided NGE, number greater >= to diff
p1 <- (NGE1 + 1)/(NS + 1) # one-sided p-value for difference
# between the two, r values
NGE2 <- length(null.diff[abs(null.diff) >= abs(diff)]) # two-sided NGE
p2 <- (NGE2 + 1)/(NS + 1) # two-sided # two-sided p-value
main.string <- paste("Distribution of Pearson correlation 'difference'",
"(spine r - hip r)\n",
"under global permutation of bone density Z-scores")
# visualize the null distribution of difference
# between the two correlations, under permutation ----------------------
hist(null.diff,
main=main.string,
col="lightblue", xlim=range(null.diff),
border="white",
xlab="Null difference between correlations",
cex.lab=1.3)
abline(v=diff, col="red", lwd=3)
mtext(paste("Test Statistic: spine r - hip r = ", round(diff,5)),
line=-.5, adj=.01)
mtext(paste("NS =", NS, "permutations"), line=-1.7,
adj=.05, cex=1, font=2, col="blue")
mtext("Pearson correlation for\nbone density by months",
line=-5.3, adj=.05, col="blue")
mtext("spine: r = 0.927", line=-7, adj=.1, col="blue")
mtext("hip: r = 0.616", line=-8, adj=.132, col="blue")
mtext(paste("p =", signif(p1,4), "(1), NGE =", NGE1), line=-1, adj=.95)
mtext(paste("p =", signif(p2,4), "(2), NGE =", NGE2), line=-2, adj=.95)
mtext("p = (NGE + 1) / (NS + 1)", line=-3.5, adj=.95)
mtext("< Permutation Test >", line=-6.5, adj=1, font=2, col="blue")
mtext("null hypothesis:\nspine r = hip r", line=-9, adj=1, col="blue")
mtext("alternate hypotheses:\nspine r = hip r\nspine r > hip r",
line=-12.5, adj=1, col="blue")
mtext("For NS --\nstart low, go high
then check for\nstability of p-value
in repeated runs",
line=-20, adj=1, cex=.8, col="black")
mtext("Red line -- observed test statistic: 0.31095", side=1, line=4,
adj=0, cex=.9, col="red")
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.8, col="gray")
mtext(paste("Elapsed: ", signif(timer,4), "sec."), side=1, adj=1, line=2)
# NOTE: In any permutation test, we're concerned about stability of p-values in repeated
# runs at the same NS. We seek stable results, recognizing that the combinatorial outcome
# space may be vast. But here, it's rather low at 16!/(8!*8!)=12,500 possible, unique
# global shuffles of bone density.
# By R code: > factorial(16)/(factorial(8)*factorial(8))
# run of NS 1e5 p=.2838(1) ; p=.5733(2) ; 3.85 seconds elapsed
# run of NS 1e5 p=.2866(1) ; p=.5727(2) ; 3.96 seconds elapsed
# run of NS 1e5 p=.2880(1) ; p=.5762(2) ; 3.80 seconds elapsed
# run of NS 1e5 p=.2873(1) ; p=.5740(2) ; 3.85 seconds elapsed
# run of NS 1e5 p=.2887(1) ; p=.5763(2) ; 3.56 seconds elapsed
# --------
# box plot ------------------
# --------
boxplot(dens ~ bone, data=dfr,
xlab="65 year old female patient",
main="Alendronate treatment", cex.main=2,
ylab="Bone Density, g/cm^2",
cex.lab=1.45, cex.axis=1.5, col.lab="darkred",
names=c("Spine", "Hip"),
col=c("gray", "lightblue"),
ylim=c(.6, .9))
text(c(1,2), .6, "n = 8", cex=1.4)
# -------------------------------
# lowess fit through the points; locally weighted,
# nonparametric nonlinear fit using lowess() ---------
# -------------------------------
plot(time, spine, ylim=c(.6, .9), type="b", cex.lab=1.2,
xlab="Time from Baseline, months", ylab="Bone Density, g/cm^2")
points(time, hip, pch=16, type="b" )
points(62, .78, pch=1, cex=1.5)
text(64, .78, "Lumbar Spine", pos=4, font=3)
points(62, .75, pch=16, cex=1.5)
text(64, .75, "Temporal Neck (hip)", pos=4, font=3)
lines(lowess(time, spine), lwd=2, col="red")
lines(lowess(time, hip), lwd=2, col="red")
# ------------------------
# Bayes Factor Correlation -------
# ------------------------
library(BayesFactor)
correlationBF(time, spine) # 10.14651 +- 0%
correlationBF(time, hip) # 1.4675 +- 0%
# ------------------------------
# --------- SUMMARY ------------
# ------------------------------
# Bone density was significantly and strongly positively correlated with time,
# for << lumbar spine >> but not for temporal neck << hip >>
#
spine.reg # slope, b = .0008802 b*12 months = +0.0106 g/cm^2 per year
hip.reg # slope, b = .0003133 b*12 months = +0.0038 " " "
# check Rosner on metrics for bone rate of increase
# 2.8 times, slope for spine v. hip, rate of bone density increase
# spine density increased 2.8 times as fast as hip density, per unit of time
# In OLS linear regression, the 60 month (5 year) rate of bone density increase
# was 0.0xx g/cm^2 for lumbar spine P=0.000xxx) but for temporal hip, bone
# density increase was not significantly different from zero (p=0.10).
# Need the 12 month rate, vs. Rosner comments
# end. bone density Arlo.R
We must be able to write about our statistical results and graphical visualizations, and write in a style suitable for peer-review at a scientific journal as well as write in a style for the more general public.
For many, writing about statistical results is the hardest part.
It’s fun to save and manipulate our data. It’s always exciting to see the discoveries and the revelations from our graphs and from our statistical analyses. It’s what we live for as scientists.
All these tasks are skills, skills learned by practice and by reading scientific papers and technical books, to see how our peers do it.
Expect to spend a great deal of time writing a paragraph (or page) about a data set. It’s not easy. For me, it’s never easy but it’s always fulfilling.
You may need to write two versions – a version for the general, college educated public and a version for scientific peer-review, the former rather informal and the latter, highly technical. Inspection of these two contrasting styles informs us on how best to communicate our results.
I start by writing a good, solid statistical sentence. If I can do this, I can write a paragraph. If I can write a good paragraph I can write a page about my statistical results. Start small – a single sentence. Along the way, we profit by having colleagues read and edit our writing!
In an above Section, I wrote – “Also for small sample sizes, we proceed with statistical analysis using caution, realizing it may have little if any, inferential value beyond the data at hand.”
Here, the data at hand are important to this patient and to her Physician!
We realize the 65 year old female patient’s Physician is seriously motivated to understand as much as possible from the patient’s x,y data for bone density by time, for both spine and for hips, given Alendronate treatment across 92 months.
She wants to pursue formal statistical analysis coupled with graphical visualization and she has some ability in R programming.
Her questions include:
She treats many elderly women for Osteoporosis. In addition to Alendronate, other drugs are prescribed depending on the situation.
She reasons that writing custom R code for the specific case of this 65 year old woman, was worth the effort and she is rightfully proud of her code. However, writing and curating, custom R code developed for each of her many Osteoporosis patients is simply not practical, and such a practice would be error prone.
She knows that given the correct, reliable, clean and readable code in her R editor file created for the data of this one patient, she could generalize that code into a new R function.
This new function could be called for the x,y data from any Osteoporosis patient, given any drug not just Alendronate and given any dose across any time course.
Such a function, say boneDensity() would have arguments
including for input of x,y data and for different bone types, arguments
for whether to save the graphs as a multipage PDF file and text output
to a text file or simply displayed on the screen with prompts to
scroll.
She understands that converting her multiple pages of custom R code into a function, will take time and considerable effort but would support her Osteoporosis medical practice and be of great value in guiding her decisions on Osteoporosis treatment.
She also knows that calling a function is always much less error prone, easier and faster than interactively executing a very long, custom code block that must be manually edited (e.g., name changes for variables) to deal with every different data set.
In this document I do not engage this function creation, as much as it appeals to me.
The creation of this new function would be appropriate for a Biology BS Honors project. If fully developed, it would be appropriate for a Master’s Degree thesis in Biology.
In my opinion, a person taking on this task should contact local Physicians who treat Osteoporosis, present their function with output, request anonymous data from the medical practice, and give the function to them for their use, as appropriate.
I am sure that Osteoporosis Physicians would have great suggestions for the improvement and enhancement of this new function, whether dealing with data, statistical analysis, graphics or other metrics. Students, such is the stuff upon which careers are built.
R Coding conventions
R Coding behavior
Coding takes patience and the cultivation of trial-and-error behavior. But that’s not all.
Start small when learning a programming environment like R. I think it’s best to start with small data sets that are easy to visualize and hopefully, relatively easy to understand, such as biomedical data. Also, start small in terms of your graphical and statistical aspirations for these small data sets. This approach builds confidence as you learn.
Even when we’re accomplished in R, it still pays to start small. On a new graphing project for instance, we code in step-wise fashion from small, minimal calls, to the final, desired graphs while debugging, saving and laying in comments along the way.
Start small, fail fast remains a wise, workflow adage across all computer languages.
A minimal call to a function uses the minimum number of arguments, below which it simply will not execute. We do this as proof of concept. Unless we can get a function to return the minimal, we can’t get it to return the complicated.
Remember, some functions can be called with zero args, such as
date(). Most functions require one or two arguments to run,
e.g., str() hist(y)
plot(x, y) boxplot(y ~ x).
It is common for a call to plot() to have 15 or so
arguments. But, if we enter all those args before any execution along
the way, then expect to see bewildering, noninformative error messages
that compound and cascade down your long function call, making it
impossible to find the error(s).
Errors in coding are most often punctuation errors, then spelling errors, then mental thinking errors. And, coding errors such as these can be hard to see, just like it’s often hard to see typos in our English prose.
Whether we’re talking about a function that’s new to us or one we
know well (e.g., plot), being able to get it to “work” by a
minimal call is the place to begin. Then, to achieve
our final objectives, we add/change args and perhaps call other
functions such as text() mtext()
points() lines() legend() —
functions that add material to the graph window.
In our workflow, as we move forward from a minimal call, we save the minimal call in our program and proceed coding in top-down fashion, sending the call with each new arg added or changed. This is especially important if we are unsure about function syntax or unsure about our choices of values typed-in for args or if we don’t yet know what the best graph for our data should be. This is how interactive coding sessions proceed.
We submit to the R Console, the function call with each change. If it throws an error, we know the locus of the error – the last line or so of code entered, not 10 lines upstream. This is top-down de-bugging. It works. It may seem slow and tedious but it saves time in the long run. It works, it works, it works – learn it.
Coding in R requires attention to detail as you must
follow the syntax rules of the language in terms of spelling, and
punctuation, such as, , : , ::, ( ) { } [ ] \
Students often fail to realize that practice is essential in using R for data analysis and graphics. You must practice running a program on many different data sets.
To become self-sufficient in R, one must read and study code. It’s not enough just to be able to run an R program that you wrote or that another person wrote.
You need to understand the code in your/our R programs. This requires us to read carefully and to study deeply, the code, if our aim is self-sufficiency.
In the late 1970s a Math Professor taught me an important lesson – There is no substitute for sitting down in a quiet place with a pencil and a paper copy of your computer program.
Way back then, we were coding in Applesoft BASIC (me) on revolutionary Apple II microcomputers (5.25 inch floppy drive, 48 KB RAM with BASIC in ROM, and at a $2,500 price point) and in C (him) on a primitive mainframe.
Now, I use R (mostly) and increasingly HTML5, CSS3, JavaScript, SVG and LaTex. Interestingly, R has contributed packages that allow native code from Python, C, C++, Julia and other languages, to be executed within R, and this remains appealing for the future growth of Bio 240 and derivative courses.
The 19,000+ contributed R packages enable wonderful things, apart from just “graphs” and statistics.”
I find word clouds useful for display of key words and vocabulary. Creating a word cloud is not “statistics.” It’s an artistic exercise requiring considerable trial-and-error in function calls and in the use of arguments, to get a pleasing image. Word clouds are not part of Bio 240 but my R code below will work for your own word clouds. Try it!
Question: Does this word cloud omit important keywords or concepts in this document? Does it have irrelevant words? If so, can you see how to change the code? You can copy/paste any of my R code you see here, into your own R Editor file and proceed.
library(wordcloud2) # load package for word clouds calculated in R by C++ & rendered
# to HTML & JavaScropt. We don't need to know C++ or HTML for
# this -- that's the job of function wordcloud2()
word <- c("R", "data", "graphs", "Biostatistics", "Fall 2023", "coding", "histogram",
"scatterplot", "box plot", "Zoom", "twitter", "email", "kernel density",
"covid-19", "CDC", "Lehman College", "Bio 240", "time stamp", "the Bronx",
"moving average", "for loop", "data frame", "literate programming",
"reproducible research", "ugly code", "buggy code", "plot()", "mtext()",
"text()", "str()", "lines()", "function", "arguments", "comments", "lowess fit",
"HTML", "JavaScript", "head()", "vector", "mean", "median", "standard deviation",
"skew", "kurtosis", "source()", "tail()", "Shapiro-Wilk test", "abline()",
"summary()", "cases", "file path", "explicit indexing", "word cloud", "RStudio",
".csv file", "descriptive statistics", "distribution", "Donald Knuth",
"Peter Dalgaard", "Paul Murrell", "distance learning", "density()", "polygon()",
"rnorm()", "wordcloud2()", "data table", "RPubs", "Blackboard", "points()",
"effect size", "Anova", "correlation", "statistical power", "Bayes Factor",
"Tukey's test", "t-test", "multiple comparisons", "Dunnett's test",
"assumptions", "bootstrap", "permutation tests", "confidence interval")
freq <- c(19,6,6,8,4,6,2,2,2, rep(1,73)) # arbitrary 'frequency' for each word
d <- data.frame(word, freq, row.names=word) # n=82 words
wordcloud2(d, size=1.1)
“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.”
— Donald E. Knuth. 1984. Literate
Programming. The Computer Journal 27(2): 97-111.
Download free
PDF of this paper from the journal: https://academic.oup.com/comjnl/article/27/2/97/343244
Dr. Knuth is Professor Emeritus, Computer Science Department, Stanford
University.
Relative to the importance of writing well structured, clean and readable code in R, Dr. Andrew Gelman of Columbia University (NYC) wrote a blog post to Statistical Modeling, Causal inference, and Social Science on 8 July 2020. He is renowned as Professor of Statistics and lead author on the acclaimed new book (Andrew Gelman, Jennifer Hill, Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press, 550 pages). Convincingly, he maintains in this blog post that ugly code is buggy code.
\[\mathbb {\huge UGLY\quad code\quad is\quad BUGGY\quad code}\]
As illustrated by an example of his own sometimes buggy R code, he stresses that we want “…to avoid tripping up and even publishing erroneous results.”
He continues, “Anyway, the point of this post is just to demonstrate how, with sloppy code, you really have to keep your hands on the wheel at all times. Even such simple fitting is hard to do if you’re not continually checking results against your expectations.”
Finally, “To return to the title of this post: Ugly code is buggy code not just because it’s easier to introduce bugs and harder to find them if your code is poorly structured. It’s also that, for code to be truly non-buggy, it doesn’t just have to be free of bugs. You also have to know it’s free of bugs, so that you can use it confidently.” (bold font added by Kincaid; view original story with interesting comments by searching “buggy code” at https://statmodeling.stat.columbia.edu, or see my 8/9/20 tweet @LC_biostat).
Consider my crude example below of why it’s so important to write clean, well-structured R code by making use of spacing, indentation and comments.
Let’s go back to section 14 and bring that code down here for the histogram annotated with descriptive statistics, p-value from the Shapiro-Wilk test of normality and a personalized time stamp.
y1 <- rnorm(60, 50, 9) # 60 random numbers from normal distribution: mean 50, SD 9
m <- round(mean(y1), 2) # get mean of y and round off to 2 decimals
s <- round(sd(y1), 3) # get SD of y and round off
p <- round(shapiro.test(y1)$p, 3) # p-value from Shapiro-Wilk test of normality
hist(y1,
col="peru", # color of histogram, default is "white"
main=NULL, # string for a title of histogram; NULL omits it
cex.lab=1.4, cex.axis=1.3, las=1, # typography of axes
border="white", # color of lines around histogram bars
xlab="Random sample from N(50, 9)") # x-axis label
clr <- "darkred" # color of text
mtext(paste("N =", length(y1)), adj=1, line=0, cex=1, col=clr)
mtext(paste("Mean =", m), adj=1, line=-1, cex=1, col=clr)
mtext(paste("SD =", s), adj=1, line=-2, cex=1, col=clr)
mtext(paste(" Shapiro-Wilk p =", p), adj=0, line=0, cex=1, col=clr)
mtext(paste("Arlo: ", date()), side=1, adj=1, line=4, cex=.9,col="gray") # time stamp
mtext("clean, well structured code", side=3, line=2, col="black", font=2.2, cex=2.1)
13 lines of readable CODE to make the graph
The above code to make the graph is relatively easy to read, to
understand, to de-bug, to change and to run. It resides on 13 lines.
Now I turn this clean and readable, graph code into Ugly code by putting it all on one line. Why not? It’s debugged and works great. Why not condense it and make a shorter R program.
Below the Ugly code returns exactly the same graph as the clean code of 13 lines.
But, heaven forbid this condensed code should contain or acquire a bug – good luck finding the bug. And, since the structure of the function calls was destroyed by putting it all on one line, good luck trying to edit the code to make changes in the graph such as in colors, labels, font type and size, and so forth. Good luck trying to understand the code by looking at it sprawled across one line.
Line spacing, spreading out a long function call into several lines with indentation, along with liberal use of comments, generates code one can come back to in a month or in 6 months, read it and understand what you had in mind to start with; otherwise the code appears as gibberish.
The Ugly code of one long line (scrolled below) makes exactly the same graph as the well structured, 13 lines of code above.
hist(y1,col="peru",main=NULL,cex.lab=1.4,cex.axis=1.3,las=1,border="white",xlab="Random sample from N(50,9)");clr <- "darkred";mtext(paste("N =",length(y1)),adj=1,line=0,cex=1,col=clr);mtext(paste("Mean =",m),adj=1,line=-1,cex=1,col=clr);mtext(paste("SD =",s),adj=1,line=-2,cex=1,col=clr);mtext(paste(" Shapiro-Wilk p =", p),adj=0,line=0,cex=1,col=clr);mtext(paste("Arlo: ",date()),side=1,adj=1,line=4,cex=.9,col="gray"); mtext("U G L Y C O D E", side=3, line=2, col="black", font=2, cex=2.5)
1 long line of unreadable CODE
Writing code is like writing a poem to your
mother-in-law; you are never satisfied; you never stop
revising it. All software resides under constant
revision. This applies equally to one of your many, relatively
short, e.g., 1 or 2 pages, .R files that you will craft
this semester, to my large .R files that define new
functions useful for univariate data, one-way data, and for x,y data (K
funs UNIVARIATE.R, K funs ONE WAY.R, K funs CORRELATION.R). Even
the code for base R itself, remains under constant revision.
“If debugging is the process of removing software bugs, then
programming must be the process of putting them in.”
—
E. Dijkstra (1930, 2002) The originator of the
structured programming movement, and more.
“Programmers often find that they spend more time debugging a
program than actually writing it. Good debugging skills are invaluable,…
Though debugging is an art rather than a science, it involves some
fundamental principles.”
— Norman Matloff
(2011) The Art of R Programming, A Tour of Statistical Software
In Bio 240 we practice interactive, top-down coding
and top-down debugging, using the write-execute loop
between the R Editor and the R Console. More advanced debugging
strategies are beyond our time constraints in this course. Indeed, there
are books on debugging and R packages devoted to it.
Across all biological and biomedical subject areas, we seek statistical wisdom in our research design, data acquisition, data wrangling, graphing, statistical analysis and in our ability to write about it all, in forms suitable for papers in peer-reviewed scientific journals and for formal presentations (poster sessions, live audience, interview).
The central path to statistical wisdom is to internalize, master and integrate a variety of contrasting, 1) statistical methods (traditional, nonparametric, computationally intensive, Bayesian) and 2) graphical visualizations, applied across many different data sets, of both small and large sample sizes.
We learn how to analyze data by analyzing data. Yes it’s largely, an acquired skill.
Statistical wisdom also involves wise naming for data variables, for R program files and for .csv data files.
As we study and learn, along our path to statistical wisdom, we soon identify 3 big things: practice, practice, practice.
We enjoy a wonderful, empowering reality as we learn basic statistics writ large.
Early in any basic statistics course, we must master topics including, descriptive statistics (moments, quantiles), confidence intervals, effect sizes (e.g., Cohen’s d), statistical power analysis, and basic graphs (line graph, scatter plot, histogram, box plot, means with error bars, bar graphs). These are always used in scientific papers reflecting the latest cutting-edge research as published across the top, peer reviewed journals.
This stuff never goes out of date. Learning it is never wasted. It enhances your undergraduate Biology program at Lehman College. It helps make you competitive in professional career development.
The ability to handle data and do basic statistics and graphics by custom programming in R, enables research experiences, enhances postgraduate applications and keeps you on the fast track to success!
Hopefully, Bio 240. Biostatistics will help you move forward toward these goals.
When you click on this HTML file, it opens in your web browser and R need not be present or running. This file will work on computers with any operating system (Win, macOS, Chrome OS, Linux distros).
The source code is a plain text file, a R markdown
file .Rmd that is bound to HTML and JavaScript by the
powerful R function knitr() and by pandoc (a
Haskell library for document conversion to HTMl, ePub, PDF,
OpenDocument, Word & PowerPoint, shiny web apps, and more). While
pandoc is not part of R, it is downloaded by and made
available automatically by RStudio which is the best
environment for crafting reproducible research. Python
programmers simply insert native Python code instead of R code in the
code blocks.
As the semester proceeds you might find this entire, Syllabus HTML file handy for review of some topics covered in Bio 240.
Many Biology majors aspire to, and identify as pre-med. Some think computer programming is not relevant to the biomedical career path. Consider the statement in the picture below, from one of the world’s most prestigious medical facilities, The Cleveland Clinic (Cleveland, Ohio).
Physicians who code could be the future of healthcare,…
The journal \({\mathit Nature}\) is published in England, it covers all areas of science, it has an extremely high rejection rate on submitted papers and it compares favorably to our journal, \({\mathit Science}\) in terms of being arguably, the most prestigeous scientific journal in the world. Neither journal is open-access but both make some papers in each issue available to all.
\({\mathit Nature}\) has what it calls “collections” of papers, many of which are open-access. The collection, “Statistics for Biologists” is absolutely excellent. Find it from my August 26, 2020 twitter post, https://twitter.com/LC_biostat or visit https://www.nature.com/collections/qghhqm
\({\mathit Science}\) publishes \({\mathit Science\ Advances}\) which is completely open-access, multidisciplinary and is a highly recommended read for all of us. \({\mathit Science\ Advances}\) has a lively twitter site, https://twitter.com/ScienceAdvances and at https://advances.sciencemag.org each issue can be viewed free. You should regularly read \({\mathit Science\ Advances}\) to keep up with the scientific world as enjoyment and as part of your career development strategy. Remember, what courses require is only the top of the iceberg.
As a public service during the pandemic, \({\mathit Science}\) has a free site,
https://www.sciencemag.org/collections/coronavirus?IntCmp=coronavirussiderail-128
for “Coronavirus: Research, Commentary, and News” which provides
open-access to many of their recent papers on covid-19.
Professional Career Development. In class I will discuss this topic which is optional material you will not be tested on.
It will involve case studies, advice from my experience, my personal stories, tips, ideas and challenges.
A solid career in science, medicine or science education does not happen automatically.
One reason I talk about career development in my classes is because when I was an undergraduate, no professors bothered to tell me these things. They either did not care, did not know, or both.
Succeed in your Career Development !
❶ Always show up & show up on time
❷ Pay attention & take notes
❸ Work hard & ask questions
❹ Be honest, don't cheat
❺ Practice, practice, practice
❻ Maintain a high GPA
❼ Join professional scientific societies & attend their meetings
❽ You must have a research experience to be competitive
❾ You should read books & journal papers, far beyond what's required
To enhance the
reproducibility of our R
code in this document, we call the sessionInfo() function
in base R. At the R Console enter >?sessionInfo for its
documentation which will come up in your browser as an HTML file. This
function returns a great deal of useful information about your current,
R Session.
sI <- sessionInfo()
print(sI, locale=TRUE)
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] wordcloud2_0.2.1 gapminder_1.0.0 plotly_4.10.2
## [4] BayesFactor_0.9.12-4.4 Matrix_1.6-0 coda_0.19-4
## [7] pwr_1.3-0 gplots_3.1.3 DescTools_0.99.49
## [10] nortest_1.0-4 moments_0.14.1 psych_2.3.6
## [13] ggwordcloud_0.5.0 ggplot2_3.4.2
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.6 tidyr_1.3.0 sass_0.4.7 viridisLite_0.4.2
## [5] jsonlite_1.8.7 gtools_3.9.4 bslib_0.5.0 expm_0.999-7
## [9] gld_2.6.6 lmom_2.9 highr_0.10 cellranger_1.1.0
## [13] yaml_2.3.7 pillar_1.9.0 lattice_0.21-8 glue_1.6.2
## [17] digest_0.6.33 colorspace_2.1-0 htmltools_0.5.5 pkgconfig_2.0.3
## [21] purrr_1.0.1 mvtnorm_1.2-2 scales_1.2.1 rootSolve_1.8.2.3
## [25] MatrixModels_0.5-2 tibble_3.2.1 proxy_0.4-27 generics_0.1.3
## [29] farver_2.1.1 ellipsis_0.3.2 cachem_1.0.8 withr_2.5.0
## [33] pbapply_1.7-2 lazyeval_0.2.2 cli_3.6.0 mnormt_2.1.1
## [37] magrittr_2.0.3 readxl_1.4.3 evaluate_0.21 fansi_1.0.4
## [41] nlme_3.1-162 MASS_7.3-60 class_7.3-20 tools_4.2.2
## [45] data.table_1.14.8 lifecycle_1.0.3 stringr_1.5.0 Exact_3.2
## [49] munsell_0.5.0 compiler_4.2.2 jquerylib_0.1.4 e1071_1.7-13
## [53] caTools_1.18.2 rlang_1.1.1 grid_4.2.2 rstudioapi_0.15.0
## [57] htmlwidgets_1.6.2 crosstalk_1.2.0 bitops_1.0-7 labeling_0.4.2
## [61] rmarkdown_2.23 boot_1.3-28.1 gtable_0.3.3 R6_2.5.1
## [65] knitr_1.43 dplyr_1.1.2 fastmap_1.1.1 utf8_1.2.3
## [69] KernSmooth_2.23-20 stringi_1.7.12 parallel_4.2.2 Rcpp_1.0.11
## [73] vctrs_0.6.3 png_0.1-8 tidyselect_1.2.0 xfun_0.39
# str(sI) # comment-in for the STRUCTURE of the sessionInfo() object
Question: Call the above R code using
locale=FALSE and see what happens. Call the following
functions, each with zero arguments: R.Version() ,
getRversion() , and R.home(). Then call
R.version which is an operator that returns a
list; compare it to the output from function R.Version()
and inspect, str(R.Version). Remember that the invaluable
and extremely commonly called, str function returns the
structure of whatever object is named as its argument.
Every class this semester, we will call the str function,
for a variety of reasons.
__________________
https://r-project.org
Click to Explore it
Click on
CRAN where you install R
Check out The R
Journal
Examine the Task Views
Scan the
thousands of contributed packages by A-Z name or by
submission date
John Tukey (1915-2000)
Finding the question is often more important than finding the answer.
— John Tukey, Princeton University and Bell Labs (1915 - 2000)
END of Bio 240 Syllabus, Fall 2023
R
data
graphs
statistics
writing about the results
Dwight Kincaid, PhD
Professor dwight.kincaid@lehman.cuny.edu
time stamp: Thursday August 24, 2023 at
19:25:48
last rendered: Thursday August 24, 2023 at
19:25:48