Slides for Week 1 is available here
R by itself is just the ‘beating heart’ of R programming, but it has no particular user interface. R Studio is an Integrated Development Environment (IDE). This program serves as a text editor, data manager, and package library to help you read and write R code. You can create R script, R Markdown, Shiny App, run Python script and many more in R Studio.
In your RStudio, create a new project under the tab “File” and select a preferred file location. Project allows you to keep all the files associated with a project organized together, each with their own working directory, workspace, history, and source documents.
Download the R Markdown and the dataset titled birthweight_smoking.csv from the Moodle site to the same location for your project.
This is an R Markdown document. You can use R Markdown for formatting all your submissions professionally. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents that embed R codes and/or outputs such as figures, tables and statistical results. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
For more examples on R code chunk, see learnr lesson 3: https://rmarkdown.rstudio.com/lesson-3.html Then, open the R Markdown or R Script.
R Script is a series of commands that you can execute at one time and you can save lot of time. Script is just a plain text file with R commands in it. It is similar to the code chuck in R Markdown document. You will see all your outputs in the console, located at the bottom panel of R Studio.
For assignments and quizes, you have the options to use either
format, as long as you submit a PDF file to Gradescope.
If you use RMarkdown (.Rmd) document, Knit
your R Markdown
document–move your cursor to the face-down triangle next to
Knit
, and choose for PDF. Note that coding
errors will prevent you from running the codes and knitting the
document. Similarly, if you use an R script (.R), then transfer your
codes, results and work on the assignment in a word document, then
convert it to a PDF for submission.
Use COMMAND+ENTER
key on Mac, and
CONTROL + ENTER
key on PC to run a R script one or multiple
line at a time. (Put your cursor to the line of code you want to run!)
To run multiple lines in a code chuck, just press the green triangle
button (see below) or select them and run.
RStudio has a large number of useful keyboard shortcuts, check them out using a keyboard shortcut:
Alt
+ Shift
+
K
Option
+ Shift
+
K
The RStudio
team has developed a number of
“cheatsheets” for working with both R
and
RStudio
.
If you do not have the latest version of R, update it by running the
following codes (a small window will pop up, asking you for permission
to download). Note, you should first UNCOMMENT the codes by
removing the hashtag #
before running them.
# install.packages("installr")
# library(installr)
# updateR()
# https://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/
To update RStudio, just run RStudio, and go to the Help menu in the top menu bar (not the Help tab in the lower right quadrant). In the Help menu, select Check for Updates. It will tell you if you are using the latest version of RStudio, or go to the website to download the latest version.
Values can be stored in variables. Variable names are given by letters or names. Storing is done using an arrow (<-). An example of assigning the value of 87 to a variable named age is:
<-
is a left assignment operator in R (i.e., to
command R to assign values to vectors, which are the codes after the
arrow)
These operators are used to assign values to variables.
## [1] 50
## [1] "this is my first line of R code"
## [1] 46
## [1] 48
## [1] 200
## [1] 6171.5
‘smoking_data’ is the variable where the data will be stored. If the parameter “header=” is “TRUE”, then the first row will be treated as the row names. These data were reported in Almond, D., Chay, K. Y., & Lee, D. S. (2005). The costs of low birth weight. The Quarterly Journal of Economics, 120(3), 1031-1083. which is made available via Stock, J. H., & Watson, M. W. (2020). Introduction to Econometrics. 4th ed. Pearson. NY: New York.
R Studio Cloud/Desktop: When you saved the csv dataset in your project folder, you can load the data easily
If your data is not stored in the project folder, you will need to insert the full path for your data, remember to use the forward slash ’/“. Remove the # to run the following code.
smoking_data <- read.csv("birthweight_smoking.csv",header=TRUE, sep=",")
# smoking_data is the name we assign for the .csv dataset,
# that's how R calls this dataset from now on!
Now on your right panel in your global environment, you should see ‘smoking_data’ stored.
A data frame is a rectangular collection of values, usually organized so that observations appear in rows (unique entities, such as student id) and variables appear in the columns (such as height, GPA).
Examine the dimensions of your dataset, it returns two numbers: (1) # of Rows (2) # of Columns.
## [1] 3000 13
What are the variables included in the dataset?
## [1] "id" "birthweight" "nprevist" "alcohol" "smoker"
## [6] "unmarried" "educ" "age" "drinks" "tripre1"
## [11] "tripre2" "tripre3" "tripre0"
head()
shows the first six rows of a dataset.
## id birthweight nprevist alcohol smoker unmarried educ age drinks tripre1
## 1 1 4253 12 0 1 1 12 27 0 1
## 2 2 3459 5 0 0 0 16 24 0 0
## 3 3 2920 12 0 1 0 11 23 0 1
## 4 4 2600 13 0 0 0 17 28 0 1
## 5 5 3742 9 0 0 0 13 27 0 1
## 6 6 3420 11 0 0 0 16 33 0 1
## tripre2 tripre3 tripre0
## 1 0 0 0
## 2 1 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
Examine the data structure of the variables in the data frame (factor, numeric, integer, etc.).
## 'data.frame': 3000 obs. of 13 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ birthweight: int 4253 3459 2920 2600 3742 3420 2325 4536 2850 2948 ...
## $ nprevist : int 12 5 12 13 9 11 12 10 13 10 ...
## $ alcohol : int 0 0 0 0 0 0 0 0 0 0 ...
## $ smoker : int 1 0 1 0 0 0 1 0 0 0 ...
## $ unmarried : int 1 0 0 0 0 0 0 0 0 0 ...
## $ educ : int 12 16 11 17 13 16 14 13 17 14 ...
## $ age : int 27 24 23 28 27 33 24 38 29 28 ...
## $ drinks : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre1 : int 1 0 1 1 1 1 1 1 1 1 ...
## $ tripre2 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ tripre3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre0 : int 0 0 0 0 0 0 0 0 0 0 ...
Examine the summary statistics of variables in the dataset, what can you learn from it? Important: Check against the “Documentation for Birthweight_Smoking.pdf” to understand the meaning of the variable and the labeling system.
## id birthweight nprevist alcohol
## Min. : 1.0 Min. : 425 Min. : 0.00 Min. :0.00000
## 1st Qu.: 750.8 1st Qu.:3062 1st Qu.: 9.00 1st Qu.:0.00000
## Median :1500.5 Median :3420 Median :12.00 Median :0.00000
## Mean :1500.5 Mean :3383 Mean :10.99 Mean :0.01933
## 3rd Qu.:2250.2 3rd Qu.:3750 3rd Qu.:13.00 3rd Qu.:0.00000
## Max. :3000.0 Max. :5755 Max. :35.00 Max. :1.00000
## smoker unmarried educ age
## Min. :0.000 Min. :0.0000 Min. : 0.00 Min. :14.00
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:12.00 1st Qu.:23.00
## Median :0.000 Median :0.0000 Median :12.00 Median :27.00
## Mean :0.194 Mean :0.2267 Mean :12.91 Mean :26.89
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:14.00 3rd Qu.:31.00
## Max. :1.000 Max. :1.0000 Max. :17.00 Max. :44.00
## drinks tripre1 tripre2 tripre3
## Min. : 0.00000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.00000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000
## Median : 0.00000 Median :1.000 Median :0.000 Median :0.000
## Mean : 0.05833 Mean :0.804 Mean :0.153 Mean :0.033
## 3rd Qu.: 0.00000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :21.00000 Max. :1.000 Max. :1.000 Max. :1.000
## tripre0
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.01
## 3rd Qu.:0.00
## Max. :1.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 425 3062 3420 3383 3750 5755
For numeric variables, we can summarize data with the center and spread.
Note: Some exercises below are inspired by Dalpiaz, D. (2019). Applied statistics with R. University of Illinois. Urbana-Champain, IL. Dostupno na: https://daviddalpiaz. github. io/appliedstats.
Measure | R |
Result |
---|---|---|
Mean | mean(smoking_data$educ) |
12.907 |
Median | median(smoking_data$educ) |
12 |
Measure | R |
Result |
---|---|---|
Variance | var(smoking_data$educ) |
4.6945825 |
Standard Deviation | sd(smoking_data$educ) |
2.1666985 |
IQR | IQR(smoking_data$educ) |
2 |
Minimum | min(smoking_data$educ) |
0 |
Maximum | max(smoking_data$educ) |
17 |
Range | range(smoking_data$educ) |
0, 17 |
For categorical variables, counts and percentages can be used for summarizing the group.
We can produce a frequency table to find out the number of observations in each group, and create a percentage table to summarize the percentages.
##
## 0 1
## 2320 680
##
## 0 1
## 0.7733333 0.2266667
# percentage table: number of unmarried or married mothers divided by
# nrow (row number) = the number of people (observations) in the dataset
77.3% of the mothers are unmarried in the dataset, while only 22.7% are married.
R
comes with a number of built-in functions and
datasets, but one of the main strengths of R
as an
open-source project is its package system. Packages add additional
functions and data. Frequently if you want to do something in
R
, and it is not available by default, there is a good
chance that there is a package that will fulfill your needs.
To install a package, use the install.packages()
function. Think of this as a special power package you can load to your
character in video games or buying a recipe book from the store,
bringing it home, and putting it on your shelf (i.e. into your
library):
If you generate PDF output for the first time, you will need to install LaTeX. LaTeX is a typesetting system; it includes features designed for the production of technical and scientific documentation. LaTeX is the de facto standard for the communication and publication of scientific documents.
For R Markdown users who have not installed LaTeX before, you should install TinyTeX. TinyTeX is a lightweight, portable, cross-platform, and easy-to-maintain LaTeX distribution. The R companion package tinytex can help you automatically install missing LaTeX packages when compiling LaTeX or R Markdown documents to PDF. If for some reason TinyTeX does not work on your Mac computer then you can try to install MacTeX instead. You can download the latest version of MacTeX. Read more here.
You can install TinyTex from within RStudio using the following code
(uncomment the code first by removing #
): OR copy and paste
the code in the Console window at the bottom left of the R Studio
(remove #
).
Once you close R
, all the packages are closed and put
back on the imaginary shelf. The next time you open R
, you
do not have to install the package again, but you do have to load any
packages you intend to use by invoking library()
.
Q1: Import the dataset named “birthweight_smoking.csv”, name your dataset.
Q2: How many observations and variables does the dataset have?
Q3: Use summary()
function to find out the summary
statistics of age.
Report your findings.
Q4: Find out and report the median and range of
drinks
.
Q5: Use the table()
function, report the summary
statistics of smoker
?
Statastic – you did it!
Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!
Step 2: If you use RMarkdown (.Rmd) document, Knit
your
R Markdown document–move your cursor to the face-down triangle next to
Knit
, and choose for PDF. If you use an R
script (.R), then transfer your codes, results and work on the
assignment in a word document, then convert it to a PDF.
Step 3: Submit your assignment to Gradescope https://www.gradescope.com/. For grading purposes, you must match the corresponding pages of your submission to the questions respectively.