SPP608 Statistical Methods in R Lab 1

Lecture

Slides for Week 1 is available here

R Studio

R by itself is just the ‘beating heart’ of R programming, but it has no particular user interface. R Studio is an Integrated Development Environment (IDE). This program serves as a text editor, data manager, and package library to help you read and write R code. You can create R script, R Markdown, Shiny App, run Python script and many more in R Studio.

Create a project, an R script and set up work directory

In your RStudio, create a new project under the tab “File” and select a preferred file location. Project allows you to keep all the files associated with a project organized together, each with their own working directory, workspace, history, and source documents.

Download the R Markdown and the dataset titled birthweight_smoking.csv from the Moodle site to the same location for your project.

R Markdown and R Script

This is an R Markdown document. You can use R Markdown for formatting all your submissions professionally. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents that embed R codes and/or outputs such as figures, tables and statistical results. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

For more examples on R code chunk, see learnr lesson 3: https://rmarkdown.rstudio.com/lesson-3.html Then, open the R Markdown or R Script.

R Script is a series of commands that you can execute at one time and you can save lot of time. Script is just a plain text file with R commands in it. It is similar to the code chuck in R Markdown document. You will see all your outputs in the console, located at the bottom panel of R Studio.

Assignment Submissions

For assignments and quizes, you have the options to use either format, as long as you submit a PDF file to Gradescope. If you use RMarkdown (.Rmd) document, Knit your R Markdown document–move your cursor to the face-down triangle next to Knit, and choose for PDF. Note that coding errors will prevent you from running the codes and knitting the document. Similarly, if you use an R script (.R), then transfer your codes, results and work on the assignment in a word document, then convert it to a PDF for submission.

Running R codes

Use COMMAND+ENTER key on Mac, and CONTROL + ENTER key on PC to run a R script one or multiple line at a time. (Put your cursor to the line of code you want to run!) To run multiple lines in a code chuck, just press the green triangle button (see below) or select them and run.

RStudio has a large number of useful keyboard shortcuts, check them out using a keyboard shortcut:

On Windows: Alt + Shift + K
On Mac: Option + Shift + K

The RStudio team has developed a number of “cheatsheets” for working with both R and RStudio.

Update R and R Studio

If you do not have the latest version of R, update it by running the following codes (a small window will pop up, asking you for permission to download). Note, you should first UNCOMMENT the codes by removing the hashtag # before running them.

# install.packages("installr")
# library(installr)
# updateR()

# https://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/

To update RStudio, just run RStudio, and go to the Help menu in the top menu bar (not the Help tab in the lower right quadrant). In the Help menu, select Check for Updates. It will tell you if you are using the latest version of RStudio, or go to the website to download the latest version.

Variable Assignment

Values can be stored in variables. Variable names are given by letters or names. Storing is done using an arrow (<-). An example of assigning the value of 87 to a variable named age is:

<- is a left assignment operator in R (i.e., to command R to assign values to vectors, which are the codes after the arrow)

These operators are used to assign values to variables.

# simple calculation
20 / 2 * 5

## [1] 50

# store a variable as "a"
a <- "this is my first line of R code"

a

## [1] "this is my first line of R code"

# store a variable as "age"
age <- 46

age

## [1] 46

age + 2

## [1] 48

b <- 2*100

b

## [1] 200

c <- 12343/2 

c

## [1] 6171.5

Step 1: Import a .csv Dataset

‘smoking_data’ is the variable where the data will be stored. If the parameter “header=” is “TRUE”, then the first row will be treated as the row names. These data were reported in Almond, D., Chay, K. Y., & Lee, D. S. (2005). The costs of low birth weight. The Quarterly Journal of Economics, 120(3), 1031-1083. which is made available via Stock, J. H., & Watson, M. W. (2020). Introduction to Econometrics. 4th ed. Pearson. NY: New York.

R Studio Cloud/Desktop: When you saved the csv dataset in your project folder, you can load the data easily

If your data is not stored in the project folder, you will need to insert the full path for your data, remember to use the forward slash ’/“. Remove the # to run the following code.

smoking_data <- read.csv("birthweight_smoking.csv",header=TRUE, sep=",") 

# smoking_data is the name we assign for the .csv dataset, 
# that's how R calls this dataset from now on!

Now on your right panel in your global environment, you should see ‘smoking_data’ stored.

Step 2. Exploratory Data Analysis

What is a data frame?

A data frame is a rectangular collection of values, usually organized so that observations appear in rows (unique entities, such as student id) and variables appear in the columns (such as height, GPA).

dim()

Examine the dimensions of your dataset, it returns two numbers: (1) # of Rows (2) # of Columns.

dim(smoking_data)

## [1] 3000   13

colnames()

What are the variables included in the dataset?

colnames(smoking_data)

##  [1] "id"          "birthweight" "nprevist"    "alcohol"     "smoker"     
##  [6] "unmarried"   "educ"        "age"         "drinks"      "tripre1"    
## [11] "tripre2"     "tripre3"     "tripre0"

head()

head() shows the first six rows of a dataset.

head(smoking_data)

##   id birthweight nprevist alcohol smoker unmarried educ age drinks tripre1
## 1  1        4253       12       0      1         1   12  27      0       1
## 2  2        3459        5       0      0         0   16  24      0       0
## 3  3        2920       12       0      1         0   11  23      0       1
## 4  4        2600       13       0      0         0   17  28      0       1
## 5  5        3742        9       0      0         0   13  27      0       1
## 6  6        3420       11       0      0         0   16  33      0       1
##   tripre2 tripre3 tripre0
## 1       0       0       0
## 2       1       0       0
## 3       0       0       0
## 4       0       0       0
## 5       0       0       0
## 6       0       0       0

str()

Examine the data structure of the variables in the data frame (factor, numeric, integer, etc.).

str(smoking_data)

## 'data.frame':    3000 obs. of  13 variables:
##  $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ birthweight: int  4253 3459 2920 2600 3742 3420 2325 4536 2850 2948 ...
##  $ nprevist   : int  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ smoker     : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : int  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : int  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : int  0 0 0 0 0 0 0 0 0 0 ...

# want to find out the current data structure?
# Next week, we will learn how to transform some of the data structure!

summary()

Examine the summary statistics of variables in the dataset, what can you learn from it? Important: Check against the “Documentation for Birthweight_Smoking.pdf” to understand the meaning of the variable and the labeling system.

# the entire dataset
summary(smoking_data)

##        id          birthweight      nprevist        alcohol       
##  Min.   :   1.0   Min.   : 425   Min.   : 0.00   Min.   :0.00000  
##  1st Qu.: 750.8   1st Qu.:3062   1st Qu.: 9.00   1st Qu.:0.00000  
##  Median :1500.5   Median :3420   Median :12.00   Median :0.00000  
##  Mean   :1500.5   Mean   :3383   Mean   :10.99   Mean   :0.01933  
##  3rd Qu.:2250.2   3rd Qu.:3750   3rd Qu.:13.00   3rd Qu.:0.00000  
##  Max.   :3000.0   Max.   :5755   Max.   :35.00   Max.   :1.00000  
##      smoker        unmarried           educ            age       
##  Min.   :0.000   Min.   :0.0000   Min.   : 0.00   Min.   :14.00  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:12.00   1st Qu.:23.00  
##  Median :0.000   Median :0.0000   Median :12.00   Median :27.00  
##  Mean   :0.194   Mean   :0.2267   Mean   :12.91   Mean   :26.89  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:14.00   3rd Qu.:31.00  
##  Max.   :1.000   Max.   :1.0000   Max.   :17.00   Max.   :44.00  
##      drinks            tripre1         tripre2         tripre3     
##  Min.   : 0.00000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 0.00000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median : 0.00000   Median :1.000   Median :0.000   Median :0.000  
##  Mean   : 0.05833   Mean   :0.804   Mean   :0.153   Mean   :0.033  
##  3rd Qu.: 0.00000   3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :21.00000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##     tripre0    
##  Min.   :0.00  
##  1st Qu.:0.00  
##  Median :0.00  
##  Mean   :0.01  
##  3rd Qu.:0.00  
##  Max.   :1.00

# a specific variable
summary(smoking_data$birthweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     425    3062    3420    3383    3750    5755

For numeric variables, we can summarize data with the center and spread.

Note: Some exercises below are inspired by Dalpiaz, D. (2019). Applied statistics with R. University of Illinois. Urbana-Champain, IL. Dostupno na: https://daviddalpiaz. github. io/appliedstats.

Central Tendency

Measure	`R`	Result
Mean	`mean(smoking_data$educ)`	12.907
Median	`median(smoking_data$educ)`	12

Spread

Measure	`R`	Result
Variance	`var(smoking_data$educ)`	4.6945825
Standard Deviation	`sd(smoking_data$educ)`	2.1666985
IQR	`IQR(smoking_data$educ)`	2
Minimum	`min(smoking_data$educ)`	0
Maximum	`max(smoking_data$educ)`	17
Range	`range(smoking_data$educ)`	0, 17

Categorical

For categorical variables, counts and percentages can be used for summarizing the group.

We can produce a frequency table to find out the number of observations in each group, and create a percentage table to summarize the percentages.

table(smoking_data$unmarried)

## 
##    0    1 
## 2320  680

table(smoking_data$unmarried) / nrow(smoking_data)

## 
##         0         1 
## 0.7733333 0.2266667

# percentage table: number of unmarried or married mothers divided by 
# nrow (row number) = the number of people (observations) in the dataset

77.3% of the mothers are unmarried in the dataset, while only 22.7% are married.

Installing Packages

R comes with a number of built-in functions and datasets, but one of the main strengths of R as an open-source project is its package system. Packages add additional functions and data. Frequently if you want to do something in R, and it is not available by default, there is a good chance that there is a package that will fulfill your needs.

To install a package, use the install.packages() function. Think of this as a special power package you can load to your character in video games or buying a recipe book from the store, bringing it home, and putting it on your shelf (i.e. into your library):

Knit a PDF in R Markdown

If you generate PDF output for the first time, you will need to install LaTeX. LaTeX is a typesetting system; it includes features designed for the production of technical and scientific documentation. LaTeX is the de facto standard for the communication and publication of scientific documents.

For R Markdown users who have not installed LaTeX before, you should install TinyTeX. TinyTeX is a lightweight, portable, cross-platform, and easy-to-maintain LaTeX distribution. The R companion package tinytex can help you automatically install missing LaTeX packages when compiling LaTeX or R Markdown documents to PDF. If for some reason TinyTeX does not work on your Mac computer then you can try to install MacTeX instead. You can download the latest version of MacTeX. Read more here.

You can install TinyTex from within RStudio using the following code (uncomment the code first by removing #): OR copy and paste the code in the Console window at the bottom left of the R Studio (remove #).

# tinytex::install_tinytex()

Once you close R, all the packages are closed and put back on the imaginary shelf. The next time you open R, you do not have to install the package again, but you do have to load any packages you intend to use by invoking library().

Lab 1 Practice

Q1: Import the dataset named “birthweight_smoking.csv”, name your dataset.

Q2: How many observations and variables does the dataset have?

Q3: Use summary() function to find out the summary statistics of age. Report your findings.

Q4: Find out and report the median and range of drinks.

Q5: Use the table() function, report the summary statistics of smoker?

Submit your Assignment (Ungraded)

Statastic – you did it!

Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!

Step 2: If you use RMarkdown (.Rmd) document, Knit your R Markdown document–move your cursor to the face-down triangle next to Knit, and choose for PDF. If you use an R script (.R), then transfer your codes, results and work on the assignment in a word document, then convert it to a PDF.

Step 3: Submit your assignment to Gradescope https://www.gradescope.com/. For grading purposes, you must match the corresponding pages of your submission to the questions respectively.