Lecture Resources

Slides for Week 2 are available here.

In this lab, you will learn to:

âś… Import, explore, and visualize a dataset in R.
âś… Understand data structures and variable types.
âś… Apply basic data cleaning techniques.

đź”– Additional Readings:


Setup and Load the R libraries

Install Packages

Right now, the functionality available to you is that of “base R.” This means that you have not yet installed and loaded any extra software packages written by independent developers. In order to create the graphs and perform many interesting analyses, you’ll need to install some additional R packages. To install the necessary packages for this class, run the following code in the RStudio console window.

You only need to install these once. If they are already installed, you can skip this step.

# here, I specified (eval = FALSE) that R ignores this code chuck when knitting.
# This will prevent R to install these packages again every time we knit this document.

# Install required packages
pkgs <- c(
  # Data Manipulation & Wrangling
  "dplyr", "tidyr", "data.table", "readr", "readxl", "janitor", "stringr", "lubridate",
  # General Statistics & Modeling
  "MASS", "car", "broom", "effects", "lmtest", "sandwich",
  # Data Visualization
  "ggplot2", "ggthemes", "ggridges", "GGally", "ggbeeswarm", "ggpubr", "patchwork", "viridis",
  # Correlation & Regression Visualization
  "ggcorrplot", "corrplot", "visreg",
  # Advanced Data Visualization
  "plotly", "leaflet", "superheat", "waterfalls", "factoextra",
  # Report & Output Formatting
  "stargazer", "kableExtra", "flextable", "xtable", "gt",
  # Teaching & Educational Tools
  "swirl", "learnr"
)

# Install all packages
install.packages(pkgs)

Once you’ve installed the package, you still have to load the library you will use in every R session. Now, load the libraries that we will use for Lab 2.

library(ggplot2) 
# a powerful package for creating visualizations in R.
library(psych) 
# descriptive statistics, factor analysis, and data exploration.
library(tidyverse) 
# a collection of packages including dplyr, tidyr, readr, etc. for data science in R.

Step 1: Import a Dataset

smoking_data is the variable we named in which the smoking data will be stored. If the parameter “header=” is “TRUE”, then the first row will be treated as the row names. These data were reported in Almond, D., Chay, K. Y., & Lee, D. S. (2005). The costs of low birth weight. The Quarterly Journal of Economics, 120(3), 1031-1083. which is made available via Stock, J. H., & Watson, M. W. (2020). Introduction to Econometrics. 4th ed. Pearson. NY: New York.

R Studio Desktop: When you saved the .csv dataset in your project folder, you can load the dataset easily.

If you suspect you accidentally changed or messed up the dataset, you can always re-import the csv file again. Re-importing the dataset will allow you to start anew with the original data.

smoking_data <- read.csv("birthweight_smoking.csv")

Now, in the Global Environment panel (located on the right side of your RStudio interface), you should see "smoking_data" listed. Click on it to open and explore the dataset in a spreadsheet-like viewer. 🚀

It has 3,000 observations (i.e., 3,000 rows) and 12 variables (i.e., 12 columns).

If your dataset is not stored in the project folder, you will need to insert the full path for importing your data, remember to use the forward slash ’/“. Replace with your file path (including the .csv file name) and remove the # to run the following code.

An example:

# smoking_data2 <- read.csv("G:/My Drive/SPP608-2023/Lab2/birthweight_smoking.csv") 

What if our data was saved as an Excel sheet (.xlsx)? We first install a new package called “readxl”: install.packages(“readxl”) and run the following codes:

Tips: To run the code, remove # first. Install packages only once, then put back # in front of the codes to prevent re-running the install.packages() codes when knitting.

# install.packages("readxl")
# library(readxl)
# smoking_data3 <- read_excel("birthweight_smoking.xlsx") 

Alternatively, you can open it through the files pane (in the lower right panel) by clicking the .xlsx file.


Step 2. Exploratory Data Analysis (Recap)

summary()

The summary() function provides a quick overview of the dataset, summarizing key statistics for each variable. This is a crucial step in exploratory data analysis (EDA) because it helps you identify patterns, potential issues, and characteristics of the data before performing further analysis.

summary(smoking_data)
##     nprevist        alcohol           tripre1         tripre2     
##  Min.   : 0.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 9.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median :12.00   Median :0.00000   Median :1.000   Median :0.000  
##  Mean   :10.99   Mean   :0.01933   Mean   :0.804   Mean   :0.153  
##  3rd Qu.:13.00   3rd Qu.:0.00000   3rd Qu.:1.000   3rd Qu.:0.000  
##  Max.   :35.00   Max.   :1.00000   Max.   :1.000   Max.   :1.000  
##     tripre3         tripre0      birthweight       smoker        unmarried     
##  Min.   :0.000   Min.   :0.00   Min.   : 425   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00   1st Qu.:3062   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00   Median :3420   Median :0.000   Median :0.0000  
##  Mean   :0.033   Mean   :0.01   Mean   :3383   Mean   :0.194   Mean   :0.2267  
##  3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:3750   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.00   Max.   :5755   Max.   :1.000   Max.   :1.0000  
##       educ            age            drinks        
##  Min.   : 0.00   Min.   :14.00   Min.   : 0.00000  
##  1st Qu.:12.00   1st Qu.:23.00   1st Qu.: 0.00000  
##  Median :12.00   Median :27.00   Median : 0.00000  
##  Mean   :12.91   Mean   :26.89   Mean   : 0.05833  
##  3rd Qu.:14.00   3rd Qu.:31.00   3rd Qu.: 0.00000  
##  Max.   :17.00   Max.   :44.00   Max.   :21.00000

Next Steps After summary()

  1. Identify missing values (NA) – Consider imputation or data cleaning.
  2. Detect potential outliers – Use boxplots or histograms to visualize distributions.
  3. Check for inconsistencies – Compare against documentation to ensure variable labels are correct.
  4. Decide transformations – Some categorical variables may need recoding as factors.

Examine a variable using $

To take a closer look into specific variables, we use $ to indicate that we are looking for a variable of interest within smoking_data. For instance, to find out the data structure and summary information of the variable called smoker, we specify the by data$variable like the following example.

str()

Examine the data structure of the variables in the data frame (factor, numeric, integer, etc.).

str(smoking_data)
## 'data.frame':    3000 obs. of  12 variables:
##  $ nprevist   : int  12 5 12 13 9 11 12 10 13 10 ...
##  $ alcohol    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre1    : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ tripre2    : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ tripre3    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tripre0    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ birthweight: int  4253 3459 2920 2600 3742 3420 2325 4536 2850 2948 ...
##  $ smoker     : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ unmarried  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ educ       : int  12 16 11 17 13 16 14 13 17 14 ...
##  $ age        : int  27 24 23 28 27 33 24 38 29 28 ...
##  $ drinks     : int  0 0 0 0 0 0 0 0 0 0 ...
# What is the data structure of `data$smoker`?
str(smoking_data$smoker)
##  int [1:3000] 1 0 1 0 0 0 1 0 0 0 ...

Step 3: Data Cleaning and Variable Transformation

Before proceeding to data analysis, it is essential to first transform certain variables (e.g., categorical variables) to ensure that the data is in the right format and condition to yield meaningful, accurate, and interpretable results.

Handling Categorical Variables

Categorical variables represent distinct groups or categories and need to be converted into a format that R can process efficiently for analysis. Since statistical software treats raw numeric values as integer data by default, we must convert categorical variables from integer format to factor format.

Converting categorical variables into factors allows us to:

âś… Categorize and analyze data based on unique categories.
âś… Perform frequency tabulations to summarize categorical data.
âś… Apply appropriate statistical models (e.g., logistic regression, chi-square tests).

For comparison, while categorical variables focus on group frequencies, numerical variables are assessed using measures like mean, median, variance, and range to describe their distribution.

Factor Variable & Assigning Labels

To ensure data integrity and facilitate analysis without altering the original dataset, it is best practice to create new transformed variables while preserving the original values. For example, we have a numeric variable named smoker where:

  • 0 represents non-smokers

  • 1 represents smokers

We can create two new variables:

smoker1 is a factor variable converted from smoker, enabling categorical analysis. smoker2 is a labeled factor variable, with descriptive category names for clarity.

smoking_data$smoker1 <- factor(smoking_data$smoker)   

# Now, check if the data structure and summary for `smoker1` has changed 
str(smoking_data$smoker1)  
##  Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
summary(smoking_data$smoker1)  
##    0    1 
## 2418  582

Assigning Labels to Factor Variables

To improve interpretability, we assign meaningful labels to the factor levels so that instead of numeric values, we work with readable category names:

smoking_data$smoker2 <- factor(smoking_data$smoker, 
                              levels = c(0,1),
                              labels = c("no", "yes"))
summary(smoking_data$smoker2)
##   no  yes 
## 2418  582
str(smoking_data$smoker2)
##  Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 1 1 1 ...

Important: Once a factor variable is labeled, always refer to the labels ("no", "yes") rather than the original numerical values (0, 1).

R Operators

R has several operators to perform tasks including arithmetic, logical and bitwise operations. They are very helpful when you want to specify conditions for variable transformation and manipulation of the data frame.

Operator Meaning
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE

Count frequency

You might want to count how many values in a column satisfy a certain condition. For example, suppose we want to know how many adult mothers drank alcohol during pregnancy. We can achieve this using: In the following code chunk, smoking_data$drinks refers to the drinks column within the smoking_data dataset.

sum(smoking_data$drinks == "1")
## [1] 30
sum(smoking_data$smoker2 =="yes")
## [1] 582

📌 Key Considerations:

  • smoking_data$drinks refers to the drinks column in the dataset.

  • The first code counts the number of rows (i.e., mothersin the dataset) where drinks equals "1" (indicating alcohol consumption).

  • The second code counts the number of rows (i.e., mothersin the dataset) where smoker2 equals "yes" (indicating smoker).

  • If working with a transformed factor variable, make sure to use labels instead of numerical values.

filter() a dataset

Often, we need to filter a dataset to see only the observations that match specific conditions. Suppose we want to filter for mothers who were 18 years old or older. We can use the filter() function from the tidyverse package:

# Store the filtered data in a new data frame "adults"
adults <- filter(smoking_data, age >= 18)

# check the dimension
# also check the dataset in the global environment
# inspect the top six rows
dim(adults) 
## [1] 2908   14
head(adults)
##   nprevist alcohol tripre1 tripre2 tripre3 tripre0 birthweight smoker unmarried
## 1       12       0       1       0       0       0        4253      1         1
## 2        5       0       0       1       0       0        3459      0         0
## 3       12       0       1       0       0       0        2920      1         0
## 4       13       0       1       0       0       0        2600      0         0
## 5        9       0       1       0       0       0        3742      0         0
## 6       11       0       1       0       0       0        3420      0         0
##   educ age drinks smoker1 smoker2
## 1   12  27      0       1     yes
## 2   16  24      0       0      no
## 3   11  23      0       1     yes
## 4   17  28      0       0      no
## 5   13  27      0       0      no
## 6   16  33      0       0      no

Try it yourself!

📌 Filtering with Multiple Conditions

To filter for adult mothers who are also married, we can add another condition:

# filter with two conditions
adults_married <- filter(smoking_data, age >= 18 & unmarried == 0)
dim(adults_married) 
## [1] 2313   14
head(adults_married)
##   nprevist alcohol tripre1 tripre2 tripre3 tripre0 birthweight smoker unmarried
## 1        5       0       0       1       0       0        3459      0         0
## 2       12       0       1       0       0       0        2920      1         0
## 3       13       0       1       0       0       0        2600      0         0
## 4        9       0       1       0       0       0        3742      0         0
## 5       11       0       1       0       0       0        3420      0         0
## 6       12       0       1       0       0       0        2325      1         0
##   educ age drinks smoker1 smoker2
## 1   16  24      0       0      no
## 2   11  23      0       1     yes
## 3   17  28      0       0      no
## 4   13  27      0       0      no
## 5   16  33      0       0      no
## 6   14  24      0       1     yes

đź’ˇ Tip: To avoid overwriting your dataset, always store the filtered output in a new variable such as data_new rather than modifying smoking_data directly.


Step 4: Data Visualization of Numerical/Continuous/Discrete Variables

Data visualization of numerical variables involves representing numerical data that can take on an infinite number of values within a given range. Visualizing numerical variables helps to understand the distribution, central tendency, variability, and presence of any patterns or outliers in the data. Here are some common methods for visualizing continuous variables:

describe() from psych library

Besides the mean, median, min and max, we can also find out the standard deviation (sd), skewness, kurtosis and inter-quartile range (The IQR is the difference between Q3 and Q1, i.e., between the 75th and 25th percentiles of the data.).

describe(smoking_data$birthweight, IQR=TRUE) 
##    vars    n    mean     sd median trimmed    mad min  max range  skew kurtosis
## X1    1 3000 3382.93 592.16   3420 3412.04 520.39 425 5755  5330 -0.83     2.54
##       se IQR
## X1 10.81 688

ggplot(): Grammar of Graphics!

ggplot2 is an R package specifically designed for data visualization, adhering to the principles of The Grammar of Graphics. By supplying your data and instructing ggplot2 on how to map variables to aesthetics, what graphical primitives to use, it greatly enhances the quality and visual appeal of your graphics.

“geoms” is short for geometric objects. Geometric objects are the actual shapes or visual elements that are plotted on a graph to represent data. Each geom function in ggplot2 corresponds to a specific type of graphical representation. The choice of geom function depends on the nature of the data and the type of visualization you want to create. Some of the most commonly used geoms include:

Type Function Description
Point geom_point() Adds points to the plot, useful for scatter plots.
Line geom_line() Connects data points with lines, ideal for time series or trend lines.
Bar geom_bar(), geom_col() Produces bar charts, which can display values for different categories or counts of categorical variables.
Histogram geom_histogram() Creates histograms to visualize the distribution of a single continuous variable by dividing it into bins and counting the number of observations in each bin.
Regression geom_smooth() Adds a smoothed condition mean to the plot, often used to visualize trends and patterns over a continuous variable.
Boxplot geom_boxplot() Generates box plots to show the distribution of a continuous variable, highlighting the median, quartiles, and outliers.
Text/Label geom_text() or geom_label() Allows adding text or labels to the plot to annotate specific points or provide additional information.
Vert./Horiz. Line geom_{vh}line() Adds vertical or horizontal lines to a plot.
Count geom_count() Sizes points according to the number of observations at identical locations in the plot.
Density geom_density() Creates a smooth density estimate of a continuous variable to visualize the distribution of the data.

Creating Histograms to Examine Distributions

Basic Histogram Using hist()

A histogram is a simple way to visualize the distribution of a numerical variable. It groups the data into bins and displays the frequency of observations within each bin.

We will create a histogram for our dependent variable, birthweight, to understand its distribution:

# base r approach
hist(smoking_data$birthweight)

# with labels
hist(smoking_data$birthweight, 
     main = "Histogram of Birthweight",
     xlab = "Birthweight (grams)",
     col = "lightblue", 
     border = "black")

# basic ggplot2 approach

ggplot(smoking_data, aes(birthweight)) +
  geom_bar(colour="black", fill="bisque") +
  scale_x_binned()

# Plot the histogram with blue bars and white borders

ggplot(smoking_data, aes(x = birthweight)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white") + 
  labs(title = "Distribution of Birthweight",
       x = "Birthweight (grams)",
       y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Choose your color from: R color cheatsheet.

đź’ˇ What to Look For?

  • Is the distribution normal, skewed, or multimodal?

  • Are there any outliers (extremely high or low values)?

  • What is the range and spread of values?

Histogram + Density Plot

A density plot provides a smoothed estimate of the distribution using a kernel density function. Overlaying a density plot on a histogram allows us to better understand the shape of the distribution.

# simplified version of density plot
density_bw <- density(smoking_data$birthweight)
plot(density_bw)

# Histogram with a density plot
ggplot(smoking_data, aes(x = birthweight)) +
  geom_histogram(aes(y = ..density..), 
                 fill = "white", 
                 color = "black", 
                 bins = 30) +
  geom_density(alpha = 0.4, fill = "coral") +
  labs(title = "Histogram and Density Plot of Birthweight",
       x = "Birthweight (grams)",
       y = "Density")

# save the last ggplot graphic! 
# After running the code, the PDF file will be stored in your project folder
ggsave("histogram_birthweight.pdf", width = 6, height = 4)
# note: ggsave won't work for basic R plots, only for ggplot() objects

đź’ˇ Key Insights from Density Plots:

  • The peak(s) indicate where most data points are concentrated.

  • A long tail suggests skewness (positive or negative).

  • Multiple peaks may indicate a bimodal or multimodal distribution.

Dot Chart

Another alternative to the histogram is the dot chart. Again, the quantitative variable is divided into bins, but rather than summary bars, each observation is represented by a dot. By default, the width of a dot corresponds to the bin width, and dots are stacked, with each dot representing one observation. This works best when the number of observations is small (say, less than 150).

# create a smaller dataset for a dot chart

minidata <- filter(smoking_data, drinks >0) # n = 58

ggplot(minidata, aes(x = birthweight)) +
  geom_dotplot() + 
  labs(title = "Infant Birthweight of Drinking Mothers",
       y = "Proportion",
       x = "Birthweight")
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Scatterplot

A scatterplot is a graphical representation of the relationship between two numerical variables. In this case, we examine the association between the number of prenatal visits (nprevist) and infant birthweight (birthweight).

library(ggthemes)
ggplot(smoking_data, aes(x = nprevist, y = birthweight)) +
  geom_point(color = "salmon3", alpha = 0.5) +
  geom_smooth(method = "lm", color = "black", se = FALSE, size=0.5) +
  labs(title = "Scatterplot of Prenatal Visits and Birthweight",
       x = "Number of Prenatal Visits",
       y = "Birthweight (grams)",
            subtitle = "Lab 2", 
            caption = "(based on data in Pennsylvania in 1989)") +
       theme_gdocs(base_size = 10, base_family = "sans") 

  # Other themes: theme_grey, theme_bw, theme_light, theme_minimal, theme_classic, etc.
  # customize your theme
  # https://www.r-bloggers.com/custom-themes-in-ggplot2/ 
  # https://ggplot2.tidyverse.org/reference/geom_histogram.html 

đź’ˇ Key Observations:

  • The scatterplot suggests a positive linear relationship: as the number of prenatal visits increases, birthweight also tends to increase.

  • The black regression line provides a clearer visualization of this trend.

  • Potential Outliers: Some points may deviate significantly from the trend, requiring further investigation.


Lab 2 Assignment

Q1: Import the dataset named “birthweight_smoking.csv”, name your dataset.

Q2: From the summary() statistics output, distinguish which are numerical, or categorical variables in the dataset. (Tips: A categorical variable takes on two or more values which represents categories or labels without inherent numerical meaning. A numerical variable has countable or infinite values within a given range.)

Q3: Change the data type of another categorical variable (name a new variable), and assign the value labels. Then use summary() to discuss the summary results.

Q4: In your response, specify the condition(s) for at least one variable to create a new data frame. In the code chunk below, use filter() to generate a new data frame (specify a new name) using the conditions and operator. Lastly, use the summary() function to inspect the new dataset.

Q5: Describe the distribution of the variable `birthweight`, including mean, median, the IQR, and the shape. Discuss possible outliers using the measures of skewness (symmetry) and kutosis (masses in tails).

Q6: Use ggplot() to create a histogram with a density plot for another numerical variable other than birthweight. What can you observe from the graph in terms of the distribution?

Q7: Plot a new scatterplot using ggplot() for another numerical variable (x) and birthweight, and interpret the relationship. You can add themes, labels, colors that make your graph look neat and professional.


Submit your Assignment

Statastic – Well done!

Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!

Step 2: If you use RMarkdown (.Rmd) document, Knit your R Markdown document–move your cursor to the face-down triangle next to Knit, and choose for PDF. If you use an R script (.R), then transfer your codes, results and work on the assignment in a word document, then convert it to a PDF.

Step 3: Submit your assignment to Gradescope https://www.gradescope.com/.