Lecture Resources
Slides for Week 2 are available here.
In this lab, you will learn to:
âś… Import, explore, and visualize a dataset in R.
âś… Understand data structures and variable types.
âś… Apply basic data cleaning techniques.
đź”– Additional Readings:
Right now, the functionality available to you is that of “base R.” This means that you have not yet installed and loaded any extra software packages written by independent developers. In order to create the graphs and perform many interesting analyses, you’ll need to install some additional R packages. To install the necessary packages for this class, run the following code in the RStudio console window.
You only need to install these once. If they are already installed, you can skip this step.
# here, I specified (eval = FALSE) that R ignores this code chuck when knitting.
# This will prevent R to install these packages again every time we knit this document.
# Install required packages
pkgs <- c(
# Data Manipulation & Wrangling
"dplyr", "tidyr", "data.table", "readr", "readxl", "janitor", "stringr", "lubridate",
# General Statistics & Modeling
"MASS", "car", "broom", "effects", "lmtest", "sandwich",
# Data Visualization
"ggplot2", "ggthemes", "ggridges", "GGally", "ggbeeswarm", "ggpubr", "patchwork", "viridis",
# Correlation & Regression Visualization
"ggcorrplot", "corrplot", "visreg",
# Advanced Data Visualization
"plotly", "leaflet", "superheat", "waterfalls", "factoextra",
# Report & Output Formatting
"stargazer", "kableExtra", "flextable", "xtable", "gt",
# Teaching & Educational Tools
"swirl", "learnr"
)
# Install all packages
install.packages(pkgs)
Once you’ve installed the package, you still have to load the library you will use in every R session. Now, load the libraries that we will use for Lab 2.
library(ggplot2)
# a powerful package for creating visualizations in R.
library(psych)
# descriptive statistics, factor analysis, and data exploration.
library(tidyverse)
# a collection of packages including dplyr, tidyr, readr, etc. for data science in R.
smoking_data
is the variable we named in which the
smoking data will be stored. If the parameter “header=” is “TRUE”, then
the first row will be treated as the row names. These data were reported
in Almond, D., Chay, K. Y., & Lee, D. S. (2005). The costs of low
birth weight. The Quarterly Journal of Economics, 120(3),
1031-1083. which is made available via Stock, J. H., & Watson, M. W.
(2020). Introduction to Econometrics. 4th ed. Pearson. NY: New
York.
R Studio Desktop: When you saved the .csv dataset in your project folder, you can load the dataset easily.
If you suspect you accidentally changed or messed up the dataset, you can always re-import the csv file again. Re-importing the dataset will allow you to start anew with the original data.
smoking_data <- read.csv("birthweight_smoking.csv")
Now, in the Global Environment panel (located on the
right side of your RStudio interface), you should see
"smoking_data"
listed. Click on it to open and explore the
dataset in a spreadsheet-like viewer. 🚀
It has 3,000 observations (i.e., 3,000 rows) and 12 variables (i.e., 12 columns).
If your dataset is not stored in the project folder, you will need to
insert the full path for importing your data, remember to use the
forward slash ’/“. Replace with your file path (including the .csv file
name) and remove the #
to run the following code.
An example:
# smoking_data2 <- read.csv("G:/My Drive/SPP608-2023/Lab2/birthweight_smoking.csv")
What if our data was saved as an Excel sheet (.xlsx)? We first install a new package called “readxl”: install.packages(“readxl”) and run the following codes:
Tips: To run the code, remove #
first. Install packages
only once, then put back #
in front of the codes to prevent
re-running the install.packages()
codes when knitting.
# install.packages("readxl")
# library(readxl)
# smoking_data3 <- read_excel("birthweight_smoking.xlsx")
Alternatively, you can open it through the files pane (in the lower right panel) by clicking the .xlsx file.
The summary() function provides a quick overview of the dataset, summarizing key statistics for each variable. This is a crucial step in exploratory data analysis (EDA) because it helps you identify patterns, potential issues, and characteristics of the data before performing further analysis.
summary(smoking_data)
## nprevist alcohol tripre1 tripre2
## Min. : 0.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 9.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median :12.00 Median :0.00000 Median :1.000 Median :0.000
## Mean :10.99 Mean :0.01933 Mean :0.804 Mean :0.153
## 3rd Qu.:13.00 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :35.00 Max. :1.00000 Max. :1.000 Max. :1.000
## tripre3 tripre0 birthweight smoker unmarried
## Min. :0.000 Min. :0.00 Min. : 425 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3062 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.00 Median :3420 Median :0.000 Median :0.0000
## Mean :0.033 Mean :0.01 Mean :3383 Mean :0.194 Mean :0.2267
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:3750 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.00 Max. :5755 Max. :1.000 Max. :1.0000
## educ age drinks
## Min. : 0.00 Min. :14.00 Min. : 0.00000
## 1st Qu.:12.00 1st Qu.:23.00 1st Qu.: 0.00000
## Median :12.00 Median :27.00 Median : 0.00000
## Mean :12.91 Mean :26.89 Mean : 0.05833
## 3rd Qu.:14.00 3rd Qu.:31.00 3rd Qu.: 0.00000
## Max. :17.00 Max. :44.00 Max. :21.00000
Next Steps After summary()
NA
) –
Consider imputation or data cleaning.$
To take a closer look into specific variables, we use $
to indicate that we are looking for a variable of interest within
smoking_data
. For instance, to find out the data structure
and summary information of the variable called smoker
, we
specify the by data$variable like the following example.
Examine the data structure of the variables in the data frame (factor, numeric, integer, etc.).
str(smoking_data)
## 'data.frame': 3000 obs. of 12 variables:
## $ nprevist : int 12 5 12 13 9 11 12 10 13 10 ...
## $ alcohol : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre1 : int 1 0 1 1 1 1 1 1 1 1 ...
## $ tripre2 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ tripre3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre0 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ birthweight: int 4253 3459 2920 2600 3742 3420 2325 4536 2850 2948 ...
## $ smoker : int 1 0 1 0 0 0 1 0 0 0 ...
## $ unmarried : int 1 0 0 0 0 0 0 0 0 0 ...
## $ educ : int 12 16 11 17 13 16 14 13 17 14 ...
## $ age : int 27 24 23 28 27 33 24 38 29 28 ...
## $ drinks : int 0 0 0 0 0 0 0 0 0 0 ...
# What is the data structure of `data$smoker`?
str(smoking_data$smoker)
## int [1:3000] 1 0 1 0 0 0 1 0 0 0 ...
Before proceeding to data analysis, it is essential to first transform certain variables (e.g., categorical variables) to ensure that the data is in the right format and condition to yield meaningful, accurate, and interpretable results.
Categorical variables represent distinct groups or categories and need to be converted into a format that R can process efficiently for analysis. Since statistical software treats raw numeric values as integer data by default, we must convert categorical variables from integer format to factor format.
Converting categorical variables into factors allows us to:
âś… Categorize and analyze data based on unique categories.
âś… Perform frequency tabulations to summarize categorical data.
âś… Apply appropriate statistical models (e.g., logistic regression,
chi-square tests).
For comparison, while categorical variables focus on group frequencies, numerical variables are assessed using measures like mean, median, variance, and range to describe their distribution.
To ensure data integrity and facilitate analysis without altering the
original dataset, it is best practice to create
new transformed variables while preserving the original values.
For example, we have a numeric variable named smoker
where:
0 represents non-smokers
1 represents smokers
We can create two new variables:
smoker1
is a factor variable converted from smoker,
enabling categorical analysis. smoker2
is a labeled
factor variable, with descriptive category names for clarity.
smoking_data$smoker1 <- factor(smoking_data$smoker)
# Now, check if the data structure and summary for `smoker1` has changed
str(smoking_data$smoker1)
## Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
summary(smoking_data$smoker1)
## 0 1
## 2418 582
To improve interpretability, we assign meaningful labels to the factor levels so that instead of numeric values, we work with readable category names:
smoking_data$smoker2 <- factor(smoking_data$smoker,
levels = c(0,1),
labels = c("no", "yes"))
summary(smoking_data$smoker2)
## no yes
## 2418 582
str(smoking_data$smoker2)
## Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 1 1 1 ...
Important: Once a factor variable is labeled, always
refer to the labels ("no"
, "yes"
) rather than
the original numerical values (0
, 1
).
R has several operators to perform tasks including arithmetic, logical and bitwise operations. They are very helpful when you want to specify conditions for variable transformation and manipulation of the data frame.
Operator | Meaning |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
equal to |
!= |
not equal to |
!x |
Not x |
x | y |
x OR y |
x & y |
x AND y |
isTRUE(x) |
test if X is TRUE |
You might want to count how many values in a column satisfy a certain
condition. For example, suppose we want to know how many adult
mothers drank alcohol during pregnancy. We can achieve this
using: In the following code chunk, smoking_data$drinks
refers to the drinks
column within the
smoking_data
dataset.
sum(smoking_data$drinks == "1")
## [1] 30
sum(smoking_data$smoker2 =="yes")
## [1] 582
📌 Key Considerations:
smoking_data$drinks
refers to the
drinks column in the dataset.
The first code counts the number of rows (i.e., mothersin the
dataset) where drinks
equals "1"
(indicating
alcohol consumption).
The second code counts the number of rows (i.e., mothersin the
dataset) where smoker2
equals "yes"
(indicating smoker).
If working with a transformed factor variable, make sure to use labels instead of numerical values.
filter()
a datasetOften, we need to filter a dataset to see only the observations that
match specific conditions. Suppose we want to filter for mothers
who were 18 years old or older. We can use the
filter()
function from the tidyverse
package:
# Store the filtered data in a new data frame "adults"
adults <- filter(smoking_data, age >= 18)
# check the dimension
# also check the dataset in the global environment
# inspect the top six rows
dim(adults)
## [1] 2908 14
head(adults)
## nprevist alcohol tripre1 tripre2 tripre3 tripre0 birthweight smoker unmarried
## 1 12 0 1 0 0 0 4253 1 1
## 2 5 0 0 1 0 0 3459 0 0
## 3 12 0 1 0 0 0 2920 1 0
## 4 13 0 1 0 0 0 2600 0 0
## 5 9 0 1 0 0 0 3742 0 0
## 6 11 0 1 0 0 0 3420 0 0
## educ age drinks smoker1 smoker2
## 1 12 27 0 1 yes
## 2 16 24 0 0 no
## 3 11 23 0 1 yes
## 4 17 28 0 0 no
## 5 13 27 0 0 no
## 6 16 33 0 0 no
Try it yourself!
📌 Filtering with Multiple Conditions
To filter for adult mothers who are also married, we can add another condition:
# filter with two conditions
adults_married <- filter(smoking_data, age >= 18 & unmarried == 0)
dim(adults_married)
## [1] 2313 14
head(adults_married)
## nprevist alcohol tripre1 tripre2 tripre3 tripre0 birthweight smoker unmarried
## 1 5 0 0 1 0 0 3459 0 0
## 2 12 0 1 0 0 0 2920 1 0
## 3 13 0 1 0 0 0 2600 0 0
## 4 9 0 1 0 0 0 3742 0 0
## 5 11 0 1 0 0 0 3420 0 0
## 6 12 0 1 0 0 0 2325 1 0
## educ age drinks smoker1 smoker2
## 1 16 24 0 0 no
## 2 11 23 0 1 yes
## 3 17 28 0 0 no
## 4 13 27 0 0 no
## 5 16 33 0 0 no
## 6 14 24 0 1 yes
đź’ˇ Tip: To avoid overwriting your dataset, always
store the filtered output in a new variable such as
data_new
rather than modifying smoking_data
directly.
Data visualization of numerical variables involves representing numerical data that can take on an infinite number of values within a given range. Visualizing numerical variables helps to understand the distribution, central tendency, variability, and presence of any patterns or outliers in the data. Here are some common methods for visualizing continuous variables:
describe()
from psych
libraryBesides the mean, median, min and max, we can also find out the standard deviation (sd), skewness, kurtosis and inter-quartile range (The IQR is the difference between Q3 and Q1, i.e., between the 75th and 25th percentiles of the data.).
describe(smoking_data$birthweight, IQR=TRUE)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 3000 3382.93 592.16 3420 3412.04 520.39 425 5755 5330 -0.83 2.54
## se IQR
## X1 10.81 688
ggplot()
: Grammar of Graphics!ggplot2 is an R package specifically designed for data visualization, adhering to the principles of The Grammar of Graphics. By supplying your data and instructing ggplot2 on how to map variables to aesthetics, what graphical primitives to use, it greatly enhances the quality and visual appeal of your graphics.
“geoms” is short for geometric objects. Geometric objects are the
actual shapes or visual elements that are plotted on a graph to
represent data. Each geom function in ggplot2
corresponds
to a specific type of graphical representation. The choice of geom
function depends on the nature of the data and the type of visualization
you want to create. Some of the most commonly used geoms include:
Type | Function | Description |
---|---|---|
Point | geom_point() |
Adds points to the plot, useful for scatter plots. |
Line | geom_line() |
Connects data points with lines, ideal for time series or trend lines. |
Bar | geom_bar() , geom_col() |
Produces bar charts, which can display values for different categories or counts of categorical variables. |
Histogram | geom_histogram() |
Creates histograms to visualize the distribution of a single continuous variable by dividing it into bins and counting the number of observations in each bin. |
Regression | geom_smooth() |
Adds a smoothed condition mean to the plot, often used to visualize trends and patterns over a continuous variable. |
Boxplot | geom_boxplot() |
Generates box plots to show the distribution of a continuous variable, highlighting the median, quartiles, and outliers. |
Text/Label | geom_text() or
geom_label() |
Allows adding text or labels to the plot to annotate specific points or provide additional information. |
Vert./Horiz. Line | geom_{vh}line() |
Adds vertical or horizontal lines to a plot. |
Count | geom_count() |
Sizes points according to the number of observations at identical locations in the plot. |
Density | geom_density() |
Creates a smooth density estimate of a continuous variable to visualize the distribution of the data. |
hist()
A histogram is a simple way to visualize the distribution of a numerical variable. It groups the data into bins and displays the frequency of observations within each bin.
We will create a histogram for our dependent variable,
birthweight
, to understand its distribution:
# base r approach
hist(smoking_data$birthweight)
# with labels
hist(smoking_data$birthweight,
main = "Histogram of Birthweight",
xlab = "Birthweight (grams)",
col = "lightblue",
border = "black")
# basic ggplot2 approach
ggplot(smoking_data, aes(birthweight)) +
geom_bar(colour="black", fill="bisque") +
scale_x_binned()
# Plot the histogram with blue bars and white borders
ggplot(smoking_data, aes(x = birthweight)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
labs(title = "Distribution of Birthweight",
x = "Birthweight (grams)",
y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Choose your color from: R color cheatsheet.
đź’ˇ What to Look For?
Is the distribution normal, skewed, or multimodal?
Are there any outliers (extremely high or low values)?
What is the range and spread of values?
A density plot provides a smoothed estimate of the distribution using a kernel density function. Overlaying a density plot on a histogram allows us to better understand the shape of the distribution.
# simplified version of density plot
density_bw <- density(smoking_data$birthweight)
plot(density_bw)
# Histogram with a density plot
ggplot(smoking_data, aes(x = birthweight)) +
geom_histogram(aes(y = ..density..),
fill = "white",
color = "black",
bins = 30) +
geom_density(alpha = 0.4, fill = "coral") +
labs(title = "Histogram and Density Plot of Birthweight",
x = "Birthweight (grams)",
y = "Density")
# save the last ggplot graphic!
# After running the code, the PDF file will be stored in your project folder
ggsave("histogram_birthweight.pdf", width = 6, height = 4)
# note: ggsave won't work for basic R plots, only for ggplot() objects
đź’ˇ Key Insights from Density Plots:
The peak(s) indicate where most data points are concentrated.
A long tail suggests skewness (positive or negative).
Multiple peaks may indicate a bimodal or multimodal distribution.
Another alternative to the histogram is the dot chart. Again, the quantitative variable is divided into bins, but rather than summary bars, each observation is represented by a dot. By default, the width of a dot corresponds to the bin width, and dots are stacked, with each dot representing one observation. This works best when the number of observations is small (say, less than 150).
# create a smaller dataset for a dot chart
minidata <- filter(smoking_data, drinks >0) # n = 58
ggplot(minidata, aes(x = birthweight)) +
geom_dotplot() +
labs(title = "Infant Birthweight of Drinking Mothers",
y = "Proportion",
x = "Birthweight")
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.
A scatterplot is a graphical representation of the
relationship between two numerical variables. In this case, we examine
the association between the number of prenatal visits
(nprevist
) and infant birthweight
(birthweight
).
library(ggthemes)
ggplot(smoking_data, aes(x = nprevist, y = birthweight)) +
geom_point(color = "salmon3", alpha = 0.5) +
geom_smooth(method = "lm", color = "black", se = FALSE, size=0.5) +
labs(title = "Scatterplot of Prenatal Visits and Birthweight",
x = "Number of Prenatal Visits",
y = "Birthweight (grams)",
subtitle = "Lab 2",
caption = "(based on data in Pennsylvania in 1989)") +
theme_gdocs(base_size = 10, base_family = "sans")
# Other themes: theme_grey, theme_bw, theme_light, theme_minimal, theme_classic, etc.
# customize your theme
# https://www.r-bloggers.com/custom-themes-in-ggplot2/
# https://ggplot2.tidyverse.org/reference/geom_histogram.html
đź’ˇ Key Observations:
The scatterplot suggests a positive linear relationship: as the number of prenatal visits increases, birthweight also tends to increase.
The black regression line provides a clearer visualization of this trend.
Potential Outliers: Some points may deviate significantly from the trend, requiring further investigation.
Q1: Import the dataset named “birthweight_smoking.csv”, name your dataset.
Q2: From the summary()
statistics output, distinguish
which are numerical, or categorical variables in the
dataset. (Tips: A categorical variable takes on two or more values which
represents categories or labels without inherent numerical meaning. A
numerical variable has countable or infinite values within a given
range.)
Q3: Change the data type of another categorical variable (name a new
variable), and assign the value labels. Then use summary()
to discuss the summary results.
Q4: In your response, specify the condition(s) for at least one
variable to create a new data frame. In the code chunk below, use
filter()
to generate a new data frame (specify a new name)
using the conditions and operator. Lastly, use the
summary()
function to inspect the new dataset.
Q5: Describe the distribution of the variable `birthweight`, including mean, median, the IQR, and the shape. Discuss possible outliers using the measures of skewness (symmetry) and kutosis (masses in tails).
Q6: Use ggplot()
to create a histogram with a density
plot for another numerical variable other than birthweight
.
What can you observe from the graph in terms of the distribution?
Q7: Plot a new scatterplot using ggplot()
for another
numerical variable (x) and birthweight, and interpret the relationship.
You can add themes, labels, colors that make your graph look neat and
professional.
Statastic – Well done!
Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!
Step 2: If you use RMarkdown (.Rmd) document, Knit
your
R Markdown document–move your cursor to the face-down triangle next to
Knit
, and choose for PDF. If you use an R
script (.R), then transfer your codes, results and work on the
assignment in a word document, then convert it to a PDF.
Step 3: Submit your assignment to Gradescope https://www.gradescope.com/.