Summary Statistics and Data Visualization
Lab Overview
- Variable types
- Summary statistics
- Data visualization using GGplot2
- Correlation Analysis
1. Getting Started
Download Lab 2’s materials from Moodle:
- Save provided R script in your
codefolder in BRM-Labs project folder.
- Save provided R script in your
Open the provided lab 2’s R script.
Package installation
- In this lab we will install new packages that will help us visualize
and summarize data (
skimr), look at relationships between variables (GGally,corrr), and customize our plots (ggthemes). If you have already installed these packages, you don’t need to repeat this step.
- In this lab we will install new packages that will help us visualize
and summarize data (
# Install packages
install.packages("GGally")
install.packages("ggthemes")
install.packages("skimr")
install.packages("corrr")
- Setup your R environment
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
# Load data, convert to tibble format, discard original
load("data/ceosal2.RData")
tb.ceosal2 <- as_tibble(data)
rm(data)
2. Data types in R
R has six atomic types in total, but for everyday data analysis we mostly care about four:
| Type | Description | Example | Notes / Conversion |
|---|---|---|---|
logical |
Boolean values, used for conditions | TRUE, FALSE |
- |
integer |
Whole numbers, explicitly specified with L |
1L, 42L |
as.integer(x) to convert |
numeric |
Real numbers, double-precision by default | 3.14, 1 |
Double-precision = 64-bit, ~15–16 digits of accuracy;
as.numeric(x) or as.double(x) to convert |
character |
Text strings | "hello", "R" |
as.character(x) to convert |
Notes:
- In R,
numericis effectively double precision, so “double” is not a separate type in practice.
- More information about R data types can be found here.
Below, we provide examples of the types of vectors we will be using in this course.
2.1 Vectors
To create a vector in R we use the c()
(combine) function and then separate the elements of the vector with a
comma.
In this course, we will generally use the following vector types:
# Vector types
a <- c(1, 2, 5.3, -2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE) # logical vector
2.2 Factors
Factors are used for nominal or ordinal variables. Converting character variables into factors is useful for data visualization - graphs and tables will present the different category labels in the specified order.
# Create factor variables
tb.ceosal2 <- tb.ceosal2 %>% mutate(
college = factor( x = college
, levels = c(0,1)
, labels = c("No College Degree", "College Degree"))
, grad = factor( x = grad
, levels = c(0,1)
, labels = c("No Grad. Degree", "Grad. Degree")))
3. Data Inspection and Summary Statistics
The provided data set is already ready for analysis and no cleaning or pre-processing of the data will be necessary.
In Lab 1, we already learned the main features of the dataset such as the number of CEO’s in the sample and some summary statistics.
We will now continue that process by making ourselves familiar with the different variables in the data set and their distributions.
Below we present a few functions that can help us in this process.
3.1 Glimpse
As an alternative to head() or View(), we
can take a glimpse() of our data set. This will tell us the
number of columns (variables) and rows (observations) in the data, and
present a list of all variables, their type, and the first few values in
each variable.
# Take a glimpse of the data
glimpse(tb.ceosal2)
Rows: 177
Columns: 15
$ salary <int> 1161, 600, 379, 651, 497, 1067, 945, 1261, 503, 1094, 601, 35…
$ age <int> 49, 43, 51, 55, 44, 64, 59, 63, 47, 64, 54, 66, 72, 51, 63, 4…
$ college <fct> College Degree, College Degree, College Degree, College Degre…
$ grad <fct> Grad. Degree, Grad. Degree, Grad. Degree, No Grad. Degree, Gr…
$ comten <int> 9, 10, 9, 22, 8, 7, 35, 32, 4, 39, 26, 39, 37, 25, 21, 7, 38,…
$ ceoten <int> 2, 10, 3, 22, 6, 7, 10, 8, 4, 5, 7, 8, 37, 1, 11, 7, 4, 12, 2…
$ sales <dbl> 6200, 283, 169, 1100, 351, 19000, 536, 4800, 610, 2900, 1200,…
$ profits <int> 966, 48, 40, -54, 28, 614, 24, 191, 7, 230, 34, 8, 35, 234, 9…
$ mktval <dbl> 23200, 1100, 1100, 1000, 387, 3900, 623, 2100, 454, 3900, 533…
$ lsalary <dbl> 7.057037, 6.396930, 5.937536, 6.478509, 6.208590, 6.972606, 6…
$ lsales <dbl> 8.732305, 5.645447, 5.129899, 7.003066, 5.860786, 9.852194, 6…
$ lmktval <dbl> 10.051908, 7.003066, 7.003066, 6.907755, 5.958425, 8.268732, …
$ comtensq <int> 81, 100, 81, 484, 64, 49, 1225, 1024, 16, 1521, 676, 1521, 13…
$ ceotensq <int> 4, 100, 9, 484, 36, 49, 100, 64, 16, 25, 49, 64, 1369, 1, 121…
$ profmarg <dbl> 15.580646, 16.961130, 23.668638, -4.909091, 7.977208, 3.23157…
Note that most variables in the data are either integers or doubles,
which means that they are numeric1. The variables college and
grad have already been converted to factors.
3.2 Skim
The package skimr provides some useful functions for
exploring our data sets. An overview of the full data set along with
summary statistics and histograms can be obtained using the
skim() function.
# Data set overview
skim(tb.ceosal2)
| Name | tb.ceosal2 |
| Number of rows | 177 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| college | 0 | 1 | FALSE | 2 | Col: 172, No : 5 |
| grad | 0 | 1 | FALSE | 2 | Gra: 94, No : 83 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| salary | 0 | 1 | 865.86 | 587.59 | 100.00 | 471.00 | 707.00 | 1119.00 | 5299.00 | ▇▂▁▁▁ |
| age | 0 | 1 | 56.43 | 8.42 | 33.00 | 52.00 | 57.00 | 62.00 | 86.00 | ▁▅▇▂▁ |
| comten | 0 | 1 | 22.50 | 12.29 | 2.00 | 12.00 | 23.00 | 33.00 | 58.00 | ▇▆▇▅▁ |
| ceoten | 0 | 1 | 7.95 | 7.15 | 0.00 | 3.00 | 6.00 | 11.00 | 37.00 | ▇▃▁▁▁ |
| sales | 0 | 1 | 3529.46 | 6088.65 | 29.00 | 561.00 | 1400.00 | 3500.00 | 51300.00 | ▇▁▁▁▁ |
| profits | 0 | 1 | 207.83 | 404.45 | -463.00 | 34.00 | 63.00 | 208.00 | 2700.00 | ▇▂▁▁▁ |
| mktval | 0 | 1 | 3600.32 | 6442.28 | 387.00 | 644.00 | 1200.00 | 3500.00 | 45400.00 | ▇▁▁▁▁ |
| lsalary | 0 | 1 | 6.58 | 0.61 | 4.61 | 6.15 | 6.56 | 7.02 | 8.58 | ▁▅▇▅▁ |
| lsales | 0 | 1 | 7.23 | 1.43 | 3.37 | 6.33 | 7.24 | 8.16 | 10.85 | ▁▅▇▆▂ |
| lmktval | 0 | 1 | 7.40 | 1.13 | 5.96 | 6.47 | 7.09 | 8.16 | 10.72 | ▇▅▃▁▁ |
| comtensq | 0 | 1 | 656.68 | 577.12 | 4.00 | 144.00 | 529.00 | 1089.00 | 3364.00 | ▇▅▁▁▁ |
| ceotensq | 0 | 1 | 114.12 | 212.57 | 0.00 | 9.00 | 36.00 | 121.00 | 1369.00 | ▇▁▁▁▁ |
| profmarg | 0 | 1 | 6.42 | 17.86 | -203.08 | 4.23 | 6.83 | 10.95 | 47.46 | ▁▁▁▁▇ |
3.3 Stargazer
A quick and easy way to produce a summary statistics table is to use
the stargazer function. Note that stargazer
will only summarize the numeric variables in the data. Any qualitative
variables need to be inspected separately using a different method.
Note: stargazer will not work with tibbles so we must convert the data into a data.frame inside the stargazer function for it to run.
# Quick summary statistics table
stargazer(data.frame(tb.ceosal2), type = "text")
=================================================
Statistic N Mean St. Dev. Min Max
-------------------------------------------------
salary 177 865.864 587.589 100 5,299
age 177 56.429 8.422 33 86
comten 177 22.503 12.295 2 58
ceoten 177 7.955 7.151 0 37
sales 177 3,529.463 6,088.654 29 51,300
profits 177 207.831 404.454 -463 2,700
mktval 177 3,600.316 6,442.276 387 45,400
lsalary 177 6.583 0.606 4.605 8.575
lsales 177 7.231 1.432 3.367 10.845
lmktval 177 7.399 1.133 5.958 10.723
comtensq 177 656.684 577.123 4 3,364
ceotensq 177 114.124 212.566 0 1,369
profmarg 177 6.420 17.861 -203.077 47.458
-------------------------------------------------
4. Variable Distributions
We will now use graphs to visualize the distribution of the variables
relevant to our analysis, using the ggplot2 package.
At this stage, it is particularly useful to plot histograms and boxplots
for numeric variables. These visualizations help us assess the shape of
the distribution—whether it is symmetric or skewed—identify potential
outliers, determine typical values, and evaluate variability. For
categorical variables, barplots are the appropriate choice. All
graphical elements are highly customizable, and ggplot2
offers extensive aesthetic options for colors, labels, themes, and
layout. Code copilots and AI tools like ChatGPT can be especially
helpful when exploring or fine-tuning these customizations.
4.1 Histograms
We will be focusing on the relationship between CEO salaries and firm profits. We will thus start by plotting the distributions of these two variables.
# Histogram of CEO salary
ggplot(data = tb.ceosal2, aes(x = salary)) + geom_histogram()
# Alternatively:
tb.ceosal2 %>% ggplot(aes(x = salary)) + geom_histogram()
# Histogram of firm profits
ggplot(data = tb.ceosal2, aes(x = profits)) + geom_histogram()
4.2 Boxplots
Box and whisker plots are an alternative (or complementary) way to visualize the distribution of numeric variables. The “box” gives us the first and third quartiles and the line in the middle of the box give us the median value. Outliers are highlighted as dots separated from the box and the whiskers.
# Boxplot of CEO salary (horizontal)
ggplot(data = tb.ceosal2, aes(x = salary)) + geom_boxplot() + scale_y_discrete( )
# Boxplot of firm profits (vertical)
ggplot(data = tb.ceosal2, aes(y = profits)) + geom_boxplot() + scale_x_discrete( )
4.3 Barplots
Categorical data can be visualized using barplots. Note that the plot will be ordered according to the pre-specified factor ordering.
# Barplot of grad
ggplot(data = tb.ceosal2, aes(x = grad)) + geom_bar()
5. Relationships Among Variables
Next in our analysis, we will investigate the relationships among the variables of interest.
5.1 Scatter plots
A scatterplot allows us to visualize the relationship between two
numeric variables. Scatterplots can be easily created with
ggplot2 by using geom_point(). As before, all
plot elements can be customized.
To add a regression line to the scatterplot, we use the layer
stat_smooth(method = lm) (the option
method = lm indicates we want a linear regression line —
the most common). The gray region is the confidence interval for the
regression line.
# Create scatterplot and store it as a new object
plot.scatter <- ggplot(data = tb.ceosal2
, aes(x = profits, y = salary)) +
geom_point(size = 1) +
theme_few() +
labs( x = "1990 profits, millions"
, y = "1990 CEO compensation, $1000s"
, title = "Relationship between CEO salaries and firm profits") +
theme(title = element_text(size = 8))
# Print scatterplot
plot.scatter
# Add regression line to scatterplot
plot.scatter +
# add regression line
stat_smooth(method = lm, colour="black")
The scatterplot suggests a positive (when one variable increases, the other increases as well) but weak linear relationship between the two variables (the points do not fall neatly on the straight line). In the scatterplot we can also more clearly detect outlier points.
We can use color, point shape, or point size to add additional layers of information to scatterplots.
# Create scatterplot
ggplot(data = tb.ceosal2
, aes(x = profits, y = salary, color = grad)) +
geom_point(size = 1) +
# use custom theme and color scheme
theme_classic() + scale_color_fivethirtyeight() +
theme(title = element_text(size = 12)) +
labs( x = "1990 profits, millions"
, y = "1990 CEO compensation, $1000s"
, title = "Relationship between CEO salaries and firm profits"
, color = "")
5.2 Categorical Data
We can use color in barplots to analyze relationships between two categorical variables.
# Barplot of grad - Counts
ggplot(data = tb.ceosal2
, aes(x = college, fill = grad)) +
geom_bar() +
# add custom theme and color scheme
theme_fivethirtyeight() + scale_fill_tableau()
# Barplot of grad - Percent
ggplot(data = tb.ceosal2
, aes(x = college, fill = grad)) +
geom_bar(position="fill") +
# add custom theme and color scheme
theme_fivethirtyeight() + scale_fill_tableau()
We can use side-by-side boxplots with or without color to analyze relationships between categorical and numerical variables.
# Boxplot of college and salary
ggplot(data = tb.ceosal2
, aes(x = grad, y = salary, fill = grad)) +
geom_boxplot() +
# add custom theme and color scheme
theme_fivethirtyeight() + scale_fill_tableau()
5.3 Correlation Tables and Plots
Correlation tables provide us the correlation coefficient for each pair of numeric variables in the data.
# Create correlation matrix
cor_matrix <- tb.ceosal2 %>%
select(salary, age, comten, ceoten, sales, profits, mktval) %>%
correlate(diagonal = 1)
# Format table for a professional presentation
cor_matrix %>%
rearrange() %>% # rearrange by correlations
shave() %>% # Shave off the upper triangle for a clean result
fashion(decimals = 3) # Clean presentation
Correlations can also be presented visually. This can be done in different ways (e.g. heatmap, barplot, correlation plot), depending on the data and the researcher’s preferences.
A correlation plot can be created easily using the function
rplot().
# Correlation plot
cor_matrix %>% rplot()
5.4 GGpairs
The function ggpairs() of the package
GGally is a good shortcut for visualizing the relationship
between several variables all at once. However, you should be careful
when using this function: select only a few variables (otherwise the
output becomes unreadable) and, if your dataset contains too many
observations, make sure to select only a sample of those observations
for plotting.
# Visualize distributions, scatterplots, and correlations
ggpairs(tb.ceosal2 %>% select(salary, grad, sales, profits))
6. Recommended Assignment
- Complete DataCamp’s third chapter of the Introduction to the Tidyverse course: Grouping and Summarizing
Resources
Data types: https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures.html
Introductory Econometrics, Examples 2.3 to 2.5 (CEO salary and return on equity; Wage and education; Voting outcomes and campaign expenditures)
More on
skimr
“There are multiple classes that are grouped together as”numeric” classes, the 2 most common of which are double (for double precision floating point numbers) and integer. R will automatically convert between the numeric classes when needed, so for the most part it does not matter to the casual user whether the number 3 is currently stored as an integer or as a double. Most math is done using double precision, so that is often the default storage.” (Source: https://stackoverflow.com/questions/23660094/whats-the-difference-between-integer-class-and-numeric-class-in-r)↩︎