BRM | Lab #2 | Fall 2025

Summary Statistics and Data Visualization

Lab Overview

  • Variable types
  • Summary statistics
  • Data visualization using GGplot2
  • Correlation Analysis




1. Getting Started

  1. Download Lab 2’s materials from Moodle:

    • Save provided R script in your code folder in BRM-Labs project folder.
  2. Open the provided lab 2’s R script.

  3. Package installation

    • In this lab we will install new packages that will help us visualize and summarize data (skimr), look at relationships between variables (GGally, corrr), and customize our plots (ggthemes). If you have already installed these packages, you don’t need to repeat this step.
# Install packages
install.packages("GGally")
install.packages("ggthemes")
install.packages("skimr")
install.packages("corrr")
  1. Setup your R environment
# Clean work environment 
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
# Load data, convert to tibble format, discard original
load("data/ceosal2.RData")
tb.ceosal2 <- as_tibble(data)
rm(data)




2. Data types in R

R has six atomic types in total, but for everyday data analysis we mostly care about four:

Type Description Example Notes / Conversion
logical Boolean values, used for conditions TRUE, FALSE -
integer Whole numbers, explicitly specified with L 1L, 42L as.integer(x) to convert
numeric Real numbers, double-precision by default 3.14, 1 Double-precision = 64-bit, ~15–16 digits of accuracy; as.numeric(x) or as.double(x) to convert
character Text strings "hello", "R" as.character(x) to convert

Notes:

  • In R, numeric is effectively double precision, so “double” is not a separate type in practice.
  • More information about R data types can be found here.

Below, we provide examples of the types of vectors we will be using in this course.


2.1 Vectors

To create a vector in R we use the c() (combine) function and then separate the elements of the vector with a comma.

In this course, we will generally use the following vector types:

# Vector types
a <- c(1, 2, 5.3, -2,4)         # numeric vector
b <- c("one","two","three")     # character vector
c <- c(TRUE,TRUE,TRUE,FALSE)    # logical vector




2.2 Factors

Factors are used for nominal or ordinal variables. Converting character variables into factors is useful for data visualization - graphs and tables will present the different category labels in the specified order.

# Create factor variables
tb.ceosal2 <- tb.ceosal2 %>% mutate(
    college = factor( x = college
                    , levels = c(0,1)
                    , labels = c("No College Degree", "College Degree"))
  , grad = factor(    x = grad
                    , levels = c(0,1)
                    , labels = c("No Grad. Degree", "Grad. Degree")))




3. Data Inspection and Summary Statistics

The provided data set is already ready for analysis and no cleaning or pre-processing of the data will be necessary.

In Lab 1, we already learned the main features of the dataset such as the number of CEO’s in the sample and some summary statistics.

We will now continue that process by making ourselves familiar with the different variables in the data set and their distributions.

Below we present a few functions that can help us in this process.

3.1 Glimpse

As an alternative to head() or View(), we can take a glimpse() of our data set. This will tell us the number of columns (variables) and rows (observations) in the data, and present a list of all variables, their type, and the first few values in each variable.

# Take a glimpse of the data
glimpse(tb.ceosal2)
Rows: 177
Columns: 15
$ salary   <int> 1161, 600, 379, 651, 497, 1067, 945, 1261, 503, 1094, 601, 35…
$ age      <int> 49, 43, 51, 55, 44, 64, 59, 63, 47, 64, 54, 66, 72, 51, 63, 4…
$ college  <fct> College Degree, College Degree, College Degree, College Degre…
$ grad     <fct> Grad. Degree, Grad. Degree, Grad. Degree, No Grad. Degree, Gr…
$ comten   <int> 9, 10, 9, 22, 8, 7, 35, 32, 4, 39, 26, 39, 37, 25, 21, 7, 38,…
$ ceoten   <int> 2, 10, 3, 22, 6, 7, 10, 8, 4, 5, 7, 8, 37, 1, 11, 7, 4, 12, 2…
$ sales    <dbl> 6200, 283, 169, 1100, 351, 19000, 536, 4800, 610, 2900, 1200,…
$ profits  <int> 966, 48, 40, -54, 28, 614, 24, 191, 7, 230, 34, 8, 35, 234, 9…
$ mktval   <dbl> 23200, 1100, 1100, 1000, 387, 3900, 623, 2100, 454, 3900, 533…
$ lsalary  <dbl> 7.057037, 6.396930, 5.937536, 6.478509, 6.208590, 6.972606, 6…
$ lsales   <dbl> 8.732305, 5.645447, 5.129899, 7.003066, 5.860786, 9.852194, 6…
$ lmktval  <dbl> 10.051908, 7.003066, 7.003066, 6.907755, 5.958425, 8.268732, …
$ comtensq <int> 81, 100, 81, 484, 64, 49, 1225, 1024, 16, 1521, 676, 1521, 13…
$ ceotensq <int> 4, 100, 9, 484, 36, 49, 100, 64, 16, 25, 49, 64, 1369, 1, 121…
$ profmarg <dbl> 15.580646, 16.961130, 23.668638, -4.909091, 7.977208, 3.23157…

Note that most variables in the data are either integers or doubles, which means that they are numeric1. The variables college and grad have already been converted to factors.


to top



3.2 Skim

The package skimr provides some useful functions for exploring our data sets. An overview of the full data set along with summary statistics and histograms can be obtained using the skim() function.

# Data set overview
skim(tb.ceosal2)
Data summary
Name tb.ceosal2
Number of rows 177
Number of columns 15
_______________________
Column type frequency:
factor 2
numeric 13
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
college 0 1 FALSE 2 Col: 172, No : 5
grad 0 1 FALSE 2 Gra: 94, No : 83

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary 0 1 865.86 587.59 100.00 471.00 707.00 1119.00 5299.00 ▇▂▁▁▁
age 0 1 56.43 8.42 33.00 52.00 57.00 62.00 86.00 ▁▅▇▂▁
comten 0 1 22.50 12.29 2.00 12.00 23.00 33.00 58.00 ▇▆▇▅▁
ceoten 0 1 7.95 7.15 0.00 3.00 6.00 11.00 37.00 ▇▃▁▁▁
sales 0 1 3529.46 6088.65 29.00 561.00 1400.00 3500.00 51300.00 ▇▁▁▁▁
profits 0 1 207.83 404.45 -463.00 34.00 63.00 208.00 2700.00 ▇▂▁▁▁
mktval 0 1 3600.32 6442.28 387.00 644.00 1200.00 3500.00 45400.00 ▇▁▁▁▁
lsalary 0 1 6.58 0.61 4.61 6.15 6.56 7.02 8.58 ▁▅▇▅▁
lsales 0 1 7.23 1.43 3.37 6.33 7.24 8.16 10.85 ▁▅▇▆▂
lmktval 0 1 7.40 1.13 5.96 6.47 7.09 8.16 10.72 ▇▅▃▁▁
comtensq 0 1 656.68 577.12 4.00 144.00 529.00 1089.00 3364.00 ▇▅▁▁▁
ceotensq 0 1 114.12 212.57 0.00 9.00 36.00 121.00 1369.00 ▇▁▁▁▁
profmarg 0 1 6.42 17.86 -203.08 4.23 6.83 10.95 47.46 ▁▁▁▁▇


to top



3.3 Stargazer

A quick and easy way to produce a summary statistics table is to use the stargazer function. Note that stargazer will only summarize the numeric variables in the data. Any qualitative variables need to be inspected separately using a different method.

Note: stargazer will not work with tibbles so we must convert the data into a data.frame inside the stargazer function for it to run.

# Quick summary statistics table
stargazer(data.frame(tb.ceosal2), type = "text")

=================================================
Statistic  N    Mean    St. Dev.    Min     Max  
-------------------------------------------------
salary    177  865.864   587.589    100    5,299 
age       177  56.429     8.422      33      86  
comten    177  22.503    12.295      2       58  
ceoten    177   7.955     7.151      0       37  
sales     177 3,529.463 6,088.654    29    51,300
profits   177  207.831   404.454    -463   2,700 
mktval    177 3,600.316 6,442.276   387    45,400
lsalary   177   6.583     0.606    4.605   8.575 
lsales    177   7.231     1.432    3.367   10.845
lmktval   177   7.399     1.133    5.958   10.723
comtensq  177  656.684   577.123     4     3,364 
ceotensq  177  114.124   212.566     0     1,369 
profmarg  177   6.420    17.861   -203.077 47.458
-------------------------------------------------


to top



4. Variable Distributions

We will now use graphs to visualize the distribution of the variables relevant to our analysis, using the ggplot2 package. At this stage, it is particularly useful to plot histograms and boxplots for numeric variables. These visualizations help us assess the shape of the distribution—whether it is symmetric or skewed—identify potential outliers, determine typical values, and evaluate variability. For categorical variables, barplots are the appropriate choice. All graphical elements are highly customizable, and ggplot2 offers extensive aesthetic options for colors, labels, themes, and layout. Code copilots and AI tools like ChatGPT can be especially helpful when exploring or fine-tuning these customizations.

4.1 Histograms

We will be focusing on the relationship between CEO salaries and firm profits. We will thus start by plotting the distributions of these two variables.

# Histogram of CEO salary
ggplot(data = tb.ceosal2, aes(x = salary)) + geom_histogram()

# Alternatively:
tb.ceosal2 %>% ggplot(aes(x = salary)) + geom_histogram()

# Histogram of firm profits
ggplot(data = tb.ceosal2, aes(x = profits)) + geom_histogram()




4.2 Boxplots

Box and whisker plots are an alternative (or complementary) way to visualize the distribution of numeric variables. The “box” gives us the first and third quartiles and the line in the middle of the box give us the median value. Outliers are highlighted as dots separated from the box and the whiskers.

# Boxplot of CEO salary (horizontal)
ggplot(data = tb.ceosal2, aes(x = salary)) + geom_boxplot() + scale_y_discrete( ) 

# Boxplot of firm profits (vertical)
ggplot(data = tb.ceosal2, aes(y = profits)) + geom_boxplot() + scale_x_discrete( ) 




4.3 Barplots

Categorical data can be visualized using barplots. Note that the plot will be ordered according to the pre-specified factor ordering.

# Barplot of grad
ggplot(data = tb.ceosal2, aes(x = grad)) + geom_bar()


to top



5. Relationships Among Variables

Next in our analysis, we will investigate the relationships among the variables of interest.

5.1 Scatter plots

A scatterplot allows us to visualize the relationship between two numeric variables. Scatterplots can be easily created with ggplot2 by using geom_point(). As before, all plot elements can be customized.

To add a regression line to the scatterplot, we use the layer stat_smooth(method = lm) (the option method = lm indicates we want a linear regression line — the most common). The gray region is the confidence interval for the regression line.

# Create scatterplot and store it as a new object
plot.scatter <-  ggplot(data = tb.ceosal2
                     , aes(x = profits, y = salary)) +
                geom_point(size = 1) +
                theme_few() +
                labs(  x = "1990 profits, millions"
                     , y = "1990 CEO compensation, $1000s"
                     , title = "Relationship between CEO salaries and firm profits") +
                theme(title = element_text(size = 8))

# Print scatterplot
plot.scatter

# Add regression line to scatterplot
plot.scatter +
  # add regression line
  stat_smooth(method = lm, colour="black") 

The scatterplot suggests a positive (when one variable increases, the other increases as well) but weak linear relationship between the two variables (the points do not fall neatly on the straight line). In the scatterplot we can also more clearly detect outlier points.

We can use color, point shape, or point size to add additional layers of information to scatterplots.

# Create scatterplot 
ggplot(data = tb.ceosal2
     , aes(x = profits, y = salary, color = grad)) +
  geom_point(size = 1) +
  # use custom theme and color scheme
  theme_classic() + scale_color_fivethirtyeight() +
  theme(title = element_text(size = 12)) +
  labs(  x     = "1990 profits, millions"
       , y     = "1990 CEO compensation, $1000s"
       , title = "Relationship between CEO salaries and firm profits"
       , color = "")


to top



5.2 Categorical Data

We can use color in barplots to analyze relationships between two categorical variables.

# Barplot of grad - Counts
ggplot(data = tb.ceosal2
     , aes(x = college, fill = grad)) + 
  geom_bar() +
  # add custom theme and color scheme
  theme_fivethirtyeight() + scale_fill_tableau()

# Barplot of grad - Percent
ggplot(data = tb.ceosal2
     , aes(x = college, fill = grad)) + 
  geom_bar(position="fill") +
  # add custom theme and color scheme
  theme_fivethirtyeight() + scale_fill_tableau()

We can use side-by-side boxplots with or without color to analyze relationships between categorical and numerical variables.

# Boxplot of college and salary
ggplot(data = tb.ceosal2
     , aes(x = grad, y = salary, fill = grad)) + 
  geom_boxplot() +
  # add custom theme and color scheme
  theme_fivethirtyeight() + scale_fill_tableau() 


to top



5.3 Correlation Tables and Plots

Correlation tables provide us the correlation coefficient for each pair of numeric variables in the data.

# Create correlation matrix 
cor_matrix <- tb.ceosal2 %>% 
  select(salary, age, comten, ceoten, sales, profits, mktval) %>% 
  correlate(diagonal = 1)
# Format table for a professional presentation
cor_matrix %>%
  rearrange() %>%            # rearrange by correlations
  shave() %>%                # Shave off the upper triangle for a clean result
  fashion(decimals = 3)      # Clean presentation

Correlations can also be presented visually. This can be done in different ways (e.g. heatmap, barplot, correlation plot), depending on the data and the researcher’s preferences.

A correlation plot can be created easily using the function rplot().

# Correlation plot
cor_matrix %>% rplot()


to top



5.4 GGpairs

The function ggpairs() of the package GGally is a good shortcut for visualizing the relationship between several variables all at once. However, you should be careful when using this function: select only a few variables (otherwise the output becomes unreadable) and, if your dataset contains too many observations, make sure to select only a sample of those observations for plotting.

# Visualize distributions, scatterplots, and correlations
ggpairs(tb.ceosal2 %>% select(salary, grad, sales, profits))


to top



Resources


  1. “There are multiple classes that are grouped together as”numeric” classes, the 2 most common of which are double (for double precision floating point numbers) and integer. R will automatically convert between the numeric classes when needed, so for the most part it does not matter to the casual user whether the number 3 is currently stored as an integer or as a double. Most math is done using double precision, so that is often the default storage.” (Source: https://stackoverflow.com/questions/23660094/whats-the-difference-between-integer-class-and-numeric-class-in-r)↩︎