Data Presentation and Visualization

Data Visualization is a term used to describe the use of graphical displays to:

  • summarize
  • present

Data becomes more comprehensible and more useful when organized and presented

Data Patterns in Graphs

Data patterns are commonly described in terms of the:

  • Center: data point where about half of the observations are on either side.
  • Spread: variability of the data
  • Shape: can be described by the characteristics:

Symmetry

Number of Peaks

Skewness

Other data patterns:

  • Unusual Features: if there are gaps or if there are outliers.

Summarizing Qualitative and Quantitative Data for a Single Variable

FREQUENCY DISTRIBUTION TABLE

  • Shows how often each value (or set of values) of the variable in question occurs in a data set.
  • tabular summary of data showing frequency or number

Relative Frequency Distribution

  • Gives tabular summary of data showing relative frequency of each class.

Percent Frequency Distribution

  • presents percent frequency of the data for each class

Frequency Distribution Table

Example 1
Create a Frequency Distribution Table for the data on soft drink purchases presented on the following table.

Purchase Purchase Purchase Purchase
Coke Classic Sprite Pepsi Diet Coke
Coke Classic Coke Classic Pepsi Diet Coke
Coke Classic Diet Coke Coke Classic Coke Classic
Coke Classic Diet Coke Pepsi Coke Classic
Coke Classic Dr. Pepper Dr. Pepper Sprite
Coke Classic Diet Coke Pepsi Diet Coke
Pepsi Coke Classic Pepsi Pepsi
Coke Classic Pepsi Coke Classic Coke Classic
Pepsi Dr. Pepper Pepsi Pepsi
Sprite Coke Classic Coke Classic Coke Classic
Sprite Dr. Pepper Diet Coke Dr. Pepper
Pepsi Coke Classic Pepsi Sprite
Coke Classic Diet Coke

Frequency Distribution Table

The R Script:

#install.packages("readr")
#install.packages("pander")
library(readr)
Warning: package 'readr' was built under R version 4.1.1
library(pander)
Warning: package 'pander' was built under R version 4.1.1
# Import "purchase.csv" data and store it in 'pchase'.
pchase <- read.csv("purchase.csv")
# Determine the frequencies for each observation.
pchase.freq = table(pchase)
pander(pchase.freq)
Coke Classic Diet Coke Dr. Pepper Pepsi Sprite
19 8 5 13 5
# Create the Frequency Distribution Table.
freq.dist <- cbind(pchase.freq)
colnames(freq.dist) <-c("Frequency")

The Frequency Distribution Table

The Tabular Output

# Generate created Frequency Distribution Table.
pander(freq.dist)
  Frequency
Coke Classic 19
Diet Coke 8
Dr. Pepper 5
Pepsi 13
Sprite 5

Relative Frequency Distribution Table

The R script:

data.relfreq<-pchase.freq/nrow(pchase)
relfreq.dist<-cbind(data.relfreq) 
colnames(relfreq.dist) <-c("Relative Frequency")

Relative Frequency Distribution Table

The Tabular Output

pander(relfreq.dist)
  Relative Frequency
Coke Classic 0.38
Diet Coke 0.16
Dr. Pepper 0.1
Pepsi 0.26
Sprite 0.1

Relative Frequency Distribution Table

Example 2
A survey was taken in Aurora Avenue. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0

Relative Frequency Distribution Table

The R Script:

# Create the data vector.
cars <- c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0)
# Determine the frequencies.
# Compute for relative and percent frequencies, respectively.
car.freq<-table(cars) 
car.relfreq<-car.freq/sum(car.freq) 
car.pctfreq<-car.relfreq*100
# Create the tabular output.
car.freqdist<-cbind(car.freq, car.relfreq, car.pctfreq)
colnames(car.freqdist) <-c("Frequency", "Relative Frequency", "Percent Frequency")

Complete Frequency Distribution

# Generate the tabular output.
pander(car.freqdist)
  Frequency Relative Frequency Percent Frequency
0 4 0.2 20
1 6 0.3 30
2 5 0.25 25
3 3 0.15 15
4 2 0.1 10

Grouped Frequency Distribution Table

  • Individual data values are classified into categories called class intervals.
  • Not advisable to create and use one for data analysis.
  • Simply created as a convenient means of organizing and summarizing data.

Grouped Frequency Distribution Table

Steps in Creating a Grouped FDT:

  • Determine the number of classes, k.
    (Use Sturges’ formula.)
  • Calculate the class size (or class width), c. 
  • Enumerate the class intervals.
  • Tally the observations.

Grouped Frequency Distribution Table

Example
Consider the following data set presented in Example 3 of the module:
425, 430, 430, 435, 435, 435, 435, 435, 440, 440, 440, 440, 440, 445, 445, 445, 445, 445, 450, 450, 450, 450, 450, 450, 450, 460, 460, 460, 465, 465, 465, 470, 470, 472, 475, 475, 475, 480, 480, 480, 480, 485, 490, 490, 490, 500, 500, 500, 500, 510, 510, 515, 525, 525, 525, 535, 549, 550, 570, 570, 575, 575, 580, 590, 600, 600, 600, 600, 615, 615

Create a frequency distribution table with 7 class intervals.

Grouped Frequency Distribution Table

The R script

# Load necessary packages.
library(readr) 
library(pander)
# Import data into RStudio.
rent <-read.csv("rent.csv")
# Generate regular sequences of values
breaks <-seq(425, 621, by =28)

# Create the class intervals and assign data values to these.
classint<-cut(rent$Rent, breaks, right =FALSE)
# Determine frequencies of each class interval.
freq<-table(classint)
# Transform table to column format.
freq.dist<-cbind(freq)
# Provide label to the column of frequencies.
colnames(freq.dist) <-c("Frequency")

Grouped Frequency Distribution Table

The Tabular Output

# Generate the Grouped FDT.
pander(freq.dist)
  Frequency
[425,453) 25
[453,481) 16
[481,509) 8
[509,537) 7
[537,565) 2
[565,593) 6
[593,621) 6

Bar Chart

  • Used to display qualitative data summarized in a frequency, relative frequency, or percent frequency distribution.
  • Vertical bar chart: horizontal axis - categories; vertical axis - value (freq., rel. freq., % freq.)

Bar Chart

Example for Data on Soft Drink Purchases

The R Script:

# install.packages(“tidyverse”) 
# install.packages(“forcats”)
# Load necessary packages.
library(readr) 
library(tidyverse) 
library(forcats)

# Import "purchase.csv" file.
purchase <-read.csv("purchase.csv") 

Bar Chart

The Chart:

# Create the chart. Store it to 'bar1'.
bar1 <- ggplot(purchase, aes(x=Purchase)) + geom_bar(width=.5)+
  ggtitle("Soft Drink Purchases")

# Present the generated chart.
bar1

Bar Chart

The Graphical Output (ordered bars):

bar2 <- ggplot(mutate(purchase, Purchase =fct_infreq(Purchase)))+ 
  geom_bar(aes(x = Purchase), width = .5) + ggtitle("Soft Drink Purchases") 
bar2

Pie Chart

  • Provides another graphical device for presenting the frequency, relative frequency, or percent frequency distributions for qualitative data.
  • The pie chart makes use of sectors of a circle where the numerical values presented by each sector could be the frequencies, relative frequencies or percent frequencies.
  • The angle of a sector is proportional to the frequency of each of the categories of the variable.

The Dot Plot

  • Similar to a bar graph.
  • The height, represented by the number of dots, equals the number of items in a certain category.

Example (Data on the number of cars registered to each household).
Data: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0

The Dot Plot

The R Script

library(ggplot2) 
library(readr) 

# Import the "cars.csv" data file
cars <-read.csv("cars.csv") 

# Generate the plot
dplot <-ggplot(cars, aes(cars)) + geom_dotplot(binwidth = 0.25)

The Dot Plot

The Output Plot

dplot

The Dot Plot

Using the “stripchart” function

The R Script

# Manually construct the data vector.
cars <- c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1,2, 2, 3, 2, 3, 2, 1, 4, 0, 0)

# Creater the dotplot
dplot <- stripchart(cars, method = "stack", at = c(0.05), 
            pch = 20, cex = 3.2, las = 1, frame.plot = FALSE, 
            xlim = c(0,5), main = "Number of Cars Registered")

Stem-and-Leaf Plot

  • It shows both rank order of data as well as the shape.
  • Useful for numerous data.

Example 1

Data: 22, 29, 22, 31, 20, 12, 14, 24, 13, 4, 2, 1

Stem-and-Leaf Plot

The R Script

# Create the data vector
data <- c(22, 29, 22, 31, 20, 12, 14, 24, 13, 4, 2, 1)
# Create the stem-and-leaf plot
stem(data, scale = 1)
  The decimal point is 1 digit(s) to the right of the |

  0 | 124
  1 | 234
  2 | 02249
  3 | 1

Stem-and-Leaf Plot

Example 2

Data: 8.6, 11.7, 9.4, 9.1, 10.2, 11.0, 8.8

Stem-and-Leaf Plot

The R Script

# Create the data vector
data <- c(8.6, 11.7, 9.4, 9.1, 10.2, 11.0, 8.8)
# Create the stem-and-leaf plot
stem(data, scale = 1)
  The decimal point is at the |

   8 | 68
   9 | 14
  10 | 2
  11 | 07

Crosstabulation

A crosstabulation is a tabular summary of data for two variables.
It is also called a contingency table.

The two variables can be:

  • both qualitative
  • both quantitative
  • combination

Crosstabulation

Example: Consider the data below:(named “example.csv”)

Respondent Gender Age Education
1 Male 20 Bachelor’s Degree
2 Male 18 Undergraduate
3 Female 19 Undergraduate
4 Male 25 Bachelor’s Degree
5 Female 37 Master’s Degree
6 Female 15 Undergraduate
7 Female 40 PhD
8 Male 43 Bachelor’s Degree
9 Male 60 Bachelor’s Degree
10 Female 65 Master’s Degree
11 Female 42 Bachelor’s Degree

For the given data:

  1. Create a crosstabulation of Gender vs Education

  2. . Create a crosstabulation of Age vs Education where for the age, classify each respondents as a “Teen” (from 15 to 19 years of age), an “Adult” (from 20 to 59 years of age), or a “Senior” (from 60 years of age and above).