Data Visualization is a term used to describe the use of graphical displays to:
Data becomes more comprehensible and more useful when organized and presented
Data patterns are commonly described in terms of the:
Example 1
Create a Frequency Distribution Table for the data on soft drink purchases presented on the following table.
Purchase | Purchase | Purchase | Purchase |
---|---|---|---|
Coke Classic | Sprite | Pepsi | Diet Coke |
Coke Classic | Coke Classic | Pepsi | Diet Coke |
Coke Classic | Diet Coke | Coke Classic | Coke Classic |
Coke Classic | Diet Coke | Pepsi | Coke Classic |
Coke Classic | Dr. Pepper | Dr. Pepper | Sprite |
Coke Classic | Diet Coke | Pepsi | Diet Coke |
Pepsi | Coke Classic | Pepsi | Pepsi |
Coke Classic | Pepsi | Coke Classic | Coke Classic |
Pepsi | Dr. Pepper | Pepsi | Pepsi |
Sprite | Coke Classic | Coke Classic | Coke Classic |
Sprite | Dr. Pepper | Diet Coke | Dr. Pepper |
Pepsi | Coke Classic | Pepsi | Sprite |
Coke Classic | Diet Coke |
The R Script:
#install.packages("readr") #install.packages("pander") library(readr)
Warning: package 'readr' was built under R version 4.1.1
library(pander)
Warning: package 'pander' was built under R version 4.1.1
# Import "purchase.csv" data and store it in 'pchase'. pchase <- read.csv("purchase.csv")
# Determine the frequencies for each observation. pchase.freq = table(pchase) pander(pchase.freq)
Coke Classic | Diet Coke | Dr. Pepper | Pepsi | Sprite |
---|---|---|---|---|
19 | 8 | 5 | 13 | 5 |
# Create the Frequency Distribution Table. freq.dist <- cbind(pchase.freq) colnames(freq.dist) <-c("Frequency")
The Tabular Output
# Generate created Frequency Distribution Table. pander(freq.dist)
Frequency | |
---|---|
Coke Classic | 19 |
Diet Coke | 8 |
Dr. Pepper | 5 |
Pepsi | 13 |
Sprite | 5 |
The R script:
data.relfreq<-pchase.freq/nrow(pchase) relfreq.dist<-cbind(data.relfreq) colnames(relfreq.dist) <-c("Relative Frequency")
The Tabular Output
pander(relfreq.dist)
Relative Frequency | |
---|---|
Coke Classic | 0.38 |
Diet Coke | 0.16 |
Dr. Pepper | 0.1 |
Pepsi | 0.26 |
Sprite | 0.1 |
Example 2
A survey was taken in Aurora Avenue. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
The R Script:
# Create the data vector. cars <- c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0)
# Determine the frequencies. # Compute for relative and percent frequencies, respectively. car.freq<-table(cars) car.relfreq<-car.freq/sum(car.freq) car.pctfreq<-car.relfreq*100
# Create the tabular output. car.freqdist<-cbind(car.freq, car.relfreq, car.pctfreq) colnames(car.freqdist) <-c("Frequency", "Relative Frequency", "Percent Frequency")
# Generate the tabular output. pander(car.freqdist)
Frequency | Relative Frequency | Percent Frequency | |
---|---|---|---|
0 | 4 | 0.2 | 20 |
1 | 6 | 0.3 | 30 |
2 | 5 | 0.25 | 25 |
3 | 3 | 0.15 | 15 |
4 | 2 | 0.1 | 10 |
Steps in Creating a Grouped FDT:
Example
Consider the following data set presented in Example 3 of the module:
425, 430, 430, 435, 435, 435, 435, 435, 440, 440, 440, 440, 440, 445, 445, 445, 445, 445, 450, 450, 450, 450, 450, 450, 450, 460, 460, 460, 465, 465, 465, 470, 470, 472, 475, 475, 475, 480, 480, 480, 480, 485, 490, 490, 490, 500, 500, 500, 500, 510, 510, 515, 525, 525, 525, 535, 549, 550, 570, 570, 575, 575, 580, 590, 600, 600, 600, 600, 615, 615
Create a frequency distribution table with 7 class intervals.
The R script
# Load necessary packages. library(readr) library(pander)
# Import data into RStudio. rent <-read.csv("rent.csv")
# Generate regular sequences of values breaks <-seq(425, 621, by =28) # Create the class intervals and assign data values to these. classint<-cut(rent$Rent, breaks, right =FALSE)
# Determine frequencies of each class interval. freq<-table(classint) # Transform table to column format. freq.dist<-cbind(freq) # Provide label to the column of frequencies. colnames(freq.dist) <-c("Frequency")
The Tabular Output
# Generate the Grouped FDT. pander(freq.dist)
Frequency | |
---|---|
[425,453) | 25 |
[453,481) | 16 |
[481,509) | 8 |
[509,537) | 7 |
[537,565) | 2 |
[565,593) | 6 |
[593,621) | 6 |
Example for Data on Soft Drink Purchases
The R Script:
# install.packages(“tidyverse”) # install.packages(“forcats”) # Load necessary packages. library(readr) library(tidyverse) library(forcats) # Import "purchase.csv" file. purchase <-read.csv("purchase.csv")
The Chart:
# Create the chart. Store it to 'bar1'. bar1 <- ggplot(purchase, aes(x=Purchase)) + geom_bar(width=.5)+ ggtitle("Soft Drink Purchases") # Present the generated chart. bar1
The Graphical Output (ordered bars):
bar2 <- ggplot(mutate(purchase, Purchase =fct_infreq(Purchase)))+ geom_bar(aes(x = Purchase), width = .5) + ggtitle("Soft Drink Purchases") bar2
Example (Data on the number of cars registered to each household).
Data: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
The R Script
library(ggplot2) library(readr) # Import the "cars.csv" data file cars <-read.csv("cars.csv") # Generate the plot dplot <-ggplot(cars, aes(cars)) + geom_dotplot(binwidth = 0.25)
The Output Plot
dplot
Using the “stripchart” function
The R Script
# Manually construct the data vector. cars <- c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1,2, 2, 3, 2, 3, 2, 1, 4, 0, 0) # Creater the dotplot dplot <- stripchart(cars, method = "stack", at = c(0.05), pch = 20, cex = 3.2, las = 1, frame.plot = FALSE, xlim = c(0,5), main = "Number of Cars Registered")
Example 1
Data: 22, 29, 22, 31, 20, 12, 14, 24, 13, 4, 2, 1
The R Script
# Create the data vector data <- c(22, 29, 22, 31, 20, 12, 14, 24, 13, 4, 2, 1)
# Create the stem-and-leaf plot stem(data, scale = 1)
The decimal point is 1 digit(s) to the right of the | 0 | 124 1 | 234 2 | 02249 3 | 1
Example 2
Data: 8.6, 11.7, 9.4, 9.1, 10.2, 11.0, 8.8
The R Script
# Create the data vector data <- c(8.6, 11.7, 9.4, 9.1, 10.2, 11.0, 8.8)
# Create the stem-and-leaf plot stem(data, scale = 1)
The decimal point is at the | 8 | 68 9 | 14 10 | 2 11 | 07
A crosstabulation is a tabular summary of data for two variables.
It is also called a contingency table.
Example: Consider the data below:(named “example.csv”)
Respondent | Gender | Age | Education |
---|---|---|---|
1 | Male | 20 | Bachelor’s Degree |
2 | Male | 18 | Undergraduate |
3 | Female | 19 | Undergraduate |
4 | Male | 25 | Bachelor’s Degree |
5 | Female | 37 | Master’s Degree |
6 | Female | 15 | Undergraduate |
7 | Female | 40 | PhD |
8 | Male | 43 | Bachelor’s Degree |
9 | Male | 60 | Bachelor’s Degree |
10 | Female | 65 | Master’s Degree |
11 | Female | 42 | Bachelor’s Degree |
Create a crosstabulation of Gender vs Education