Basic Data Visualization

Basic Visualization in R Programming: Base Package

Data Preparation

Before we begin, make sure that you read the dataset ‘telco.csv’ into your R console. There are two ways of doing it;

Read the csv file by using read.csv and put the function file.choose () into the read.csv function, and name the data as telco_data;

telco_data <- read.csv(file.choose ())

Copy the datasets into your working directory and read the csv from your file directory;

getwd()

## [1] "C:/Users/Asmui/Documents/Online Class"

as example my directory is at “C:/Users/Asmui/Documents/Online Class”. Next read the “telco.csv” dataset.

telco_data <- read.csv ("telco.csv", stringsAsFactors = TRUE)

3. You can check the stucture of the dataset to make sure that the dataset is correctly read by R.

str(telco_data)

## 'data.frame':    45 obs. of  6 variables:
##  $ Gender       : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 2 1 1 ...
##  $ Programs     : Factor w/ 4 levels "Account","Business",..: 4 2 3 4 3 3 1 4 4 3 ...
##  $ Car_Ownership: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 2 1 2 1 ...
##  $ Telco_Prefer : Factor w/ 4 levels "Celcom","DiGi",..: 1 2 1 3 4 1 3 2 2 3 ...
##  $ Usage_GB     : num  14.6 15.7 14.8 15.4 12.9 22.4 28 19.2 25.4 25.3 ...
##  $ Hour_Perday  : num  3 3.8 4 3.5 3 6 6.5 5.5 6 6 ...

head (telco_data)

##   Gender   Programs Car_Ownership Telco_Prefer Usage_GB Hour_Perday
## 1   Male Statistics           Yes       Celcom     14.6         3.0
## 2 Female   Business           Yes         DiGi     15.7         3.8
## 3 Female   Sciences            No       Celcom     14.8         4.0
## 4 Female Statistics            No        Maxis     15.4         3.5
## 5 Female   Sciences            No     U-Mobile     12.9         3.0
## 6 Female   Sciences           Yes       Celcom     22.4         6.0

Note that for this example, we will attach the telco_data;

attach (telco_data)

Revision;
What is the different between using and not using attach () function?

Data Visualization by using Base Package

Categorical Data (Summary)

Frequency and Contingency Table

table () is used to produce a frequency table of counts for each level of specified factors.

table (Gender)

## Gender
## Female   Male 
##     29     16

table (Programs)

## Programs
##    Account   Business   Sciences Statistics 
##         11          8         13         13

table (Gender, Programs)

##         Programs
## Gender   Account Business Sciences Statistics
##   Female       8        5       10          6
##   Male         3        3        3          7

table (Gender, Car_Ownership)

##         Car_Ownership
## Gender   No Yes
##   Female 16  13
##   Male    7   9

Contingency Table for More than Two Variables by using ftable()

ftable (Gender, Programs, Car_Ownership)

##                   Car_Ownership No Yes
## Gender Programs                       
## Female Account                   5   3
##        Business                  1   4
##        Sciences                  6   4
##        Statistics                4   2
## Male   Account                   1   2
##        Business                  1   2
##        Sciences                  1   2
##        Statistics                4   3

Categorical Data (Plot)

Pie chart

To plot a pie chart using r, we can use pie() function. We need table () function first, so that pie () function can plot the counts for each level of factors.
The most basic pie chart () is;

pie (table (Telco_Prefer))

As we can see, by default, R will assign pastel colours to our pie chart. We can customize the pie chart so that it gives us more information rather than a simple pie chart.

adding colour preference

pie (table (Telco_Prefer), col = c("Blue", "Yellow", "Red", "Orange"))

We represent each telco operator by using their own brand colour.

Note that you can also put the col = argument by using value. Eg;

pie (table (Telco_Prefer), col = c(1, 2, 3, 4))

and it will gives us the colour according to default colour value assignment, 1=black, 2=blue, 3=green, 4=red.

b. adding main title, Telco Preference by Students

pie (table (Telco_Prefer), col = c("Blue", "Yellow", "Red", "Orange"), main = "Telco Preference by Students")

Figure 1: Pie Chart for Telco Operator Preference by Students

2. Bar Chart

To plot a bar chart using r, we can use barplot() function. Again, We need table () function first, so that barplot () function can plot the counts for each level of factors.

barplot (table (Programs))

again, we can add our colour preference to our barplot;
How many colours are actually we can customize in R plot? You can visit http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf to get the idea on how much colours you can choose.

barplot (table (Programs), col = c ("lawngreen", "indian red", "khaki", "midnightblue"))

adjusting the y-axis limit;

barplot (table (Programs), col = c ("lawngreen", "indian red", "khaki", "midnightblue"), 
         ylim = c(0, 20))

label the x-axis, y-axis and the main label;

barplot (table (Programs), col = c ("lawngreen", "indian red", "khaki", "midnightblue"), 
         ylim = c(0, 20), xlab = "Programs", ylab = "Number of Students", main = "Number of Students by Program")

Figure 2: Bar Chart for the Number of Students by Programs

3. Cluster Bar Chart

We can combine two variables (Gender and Programs) to get a cluster bar chart

barplot (table (Gender, Programs), col = c("deeppink", "cyan"), ylab = "Numbers of Students", xlab = "Programs", 
         main = "Numbers of Students by Gender and Programs")

Figure 3 (a): Cluster Bar Chart for the Number of Students by Gender and Programs

adding legend = TRUE and beside = TRUE argument to the cluster bar chart

barplot (table (Gender, Programs), col = c("deeppink", "cyan"), ylab = "Numbers of Students", 
         xlab = "Programs", main = "Numbers of Students by Gender and Programs", legend = TRUE, beside = TRUE)

Figure 3 (b): Cluster Bar Chart for the Number of Students by Gender and Programs

Numerical Data (Summary)

Some important functions that you need to know when dealing with numerical data.
We can get the summary of statistics for our variable by using summary () function;

summary (Usage_GB)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   13.70   17.80   18.13   23.20   32.40

or we can get a specific statistic by using specific funtion such as;

a. finding the average/mean value

mean (Usage_GB)

## [1] 18.12889

b. finding the median value

median (Usage_GB)

## [1] 17.8

c. finding the minimum value

min (Usage_GB)

## [1] 5

d. finding the maximum value

max (Usage_GB)

## [1] 32.4

e. finding the range (max-min) value

range (Usage_GB)

## [1]  5.0 32.4

f. finding the variance value

var (Usage_GB)

## [1] 42.07528

g. finding the standard deviation value

sd (Usage_GB)

## [1] 6.486546

Adding description to our statistics value by using print (paste0())

print (paste0("On average, the students use their internet quota by", round (mean (Usage_GB), 2), "GB" ))

## [1] "On average, the students use their internet quota by18.13GB"

round () is a function to round the number, example above round to (2) decimal places.

print (paste0("On maximum usage, the students use their internet quota by ", max (Usage_GB), "GB"))

## [1] "On maximum usage, the students use their internet quota by 32.4GB"

Numerical Data (Plot)

1. Stem and Leaf Plot

stem (Usage_GB)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   0 | 5889
##   1 | 00022334
##   1 | 555566677788999
##   2 | 0002223444
##   2 | 556888
##   3 | 02

0 | 8 means 8

2. Box and Whisker Plot

boxplot (Usage_GB)

a. adding y-axis label and main label

boxplot (Usage_GB, ylab = "Internet Quota Usage in GB", main = "Figure 5")

Figure 5 (a): Box Plot for Usage GB

b. adding xlab = and horizontal = TRUE arguments

boxplot (Usage_GB, xlab = "Internet Quota Usage in GB", main = "Figure 5", horizontal = TRUE)

Figure 5 (b): Box Plot for Usage GB

c. adding par (mfrow = c(1,2)) before plotting, row = 1, column = 2

par (mfrow = c(1, 2))
boxplot (Usage_GB, ylab = "Internet Quota Usage in GB", main = "Figure 5")
boxplot (Usage_GB, xlab = "Internet Quota Usage in GB", main = "Figure 5", horizontal = TRUE)

Figure 5 (c): Box Plot for Usage GB

Histogram with Density Line

hist(Usage_GB, prob=TRUE) #if probability instead of frequency is desired
lines(density(Usage_GB),lwd=4,col="red") #low level function

Scatter plot with Linear Line (fit using lm)

plot(Hour_Perday, Usage_GB, main="Hour per Day vs Usage in GB",pch=19, col="blue")
fit=lm(Usage_GB~Hour_Perday) #fitting the linear model
abline(fit,col="red",lwd=4) #low-level function

Lines, Both, Steps, and High Density Plot

par(mfrow=c(2,2))
plot(Hour_Perday,type="l", main="lines")
plot(Hour_Perday,type="b", main="both")
plot(Hour_Perday,type="s", main="steps")
plot(Hour_Perday,type="h", main="high density")

adding lwd (line width) and lty (line type)

par(mfrow=c(2,2))
plot(Hour_Perday,type="l",lty=1,col=1,lwd=3,main="lines")
plot(Hour_Perday,type="b",lty=2,col=2,lwd=3,main="both")
plot(Hour_Perday,type="s",lty=3,col=3,lwd=3,main="steps")
plot(Hour_Perday,type="h",lty=4,col=4,lwd=3,main="high density")

Exercise

Try replicate all the graph by using your own customization (colours, label, title).

Reply on comment.

edited : 23 June 2020
by: Muhammad Asmui Abdul Rahim
email: asmui@tmsk.uitm.edu.my
created using: rmarkdown