To plot various charts, let’s first load and attach the *“Lung Capacity Dataset”.
LungCapData = read.table(file = "../Dataset/LungCapData.txt", header = T, sep = "\t")
attach(LungCapData)
head(LungCapData)
## LungCap Age Height Smoke Gender Caesarean
## 1 6.475 6 62.1 no male no
## 2 10.125 18 74.7 yes female no
## 3 9.550 16 69.7 no female yes
## 4 11.125 14 71.0 no male no
## 5 4.800 5 56.9 no male no
## 6 6.225 11 58.7 no female no
The above table gives us a brief look over the data and class of each variable.
Let’s graphically analyse the variables present in the dataset :
Barcharts are appropriate for summarizing the distribution of a categorical variable.
To get help on barplot in R :
help("barplot") #or,
?barplot
A barplot is a visual display of the frequency of each category/element of a categorical variable or, the relative frequency of each category/element.
Let’s first produce a proportion table of “Gender” to form a barplot :
Proportion = table(Gender)/length(Gender)
Proportion
## Gender
## female male
## 0.4937931 0.5062069
Now, let’s plot a barchart for the proportion table :
barplot(Proportion,
main = "Bar Plot",
xlab = "Gender",
ylab = "Proportion",
las = 1,
names.arg = c("Female", "Male"),
)
To get a horizotal barplot, we have to pass horiz = TRUE argument inside barplot() function.
barplot(Proportion,
main = "Bar Plot",
xlab = "Proportion",
ylab = "Gender",
las = 1,
names.arg = c("Female", "Male"),
horiz = TRUE
)
Piecharts are appropriate for summarizing the distribution of a categorical variable but, its intuitive only when we have fewer categories/elements (3 or, less)
To get help on piechart in R :
help("pie") #or,
?pie
A pie chart is a also visual display of the frequency of each category/element of a categorical variable or, the relative frequency of each category/element.
Now, let’s plot a pie chart for the “Proportion” table :
pie(Proportion,
radius = 0.7,
main = "Pie Chart",
border = "Black",
col = c("white", "darkslategrey")
)
We can also put a border around the pie chart, as follows :
pie(Proportion,
radius = 0.7,
main = "Pie Chart",
border = "Black",
col = c("white", "darkslategrey")
)
box()
A boxplot is appropriate for summarizing the distribution of a numeric variable.
To get help on boxplot in R :
help("boxplot") #or,
?boxplot
A boxplot is the visual display of the 5 level summary of a numeric variable.
We can visually deduce the following metrices from a single boxplot :
Let’s produce a boxplot for “LungCap” :
boxplot(LungCap,
main = "Box Plot",
ylab = "Lung Capacity",
ylim = c(0,16),
las = 1,
col = "darkslategrey",
border = "black"
)
We can get the numeric values of the boxplot summary, as follows :
quantile(LungCap, probs = c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 0.507 6.150 8.000 9.800 14.675
We can also compare the distribution of a numerical variable for different elements/categories of a categorical variable, using box plots.
Let’s say, we want to see the distribution of “LungCap” of “Male” and “Females” :
boxplot(LungCap ~ Gender,
main = "Lung Capacity By Gender",
ylab = "Lung Capacity",
xlab = NULL,
names = c("Female", "Male"),
ylim = c(0,16),
las = 1,
col = c("darkslategrey","burlywood4"),
border = "black"
)
If we want to see the boxplots horizontally, then, we can do so by passing horizontal = TRUE inside boxplot() function :
boxplot(LungCap ~ Gender,
main = "Lung Capacity By Gender",
xlab = "Lung Capacity",
ylab = NULL,
names = c("Female", "Male"),
ylim = c(0,16),
las = 1,
col = c("darkslategrey","burlywood4"),
border = "black",
horizontal = TRUE
)
If we want see the “LungCap” of only “Males” then :
boxplot(LungCap[Gender == "male"],
main = "Lung Capacity of Males",
ylab = "Lung Capacity",
xlab = NULL,
names = "Male",
ylim = c(0,16),
las = 1,
col = "burlywood4",
border = "black"
)
Startified boxplots are useful for examining the relationship between a categorical and a numeric variable, within strata/groups defined by the variable.
Let’s create a “AgeGroup” variable from the categorical variable “Age” :
AgeGroup = cut(Age,
breaks = c(0,13,15,17,25),
labels = c("<13", "14-15", "16-17","18+")
)
levels(AgeGroup)
## [1] "<13" "14-15" "16-17" "18+"
Now, let’s check whether the “AgeGroup” variable is working properly or, not by taking the first 5 elements from the “Age” category.
Age[0:5]
## [1] 6 18 16 14 5
AgeGroup[0:5]
## [1] <13 18+ 16-17 14-15 <13
## Levels: <13 14-15 16-17 18+
So, we can see that our grouping works properly.
Now, let’s compare the lung capacity of smokers and non-smokers :
boxplot(LungCap ~ Smoke,
main = "Lung Capacity By Smoking Habits",
ylab = "Lung Capacity",
xlab = NULL,
names = c("Non-Smokers", "Smokers"),
ylim = c(0,16),
las = 1,
col = c("blue","red"),
border = "black"
)
We can see the that “Non-Smokers” have lower lung capacity than “Smokers” which seems incorrect and not taking the age factor into consideration might be the cause of this output.
Let’s compare the same output only for 18+ age group :
boxplot(LungCap[Age >= 18] ~ Smoke[Age >= 18],
main = "Lung Capacity By Smoking Habits (18+ Age Group)",
ylab = "Lung Capacity",
xlab = NULL,
names = c("Non-Smokers", "Smokers"),
ylim = c(7,15),
las = 1,
col = c("blue","red"),
border = "black"
)
Now, we got a more practical result, i.e.,among the 18+ age group, the lung capacity of non-smokers is higher than that of smokers
Now, let’s jump one step further and visualize the relationship between the lung capacity and smoking habits within each of the age strata.
boxplot(LungCap ~ Smoke * AgeGroup,
main = "Lung Capacity By Smoking Habits Within Age Starta",
ylab = "Lung Capacity",
xlab = NULL,
axes = FALSE,
las = 2,
col = c("blue","red"),
border = "black"
)
box()
axis(2, at = seq(0,20,2), seq(0,20,2), las = 1)
axis(1, at = c(1.5, 3.5, 5.5, 7.5), labels = c("<13", "14-15", "16-17","18+"))
legend(x = 7.0, y = 2.8,
legend = c("Non-Smokers", "Smokers"),
col =c("blue","red"),
pch = 15, cex = 0.8)
A histogram is appropriate for summarizing the distribution of a numeric variable.
To get help on histogram in R :
help("hist") #or,
?hist
Let’s plot a histogram for “LungCap” :
hist(LungCap,
xlab = "Lung Capacity",
col = "seagreen",
border = "white",
las = 1
)
If we want “Probability” instead of “Frequency” in the Y-axis then, we can set freq = FALSE or, Prob = TRUE, as follows :