A scatter plot is a graph that shows the values of two variables plotted on the x- and y-axis. This visual representation reveals correlation in the two variables. Incorporating a grouping variable as a non-positional aesthetic allows for other relationships to be revealed if they exist.
For this plot I am using Students Performance data set which is a collection of some demographic information of test takers (gender, ethnicity, parental level of education, whether or not they receive a free or reduced lunch and whether or not they have completed a preparation course) together with their scores on three measures; mathematics, reading and writing.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
data<-read.csv("StudentsPerformance.csv")
head(data)
I am interested in the relationships between scores on the three measures. My grouping variable is gender for this project. A first attempt at the three scatter plots shows that for reading and writing there is no noticable difference between males and females while for mathematics and each of the other two measures, distinct relationships can be seen
For p2 and p3 the mathematics score is on different axes which distorts the picture that can be painted. The separation of males and females shows that males tend to score higher for mathematics while females tend to score higher for either reading or writing. I switch mathematics to the x-axis for both plots and clean up the labels for the variables and the legend
p1<-ggplot(data=data, aes(x=reading.score, y=writing.score, colour=gender))+
geom_point(alpha=0.5) + labs(x="Reading Score",
y="Writing Score", color="Gender")
p1
p2<-ggplot(data=data, aes(x=reading.score, y=math.score, colour=gender))+
geom_point()+labs(x="Reading Score",
y="Math Score", color="Gender")
p2
p3<-ggplot(data=data, aes(x=math.score, y=writing.score, colour=gender))+
geom_point()+labs(x="Math Score",
y="Writing Score", color="Gender")
p3
Putting all the plots together in one image is the final step.
p4<-ggplot(data=data, aes(x=math.score, y=reading.score, colour=gender))+
geom_point()+labs(x="Math Score",
y="Reading Score", color="Gender")
plot1<-grid.arrange(p4, p3, p1, ncol=2)
plot1
## TableGrob (2 x 2) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
Smoothers are a layer that can be added to a scatter plot to explore trends and make forecasts about the correlation of two variables.
For scatter plots that are grouped by a variable it is possible to apply smoothers to each subset of data. For instance, for the Student Performance data, a scatter plot of students Mathematics scores was plotted against their Reading scores and a linear smoother was fitted.
p6<-ggplot(data=data, aes(x=reading.score, y=math.score))+
geom_point(size=0.3, colour="purple")+
geom_smooth(method=lm, size=0.3, colour="red")+
labs(x="Math Score",y="Reading Score", color="Gender")
p6
## `geom_smooth()` using formula 'y ~ x'
Linear smoothers we applied to the same data but this time it was grouped by the variable Gender. The two plots are given side by side for ease of comparison.
p5<-ggplot(data=data, aes(x=math.score, y=reading.score, colour=gender))+
geom_point(size=0.3)+ geom_smooth(method=lm, size=0.3)+
labs(x="Math Score",y="Reading Score", color="Gender")+
theme(legend.position = "bottom")
grid.arrange(p6,p5,ncol=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
A histogram is a plot of continuous data against the frequency (sometimes frequency density) of observations in each group (bin). Histograms show the shape of the distribution of data and a quick check of whether the data is normal, if it is skewed in any way, multimodal or if there are outliers.
The same Student Performance data was used for the Writing Score variable.
x <- data$writing.score
h <- hist(x, breaks=12, col="skyblue", xlab="Writing Score",
main="Histogram with \nNormal Curve")
xfit <- seq(min(x), max(x), length=40) # generates a sequence, xfit
yfit <-dnorm(xfit, mean=mean(x), sd=sd(x)) # fits normal densities to xfit
yfit <-yfit*diff(h$mid[1:2])*length(x) # adjusts yfit to interval midpt.
lines(xfit, yfit, col="blue", lwd=1.75) # plots the normal curve line
box()
A bubble plot is similar to a scatter plot in that two variables are mapped onto the Cartesian Plane, A third variable is then mapped onto the size of each bubble. This allows for the relationships between two variables at a time to be drawn and for representation of the relationship between all three variables.
Ideally bubble plots are suited to small data sets where there is not too much overlap of the points such that the plot remains readable. I will use the same Student Performance data but I will sample 15 data points.
df<-data[sample(nrow(data),15),]
p8<-ggplot(df, aes(x=writing.score, y=reading.score, size=math.score)) +
geom_point(colour="orange", alpha=0.5)+ labs(x="Reading Score",
y="Writing Score", size="Mathematics \nScore")
p8
P–P plot is a probability plot for assessing how closely two data sets agree, which plots the two cumulative distribution functions against each other. P-P plots are vastly used to evaluate the skewness of a distribution.
Q-Q plot on the other hand compares two probability distributions by plotting their quantiles against each other.[1] First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile.
For the purposes of this mini-project, I will be interested in comparing the distribution of the observed cumulative distribution with the theoretical cumulatie distribution function with the Gaussian distribution as our reference. If there is a good fit there should be a linear relationship between the two variables.
The first step is to produce a normal distribution with parameters from the dataset.
mnp <- mean(data$math.score)
sdp <- sd(data$math.score)
np <- length(data$math.score)
pv <- (1 : np) / np - 0.5 / np
yp <- sort(pnorm(data$math.score,mnp,sdp))
dxyp <- data.frame(pv,yp)
xv <- rnorm(np,mnp,sdp)
p10<-ggplot() + geom_qq(aes(sample = xv))+
geom_abline(intercept = mnp, slope = sdp,color = "red", size = 1)+
labs(x="Normal Quantiles", y="Observed Quantiles", title="Q-Q Plot")
p9<-ggplot(dxyp, aes(x=pv, y=yp)) +
geom_point() +
geom_smooth(method="lm",colour="red",size=1)+
labs(x="Theoretical Cumulative Probability", y="Empirical Cumulative Probability", title="P-P Plot")
grid.arrange(p10,p9, ncol=2)
## `geom_smooth()` using formula 'y ~ x'
Dot distribution plots are plot in which each observation of one continuous variable is represented by a dot that is then stacked onto other observations in the same bin. Dot distribution plots give a visual representation of the distribution of the data in a similar way to bar graphs. From them one can glean the presence/absence of outliers, the spread of the distribution, whether or not it is multimodal and whether it is skewed.
A categorical variable a can be used to group observations allowing for further comparison within the continuous variable.
The mapping of variables to either the x or y axis controls how the individual dot-plots appear. In addition, the stackdir(ection) can be controlled to make the dot plot look like a histogram/bar chart (stackdir=down) or to make it fit better into a boxplot and/or violing plot.
p11<-ggplot(data, aes(x=parental.level.of.education,y=reading.score, colour=parental.level.of.education))+
geom_dotplot(binaxis='y', stackdir='down', dotsize=0.4,fill="skyblue") +
labs(x="Parental Level of Education", y="Reading Score",
title="Reading Score Dot Plot by Parental Level of Education")
p11
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Changing the order on the ordinal variable “Parental Level of Education” that was read in as a factor and hence presented in alphabetical order. This is done by creating a new variable par.level.of.education in the dataframe and imposing the order given inthe collector functions.
Axis labels also need to be attended to to make the plot more readable. The use of the legend is also redundant since the grouping be Parental Level of Eduation is already achieved on the x-axis.
#The Final Plot
data$par.level.of.education<-factor(data$parental.level.of.education,c("some high school","high school","some college","associate's degree","bachelor's degree","master's degree"))
p12<-ggplot(data, aes(x=par.level.of.education,y=reading.score, colour=par.level.of.education))+
geom_dotplot(binaxis='y', stackdir='down', dotsize=0.4,fill="skyblue") +
labs(x="Parental Level of Education", y="Reading Score",
title="Reading Score Dot Plot by Parental Level of Education")+
theme(legend.position="none",
axis.text.x = element_text(angle=55, vjust=1, hjust=1))
p12
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.