Data visualization is one of the most important parts of data analysis. I didn’t realize this at first, but it’s becoming increasingly true for me as I mature in the field.
Most of us are familiar with plots and charts in Excel … and these are usually fine for everyday purposes. However, for reproduceable and finely-tuned data visualization, nothing beats the GGPLOT package in R.
# load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# the exact name of the package is actually "ggplot2"
library(ggplot2)
We’ve used this a few times during the course so far, but let’s get really involved into why this is useful, and also do some in-class practice.
#please install "ggrepel" if you haven't already: install.packages("ggrepel")
library(ggrepel)
geom_point - continuous numerical variables, ideal for understanding correlations or distributions between 2 - 5 variables (at most)
geom_line - good for time series or any kind of sequental data. Trends over time.
geom_bar - counts categorical variables or frequency counts
geom_histogram - similar to geom_bar, but only has one variable and is used to show distributions
geom_boxplot - comparing means across groups, continuous data only
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
Here’s our old friend, mtcars. By now this data makes perfect sense to you: as WT increases, mpg goes down. Let’s look at how we pull this from mtcars using ggplot:
First, we call ggplot on mtcars:
ggplot(mtcars)
Then, we add “aes” or “aesthetics” – we are specifying what the x and y are supposed to be
ggplot(mtcars,aes(x=wt,y=mpg))
If you try just running this, it won’t work:
ggplot(mtcars,aes(x=wt,y=mpg))
It’s missing the “geom” or “geometry” – the actual shape of the chart. This is added with a “plus” symbol:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
Like this:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
The really cool thing about ggplot is that you can add more than just “x” and “y” to aesthetics. In reality, you can add almost as many dimensions as you want. In practice, this is often a bad idea, as the human mind has trouble thinking in more than three to five dimensions or so.
ggplot(mtcars,aes(x=wt,y=mpg,color=as.factor(am))) + geom_point()
We’ve added a third dimension to the plot = transmission (AM). We had to call “as.factor()” on AM otherwise R tries to interpret the 0 or 1 as a continuous number.
Because we’ve added a third dimension (wt,mpg,am) we can now understand some of the relationship between these variables. We see that as WT goes up, MPG goes down, but also manual transmission (AM=1) clusters towards higher mpg and lower weight, while higher weight, lower mpg cars tend to be automatic transmissions. ? create new data frame<-district_data %>% mutate(Existing_column_name=ifelse(am==0,“manual”,“automatic”)) ^this will create a new column using the existing data
ggplot(mtcars,aes(x=wt,y=mpg,color=as.factor(am),shape=as.factor(vs))) + geom_point()
We’ve now added a fourth dimension (shape = VS (engine shape)) but the chart has become more difficult to interpret. We are now trying to balance four interrelationships in our heads and it’s quite difficult. In general, it’s better to have more plots than to try and cram all the dimensions or details into a single plot. In my personal opinion, the ideal number of dimensions for most plots is exactly 2 = x and y.
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point() + facet_wrap(vars(am))
Here, we’ve used the command “facet_wrap(vars())” to split the plot into two: the first is AM=0 (automatic), the second is AM=1 (manual). Now the relationship between the three variables (wt,mpg,am) is arguably clearer than by assigning a color or shape to AM.
To really make your plots useable, you need to add written information. As a rule, the best type of plot is one that can be understood even if you pulled it out of the paper and printed it on it’s own.
You can add titles and subtitles to ggplot with the “labs()” command:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point() + labs(title="Weight vs. MPG",subtitle = "From the MTCARS Dataset - R (2024)")
Even more clarity can be added by labeling the points: all of the below need to be done in order for it to work properly
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point() + labs(title="Weight vs. MPG",subtitle = "From the MTCARS Dataset - R (2024)") + geom_text(label=rownames(mtcars))
Oh no, this is much worse, actually.
Luckily, we can clean this up a bit. For example, what if we wanted to just point out certain outliers?
car_table<-mtcars %>% rownames_to_column(var="car_names")
ggplot(car_table,aes(x=wt,y=mpg)) + geom_point() + labs(title="Weight vs. MPG",subtitle = "From the MTCARS Dataset - R (2024)") + geom_label_repel(data = car_table %>% filter(mpg < 14),aes(label = car_names))
In general, scatterplots or “dot-plots” such as this should be used to visualize correlations (relationships) between x and y (or sometimes z).
For different types of data, different plots should be used. Distributions, for example, should be visualized using a histogram: can also call witht the hist command
ggplot(mtcars,aes(x=mpg)) + geom_histogram(bins = 4) + labs(title="MPG Distribution",subtitle="From the MTCARS Dataset - R (2024)")
The thing to remember about histograms is that they don’t have an “x and
y” – they only have an “x”, the variable that goes across the bottom.
The variable going “up and down” is always the number of “x”.
For histograms, the “bins” argument is very important – as we have discussed before, it basically decides how many “towers” will exist in the histogram. Notice what happens when we select “bins=2”:
ggplot(mtcars,aes(x=mpg)) + geom_histogram(bins = 2) + labs(title="MPG Distribution - only two bins",subtitle="From the MTCARS Dataset - R (2024)")
So how many bins should you use? Is there a rule of thumb?
There are actually several, but I’m going to give you the simplest.
sqrt(nrow(car_table))
## [1] 5.656854
So, we can specify 6 bins for this data:
ggplot(car_table,aes(x=mpg)) + geom_histogram(bins=6)
This looks a lot smoother than before. It also captures the distribution very well.
There is a problem, though. If you have a large number of observations, you can get huge numbers of bins very quickly.
Diamonds, for example, has 53,940 observations, the square root of which is ~233:
ggplot(diamonds,aes(x=price)) + geom_histogram(bins=233)
It’s not bad per se, but you can see how it gets spikey and granular the more bins you have.
Again, this is just a “rule of thumb”, the real standard is “does this make the distribution look better or worse?”. More detail isn’t always better:
ggplot(diamonds,aes(x=price)) + geom_histogram(bins=100)
With our arbitrarily chosen 100 bins, the data looks a bit smoother.
There is a slightly more complicated method, known as Sturge’s Method:
bins = log2(nrow(a))+1 #where “a” is your dataset
log2(nrow(diamonds))+1
## [1] 16.71907
So, 17 bins for diamonds.
ggplot(diamonds,aes(x=price)) + geom_histogram(bins=17)
Incidentally, the Sturges Method is actually the default method for R’s “hist()” command:
hist(diamonds$price)
Which one to use? Whichever looks best, your choice.
Line plots are good when comparing different variables on the same scale:
ggplot(car_table,aes(x=mpg,y=hp,color = as.factor(cyl))) + geom_line(linewidth=2) + labs(title="MPG vs. HP by CYL",subtitle="From the MTCARS Dataset - R (2024)")
Here we see the relationship between mpg, hp and cyl much more clearly (in my opinion) than in the scatterplot. In a way, the geom_line() plot just “connects the dots” from the scatterplot, although this can sometimes be easier on the eyes.
A really good use of geom_line() (or line plots in general) is time series data: that is, seeing how a variable changes over time.
Here is a new dataset, called “ChickWeight” - it measures the weight of baby chickens over time depending on the type of feed they are given. For the plot to make sense, however, we need to find the average weight by feed over time, rather than the weight for each individual chicken.
chick<-ChickWeight
chick_avg<-chick %>% group_by(Diet,Time) %>% summarize(mean_wt=mean(weight))
## `summarise()` has grouped output by 'Diet'. You can override using the
## `.groups` argument.
ggplot(chick_avg,aes(x=Time,y=mean_wt,color=as.factor(Diet))) + geom_line(linewidth=1) + labs(title="Chick Weight by Feed Type",subtitle="From the ChickWeight dataset - R (2024)")
#in time series data, time is always going to be x and go across the
bottom
As we’ve discussed before, it’s also possible to visualize groups using boxplots:
ggplot(chick,aes(x=Diet,y=weight)) + geom_boxplot() + labs(title="Chick Weight by Feed Type (Boxplot)",subtitle="From the ChickWeight dataset - R (2024)")
This is useful when comparing groups but not as useful for time series. As we saw with the line graph, the time element is very crucial when discussing chicken growth vs. diet. If you look at the boxplot without the time-series line graph, you might think that the feed is all the same (but it’s not).
Can we run a linear model on time series data? Well, let’s check out our assumptions:
hist(chick$weight)
“Weight” - our response variable, is highly non-normal.
When we try to fit a linear model to explain “Weight”, we get some crazy looking residuals:
chick_lm<-lm(weight~Time+Diet,data=chick)
plot(chick_lm)
None of this is very good. We can try altering “weight” via the log transformation, or we can run a “GLM” or “generalized linear model”. Each GLM belongs to a “family” (there are several). We will be using the “Gamma” family for this GLM. “Gamma” is a non-parametric version of the linear model: you can use it with skewed distributions like “weight”, as long as the dependent variable is positive and continuous. Also, we add a “log link” to the Gamma family to try and linearize the results:
chick_glm<-glm(weight~Time+Diet,data=chick,family = Gamma(link=log))
summary(chick_glm)
##
## Call:
## glm(formula = weight ~ Time + Diet, family = Gamma(link = log),
## data = chick)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.68330 0.02015 182.801 < 2e-16 ***
## Time 0.07991 0.00133 60.092 < 2e-16 ***
## Diet2 0.12122 0.02450 4.949 9.84e-07 ***
## Diet3 0.23526 0.02450 9.604 < 2e-16 ***
## Diet4 0.22651 0.02463 9.197 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.0465709)
##
## Null deviance: 193.022 on 577 degrees of freedom
## Residual deviance: 28.338 on 573 degrees of freedom
## AIC: 5272.8
##
## Number of Fisher Scoring iterations: 5
no p value for the overall model b/c it is non-parametric. diet 1 is missing because it was used as the reference/baseline for the rest of the model. the time is significant and overtime the wt will increase regardless of the diet used. We see that all aspects of the formula are significant: including time. In general, as time progresses (“goes up”), weight will go up as well.
Degrees of freedom refer to the maximum number of logically independent values which may vary in a data sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.
#Because this is non-parametric, there is no p-value for the whole formula, however we can estimate our own (rough) p-value: if the Null Deviance minus the Residual Deviance is greater than the Null deviance DF minus the Residual deviance DF, it’s probably significant:
193 - 28 = 165 577 - 573 = 4
165 > 4
#Model is probably significant!
All different diets cause weight to go up, but Diet 3 seems to do a little bit better than the others (based on the values in the estimate column), even when time is taken into account.
Interestingly, this is almost exactly what we see in the ggplot for this data:
chick_avg<-chick %>% group_by(Diet,Time) %>% summarize(mean_wt=mean(weight))
## `summarise()` has grouped output by 'Diet'. You can override using the
## `.groups` argument.
ggplot(chick_avg,aes(x=Time,y=mean_wt,color=as.factor(Diet))) + geom_line(linewidth=1) + labs(title="Chick Weight by Feed Type",subtitle="From the ChickWeight dataset - R (2024)")
The really cool thing about the log-link in GLM’s, is the ability to
turn the “estimates” into odds:
coef(chick_glm)
## (Intercept) Time Diet2 Diet3 Diet4
## 3.68329864 0.07991416 0.12122485 0.23526066 0.22650487
From the regular estimates, all we can tell is that Diet3 is better at chicken growth than Diet 2, Diet 1 (hidden) and Diet 4.
However, if we exponentiate the estimates, we get this:
exp(coef(chick_glm))
## (Intercept) Time Diet2 Diet3 Diet4
## 39.777389 1.083194 1.128879 1.265239 1.254209
The “1.26” under Diet3 means that Diet3 is 26% better than Diet 1. Why Diet 1? Because it’s the hidden “reference” variable. (Technically it lives in the intercept). This is great because instead of just getting estimates, we are now getting odds for how much a variable affects the outcome: Diet3 increases chicken weight by 26.5% more than Diet1, even when accounting for Time. interpretting the time variable: regardless of diet, weight goes up 8% per day
How would you interpret the Time odds?
Consequently, we have finally arrived at the point where you can run non-parametric statistical models for your extremely skewed data!
We can run AIC against the two models we’ve made so far to see which is more efficient:
AIC(chick_lm,chick_glm)
## df AIC
## chick_lm 6 5789.607
## chick_glm 6 5272.827
“chick_glm” is considerably lower than “chick_lm”, which means the non-parametric Gamma GLM is likely a better model for the data than a simple linear model. AIC is mathmatical equation to determine which model is better. the lower the AIC number the better in this case. essentially numerical representation of how accurately the dots are following along the predicted line in the models.
air<-airquality
ggplot(air,aes(x=Day,y=Temp)) + geom_line(linewidth=1)
ggplot(air,aes(x=Month,y=Temp)) + geom_line(linewidth=1)
ggplot(air,aes(x=Month,y=Temp)) + geom_boxplot(linewidth=1)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
this doesnt work because the month numbers mess it up. add
“as.factor”
ggplot(air,aes(x=as.factor(Month),y=Temp)) + geom_boxplot(linewidth=1)
ggplot(air,aes(x=as.factor(Month),y=Ozone)) + geom_boxplot(linewidth=1)
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(air,aes(x=Temp)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.