“even when controlled for x and y we saw this effect”– this is what an interpretation of a linear model
Lit review– why you are researching what you are researching, why you are using a certain method, why you are using different variables
include descriptive statistics in the paper; plots are not necessary but appreciated
Data visualization is one of the most important parts of data analysis. I didn’t realize this at first, but it’s becoming increasingly true for me as I mature in the field.
Most of us are familiar with plots and charts in Excel … and these are usually fine for everyday purposes. However, for reproduceable and finely-tuned data visualization, nothing beats the GGPLOT package in R.
# load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# the exact name of the package is actually "ggplot2"
library(ggplot2)
We’ve used this a few times during the course so far, but let’s get really involved into why this is useful, and also do some in-class practice.
#please install "ggrepel" if you haven't already: install.packages("ggrepel")
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.5.2
geom_point - continuous numerical variables, ideal for understanding correlations or distributions between 2 - 5 variables (at most)
geom_line - good for time series or any kind of sequential data. Trends over time.
geom_bar - counts categorical variables or frequency counts (stacked bar chart could be useful for different scores for different categories of nursing homes)
geom_histogram - similar to geom_bar, but only has one variable and is used to show distributions
geom_boxplot - comparing means across groups, continuous data only
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
Here’s our old friend, mtcars. By now this data makes perfect sense to you: as WT increases, mpg goes down. Let’s look at how we pull this from mtcars using ggplot:
First, we call ggplot on mtcars:
ggplot(mtcars)
Then, we add “aes” or “aesthetics” – we are specifying what the x and y are supposed to be
ggplot(mtcars,aes(x=wt,y=mpg))
If you try just running this, it won’t work:
ggplot(mtcars,aes(x=wt,y=mpg))
It’s missing the “geom” or “geometry” – the actual shape of the chart. This is added with a “plus” symbol:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
Like this:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
The really cool thing about ggplot is that you can add more than just “x” and “y” to aesthetics. In reality, you can add almost as many dimensions as you want. In practice, this is often a bad idea, as the human mind has trouble thinking in more than three to five dimensions or so.
ggplot(mtcars,aes(x=wt,y=mpg,color=as.factor(am))) + geom_point()
We’ve added a third dimension to the plot = transmission (AM). We had to call “as.factor()” on AM otherwise R tries to interpret the 0 or 1 as a continuous number and tries to create a gradient.
Because we’ve added a third dimension (wt,mpg,am) we can now understand some of the relationship between these variables. We see that as WT goes up, MPG goes down, but also manual transmission (AM=1) clusters towards higher mpg and lower weight, while higher weight, lower mpg cars tend to be automatic transmissions.
ggplot(mtcars,aes(x=wt,y=mpg,color=as.factor(am),shape=as.factor(vs))) + geom_point()
We’ve now added a fourth dimension (shape = VS (engine shape)) but the chart has become more difficult to interpret. We are now trying to balance four interrelationships in our heads and it’s quite difficult. In general, it’s better to have more plots than to try and cram all the dimensions or details into a single plot. In my personal opinion, the ideal number of dimensions for most plots is exactly 2 = x and y. Three is acceptable if you really need to show a relationship.
You can create two plots easily with facet_wrap:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point() + facet_wrap(vars(am))
Here, we’ve used the command “facet_wrap(vars())” to split the plot into two: the first is AM=0 (automatic), the second is AM=1 (manual). Now the relationship between the three variables (wt,mpg,am) is arguably clearer than by assigning a color or shape to AM.
To really make your plots useable, you need to add written information. As a rule, the best type of plot is one that can be understood even if you pulled it out of the paper and printed it on it’s own.
You can add titles and subtitles to ggplot with the “labs()” command:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point() + labs(title="Weight vs. MPG",subtitle = "From the MTCARS Dataset - R (2024)")
Even more clarity can be added by labeling the points:
ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point() + labs(title="Weight vs. MPG",subtitle = "From the MTCARS Dataset - R (2024)") + geom_text(label=rownames(mtcars))
Oh no, this is much worse, actually.
Luckily, we can clean this up a bit. For example, what if we wanted to just point out certain outliers?
car_table<-mtcars %>% rownames_to_column(var="car_names")
ggplot(car_table,aes(x=wt,y=mpg)) + geom_point() + labs(title="Weight vs. MPG",subtitle = "From the MTCARS Dataset - R (2024)") + geom_label_repel(data = car_table %>% filter(mpg < 14),aes(label = car_names))
In general, scatterplots or “dot-plots” such as this should be used to visualize correlations (relationships) between x and y (or sometimes z).
For different types of data, different plots should be used. Distributions, for example, should be visualized using a histogram:
ggplot(mtcars,aes(x=mpg)) + geom_histogram(bins = 4) + labs(title="MPG Distribution",subtitle="From the MTCARS Dataset - R (2024)")
The thing to remember about histograms is that they don’t have an “x and
y” – they only have an “x”, the variable that goes across the bottom.
The variable going “up and down” is always the number of “x”.
For histograms, the “bins” argument is very important – as we have discussed before, it basically decides how many “towers” will exist in the histogram. Notice what happens when we select “bins=2”:
ggplot(mtcars,aes(x=mpg)) + geom_histogram(bins = 2) + labs(title="MPG Distribution - only two bins",subtitle="From the MTCARS Dataset - R (2024)")
So how many bins should you use? Is there a rule of thumb?
There are actually several, but I’m going to give you the simplest.
sqrt(nrow(car_table))
## [1] 5.656854
So, we can specify 6 bins for this data:
ggplot(car_table,aes(x=mpg)) + geom_histogram(bins=6)
This looks a lot smoother than before. It also captures the distribution very well.
There is a problem, though. If you have a large number of observations, you can get huge numbers of bins very quickly.
Diamonds, for example, has 53,940 observations, the square root of which is ~233:
ggplot(diamonds,aes(x=price)) + geom_histogram(bins=233)
It’s not bad per se, but you can see how it gets spikey and granular the more bins you have.
Again, this is just a “rule of thumb”, the real standard is “does this make the distribution look better or worse?”. More detail isn’t always better:
ggplot(diamonds,aes(x=price)) + geom_histogram(bins=100)
With our arbitrarily chosen 100 bins, the data looks a bit smoother.
There is a slightly more complicated method, known as Sturge’s Method:
bins = log2(nrow(a))+1 #where “a” is your dataset
log2(nrow(diamonds))+1
## [1] 16.71907
So, 17 bins for diamonds.
ggplot(diamonds,aes(x=price)) + geom_histogram(bins=17)
Incidentally, the Sturges Method is actually the default method for R’s “hist()” command:
hist(diamonds$price)
Which one to use? Whichever looks best, your choice.
Line plots are good when comparing different variables on the same scale:
ggplot(car_table,aes(x=mpg,y=hp,color = as.factor(cyl))) + geom_line(linewidth=2) + labs(title="MPG vs. HP by CYL",subtitle="From the MTCARS Dataset - R (2024)")
Even though cars is not a time series dataset, this is a useful illustration of the relationship between hp, mpg, and cylinder.
Here we see the relationship between mpg, hp and cyl much more clearly (in my opinion) than in the scatterplot. In a way, the geom_line() plot just “connects the dots” from the scatterplot, although this can sometimes be easier on the eyes.
A really good use of geom_line() (or line plots in general) is time series data: that is, seeing how a variable changes over time.
Here is a new dataset, called “ChickWeight” - it measures the weight of baby chickens over time depending on the type of feed they are given. For the plot to make sense, however, we need to find the average weight by feed over time, rather than the weight for each individual chicken.
chick<-ChickWeight
chick_avg<-chick %>% group_by(Diet,Time) %>% summarize(mean_wt=mean(weight))
## `summarise()` has grouped output by 'Diet'. You can override using the
## `.groups` argument.
#this formula took 578 observations to 48 observations
ggplot(chick_avg,aes(x=Time,y=mean_wt,color=as.factor(Diet))) + geom_line(linewidth=1) + labs(title="Chick Weight by Feed Type",subtitle="From the ChickWeight dataset - R (2024)")
If you do have a time series, you need to include time as a variable in your linear model, and that can be difficult to intrepret.
As we’ve discussed before, it’s also possible to visualize groups using boxplots:
ggplot(chick,aes(x=Diet,y=weight)) + geom_boxplot() + labs(title="Chick Weight by Feed Type (Boxplot)",subtitle="From the ChickWeight dataset - R (2024)")
This is useful when comparing groups but not as useful for time series. As we saw with the line graph, the time element is very crucial when discussing chicken growth vs. diet. If you look at the boxplot without the time-series line graph, you might think that the feed is all the same (but it’s not).
Can we run a linear model on time series data? Well, let’s check out our assumptions:
hist(chick$weight)
“Weight” - our response variable, is highly non-normal.
When we try to fit a linear model to explain “Weight”, we get some crazy looking residuals:
chick_lm<-lm(weight~Time+Diet,data=chick)
plot(chick_lm)
None of this is very good. We can try altering “weight” via the log transformation, or we can run a “GLM” or “generalized linear model”. Each GLM belongs to a “family” (there are several). We will be using the “Gamma” family for this GLM. “Gamma” is a non-parametric version of the linear model: you can use it with skewed distributions like “weight”, as long as the dependent variable is positive and continuous. Also, we add a “log link” to the Gamma family to try and linearize the results:
chick_glm<-glm(weight~Time+Diet,data=chick,family = Gamma(link=log))
summary(chick_glm)
##
## Call:
## glm(formula = weight ~ Time + Diet, family = Gamma(link = log),
## data = chick)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.68330 0.02015 182.801 < 2e-16 ***
## Time 0.07991 0.00133 60.092 < 2e-16 ***
## Diet2 0.12122 0.02450 4.949 9.84e-07 ***
## Diet3 0.23526 0.02450 9.604 < 2e-16 ***
## Diet4 0.22651 0.02463 9.197 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.0465709)
##
## Null deviance: 193.022 on 577 degrees of freedom
## Residual deviance: 28.338 on 573 degrees of freedom
## AIC: 5272.8
##
## Number of Fisher Scoring iterations: 5
We see that all aspects of the formula are significant: including time. In general, as time progresses (“goes up”), weight will go up as well.
GLM always moves one variable into the intercept– in this model, it moved Diet1.It becomes the reference. In this model, all the diets are better than Diet1. If they were not, the estimates would be negative.
Because this is non-parametric, there is no p-value for the whole formula, however we can estimate our own (rough) p-value: if the Null Deviance minus the Residual Deviance is greater than the Null deviance DF (degrees of freedom) minus the Residual deviance DF, it’s probably significant:
193 - 28 = 165 577 - 573 = 4
165 > 4
Model is probably significant!
All different diets cause weight to go up, but Diet 3 seems to do a little bit better than the others, even when time is taken into account. [Diet3 is better even when controlling for time]
GLM is interpreting broad trends. We can’t really talk about data like we can with linear models (as x goes up by 1, y goes up by .26). You can say, broadly speaking, as x increases y increases.
You can use GLMs on binary data– binomial(link = “logit”) if you wanted the dependent variable to be 0 or 1/binary data
Interestingly, this is almost exactly what we see in the ggplot for this data:
chick_avg<-chick %>% group_by(Diet,Time) %>% summarize(mean_wt=mean(weight))
## `summarise()` has grouped output by 'Diet'. You can override using the
## `.groups` argument.
ggplot(chick_avg,aes(x=Time,y=mean_wt,color=as.factor(Diet))) + geom_line(linewidth=1) + labs(title="Chick Weight by Feed Type",subtitle="From the ChickWeight dataset - R (2024)")
The really cool thing about the log-link in GLM’s, is the ability to
turn the “estimates” into odds:
coef(chick_glm)
## (Intercept) Time Diet2 Diet3 Diet4
## 3.68329864 0.07991416 0.12122485 0.23526066 0.22650487
From the regular estimates, all we can tell is that Diet3 is better at chicken growth than Diet 2, Diet 1 (hidden) and Diet 4.
However, if we exponentiate the estimates, we get this:
exp(coef(chick_glm))
## (Intercept) Time Diet2 Diet3 Diet4
## 39.777389 1.083194 1.128879 1.265239 1.254209
The “1.26” under Diet3 means that Diet3 is 26% better than Diet 1 (the reference). Why Diet 1? Because it’s the hidden “reference” variable. (Technically it lives in the intercept). This is great because instead of just getting estimates, we are now getting odds for how much a variable affects the outcome: Diet3 increases chicken weight by 26.5% more than Diet1, even when accounting for Time.
How would you interpret the Time odds? Weight goes up 8% every two days (the time intervals)
Consequently, we have finally arrived at the point where you can run non-parametric statistical models for your extremely skewed data!
[for the paper: the data did not meet the assumptions of linear models even after several transformations so GLM was the best]
We can run AIC against the two models we’ve made so far to see which is more efficient:
AIC(chick_lm,chick_glm)
## df AIC
## chick_lm 6 5789.607
## chick_glm 6 5272.827
“chick_glm” is considerably lower than “chick_lm”, which means the non-parametric Gamma GLM is likely a better model for the data than a simple linear model.
The lower the score, the more the model explains with fewer variables.
Paper– run through assumption tests