We use transformations when our data does not meet the model assumptions. The result is that we will make our data fit the model assumptions better by adjusting the response variable.
Recap of Linear Model Assumptions: 1) Linear relationship between the independent and dependent variables 2) Linear regression must have all variables to be normally distributed (goodness of fit test) 3) No multicollinearity in the data, meaning independent variables should not correlate with each other 4) No homoscedasticity
If one or more of these assumptions are broken, then our solution is to transform our data to correct outliers and/or assumption failures. We want our ideal data set to have the following: 1) Linearity, where there is a straight relationship between variables 2) Good normality, where the dependent variables are normally distributed in a bell shape curve 3) Homogeneity, where each group have similar error variances 3) Homoscedasticity, where the scores are normally distributed among the regression line
If our data does not meet the assumptions, the data might be skewed. We can say that the larger the value of skewness, the larger the distribution differs from a normal distribution. Trying to fit your data when it doesn’t meet the LINE assumptions may result in incorrect or misleading analyses,
That’s where transformations come into play! When assumptions are violated, transforming your data can help make your data fit the assumptions, and create a linear relationship. And more specifically, transformation is the replacement of a variable by a function of that variable, such that you change the shape of a distribution or relationship (often for normality).
These transformations can reduce skewness, to produce nearly equal spreads and to produce a nearly linear relationship, so that linear regressions can be performed, however, as a disclaimer, this is not always guaranteed.
In introductory data analysis, the most common transformations are outlier, log, inverse, square/cube root transformations, which we will go over next.
It is important to notice that this graph has an outlier in it. With this outlier, the entire data does not ‘linearly’ fit as well without it. With outliers removed, there might be a noticeable change in slope of the best fit line. If we had multiple outliers in our dataset, our predictions would be more error prone and inaccurate. Let’s try and remove the outliers.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(plotly)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Inject outliers into data.
cars1 <- cars[1:30, ] # original data
cars_outliers <- data.frame(speed=c(19,19,20,20,20), dist=c(190, 186, 210, 220, 218)) # introduce outliers.
cars2 <- rbind(cars1, cars_outliers) # data with outliers.
#need to create two animated scenes - can we make the slope animated with plotly?
slope_outlier <- lm(data = cars2, dist ~ speed)
outlier_s <- round(coef(slope_outlier)[2], digits = 2)
slope_original <- lm(data = cars1, dist ~ speed)
original_s <- round(coef(slope_original)[2], digits = 2)
# Plot of data with outliers.
outlier <- ggplot(cars2, aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(aes(text = paste("Slope", outlier_s)), method = "lm", se = FALSE) +
labs(title = "With Outliers",
x = "Speed",
y = "Distance") +
nicetheme
## Warning: Ignoring unknown aesthetics: text
original <- ggplot(cars1, aes(x = speed, y = dist)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, aes(text = paste("Slope", original_s))) +
labs(title = "Without Outliers",
x = "Speed",
y = "Distance") +
coord_cartesian(xlim = c(0, 20), ylim = c(-50,250)) +
nicetheme
## Warning: Ignoring unknown aesthetics: text
ggplotly(outlier, tooltip = "text")
## `geom_smooth()` using formula 'y ~ x'
ggplotly(original, tooltip = "text")
## `geom_smooth()` using formula 'y ~ x'
#grid.arrange(original, outlier, ncol = 2)
With the outliers removed, we can see that our line of best fit has a lower slope. This shows how much outliers impact a dataset. With just a simple transformation of removing outliers, the line of best fit represents the data more accurately.
df <- data.frame(y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8),
x1=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8),
x2=c(3, 3, 6, 6, 8, 9, 9, 8, 8, 7, 4, 3, 3, 2, 7))
original <- ggplot(data = df, aes(y)) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Original Graph with Skew",
y = "Count",
x = "x") +
nicetheme
new <- ggplot(data = df, aes(sqrt(y))) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Graph w/ Square Root Transformation",
y = "Count",
x = "x") +
nicetheme
ggplotly(original)
ggplotly(new)
Overall, square root transformations are the most mild transformations. However, we can still see that the data is slightly positively skewed. Yet, the data still has a better “bell curve” than the previous original graph.
Let’s try a cube root transformation. ### Cube Root Transformation
original1 <- ggplot(data = df, aes(y)) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Original Graph with Skew",
y = "Count",
x = "x") +
nicetheme
transform1 <- ggplot(data = df, aes(y^(1/3))) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Graph w/ Cube Root Transformation",
y = "Count",
x = "x") +
nicetheme
ggplotly(original1)
ggplotly(transform1)
With the cube root transformation, we can see the peak is shifted more towards the right. We can still notice that the overall data is postively skewed, but it has improved a lot from the original graph. We can notice that there is a bell curve fit to it, which implies a normal distribution.
original2<-ggplot(data = df, aes(y)) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Original Graph with Skew",
y = "Count",
x = "x") +
nicetheme
transform2<-ggplot(data = df, aes(log10(y) )) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Graph w/ Log Transformation",
y = "Count",
x = "x") +
nicetheme
ggplotly(original2)
ggplotly(transform2)
With the log transformation, we can clearly see this has a much better fit than the original graph. We can see when \(x = 0\), the count is increased and the left and right side of the graph has a better trend. There is more symmetry to the bell shape curve as well. Note: With the log transformation, one should add a constant to all the values to make them all positive before transformation. This is because we cannot take the log of 0.
original3 <- ggplot(data = df, aes(y)) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Original Graph with Skew",
y = "Count",
x = "x") +
nicetheme
transform3 <- ggplot(data = df, aes(1/y)) +
geom_histogram(bins = 30, col = "red", fill = I("blue"), alpha = 0.7) +
labs(title = "Graph w/ Inverse Trasformation",
y = "Count",
x = "x") +
nicetheme
ggplotly(original3)
ggplotly(transform3)
The inverse transformation, is the best transformation when your data has severe skewness. Here we can see there is a clear bell shape curve if we imagine a line touching the top of all the histogram boxes.
From rankings of skewedness, we will want to try these transformations first.
It is also worth to note that the examples all given above, \[(y^(1/3), sqrt(y), log(y), 1/y)\] are all used for positively, right skewed data. This means that the tail is on the right. For negatively skewed data with a left tail, use the formula: \[((max(y+1) - y)^(1/3), sqrt(max(y+1)-y), log(max(y+1)-y), 1/(max(y+1)-y))\], where \(y\) is your variable of counts. Overall - if your data does not meet the criteria of assumptions, then try and transform your data. It is worthy to note that transfromations do not always work. However, if they do- then feel free to continue working with your data!
Self-grade:
I believe our project meets an excellent project. We discussed the topic of transformations, gave an overview of what common transformations exist and what the reason for transfornations are. We also showed clear examples of what happens before and after each trasnformation, and how they better meet a normal distribution. The project was also very clearly organized, and it is easy to find any specific transformation if you are looking for one.
To successfully complete teaching students about Transformations, we included several examples of different types of transformations and the results. Before jumping into transformations, we wanted to explain why transformations are used and when we use them. Thus, we typed up and researched the reasons why transformations are needed, mainly if they do not align with the Linear Model Assumptions.
The Linear Model Assumptions are an important stepping stone to linear transformations. Without prior knowledge to them, it’s like the students would be doing work blindly. Thus, we thought it was quite crucial to discuss them despite the fact it is not part of our topic. We included the recap of them, and what a “good” graph looks like when all the assumptions are met. After an introduction of linear transformations, we started going through all the possible transformations: removing outliers, cube root, square root, logarithmic, and inverse. Although removing outliers are not necessarily the biggest, fundamental transformations, we believed they were an important lesson to mention because outliers heavily influence datasets. We plotted two graphs for each transformation, the original skewed graph and the newly transformed graph. Afterwards, at the end we included when each transformation should be applied, and how much strength they have to the dataset.
In order to create this lesson, we created skewed data sets that would be visibly different and closer to a normal distribution / linear trend after the transformation was performed. After this, we created plots such as histograms to show the distribution of the original data, transformed the y value of the data we were working with based on what specific transformation we were doing (i.e. square rooting all the values), and then plotted a new graph to show the new and improved distribution.
We hope that this crash course of transformations will be useful to all students in introductory statistics courses. Students in STAT021 at Swarthmore will be able to refer to our guide as a quick overview of what kinds of transformations exist, and when learning about linear regressions and the assumptions for linear regressions, we hope that students will be able to use the concepts we explained in order to apply to any data sets they are working with. When working with a skewed data set, the transformations we discuss (root, log, inverse, removal of outliers) can hopefully be a quick way to correct non linear data sets, so that students can move forward with their analyses and draw some pretty interesting conclusions! In addition, if we publish our project on RPUBS, this would be able to be publicly accessed by anyone who is interested in learning how to transform their data. Learning how to transform your data when it doesn’t mean linear regression assumptions is super important even beyond introductory statistics courses. Those who move from intro to advanced stat courses will hopefully be able to carry on the knowledge they have about linear transformations in order to help perform more advanced models.