class: center, middle, inverse, title-slide .title[ # Creating QQ-Plots in R using ggplot2 ] .author[ ### Farhad ] --- ## 1. Introduction to QQ-Plots - **What is a QQ-Plot?**: A QQ-plot (Quantile-Quantile plot) compares the quantiles of a sample dataset to a theoretical distribution, typically normal. - **Purpose**: It helps determine if the sample data follows the theoretical distribution. <div style="display: flex;"> <div style="width: 42%; font-size: 14px; padding-right-bottom: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Create QQ plot for normal data ggplot(data, aes(sample = sample_data)) + stat_qq(color = "blue", size = 2) + stat_qq_line(color = "red", linetype = "dashed", size = 1) + labs(title = "QQ-Plot for Normal Data", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal()+ theme(plot.title = element_text(color = "black", size = 16, face = "bold")) ``` <!--Commentary: “On this slide, we start by introducing QQ-plots, which are used to compare the distribution of data to a theoretical one, such as normal distribution. The plot shows sample quantiles (y-axis) versus theoretical quantiles (x-axis). If the points fall on the line, it suggests the data is normally distributed. Here, we’ve used vibrant colors to make the comparison clearer.”--> --- ## 2. What is a QQ-Plot? - **Definition:** A QQ-plot plots the quantiles of the data sample against the quantiles of a theoretical distribution (e.g., normal). - **Interpretation:** If the data is normally distributed, the points will lie along a straight line. <div style="display: flex;"> <div style="width: 45%; font-size: 14px; padding-right-bottom: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Create another colored QQ plot ggplot(data, aes(sample = sample_data)) + stat_qq(color = "green4", size = 2) + stat_qq_line(color = "orange", linetype = "solid", size = 1) + labs(title = "Definition of QQ-Plot", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkgreen" , size = 16, face = "bold")) ``` <!--Commentary: “Here, we define the QQ-plot in more detail. In this example, we are again using normal data, and the points line up on the diagonal. Any significant deviation from this line would indicate that the data does not follow a normal distribution. The choice of green and orange highlights the quantile points and the line.”--> --- ## 3. Real Data Example (mtcars Dataset) - Let’s take the real-world mtcars dataset and check if the mpg (miles per gallon) variable follows a normal distribution. - We’ll use a QQ-plot to visualize this comparison. <div style="display: flex;"> <div style="width: 45%; font-size: 14px; padding-right-bottom: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Create QQ plot for 'mpg' variable in # mtcars dataset ggplot(data_mtcars, aes(sample = mpg)) + stat_qq(color = "purple", size = 2) + stat_qq_line(color = "red", linetype = "dashed" , size = 1) + labs(title = "QQ-Plot for MPG in mtcars Dataset", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkred" , size = 16, face = "bold")) ``` <!--Commentary: “In this slide, we use real data from the mtcars dataset. We plot the mpg (miles per gallon) variable against the theoretical normal quantiles. While the points are generally close to the line, there is some deviation, especially at the tails, indicating the data might not perfectly follow a normal distribution.”--> --- ## 4. QQ-Plot for Skewed Data (Chi-Square Distribution) - **Skewed Data:** Let’s check a skewed dataset using the chi-square distribution and compare it to a normal distribution. - Skewed data will typically have points deviating from the reference line, particularly in the tails. <div style="display: flex;"> <div style="width: 42%; font-size: 14px; padding-right-bottom: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Create QQ plot for skewed data # (Chi-square distribution) data_skewed <- data.frame(sample_data = rchisq(100, df = 2)) ggplot(data_skewed, aes(sample = sample_data)) + stat_qq(color = "darkgreen", size = 2) + stat_qq_line(color = "orange", linetype = "dashed", size = 1) + labs(title = "QQ-Plot for Skewed Data (Chi-Square Distribution)", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkgreen" , size = 16, face = "bold")) ``` <!--Commentary: “This QQ-plot compares a skewed dataset generated from a chi-square distribution to a normal distribution. Notice how the points deviate from the reference line, particularly at the upper end of the distribution, indicating the heavy tail characteristic of the chi-square distribution.”--> --- ## 5. QQ-Plot for Heavy-Tailed Data (t-Distribution) - **Heavy-Tailed Data:** We can visualize heavy-tailed data, like the t-distribution, where we expect significant deviation at the tails. <div style="display: flex;"> <div style="width: 45%; font-size: 14px; padding-right: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Create QQ plot for heavy-tailed data #(t-distribution) data_heavy_tail <- data.frame(sample_data = rt(100, df = 2)) ggplot(data_heavy_tail, aes(sample = sample_data)) + stat_qq(color = "blue", size = 2) + stat_qq_line(color = "red", linetype = "dashed", size = 1) + labs(title = "QQ-Plot for Heavy-Tailed Data (t-Distribution)", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkblue" , size = 16, face = "bold")) ``` <!--Commentary: “This plot demonstrates a QQ-plot for heavy-tailed data generated from a t-distribution. As we can see, the points at both tails deviate from the line, indicating the presence of more extreme values compared to the normal distribution.”--> --- ## 6. Comparing QQ-Plots of Different Distributions - **Multiple Distributions:** Here we compare QQ-plots of normally distributed data, skewed data, and heavy-tailed data side by side. <div style="display: flex;"> <div style="width: 50%; font-size: 14px; padding-right: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Combine QQ-plots for different # data distributions data_compare <- data.frame(type = rep(c("Normal", "Skewed", "Heavy-tailed"), each = 100), values = c(rnorm(100), rchisq(100, df = 2), rt(100, df = 2))) ggplot(data_compare, aes(sample = values)) + stat_qq() + stat_qq_line() + facet_wrap(~type) + labs(title = "Comparison of QQ-Plots for Different Distributions", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkred" , size = 16, face = "bold")) ``` <!--Commentary: “Here, we compare the QQ-plots of three different distributions: normal, skewed, and heavy-tailed. The comparison illustrates how deviations from the reference line can indicate characteristics like skewness or heavy tails.”--> --- ## 7. Real-World Example (Sepal Length in Iris Dataset) - **Iris Dataset:** Now, we apply QQ-plots to the famous iris dataset to check if Sepal.Length follows a normal distribution. <div style="display: flex;"> <div style="width: 50%; font-size: 14px; padding-left: 20px;"> <!-- --> </div> <div style="width: 50%; padding-right: 20px;"> ``` r # Use the iris dataset ggplot(iris, aes(sample = Sepal.Length)) + stat_qq(color = "blue4", size = 2) + stat_qq_line(color = "orange", linetype = "dashed", size = 1) + labs(title = "QQ-Plot for Sepal Length (Iris Dataset)", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkblue", size = 16, face = "bold"), axis.title.x = element_text(color = "darkblue", size = 14), axis.title.y = element_text(color = "darkblue", size = 14), axis.text = element_text(color = "blue", size = 12)) ``` <!--Commentary: “This QQ-plot shows the distribution of the Sepal.Length variable from the iris dataset. We compare its quantiles to a normal distribution. While many points fall close to the line, we notice slight deviations at the tails, suggesting that the variable may not perfectly follow a normal distribution.”--> --- ## 8. Customizing QQ-Plots in ggplot2 - **Customization:** You can customize the color, size, and appearance of QQ-plots in ggplot2 to enhance readability and presentation. <div style="display: flex;"> <div style="width: 50%; font-size: 14px; padding-left: 20px;"> <!-- --> </div> <div style="width: 50%; padding-right: 20px;"> ``` r # Customized QQ-plot for aesthetics ggplot(data, aes(sample = sample_data)) + stat_qq(color = "purple", size = 3) + stat_qq_line(color = "green", linetype = "solid", size = 1.2) + labs(title = "Customized QQ-Plot Example", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "purple", size = 16, face = "bold"), axis.title.x = element_text(color = "darkgreen", size = 14), axis.title.y = element_text(color = "darkgreen", size = 14), axis.text = element_text(color = "blue", size = 12)) ``` <!--Commentary: “In this slide, we demonstrate how to customize QQ-plots using ggplot2. By changing the color, line type, and font sizes, we can make our plots more visually engaging, which is helpful when presenting results to a non-technical audience. The customization options in ggplot2 are quite extensive.”--> --- ## 9. Applications of QQ-Plots in Research - **Applications:** QQ-plots are used extensively in research to check assumptions of normality, a critical step before conducting parametric tests like t-tests or ANOVA. - They are also used in regression analysis to check the normality of residuals. <div style="display: flex;"> <div style="width: 50%; font-size: 14px; padding-left: 20px;"> <!-- --> </div> <div style="width: 50%; padding-right: 20px;"> ``` r # Real-world example using normal data for illustration ggplot(data, aes(sample = sample_data)) + stat_qq(color = "orange", size = 3) + stat_qq_line(color = "blue" , linetype = "dashed", size = 1) + labs(title = "Real-World Example: Checking Normality", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkorange", size = 16, face = "bold"), axis.title.x = element_text(color = "darkblue", size = 14), axis.title.y = element_text(color = "darkblue", size = 14), axis.text = element_text(color = "blue", size = 12)) ``` <!--Commentary: “QQ-plots have numerous applications in research. One common use is to check the normality assumption for parametric tests such as the t-test, ANOVA, or linear regression. In this real-world example, the data closely follows the normal distribution, as indicated by the close fit to the reference line.”--> --- ## 10. Conclusion and Final Example - **Recap:** QQ-plots are powerful visual tools that help in assessing the distribution of data against a theoretical distribution. - **Takeaway:** They are easy to generate and customize in ggplot2, making them ideal for statistical analysis and presentations. <div style="display: flex;"> <div style="width: 42%; font-size: 14px; padding-right-bottom: 20px;"> <!-- --> </div> <div style="width: 50%; padding-left: 20px;"> ``` r # Final clean QQ plot example ggplot(data, aes(sample = sample_data)) + stat_qq(color = "red", size = 2.5) + stat_qq_line(color = "black", linetype = "solid", size = 1.2) + labs(title = "Final QQ-Plot Example", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() + theme(plot.title = element_text(color = "darkred", size = 16, face = "bold"), axis.title.x = element_text(color = "black", size = 14), axis.title.y = element_text(color = "black", size = 14), axis.text = element_text(color = "black", size = 12)) ``` <!--Commentary: “To conclude, QQ-plots offer an intuitive way to check how well data fits a theoretical distribution. They are especially useful in normality checks and are easily customizable in ggplot2 for better visual presentation. This final example reiterates the concepts we’ve covered: simple, yet effective.”-->