March 3 2025

What is data visualization?

  • Data visualization is the practice of conveying data visually through the use of graphs and charts.

  • Graphs and charts help convey key information from the data. Popular graphs and charts include histograms, scatter plots, and pie charts.

  • Each graph and chart has their own unique uses. Scatter plots, for example, can be used to identify potential relationships between two variables and the strength of the relationship.

  • Two popular data visualization libraries are plotly and ggplot2.

Plotly vs Ggplot2

  • The most significant difference between a ggplot plot and plotly plot is that plotly plots are interactive. Plotly plots allow the user to zoom in and out and isolate a specific group by double-clicking its label on the legend.

  • Ggplot offers more customization options than plotly but is more difficult to use with plotly being more intuitive.

  • Ggplots can be converted into interactive plots using ggplotly().

Data Visualization Example Dataset

  • To demonstrate the different graphs and charts used in data visualization, I will be using a data from ‘Student_Performance.csv’. This is a dataset on Kaggle consisting of 10,000 records of students’ data.

  • The dataset consists of columns such as the number of hours a student studied, their previous scores, and whether or not they do extracurriculars. This allows the user to explore patterns and relationships in the data regarding students’ academic performance.

Plotly Scatter Plot and Linear Regression

This code shows the relationship between students previous scores and their current performance using a scatter plot and a regression line. First, it groups the dataset by previous scores and calculates the average performance index for each group. Then, a linear regression model is built to identify patterns between past and current scores. Finally, the plot is customized with axis labels and a title to make the visualization clear and easy to understand. This helps in identifying whether students with higher past scores generally perform better in their current assessments.

Using plotly, I created a scatter plot comparing previous scores and the average current score for each previous score to see if there is a relationship between previous and current score. I then added a regression equation line.

Statistical Analysis

##                 coef.model. pval
## (Intercept)      -15.181799    0
## Previous.Scores    1.013837    0
## [1] "Multiple R-squared:  0.8376"
## [1] "Residual Standard Error:  7.7435"
  • The positive slope indicates a positive relationship between previous and current score.

  • A p-value of 0 indicates that there is a significant relationship between previous score and the average current score of all students with the same previous score.

  • The model’s correlation is strong (R² = 83.76%), meaning that 83.76% of the variation in average current score can be explained by the previous score, so previous scores are a reliable predictor of future performance.

  • The residual standard error of 7.7435 suggests that other factors might also influence performance (e.g., study habits, extracurriculars).

3D Plotly Plot

This code creates a 3D scatter plot to analyze how students’ study habits impact their performance. It first categorizes students into letter grades based on their previous scores, then uses the plot_ly() to plot students’ hours studied (x-axis), sample question papers practiced (y-axis), and performance index (z-axis), with colors representing different letter grades. This plot helps in understanding whether studying more hours or practicing more questions leads to higher test scores.

Pie Chart Plot

This code creates a pie chart to visualize the distribution of students based on their sleep hours. It first groups the dataset by Sleep Hours and counts the number of students in each category using methods group_by() and summarise(). Then, it calculates the percentage of students in each sleep category and formats it as a label. The ggplot() function is used to create a bar chart, which is then transformed into a pie chart. The theme_void() function removes unnecessary background elements for a cleaner look. This visualization helps in understanding how students sleep patterns vary across the dataset.

Ggplot Box Plot

This code creates a box plot to compare sleep hours between students who participate in extracurricular activities and those who don’t. It sets extracurricular activities as the x-axis and sleep hours as the y-axis, with different colors for each group. The plot helps visualize how extracurricular involvement affects students’ sleep patterns.

Convert GGplot Box Plot into Plotly Plot

This plot shows the process of converting a ggplot into a plot_ly plot

ggplotly(p)