Course Project for ANLY 512: Data Visualization

running_data <- read_excel("running_data_March.xlsx")

library(ggplot2)

library(plotly)

# Create a line chart
line_chart <- plot_ly(running_data, x = ~Date, y = ~Distance_km, type = "scatter", mode = "lines", 
                       line = list(color = "blue", width = 2)) %>%
  layout(title = "Running Distance by Date",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Distance_km"))

line_chart

library(ggplot2)

bar_chart <- ggplot(running_data, aes(Date, ElevationGain_m)) +
  geom_bar(stat = "identity", fill = "lightblue", linewidth = 0.1) +
  labs(title = "Elevation Gain by Date",
       x = "Date",
       y = "ElevationGain_m") +
  theme_minimal()

# Display the bar chart
print(bar_chart)

# Create a scatter plot
scatter_plot <- ggplot(running_data, aes(Speed, Distance_km)) +
  geom_point(color = "darkblue", size = 3) +
  labs(title = "Running Speed vs. Distance",
       x = "Speed",
       y = "Distance_km") +
  
  theme_minimal()

print(scatter_plot)

library(ggplot2)

# Assuming you have already loaded the running_data dataset into your R session

# Load required package for data visualization
library(ggplot2)

# Create histogram
ggplot(running_data, aes(x = Distance_km)) +
  geom_histogram(binwidth = 1, color = "black", fill = "skyblue") +
  labs(title = "Distance Distribution",
       x = "Distance (in km)",
       y = "Count") +
  theme_minimal()

library(plotly)

# Create a pie chart
pie_chart <- plot_ly(running_data, labels = ~Activity, values = ~Distance_km, type = "pie",
                      marker = list(colors = c("lightblue", "lightgreen", "red"))) %>%
  layout(title = "Percentage of Running Distance by Activity")

pie_chart

Course Project for ANLY 512: Data Visualization

The Quantified Self

Data Source: Strava
Data Type: Running data analyzed for March

Write-up Introduction: Running is a popular form of exercise and a widely practiced sport. With the availability of modern fitness tracking devices and apps, runners can easily collect data on their running activities, such as distance, time, pace, and more. This data can be used to analyze performance, set goals, and gain insights into running patterns.

Data Acquisition: The first step in analyzing running data is to acquire the data. This can be done through various sources, such as fitness tracking apps, smartwatches, or external sensors. The data can be in different formats, such as CSV, Excel, or JSON. Once the data is obtained, it can be loaded into R for further analysis.

Data Storage: In R, data can be stored in various data structures, such as data frames, lists, or arrays. Data frames are commonly used for storing tabular data, where each column can represent a variable or attribute, and each row represents an observation. Data frames provide flexibility for data manipulation and visualization in R, and are often used for handling running data.

Data Manipulation: Data manipulation involves cleaning, transforming, and summarizing data to make it suitable for analysis and visualization. R provides numerous packages, such as dplyr, tidyr, and lubridate, that offer powerful tools for data manipulation. These packages allow tasks such as filtering, sorting, aggregating, and merging data, as well as handling missing or inconsistent data.

Data Visualization: Data visualization is a critical step in exploring running data, as it helps to understand patterns, trends, and relationships in the data. R provides several packages for creating various types of visualizations, such as ggplot2, plotly, and leaflet. These packages offer a wide range of plot types, including scatter plots, bar charts, line charts, heatmaps, and maps, which can be customized with various aesthetics, colors, and themes.

Summary of Visualizations:

Scatter Plot: Scatter plots are useful for visualizing the relationship between two continuous variables, such as distance and time or distance and pace. They can help identify patterns, such as correlations or trends, and outliers in the data. Scatter plots can be created using the ggplot2 package in R, with the geom_point() function, which allows customization of point size, shape, color, and transparency.

Bar Chart: Bar charts are useful for visualizing categorical data, such as the distribution of runs by day of the week, month, or year. They can provide insights into patterns or trends in the data, and can be created using the ggplot2 package in R, with the geom_bar() function. Bar charts can be customized with different bar widths, colors, labels, and themes.

Line Chart: Line charts are useful for visualizing trends or changes in data over time, such as the distance or pace of runs over weeks, months, or years. They can provide insights into patterns, seasonality, or changes in performance. Line charts can be created using the ggplot2 package in R, with the geom_line() function, which allows customization of line color, width, and style.

Pie Charts: Pie charts are commonly used to visualize categorical data where the categories represent parts of a whole or percentages of a total. They are useful for displaying data with a small number of categories and are particularly effective in showing the relative proportions or percentages of different categories within a dataset.

Questions: 1. What is the trend of running distance over time as shown in the line chart? Are there any noticeable patterns or changes in the distance covered over the specified time period?

Answer: The line chart shows the trend of running distance over time. It appears that the distance covered has generally increased over the specified time period, with some fluctuations. There are a few peaks and valleys in the line chart, indicating variations in the running distance over time.

How does the elevation gain vary by date as shown in the bar chart? Are there any dates with significantly higher or lower elevation gains compared to others?

Answer: The bar chart shows the variation in elevation gain by date. Some dates have higher elevation gains compared to others, while some dates have lower elevation gains. This could indicate that the activities have different elevation profiles on different dates.

Is there any relationship between running speed and distance covered as depicted in the scatter plot? Does the scatter plot show any trend or pattern in the data? Are there any outliers or interesting observations in the scatter plot?

Answer: The scatter plot shows the relationship between running speed and distance covered. It appears that there is no strong linear relationship between the two variables in the scatter plot. However, there seems to be a general trend of higher speeds for shorter distances and lower speeds for longer distances. There are also a few outliers in the scatter plot, which could represent instances where I had significantly different speeds for certain distances.

What is the distribution of running distances as shown in the histogram? What is the most common range of distances covered on the histogram?

Answer: The histogram shows the distribution of running distances. It appears that the most common range of distances covered is between 0 and 5 kilometers, as indicated by the peak in the histogram.

--- title: "Course Project for ANLY 512: Data Visualization" author: "Chirag Shetty" date: "`r Sys.Date()`" output: flexdashboard::flex_dashboard: orientation: rows source_code: embed --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r include=FALSE} install.packages('plyr', repos = "http://cran.us.r-project.org") options(repos = list(CRAN="http://cran.rstudio.com/")) install.packages("devtools") install.packages("treemap") library(dplyr) library(knitr) library(kableExtra) library(ggplot2) library(readxl) library(tidyr) library(vcd) library(devtools) library(car) library(plotly) library(treemap) library(reshape2) ``` # Running Data Row {data-height=400} ------------------------------------- ### ```{r include=FALSE} install.packages("Rtools") install.packages('htmltools') install.packages("plotly") library(knitr) library(ggplot2) library(readxl) library(shiny) library(plotly) ``` # Load running data from Excel file ```{r 33} running_data <- read_excel("running_data_March.xlsx") ``` library(ggplot2) # Create a line chart ```{r line_chart, fig.show='asis'} library(plotly) # Create a line chart line_chart <- plot_ly(running_data, x = ~Date, y = ~Distance_km, type = "scatter", mode = "lines", line = list(color = "blue", width = 2)) %>% layout(title = "Running Distance by Date", xaxis = list(title = "Date"), yaxis = list(title = "Distance_km")) line_chart ``` library(ggplot2) # Create a bar chart ```{r 3} bar_chart <- ggplot(running_data, aes(Date, ElevationGain_m)) + geom_bar(stat = "identity", fill = "lightblue", linewidth = 0.1) + labs(title = "Elevation Gain by Date", x = "Date", y = "ElevationGain_m") + theme_minimal() # Display the bar chart print(bar_chart) ``` # Create a scatter plot ```{r 4} # Create a scatter plot scatter_plot <- ggplot(running_data, aes(Speed, Distance_km)) + geom_point(color = "darkblue", size = 3) + labs(title = "Running Speed vs. Distance", x = "Speed", y = "Distance_km") + theme_minimal() print(scatter_plot) ``` library(ggplot2) # Create a histogram ```{r 5} # Assuming you have already loaded the running_data dataset into your R session # Load required package for data visualization library(ggplot2) # Create histogram ggplot(running_data, aes(x = Distance_km)) + geom_histogram(binwidth = 1, color = "black", fill = "skyblue") + labs(title = "Distance Distribution", x = "Distance (in km)", y = "Count") + theme_minimal() ``` library(plotly) ```{r pie_Chart} # Create a pie chart pie_chart <- plot_ly(running_data, labels = ~Activity, values = ~Distance_km, type = "pie", marker = list(colors = c("lightblue", "lightgreen", "red"))) %>% layout(title = "Percentage of Running Distance by Activity") pie_chart ``` # Summary **Course Project for ANLY 512: Data Visualization** **The Quantified Self** * Data Source: **Strava** * Data Type: **Running data analyzed for March** **Write-up** **Introduction:** Running is a popular form of exercise and a widely practiced sport. With the availability of modern fitness tracking devices and apps, runners can easily collect data on their running activities, such as distance, time, pace, and more. This data can be used to analyze performance, set goals, and gain insights into running patterns. Data Acquisition: The first step in analyzing running data is to acquire the data. This can be done through various sources, such as fitness tracking apps, smartwatches, or external sensors. The data can be in different formats, such as CSV, Excel, or JSON. Once the data is obtained, it can be loaded into R for further analysis. Data Storage: In R, data can be stored in various data structures, such as data frames, lists, or arrays. Data frames are commonly used for storing tabular data, where each column can represent a variable or attribute, and each row represents an observation. Data frames provide flexibility for data manipulation and visualization in R, and are often used for handling running data. Data Manipulation: Data manipulation involves cleaning, transforming, and summarizing data to make it suitable for analysis and visualization. R provides numerous packages, such as dplyr, tidyr, and lubridate, that offer powerful tools for data manipulation. These packages allow tasks such as filtering, sorting, aggregating, and merging data, as well as handling missing or inconsistent data. Data Visualization: Data visualization is a critical step in exploring running data, as it helps to understand patterns, trends, and relationships in the data. R provides several packages for creating various types of visualizations, such as ggplot2, plotly, and leaflet. These packages offer a wide range of plot types, including scatter plots, bar charts, line charts, heatmaps, and maps, which can be customized with various aesthetics, colors, and themes. Summary of Visualizations: Scatter Plot: Scatter plots are useful for visualizing the relationship between two continuous variables, such as distance and time or distance and pace. They can help identify patterns, such as correlations or trends, and outliers in the data. Scatter plots can be created using the ggplot2 package in R, with the geom_point() function, which allows customization of point size, shape, color, and transparency. Bar Chart: Bar charts are useful for visualizing categorical data, such as the distribution of runs by day of the week, month, or year. They can provide insights into patterns or trends in the data, and can be created using the ggplot2 package in R, with the geom_bar() function. Bar charts can be customized with different bar widths, colors, labels, and themes. Line Chart: Line charts are useful for visualizing trends or changes in data over time, such as the distance or pace of runs over weeks, months, or years. They can provide insights into patterns, seasonality, or changes in performance. Line charts can be created using the ggplot2 package in R, with the geom_line() function, which allows customization of line color, width, and style. Pie Charts: Pie charts are commonly used to visualize categorical data where the categories represent parts of a whole or percentages of a total. They are useful for displaying data with a small number of categories and are particularly effective in showing the relative proportions or percentages of different categories within a dataset. **Questions:** 1. What is the trend of running distance over time as shown in the line chart? Are there any noticeable patterns or changes in the distance covered over the specified time period? Answer: The line chart shows the trend of running distance over time. It appears that the distance covered has generally increased over the specified time period, with some fluctuations. There are a few peaks and valleys in the line chart, indicating variations in the running distance over time. 2. How does the elevation gain vary by date as shown in the bar chart? Are there any dates with significantly higher or lower elevation gains compared to others? Answer: The bar chart shows the variation in elevation gain by date. Some dates have higher elevation gains compared to others, while some dates have lower elevation gains. This could indicate that the activities have different elevation profiles on different dates. 3. Is there any relationship between running speed and distance covered as depicted in the scatter plot? Does the scatter plot show any trend or pattern in the data? Are there any outliers or interesting observations in the scatter plot? Answer: The scatter plot shows the relationship between running speed and distance covered. It appears that there is no strong linear relationship between the two variables in the scatter plot. However, there seems to be a general trend of higher speeds for shorter distances and lower speeds for longer distances. There are also a few outliers in the scatter plot, which could represent instances where I had significantly different speeds for certain distances. 4. What is the distribution of running distances as shown in the histogram? What is the most common range of distances covered on the histogram? Answer: The histogram shows the distribution of running distances. It appears that the most common range of distances covered is between 0 and 5 kilometers, as indicated by the peak in the histogram.