library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(trelliscopejs)
## This package is no longer maintained. Please use the 'trelliscope' package instead (see https://github.com/trelliscope/).
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Graph 1 (base R)

Statement: For this graph, I will be using a dataset about sales information for multiple Walmart store locations. My goal is to investigate weekly sales trends for the month of August 2010 and September 2010, specifically for store 1. I selected a line plot because I felt it was the best option to visualize changes from week to week within the two month period. The line will allow us to identify any noticeable increases, decreases, or consistent patterns in weekly sales.

walmart <- read.csv("Walmart_Sales.csv")

# convert date column
walmart$Date <- as.Date(walmart$Date, format = "%d-%m-%Y")


# subset for only store 1
walmart_subset1 <- walmart[walmart$Store == 1, ]

# subset for only August and September in 2010 
walmart_subset1 <- walmart_subset1[walmart_subset1$Date >= "2010-08-01" & walmart_subset1$Date <= "2010-09-30", ]

# make plot
plot(walmart_subset1$Date, walmart_subset1$Weekly_Sales,
     type = "l",          
     col = "maroon4", 
     lwd = 2,
     xlab = "Date",
     ylab = "Weekly Sales",
     main = "Weekly Sales for Store 1", 
     sub = "(Year: 2010)",
     ylim = c(min(walmart_subset1$Weekly_Sales), max(walmart_subset1$Weekly_Sales)),
     xaxt = "n")

# add new x-axis labels
axis(1, at = walmart_subset1$Date, labels = format(walmart_subset1$Date, "%b %d"))

Discussion: This graph made it possible to compare the Weekly Sales in the months of August 2010 and September 2010 for Store 1. I am happy with my choice to create a line plot because we can easily see the times in which weekly sales increase or decrease. The graph makes it easy to see spikes in the data. It appears as though there is a downward trend in weekly sales after the week of September 10. The week of September 17 and the week of September 24 are both less than any other week. It is also worth noting that the highest weekly sales recorded in the time period was the week of August 6. I would be curious to see if the month of July 2010 had weekly sales around this amount. Overall, I would expect there to be a reason why the weekly sales were greater in August than September. For instance, we could hypothesis that sales are lower in September because August is back to school time. Walmart would likely be affected by this time of year. The visualization easily helped answer my investigative statement. The line can be used to show changes in Weekly Sales data for Store 1.

Graph 2 (ggplot graph)

Statement: In general, I thought that it would be interesting to investigate the counts of popular baby names. To do this, I used the babynames data set that can be found in R. I chose the names Madison and Emma because these two names seem to be quite popular for people my age. I opted to specifically look at the years 2005 to 2010. I created a side-by-side barplot in ggplot to make it easy to compare the counts. This graph should make it possible to visualize which name was more popular in a given year and how their relative counts shift over time.

# install.packages("babynames")
library(babynames)
# View(babynames)

# subset data for females named Emma and Madison in years 2005-2010
pop_names <- babynames %>%
  filter(name == "Emma" | name == "Madison",
         year >= 2005,
         year <= 2010,
         sex == "F")  

# create side-by-side boxplot
ggplot(pop_names, aes(x = factor(year), y = n, fill = name)) + # change to factor(year) for appropriate axis labels
  geom_col(position = "dodge") + 
  labs(title = "Comparing Counts of 2 Popular Names",
       subtitle = "From 2005-2010",
       x = "Year", 
       y = "Count", 
       fill = "Names"
       ) +
  scale_fill_manual(values = c( "navyblue", "skyblue"))

Discussion: In the end, I was happy with how this visualization turned out. I found that the side-by-side bar plot was the ideal graph to use because it made comparing the counts of the two names intuitive and simple. We can clearly see the dark blue bar was higher for each year from 2005 to 2010. This indictates that there were more babies named “Emma” in each year. We can see that in the years 2006 and 2007, the two bars are significantly closer in height than any other year graphed. This shows that both names had similar popularity in those years. For the name “Madison”, we can clearly see that the light blue bar is getting shorter in height year after year. We can conclude that the baby name was declining in popularity. I would be interested to see this graph for the years 2000 - 2005. I wonder if “Madison” was ever a more popular name than “Emma. Overall, there is a lot that can be identified by this visualization. I believe that I succeeded in comparing the counts of baby names”Emma” and “Madison” over the 5 year span.

Graph 3 (trelliscope)

Statement: I wanted to investigate the relationship between final exam scores and total score in the course. To do this, I found a data set on kaggle that included these variables. The trelliscope graph I created can be used to analyze the relationship between these two variables across different departments. The display makes it easy to compare patterns and strengths of relationships.

student <- read.csv("Students_Performance.csv")

## Graph 
  ggplot(student, aes( x = Final_Score, y = Total_Score )) +
  geom_point(alpha = 0.5, color = "navyblue") +
    labs(x = "Final Score",
         y = "Total Score") +
  
  facet_trelliscope(~ Department,
                    name = "Final Score vs. Total Score",
                    desc = "Faceted by Deparment",
                    nrow = 2, ncol = 3,
                    scales = c("same", "same"),
                    self_contained = TRUE,
                    path = "."
                    )
## using data from the first layer

When attempting to compare the relationship between Final Exam Score in the class to Total Overall Score in the Class, I decided it would be best to look into specific departments within the school. For this reason, I faceted on the Department variable. As a result, the trelliscope graph contains 4 scatterplots. For each individual plot, we can see that the relationship between Final Score (grade on exam) and Total Score (grade in course) have a somewhat strong positive correlation. Students who perform better on the final exam will perform better in the class overall. In the end, the positive correlation did not surprise me. However, I did found it interesting to see that this was the case for all departments. This visualization was beneficial in displaying the relationship between the two selected variables.

Graph 4 (plotly)

Statement: For my plotly graph, I used a data set from kaggle that contained information about prices of fruits and vegetables. My goal for the graph was to identify trends between farm price and New York Retail Price over the years 1999 to 2019. I hypothesized that prices would be significantly higher for New York Retail and was hoping to highlight this fact through my visualization. The animation in this graph makes it possible to see how the relationship shifts over time.

food <- read.csv("fruit.csv")

food$date <- as.Date(food$date, format = "%Y-%m-%d")
food$year <- as.numeric(format(food$date, "%Y"))

p_anim <- food %>% 
  plot_ly(x = ~farmprice, y = ~newyorkretail, hoverinfo = "text", text = ~ paste("Farm Price: ", farmprice,
                                                                                 "<br> New York Retail Price: ", newyorkretail)) %>% 
  add_markers(frame = ~year, 
              marker = list(opacity = 0.6))

p_anim %>% 
  animation_opts(frame = 3000,
                 transition = 500) %>%
  animation_slider(hide = FALSE, currentvalue = list(prefix = NULL, font = list( size = 25, color = "gray6"))) %>%
  layout(legend = list(title = list(text = "Category")), 
        xaxis = list(title = "Farm Price (USD)", nticks = 15), # change x-axis labels
         yaxis = list(title = "New York Retail Price (USD)", nticks =15), # change y-axis labels
         title = "Farm Price vs. New York Retail Price")

This visualization is comparing Farm Price, in USD, to New York Retail Price, in USD. This animated graph was the ideal visual to use because it allows us to analyze the relationship over a period of time. We can see that for the majority of years there is a clear positive correlation between the two variables. This means that as Farm Price increases, New York Retail Price also increases. I expected this to happen when setting out to create this graph. Another point worth noting is that the values on the y-axis are greater overall than the values on the x-axis. This shows that the New York Retail Price is always greater than the Farm Price. When watching the animated graph over the course of the years we can see that the data points shift to the right, indictating that the product is becomings more expensive over time. This can be explained by inflation and the fact that a dollar today is not worth as much as a dollar 30 years ago. On the whole, this visualization allowed me to analyze the relationship of the variables in an easy to read manner.