Question 1

The data set BusinessData.csv shows sales data for Company X, which sells guitars. Suppose you work for Company X, and you are tasked with comparing the sales data across each of their business locations. The person who collected and disseminated the data did their best, but they (like many people) were drastically underpaid didn’t do the best job aggregating the data. Construct a graphic that best compares each business location’s success since launch. Then, interpret this graphic and rank each location’s success based on a meaningful, quantitative metric.

The final graphic that I have constructed is a bar graph showing the average daily revenue for each business location based on the days that are open. The quantitative metric that I have chosen to measure a market’s success is having each location’s average revenue taken on a day-by-day basis. However, I determined this to be a meaningful metric after creating graphs using different metrics.

I decided to test if the growth rates (simple growth rate and average growth rate) of the revenue would be a good metric to measure the success of the businesses. Because the data provided was in disarray, I mutated the business data to remove the commas and the DC/D.C. to combine the Market_Name data, then sorted the data into month and year. I then calculated for the total revenue over the total orders (or sales) to find the revenue generated per month with respect to the sales and graphed by month. The extra graph about total revenue over total sales shows the length of businesses and the fluctuations in the income. I then used the length of the youngest business (2 months) to calculate the growth rate and compound growth rate for every location’s initial two months to have a fair measure of their success in terms of how they started, which is shown in my extra graphs below. However, two months is rather short and there is value in comparing only the businesses that have been open for more than 2 months, and I found that even though growth rate may be a meaningful metric, that there are a few businesses that have very large discrepancies between their first month and their second month, which may suggest other factors that are not accounted for when calculating the revenue.

Now, I am basing success on the ranking of the total revenue for each market over the number of days open (n). This is a more relevant metric because it is accounting for the revenue earned for the days that each of the stores are open. I am also basing it off of revenue and not sales because the business sells guitars, which fluctuate in price based on the type and quality of guitar; so in this case, I find it more important to focus on the total revenue earned rather than the number of guitars sold. In this way, the average daily revenue can be measured for each business, regardless of the time it has been in operation.

When using this specific graph, the average daily revenue in US dollars is on the y-axis with the market locations on the x-axis. With this metric of ranking, the most successful businesses are Washington DC, Phoenix, Chicago, followed by Orlando, Tampa, Baltimore, San Antonio, Miami, Tucson, and Philadelphia. Washington DC generates the most revenue because it has been in operation for 366 days, with 336 of those days being open days. Philadelphia generated the least revenue, even though it was open for more days than San Antonio. For the days that San Antonio was open, the guitars they sold were priced much higher than the guitars sold in Philadelphia; it is possible that the San Antonio location sold more electric guitars, for example, whereas the Philadelphia location sold more acoustic guitars.

A few errors or factors that need to be taken into consideration are that there may be rounding errors when looking at the calculated average daily revenue and in the total revenue over total sales graph, the operation length of San Antonio, TX has a large gap between the first open day in June 2018 and the second open day in January 2019. This can be a particular problem when calculating the average daily revenue over the length of operation and not only the open days because there could be a mismatch of the actual revenue earned due to there being so many days without any revenue earned. San Antonio, TX may have closed business for a few months or moved locations, or some other speculation.

1 - Washington DC, 2 - Phoenix, AZ, 3- Chicago, IL, 4 - Orlando, FL, 5 - Tampa, FL, 6 - Baltimore, MD, 7 - San Antonio, TX, 8 - Miami, FL, 9 - Tucson, AZ, 10 - Philadelphia, PA.

#Naming the [2] and [3] columns
colnames(business)[2] <- "Market_Name"
colnames(business)[3] <- "Order_Count"
#Looking at growth rate/compound growth rate

#Removing the commas, DC, and D.C. from
#Washington to combine the location data
business_data <- business |>
  mutate(Market_Name = str_remove_all(Market_Name, ",")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "DC")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "D.C."))

#Sorting the data by month and year
month_year <- business_data |>
  group_by(Market_Name, month = lubridate::floor_date(Day, 'month')) |>
  summarize(sum_sales = sum(Order_Count),
            sum_revenue = sum(Revenue)) |>
  mutate(Day = format(as.Date(month), format = "%b-%Y"))

## `summarise()` has grouped output by 'Market_Name'. You can override using the
## `.groups` argument.

#Finding total revenue over total sales
revenue_per_sales <- month_year |>
  group_by(Market_Name) |>
  mutate(revenue_per_sales = round(sum_revenue / sum_sales, 2))

#Calculating simple growth rate
growth_rate <- revenue_per_sales |>
  arrange(month) |>
  group_by(Market_Name) |>
  summarize(Growth_rate = (diff(sum_revenue)[1]) / sum_revenue[1])

#Calculating compound growth rate for the
#initial two months of every location
compound_growth_rate <- revenue_per_sales |>
  arrange(month) |>
  group_by(Market_Name) |>
  summarize(Compound_Growth_Rate = ((sum_revenue[2]) / sum_revenue[1]) ^
              (1 / 2) - 1)

first_two_months <- revenue_per_sales |>
  group_by(Market_Name) |>
  filter(row_number() <= 1)

rates <-
  left_join(growth_rate, 
            first_two_months, 
            by = "Market_Name") |>
  left_join(compound_growth_rate, 
            first_two_months, 
            by = "Market_Name")

#Making the data in date format
revenue_per_sales$month <-
  as.Date(revenue_per_sales$month)

revenue_per_sales <-
  revenue_per_sales[order(revenue_per_sales$month),]

x_limits <- range(revenue_per_sales$month)

#Plot the data
revenue_per_sales |>
  group_by(Market_Name) |>
  mutate(revenue_per_sales = round(sum_revenue / sum_sales, 2)) |>
  ggplot(aes(x = month, 
             y = revenue_per_sales, 
             color = Market_Name)) +
  geom_point() +
  geom_line() +
  facet_wrap( ~ Market_Name,
              scales = "free_x", 
              ncol = 2) + coord_cartesian(xlim = x_limits) +
  labs(x = "Month", 
       y = "Revenue Over Sales ($)") +
  ggtitle("EXTRA: Total Revenue Over Total Sales Per Location") +
  theme(legend.position = "none")

x_labels_rates <- c(
  "Chicago IL",
  "Phoenix AZ",
  "Washington DC",
  "San Antonio TX",
  "Orlando FL",
  "Baltimore MD",
  "Tampa FL",
  "Philadelphia PA",
  "Tucson AZ",
  "Miami FL"
)

rates |>
  ggplot() +
  geom_col(aes(reorder(Market_Name,-Growth_rate), 
               Growth_rate, fill = Market_Name)) +
  geom_text(aes(label = round(Growth_rate, digits = 2),
                x = Market_Name,
                y = Growth_rate),
            vjust = -0.5,
            size = 3) +
  labs(x = "Location",
       y = "Growth Rate (%)",
       fill = "Business Location") +
  theme(axis.text.x = element_text(angle = 30,
                                   vjust = 0.75,
                                   hjust = 0.75),
        legend.position = "none") +
  scale_x_discrete(labels = x_labels_rates) +
  ggtitle("EXTRA: Initial Two-Month Growth Rates by Business Location")

rates |>
  ggplot() +
  geom_col(aes(reorder(Market_Name,
                       -Compound_Growth_Rate),
               Compound_Growth_Rate,
               fill = Market_Name)) +
  geom_text(aes(label = round(Compound_Growth_Rate, digits = 2),
                x = Market_Name,
                y = Compound_Growth_Rate),
            vjust = -0.5,
            size = 3) +
  labs(x = "Location", 
       y = "Compound Growth Rate (%)") +
  theme(axis.text.x = element_text(angle = 30,
                                   vjust = 0.75,
                                   hjust = 0.75),
        legend.position = "none") +
  scale_x_discrete(labels = x_labels_rates) +
  ggtitle("EXTRA: Initial Two-Month Compound Growth Rates by Business Location")

#Total Revenue and Total Sales
business_total <- business |>
  group_by(Market_Name) |>
  mutate(Market_Name = str_remove_all(Market_Name, ",")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "DC")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "D.C.")) |>
  summarize(Total_Revenue = sum(Revenue),
            Total_Sales = sum(Order_Count))

#Days that each market was open
days_open <- business |>
  group_by(Market_Name) |>
  mutate(Market_Name = str_remove_all(Market_Name, ",")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "DC")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "D.C.")) |>
  count(Market_Name)

first_day_open <- business |>
  group_by(Market_Name) |>
  mutate(Market_Name = str_remove_all(Market_Name, ",")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "DC")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "D.C.")) |>
  summarize(first_open_date = min(Day))

last_day_open <- business |>
  group_by(Market_Name) |>
  mutate(Market_Name = str_remove_all(Market_Name, ",")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "DC")) |>
  mutate(Market_Name = str_remove_all(Market_Name, "D.C.")) |>
  summarize(last_open_date = max(Day)) 

first_last_day <- left_join(first_day_open, last_day_open, by = "Market_Name") |>
  left_join(days_open, 
            last_day_open, 
            by = "Market_Name")

#Finding the operation length of each market
days_operated <- first_last_day |>
  group_by(Market_Name) |>
  reframe(operation = floor(last_open_date - first_open_date)) |>
  mutate(operation_length = as.numeric(str_remove_all(operation, "days")))

operation_length <- left_join(days_operated, first_last_day, by = "Market_Name")

Final_Total <- left_join(business_total, 
                         operation_length, 
                         by = "Market_Name")

ranked_data <- Final_Total |>
  group_by(Market_Name) |>
  mutate(ranked = Total_Revenue/operation_length) |>
  arrange(-ranked)

#Where n is the number of days open
only_days_open <- Final_Total |>
  group_by(Market_Name) |>
  mutate(ranked_open = Total_Revenue/n) |>
  arrange(-ranked_open)

x_labels_operation <- c(
  "Washington DC",
  "Phoenix AZ",
  "Chicago IL",
  "Orlando FL",
  "Tampa FL",
  "Baltimore MD",
  "Miami FL",
  "Philadelphia PA",
  "Tucson AZ",
  "San Antonio TX"
)

ranked_data |>
  group_by(Market_Name) |>
  ggplot() +
  geom_col(aes(reorder(Market_Name, 
                       -ranked), 
               ranked, 
               fill = Market_Name)) +
  geom_text(aes(label = round(ranked, digits = 2), 
                x = Market_Name, 
                y = ranked), 
            vjust = -0.5, 
            size = 3) +
  xlab("Location") + ylab("Average Daily Revenue ($)") +
  theme(axis.text.x = element_text(angle = 30, 
                                   vjust = .9, 
                                   hjust = 0.9), legend.position = "none") +
  scale_x_discrete(labels = x_labels_operation) +
  ggtitle("EXTRA: Average Daily Revenue Per Business Location (Operation Length)")

x_labels_open <- c(
  "Washington DC",
  "Phoenix AZ",
  "Chicago IL",
  "Orlando FL",
  "Tampa FL",
  "Baltimore MD",
  "San Antonio TX",
  "Miami FL",
  "Tucson AZ",
  "Philadelphia PA"
  )

only_days_open|>
  group_by(Market_Name) |>
  ggplot() +
  geom_col(aes(reorder(Market_Name, 
                       -ranked_open), 
               ranked_open, 
               fill = Market_Name)) +
  geom_text(aes(label = round(ranked_open, digits = 2),
                x = Market_Name, y = ranked_open), 
            vjust = -0.5, size = 3) +
  xlab("Business Location") + ylab("Average Daily Revenue ($)") +
  labs(fill = "Location") +
  theme(axis.text.x = element_text(angle = 25, vjust = 0.75, hjust = 0.85), legend.position = "none") +
  scale_x_discrete(labels = x_labels_open) +
  ggtitle("Average Daily Revenue Per Business Location (Open Days)")

Question 3

There is a longstanding claim that eating peppermints increases brain activity, thereby making students do better on exams. The data set Peppermint.csv contains experimental data from one particular class of students. (The data have been cleaned slightly to make for easier analysis.) Using this data set, determine whether eating peppermints before taking tests leads to higher test scores. Be sure to support your conclusions using the data.

The data have:

– StudentID: An ID given to anonymize students participating in this study

– Test: A number given to each test, where 1 indicates the first test of the semester, and 5 indicates the fifth (and last) test of the semester

– Peppermint: A binary variable indicating whether the student was given a peppermint prior to the test (1) or was not given a peppermint (0)

– Score: The student’s score on the exam, graded on a curve

After analyzing the data and using the data set to create many different visual representations, I have determined that eating peppermints before taking tests does not lead to higher test scores.

First, I used the original data set to create a boxplot of students’ test performance factored by whether or not peppermints were consumed before taking the test, with 0 representing no presence of peppermints, and 1 representing the presence of peppermints. A jitter plot is laid over the boxplot to show the test scores for each test (color-coded as shown). This is a general graph that portrays the overall data. I wanted to create a second graph to portray the overall data, which is shown in the boxplot below, which is faceted by the tests taken. There are also jitter plots layered over the box plots to show the test scores, which I faceted by whether or not peppermints were consumed. These graphs, while interesting, are not insightful for studying the details of the data and coming to a conclusion. On first glance, the students who ate peppermints before taking their tests have higher scores appear to have higher test scores. This is indicated by the first boxplot, where the median for students who did not eat peppermints is approximately a score of 70, whereas the median for students who were given peppermints is about 80, almost ten points higher. The second graph with the five sets of boxplots allows a bit more explanation, but not enough. The discrepancies between the median score for every test except for Test 5 has a large gap. However, after properly studying the data set, it can be seen that each student had a different level of consumption for the peppermints.

This leads me to a deeper dive into the data set. To explore how many peppermints each student ate, I took the data provided and extracted how many peppermints each student ate by grouping by Student ID and summing up the number of peppermints eaten. Then, I left joined the peppermints_eaten data with the original peppermint data set. I created a bar chart that I faceted by Student ID to see the irregularity in the data. Because a few students did not consume peppermints and a select few consumed the maximum number (5 peppermints), this caused a skew in the data to imply that eating peppermints does increase test score results.

After finding out the cause of the discrepancies in the data, I filtered out the students that were not given any peppermints and the students that were given peppermints for all five tests. This ended up being my filtered_peppermint data set, which I then used to create a boxplot. This boxplot shows that with the skewed data removed, the medians are approximately the same, and the boxplot that is faceted by test elaborates upon that. For Test 5 specifically, there is evidence that students that were given the peppermints scored lower than those without, so there is not any concrete evidence that eating peppermints boosts test scores. In addition, this is further shown by the jitter plot at the bottom. Some students received their lowest scores after eating peppermints and some students received their highest score without eating peppermints. I also added the same graph but with the filtered data for further inspection.

Ultimately, whether the students are given peppermints or not does not appear to have a large impact on the students’ academic performance.

#Finding how many peppermints were eaten by each student
peppermints_eaten <- peppermint |>
  group_by(StudentID) |>
  summarize(Peppermints_Eaten = sum(Peppermint))

peppermint_data <- left_join(peppermints_eaten,
                             peppermint,
                             by = "StudentID")

#Filtering out the students who ate only peppermints
#and the students who ate no peppermints
filtered_peppermint <- peppermint_data |>
  group_by(StudentID) |>
  filter(Peppermints_Eaten > 0, Peppermints_Eaten < 5)

ranking_scores <- peppermint_data |>
  group_by(StudentID) |>
  arrange(StudentID, Score)

filtered_ranking <- ranking_scores |>
  group_by(StudentID) |>
  filter(Peppermints_Eaten > 0, Peppermints_Eaten < 5)

#Plot the Peppermint data for the overall graph
peppermint_data |>
  group_by(StudentID) |>
  ggplot() +
  geom_boxplot(aes(x = factor(Peppermint), 
                   y = Score)) +
  geom_jitter(aes(x = factor(Peppermint),
    y = Score,
    color = factor(Test))) +
  labs(x = "Peppermint Consumption",
       y = "Test Score",
       color = "Test") +
  ggtitle("Test Scores Based on Peppermint Consumption")

#Plot the Peppermint data for a general impression
peppermint |>
  group_by(StudentID) |>
  ggplot() +
  geom_boxplot(aes(x = factor(Peppermint), 
                   y = Score)) +
  geom_jitter(aes(x = factor(Peppermint),
    y = Score,
    color = factor(Peppermint))) + 
  facet_wrap( ~ Test) +
  labs(x = "Peppermint Consumption",
       y = "Test Score",
       color = "Peppermint") +
  ggtitle("Students' Performance Per Test Based on Peppermint Consumption")



```r
#This graph shows how many peppermints were eaten by each student
peppermint_data |>
  group_by(StudentID) |>
  ggplot() +
  geom_bar(aes(x = factor(Peppermint),
               fill = Peppermints_Eaten)) + 
  facet_wrap( ~ StudentID) +
  labs(x = "Peppermint Consumption (0: Not Eaten, 1: Eaten)",
       y = "Peppermint Count",
       fill = "Peppermints Eaten") +
  ggtitle("Number of Peppermints Eaten By Each Student")

#After filtering out students who ate no peppermints
#and students who ate only peppermints
filtered_peppermint |>
  group_by(StudentID) |>
  ggplot() +
  geom_boxplot(aes(x = factor(Peppermint), y = Score)) +
  geom_jitter(aes(
    x = factor(Peppermint),
    y = Score,
    color = factor(Test))) +
  labs(x = "Peppermint Consumption (0: Not Eaten, 1: Eaten)",
       y = "Test Score",
       color = "Test") +
  ggtitle("Peppermint Consumption Impact on Students' Test Performance")

filtered_peppermint |>
  group_by(StudentID) |>
  ggplot() +
  geom_boxplot(aes(x = factor(Peppermint),
                   y = Score)) +
  geom_jitter(aes(x = factor(Peppermint),
                  y = Score,
                  color = factor(Test))) + 
  facet_wrap( ~ Test) +
  labs(x = "Peppermint Consumption", 
       y = "Test Score",
       color = "Test") +
  ggtitle("Student Test Scores Based on Peppermint Consumption")

#Find the highest scores of each student and facet based on student ID
#Factor based on whether or not they ate the peppermint
ranking_scores |>
  group_by(Test) |>
  ggplot() +
  geom_jitter(aes(
    x = factor(Peppermint),
    y = Score,
    color = factor(Peppermint))) +
  labs(x = "Peppermint Consumption (0: Not Eaten, 1: Eaten)",
       y = "Test Score") +
  facet_wrap( ~ StudentID) +
  ggtitle("Impact of Peppermint Consumption on Students' Test Scores")

#After filtering out the students who ate no peppermints and
#the students who only ate peppermints
filtered_ranking |>
  group_by(Test) |>
  ggplot() +
  geom_jitter(aes(
    x = factor(Peppermint),
    y = Score,
    color = factor(Peppermint))) +
  labs(x = "Peppermint Consumption (0: Not Eaten, 1: Eaten)",
       y = "Test Score",
       color = "Peppermint Intake") +
  facet_wrap( ~ StudentID) +
  ggtitle("Impact of Peppermint Consumption on Students' Test Performance")

Test 1

2024-03-18