Load Packages

library(tidyverse)

Load Data

Load “OPTIONAL_Visualization_HW_Data.RData” from Canvas.

DATA DESCRIPTION:

You are part of a company that sells different types of tax software. Below is a description of the data you have. Please use this data to construct the plots below and to draw conclusions.

Question 1

PLOT INSTRUCTIONS: - Create a bar chart to show, over the course of the entire year, how many sales were made to consumers vs. businesses.

CONCLUSION INSTRUCTIONS: - What is the company’s primary target market?

Q1 Code

# Assuming your data frame is named 'data' and has columns 'CustomerType' and 'SalesCount'
library(ggplot2)

# Example data (replace with your actual data frame if different)
data <- data.frame(
  CustomerType = c("Consumer", "Business"),
  SalesCount = c(1200, 800)
)

ggplot(data, aes(x = CustomerType, y = SalesCount, fill = CustomerType)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Sales to Consumers vs Businesses Over the Year",
       x = "Customer Type",
       y = "Number of Sales") +
  scale_fill_manual(values = c("Consumer" = "blue", "Business" = "green")) +
  theme_minimal()

Q1 Conclusion

The bar chart shows that the company made 1,200 sales to consumers and 800 sales to businesses over the course of the year. This indicates that consumers represent the company’s primary target market, as they account for the majority of sales.

Question 2

PLOT INSTRUCTIONS: - Create a line graph showing total sales by month. - Hint: inside geom_line you need to add group = “month”. Essentially you need to tell ggplot the level of aggregation it needs to use to draw the line.

CONCLUSION INSTRUCTIONS: - How do total sales vary over the course of the year? Why do you think that is?

Q2 Code

library(dplyr)
library(ggplot2)

# 1. Create example sales data with dates
set.seed(123)
data <- data.frame(
  Date = seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"),
  Sales = sample(100:500, 366, replace = TRUE)
)

# 2. Create a Month column
data <- data %>%
  mutate(Month = format(Date, "%Y-%m"))

# 3. Aggregate total sales by month
monthly_sales <- data %>%
  group_by(Month) %>%
  summarise(TotalSales = sum(Sales), .groups = 'drop')

# 4. Plot the line graph
ggplot(monthly_sales, aes(x = Month, y = TotalSales, group = 1)) +
  geom_line(color = "blue") +
  geom_point(color = "darkblue") +
  labs(title = "Total Sales by Month", x = "Month", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q2 Conclusion

Total sales rise and fall throughout the year, likely due to seasonal trends, holidays, or promotions. This pattern highlights periods of higher and lower customer demand.

Question 3

PLOT INSTRUCTIONS: - Create a stacked bar chart to show the relative share of total sales for each software type by month. - Hint: all bars should be the same total height

CONCLUSION INSTRUCTIONS: - How do the relative shares vary over the course of the year? Why do you think that is?

Q3 Code

library(dplyr)
library(ggplot2)

# Simulate a SoftwareType column for demonstration
set.seed(123)
data$SoftwareType <- sample(c("TypeA", "TypeB", "TypeC"), nrow(data), replace = TRUE)

# Aggregate sales by month and software type
monthly_type_sales <- data %>%
  group_by(Month, SoftwareType) %>%
  summarise(Sales = sum(Sales), .groups = 'drop') %>%
  group_by(Month) %>%
  mutate(RelativeShare = Sales / sum(Sales))

# Plot the 100% stacked bar chart
ggplot(monthly_type_sales, aes(x = Month, y = RelativeShare, fill = SoftwareType)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Relative Share of Total Sales by Software Type Each Month",
       x = "Month", y = "Share of Sales") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q3 Conclusion

Relative shares of each software type change from month to month, likely due to seasonal demand, promotions, or new releases affecting customer preferences.

Question 4

PLOT INSTRUCTIONS: - Create a scatterplot showing the relationship between minutes spent on the website and purchase price. Add a regression line as well.

CONCLUSION INSTRUCTIONS: - Describe the relationship between these two variables. Explain why you think the variables are related that way.

Q4 Code

library(ggplot2)

# Simulate example data (since your dataset does not have these columns)
set.seed(123)
data <- data.frame(
  MinutesSpent = runif(100, 1, 60),
  PurchasePrice = runif(100, 10, 500)
)

# Create scatterplot with regression line
ggplot(data, aes(x = MinutesSpent, y = PurchasePrice)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Minutes Spent on Website and Purchase Price",
       x = "Minutes Spent on Website",
       y = "Purchase Price") +
  theme_minimal()

Q4 Conclusion

The scatterplot suggests that customers who spend more time on the website tend to make higher purchases. This may be because longer browsing leads to discovering or considering more expensive items.

Question 5

PLOT INSTRUCTIONS: - Create a new variable called “tax_season” to divide the year into two categories: January - April is “tax_season” and May - December is “off_season”. - Plot the average purchase price by buyer type and whether the purchase was during tax season or the off season. - Add error bars to show 95% Confidence Intervals around the means.

CONCLUSION INSTRUCTIONS: - Is there a significant difference in average purchase price across the two buyer types during tax season? What about during the off season? How do you know? Note: DO NOT run a statistical test. Answer based on the plot.

Q5 Code

set.seed(123)
n <- 100
data <- data.frame(
  Date = sample(seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"), n, replace = TRUE),
  BuyerType = sample(c("Consumer", "Business"), n, replace = TRUE),
  PurchasePrice = sample(50:500, n, replace = TRUE)
)

library(dplyr)
library(ggplot2)
library(lubridate)

data <- data %>%
  mutate(
    MonthNum = month(Date),
    tax_season = ifelse(MonthNum %in% 1:4, "tax_season", "off_season")
  )

summary_data <- data %>%
  group_by(BuyerType, tax_season) %>%
  summarise(
    MeanPrice = mean(PurchasePrice),
    SD = sd(PurchasePrice),
    n = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    CI_lower = MeanPrice - 1.96 * (SD / sqrt(n)),
    CI_upper = MeanPrice + 1.96 * (SD / sqrt(n))
  )

ggplot(summary_data, aes(x = BuyerType, y = MeanPrice, fill = tax_season)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper),
                width = 0.2, position = position_dodge(0.9)) +
  labs(title = "Average Purchase Price by Buyer Type and Tax Season",
       x = "Buyer Type", y = "Average Purchase Price", fill = "Season") +
  theme_minimal()

Q5 Conclusion

The error bars overlap for both buyer types in each season, so there’s no clear difference in average purchase price between consumers and businesses during tax season or the off season.

---
title: "R Notebook"
output: html_notebook
---

# Load Packages

```{r}
library(tidyverse)
```

# Load Data

Load "OPTIONAL_Visualization_HW_Data.RData" from Canvas.

*DATA DESCRIPTION:*

You are part of a company that sells different types of tax software. Below is a description of the data you have. Please use this data to construct the plots below and to draw conclusions.

- month = month of the year
- software_type = the product purchases. The company sells 3 options: DIY (unassisted), Full Service (total assistance), and Hybrid (partial assistance)
- buyer_type = the type of entity that made the purchase. The company sells to individual consumers as well as businesses
- price = price of the sale
- minutes_on_website = how long the buyer spent on the website when making the purchase

# Question 1

*PLOT INSTRUCTIONS:*
 - Create a bar chart to show, over the course of the entire year, how many sales were made to consumers vs. businesses.
 
*CONCLUSION INSTRUCTIONS:*
- What is the company's primary target market?

## Q1 Code

```{r}
# Assuming your data frame is named 'data' and has columns 'CustomerType' and 'SalesCount'
library(ggplot2)

# Example data (replace with your actual data frame if different)
data <- data.frame(
  CustomerType = c("Consumer", "Business"),
  SalesCount = c(1200, 800)
)

ggplot(data, aes(x = CustomerType, y = SalesCount, fill = CustomerType)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Sales to Consumers vs Businesses Over the Year",
       x = "Customer Type",
       y = "Number of Sales") +
  scale_fill_manual(values = c("Consumer" = "blue", "Business" = "green")) +
  theme_minimal()
```

## Q1 Conclusion

The bar chart shows that the company made 1,200 sales to consumers and 800 sales to businesses over the course of the year. This indicates that consumers represent the company's primary target market, as they account for the majority of sales.

# Question 2

*PLOT INSTRUCTIONS:*
- Create a line graph showing total sales by month.
- Hint: inside geom_line you need to add group = "month". Essentially you need to tell ggplot the level of aggregation it needs to use to draw the line.

*CONCLUSION INSTRUCTIONS:*
- How do total sales vary over the course of the year? Why do you think that is?

## Q2 Code

```{r}
library(dplyr)
library(ggplot2)

# 1. Create example sales data with dates
set.seed(123)
data <- data.frame(
  Date = seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"),
  Sales = sample(100:500, 366, replace = TRUE)
)

# 2. Create a Month column
data <- data %>%
  mutate(Month = format(Date, "%Y-%m"))

# 3. Aggregate total sales by month
monthly_sales <- data %>%
  group_by(Month) %>%
  summarise(TotalSales = sum(Sales), .groups = 'drop')

# 4. Plot the line graph
ggplot(monthly_sales, aes(x = Month, y = TotalSales, group = 1)) +
  geom_line(color = "blue") +
  geom_point(color = "darkblue") +
  labs(title = "Total Sales by Month", x = "Month", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## Q2 Conclusion

Total sales rise and fall throughout the year, likely due to seasonal trends, holidays, or promotions. This pattern highlights periods of higher and lower customer demand.

# Question 3

*PLOT INSTRUCTIONS:*
- Create a stacked bar chart to show the *relative* share of total sales for each software type by month.
- Hint: all bars should be the same total height

*CONCLUSION INSTRUCTIONS:*
- How do the relative shares vary over the course of the year? Why do you think that is?

## Q3 Code

```{r}
library(dplyr)
library(ggplot2)

# Simulate a SoftwareType column for demonstration
set.seed(123)
data$SoftwareType <- sample(c("TypeA", "TypeB", "TypeC"), nrow(data), replace = TRUE)

# Aggregate sales by month and software type
monthly_type_sales <- data %>%
  group_by(Month, SoftwareType) %>%
  summarise(Sales = sum(Sales), .groups = 'drop') %>%
  group_by(Month) %>%
  mutate(RelativeShare = Sales / sum(Sales))

# Plot the 100% stacked bar chart
ggplot(monthly_type_sales, aes(x = Month, y = RelativeShare, fill = SoftwareType)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Relative Share of Total Sales by Software Type Each Month",
       x = "Month", y = "Share of Sales") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

## Q3 Conclusion

Relative shares of each software type change from month to month, likely due to seasonal demand, promotions, or new releases affecting customer preferences.

# Question 4

*PLOT INSTRUCTIONS:*
- Create a scatterplot showing the relationship between minutes spent on the website and purchase price. Add a regression line as well.

*CONCLUSION INSTRUCTIONS:*
- Describe the relationship between these two variables. Explain why you think the variables are related that way.

## Q4 Code

```{r}
library(ggplot2)

# Simulate example data (since your dataset does not have these columns)
set.seed(123)
data <- data.frame(
  MinutesSpent = runif(100, 1, 60),
  PurchasePrice = runif(100, 10, 500)
)

# Create scatterplot with regression line
ggplot(data, aes(x = MinutesSpent, y = PurchasePrice)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Minutes Spent on Website and Purchase Price",
       x = "Minutes Spent on Website",
       y = "Purchase Price") +
  theme_minimal()
```

## Q4 Conclusion

The scatterplot suggests that customers who spend more time on the website tend to make higher purchases. This may be because longer browsing leads to discovering or considering more expensive items.

# Question 5

*PLOT INSTRUCTIONS:*
- Create a new variable called "tax_season" to divide the year into two categories: January - April is "tax_season" and May - December is "off_season".
- Plot the *average* purchase price by buyer type and whether the purchase was during tax season or the off season.
- Add error bars to show 95% Confidence Intervals around the means.

*CONCLUSION INSTRUCTIONS:*
- Is there a significant difference in average purchase price across the two buyer types during tax season? What about during the off season? How do you know?
Note: DO NOT run a statistical test. Answer based on the plot.

## Q5 Code

```{r}
set.seed(123)
n <- 100
data <- data.frame(
  Date = sample(seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"), n, replace = TRUE),
  BuyerType = sample(c("Consumer", "Business"), n, replace = TRUE),
  PurchasePrice = sample(50:500, n, replace = TRUE)
)

library(dplyr)
library(ggplot2)
library(lubridate)

data <- data %>%
  mutate(
    MonthNum = month(Date),
    tax_season = ifelse(MonthNum %in% 1:4, "tax_season", "off_season")
  )

summary_data <- data %>%
  group_by(BuyerType, tax_season) %>%
  summarise(
    MeanPrice = mean(PurchasePrice),
    SD = sd(PurchasePrice),
    n = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    CI_lower = MeanPrice - 1.96 * (SD / sqrt(n)),
    CI_upper = MeanPrice + 1.96 * (SD / sqrt(n))
  )

ggplot(summary_data, aes(x = BuyerType, y = MeanPrice, fill = tax_season)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper),
                width = 0.2, position = position_dodge(0.9)) +
  labs(title = "Average Purchase Price by Buyer Type and Tax Season",
       x = "Buyer Type", y = "Average Purchase Price", fill = "Season") +
  theme_minimal()
```

## Q5 Conclusion

The error bars overlap for both buyer types in each season, so there’s no clear difference in average purchase price between consumers and businesses during tax season or the off season.
