Libraries

library(tidyverse)
library(ggplot2)
library(plotly)
library(ggrepel)
library(scales)
library(sf)
library(geosphere)

a. Shipments of Household Appliances: Line Graphs. The file ApplianceShipments.csv contains the series of quarterly shipments (in millions of dollars) of US household appliances between 1985 and 1989.

Create a well-formatted time plot of the data using R.

The plot below shows the quarterly shipments of household goods between 1985 and 1989. We see that overall, shipments have increased over the years. Q2 of most year had the most shipments. Q4 of most years had the least shipments.

# load dataset
apps <- read.csv("~/R_Projects/Data Mining/data/ApplianceShipments.csv")
# separate Quarter variable into 2
apps_yq <- separate(apps, Quarter, c("Quarter", "Year"))
# create variable with numerical quarter
apps_yq$Quarter_Num <- sub('.', '', apps_yq$Quarter)
# change order of variables
apps_yq <- select(apps_yq, Quarter_Num, Year, Quarter, Shipments)
# create new variable with year and quarter
apps_yq <-  unite(apps_yq, "Year_Quarter", Year:Quarter_Num, sep = ".", remove = FALSE)
# change to factor
apps_yq$Quarter <- as.factor(apps_yq$Quarter)
# change to numeric
apps_yq <- mutate_if(apps_yq, is.character, as.numeric)

# plot data
apps_yq %>% 
  ggplot(aes(Year_Quarter, Shipments))+ # create ggplot object
  geom_line()+ # add line layer
  geom_label(aes(label = Quarter, color = Quarter), show.legend = F)+ # add label layer
  theme_light()+ # change theme
  labs(x = "Year") + # change x axis label
  ggtitle(label = "Appliance Shipments from 1985 to 1989") # add title

c. Using R, create one chart with four separate lines, one line for each of Q1, Q2, Q3,and Q4. In R, this can be achieved by generating a data.frame for each quarter Q1, Q2, Q3, Q4, and then plotting them as separate series on the line graph. Zoom in to the range of 3500–5000 on the y-axis. Does there appear to be a difference between quarters?

The plot below shows quarterly shipments of household appliances between 1985 and 1989. The Q2 has shown an increase in shipments year to year. Q1 has shown a decrease in appliance shipments year to year.

# Create last data point label
last_label <- apps_yq %>% filter(Year == 1989)
# plot data
apps_yq %>%
  ggplot(aes(Year, Shipments, color = Quarter), show.legend = FALSE) + # create ggplot object
  geom_line() + # add line layer
  geom_text_repel(
    data = last_label,
    aes(label = Quarter),
    nudge_x = 1,
    na.rm = TRUE
  ) +
  # change theme
  theme_light() +
  theme(legend.position = "none") +
  ggtitle(label = "Quarterly Appliance Shipments from 1985 to 1989") # add title

d. Using R, create a line graph of the series at a yearly aggregated level (i.e., the total shipments in each year).

The plot below shows that at a yearly aggregate level, there has been an increase in household appliance shipments. Shipments peaked in 1987, dropped slightly in 1988 and appear to be headed back up.

# prep data
apps_yq <- apps_yq %>% 
  group_by(Year) %>% # group dataframe by year 
   mutate(yearly_ship = sum(Shipments)) %>% # create column showing total shipments
  ungroup()
# plot data
apps_yq %>% 
  ggplot(aes(Year, yearly_ship))+ # create ggplot object
  geom_line()+ # add line layer
  theme_light()+ # change theme
  ylab("Shipments") + # change axis label
  ggtitle(label = "Total Appliance Shipments from 1985 to 1989") # add title

Sales of Riding Mowers: Scatter Plots. A company that manufactures riding mowers wants to identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer is interested in classifying households as prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert looked at a random sample of 24 households, given in the file RidingMowers.csv.

a. Using R, create a scatter plot of Lot Size vs. Income, color-coded by the outcome variable owner/nonowner. Make sure to obtain a well-formatted plot (create legible labels and a legend, etc.).

The plot below shows the relationship betwen income and lot size. We can see that there is a separation between the Owner and Non Owner groups. The owner group has income and lot size on the upper half on the distribution and the non-owner group has income and lot size on the bottom half of the distribution. Both groups appear to show a slight linear negative relationship between income and lot size.

# load dataset
RidingMowers <- read_csv("~/R_Projects/Data Mining/data/RidingMowers.csv") 
## Rows: 24 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Ownership
## dbl (2): Income, Lot_Size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# change variable class to factor
RidingMowers$Ownership <- as.factor(RidingMowers$Ownership)
# create ggplot object
ggplot(RidingMowers, aes(Lot_Size, Income, color = Ownership, shape = Ownership))+
  geom_point()+
  geom_smooth(se = F, method = "lm")+ # add point layer
  theme_light()+ # change theme
  xlab("Lot Size")+ # specify x axis label
  scale_color_hue(labels = c("Non Owner", "Owner")) # change legend label
## `geom_smooth()` using formula 'y ~ x'

Problem 3: Laptop sales at a London computer chain (a, b)

Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in London in January 2008. This is a subset of the full dataset that includes data for the entire year.

a. Create a bar chart, showing the average retail price by store. Which store has the highest average? Which has the lowest?

Based on the bar chart, all the store have around the same average retail price. The store with the highest average retail price is “N17 6QA” and the store with the lowest is “W4 3PH”.

# load dataset
LaptopSales2008 <- read_csv("~/R_Projects/Data Mining/data/LaptopSalesJanuary2008.csv")
## Rows: 7956 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Date, Customer Postcode, Store Postcode, Integrated Wireless?, Bun...
## dbl (12): Configuration, Retail Price, Screen Size (Inches), Battery Life (H...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# change class to factor
LaptopSales2008$`Store Postcode` <- as.factor(LaptopSales2008$`Store Postcode`) 
# plot retail price by store
LaptopSales2008 %>% 
  group_by(`Store Postcode`) %>% # group by store code
  summarise(`Average Retail Price` = mean(`Retail Price`)) %>% # calculate mean retail price
  mutate(Store = fct_reorder(`Store Postcode`, - `Average Retail Price`)) %>% # reorder stores from most to least
  ggplot(aes(Store, `Average Retail Price`))+ # create ggplot object
  geom_col()+ # add bar graph layer
   ggtitle(label = "Average Retail Price by Store") # add title

b. To better compare retail prices across stores, create side-by-side boxplots of retail price by store. Now compare the prices in the two stores from (a). Does there seem to be a difference between their price distributions?

The box plots below show a difference in the price distribution of the store with the highest and lowest average retail price. The higher selling store has a narrower distribution with a greater mean than the lowest selling store. The lower selling store has longer tails with more outliers.

LaptopSales2008 %>% 
  subset(`Store Postcode` == c("N17 6QA", "W4 3PH")) %>% # subset 2 stores
  group_by(`Store Postcode`) %>% # group by post code
  ggplot(aes(`Retail Price`, color = `Store Postcode`))+ # create ggplot object
  geom_boxplot()+ # add boxplot layer
  facet_grid(vars( `Store Postcode`)) # facet grid by store postcode

Laptop Sales at a London Computer Chain: Interactive Visualization. The next exercises are designed for using an interactive visualization tool. The file LaptopSales.txt is a comma-separated file with nearly 300,000 rows. ENBIS (the European Network for Business and Industrial Statistics) provided these data as part of a contest organized in the fall of 2009. Scenario: Imagine that you are a new analyst for a company called Acell (a company selling laptops). You have been provided with data about products and sales. You need to help the company with their business goal of planning a product strategy and pricing policies that will maximize Acell’s projected revenues in 2009. Using an interactive visualization tool, answer the following questions.

a. Price Questions:

At what price are the laptops actually selling?

LaptopSales <-
  read_csv("~/R_Projects/Data Mining/data/LaptopSales.csv") %>%
  drop_na() # drop NA observations
## Rows: 297572 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Date, Customer Postcode, Store Postcode, Integrated Wireless?, Bun...
## dbl (11): Configuration, Retail Price, Screen Size (Inches), Battery Life (H...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# convert store location from BNG to WGS84
store_loc_tab <- LaptopSales %>% 
  st_as_sf(coords = c("store X", "store Y"), crs = 27700) %>%
  st_transform(4326) %>%
  st_coordinates() %>% 
  as_tibble() %>% 
  rename(
    store_long = X,
    store_lat = Y) %>% 
  add_column(store_postcode = LaptopSales$`Store Postcode`) %>% 
  distinct(store_postcode, .keep_all = TRUE)

# export as csv for Tableau 
write.csv(store_loc_tab, "~/R_Projects/Data Mining/data/store_loc.csv", row.names = F )

# convert customer location from BNG to WGS84
cust_loc_tab <- LaptopSales %>% 
  st_as_sf(coords = c("customer X", "customer Y"), crs = 27700) %>%
  st_transform(4326) %>%
  st_coordinates() %>% 
  as_tibble() %>% 
    rename(
    cust_long = X,
    cust_lat = Y) %>% 
    add_column(cust_postcode = LaptopSales$`Customer Postcode`) 

# export as csv for Tableau  
write.csv(cust_loc_tab, "~/R_Projects/Data Mining/data/cust_loc.csv", row.names = F )