library(tidyverse)
library(ggplot2)
library(plotly)
library(ggrepel)
library(scales)
library(sf)
library(geosphere)
The plot below shows the quarterly shipments of household goods between 1985 and 1989. We see that overall, shipments have increased over the years. Q2 of most year had the most shipments. Q4 of most years had the least shipments.
# load dataset
apps <- read.csv("~/R_Projects/Data Mining/data/ApplianceShipments.csv")
# separate Quarter variable into 2
apps_yq <- separate(apps, Quarter, c("Quarter", "Year"))
# create variable with numerical quarter
apps_yq$Quarter_Num <- sub('.', '', apps_yq$Quarter)
# change order of variables
apps_yq <- select(apps_yq, Quarter_Num, Year, Quarter, Shipments)
# create new variable with year and quarter
apps_yq <- unite(apps_yq, "Year_Quarter", Year:Quarter_Num, sep = ".", remove = FALSE)
# change to factor
apps_yq$Quarter <- as.factor(apps_yq$Quarter)
# change to numeric
apps_yq <- mutate_if(apps_yq, is.character, as.numeric)
# plot data
apps_yq %>%
ggplot(aes(Year_Quarter, Shipments))+ # create ggplot object
geom_line()+ # add line layer
geom_label(aes(label = Quarter, color = Quarter), show.legend = F)+ # add label layer
theme_light()+ # change theme
labs(x = "Year") + # change x axis label
ggtitle(label = "Appliance Shipments from 1985 to 1989") # add title
The plot below shows quarterly shipments of household appliances between 1985 and 1989. The Q2 has shown an increase in shipments year to year. Q1 has shown a decrease in appliance shipments year to year.
# Create last data point label
last_label <- apps_yq %>% filter(Year == 1989)
# plot data
apps_yq %>%
ggplot(aes(Year, Shipments, color = Quarter), show.legend = FALSE) + # create ggplot object
geom_line() + # add line layer
geom_text_repel(
data = last_label,
aes(label = Quarter),
nudge_x = 1,
na.rm = TRUE
) +
# change theme
theme_light() +
theme(legend.position = "none") +
ggtitle(label = "Quarterly Appliance Shipments from 1985 to 1989") # add title
The plot below shows that at a yearly aggregate level, there has been an increase in household appliance shipments. Shipments peaked in 1987, dropped slightly in 1988 and appear to be headed back up.
# prep data
apps_yq <- apps_yq %>%
group_by(Year) %>% # group dataframe by year
mutate(yearly_ship = sum(Shipments)) %>% # create column showing total shipments
ungroup()
# plot data
apps_yq %>%
ggplot(aes(Year, yearly_ship))+ # create ggplot object
geom_line()+ # add line layer
theme_light()+ # change theme
ylab("Shipments") + # change axis label
ggtitle(label = "Total Appliance Shipments from 1985 to 1989") # add title
The plot below shows the relationship betwen income and lot size. We can see that there is a separation between the Owner and Non Owner groups. The owner group has income and lot size on the upper half on the distribution and the non-owner group has income and lot size on the bottom half of the distribution. Both groups appear to show a slight linear negative relationship between income and lot size.
# load dataset
RidingMowers <- read_csv("~/R_Projects/Data Mining/data/RidingMowers.csv")
## Rows: 24 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Ownership
## dbl (2): Income, Lot_Size
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# change variable class to factor
RidingMowers$Ownership <- as.factor(RidingMowers$Ownership)
# create ggplot object
ggplot(RidingMowers, aes(Lot_Size, Income, color = Ownership, shape = Ownership))+
geom_point()+
geom_smooth(se = F, method = "lm")+ # add point layer
theme_light()+ # change theme
xlab("Lot Size")+ # specify x axis label
scale_color_hue(labels = c("Non Owner", "Owner")) # change legend label
## `geom_smooth()` using formula 'y ~ x'
Based on the bar chart, all the store have around the same average retail price. The store with the highest average retail price is “N17 6QA” and the store with the lowest is “W4 3PH”.
# load dataset
LaptopSales2008 <- read_csv("~/R_Projects/Data Mining/data/LaptopSalesJanuary2008.csv")
## Rows: 7956 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Date, Customer Postcode, Store Postcode, Integrated Wireless?, Bun...
## dbl (12): Configuration, Retail Price, Screen Size (Inches), Battery Life (H...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# change class to factor
LaptopSales2008$`Store Postcode` <- as.factor(LaptopSales2008$`Store Postcode`)
# plot retail price by store
LaptopSales2008 %>%
group_by(`Store Postcode`) %>% # group by store code
summarise(`Average Retail Price` = mean(`Retail Price`)) %>% # calculate mean retail price
mutate(Store = fct_reorder(`Store Postcode`, - `Average Retail Price`)) %>% # reorder stores from most to least
ggplot(aes(Store, `Average Retail Price`))+ # create ggplot object
geom_col()+ # add bar graph layer
ggtitle(label = "Average Retail Price by Store") # add title
The box plots below show a difference in the price distribution of the store with the highest and lowest average retail price. The higher selling store has a narrower distribution with a greater mean than the lowest selling store. The lower selling store has longer tails with more outliers.
LaptopSales2008 %>%
subset(`Store Postcode` == c("N17 6QA", "W4 3PH")) %>% # subset 2 stores
group_by(`Store Postcode`) %>% # group by post code
ggplot(aes(`Retail Price`, color = `Store Postcode`))+ # create ggplot object
geom_boxplot()+ # add boxplot layer
facet_grid(vars( `Store Postcode`)) # facet grid by store postcode
LaptopSales <-
read_csv("~/R_Projects/Data Mining/data/LaptopSales.csv") %>%
drop_na() # drop NA observations
## Rows: 297572 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Date, Customer Postcode, Store Postcode, Integrated Wireless?, Bun...
## dbl (11): Configuration, Retail Price, Screen Size (Inches), Battery Life (H...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# convert store location from BNG to WGS84
store_loc_tab <- LaptopSales %>%
st_as_sf(coords = c("store X", "store Y"), crs = 27700) %>%
st_transform(4326) %>%
st_coordinates() %>%
as_tibble() %>%
rename(
store_long = X,
store_lat = Y) %>%
add_column(store_postcode = LaptopSales$`Store Postcode`) %>%
distinct(store_postcode, .keep_all = TRUE)
# export as csv for Tableau
write.csv(store_loc_tab, "~/R_Projects/Data Mining/data/store_loc.csv", row.names = F )
# convert customer location from BNG to WGS84
cust_loc_tab <- LaptopSales %>%
st_as_sf(coords = c("customer X", "customer Y"), crs = 27700) %>%
st_transform(4326) %>%
st_coordinates() %>%
as_tibble() %>%
rename(
cust_long = X,
cust_lat = Y) %>%
add_column(cust_postcode = LaptopSales$`Customer Postcode`)
# export as csv for Tableau
write.csv(cust_loc_tab, "~/R_Projects/Data Mining/data/cust_loc.csv", row.names = F )