The file SupermarketTransactions.csv contains data on over 14.000 transactions. There are two numeric variables, Units Sold and Revenue. The first of these is discrete and the second is continuous. ##load tiyverse and dplyr and read the file
library(tidyverse)
## ── Attaching packages ───────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.3.4 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr, warn.conflicts = FALSE)
dtst <- read_csv("SupermarketTransactions.csv")
## Parsed with column specification:
## cols(
## Transaction = col_integer(),
## `Purchase Date` = col_character(),
## `Customer ID` = col_integer(),
## Gender = col_character(),
## `Marital Status` = col_character(),
## Homeowner = col_character(),
## Children = col_integer(),
## `Annual Income` = col_character(),
## City = col_character(),
## `State or Province` = col_character(),
## Country = col_character(),
## `Product Family` = col_character(),
## `Product Department` = col_character(),
## `Product Category` = col_character(),
## `Units Sold` = col_integer(),
## Revenue = col_double()
## )
For each of the following, do whatever it takes to create a bar chart of counts for Units Sold and a histogram of Revenue for each of the given subpopulation of purchases below. a. All purchases made during January and February of 2008. ## subset the purchases made during January and February of 2008
spmt <- subset(dtst,as.Date(dtst$`Purchase Date`, "%m/%d/%Y") >= "2008-01-01"
& as.Date(dtst$`Purchase Date`, "%m/%d/%Y") <= "2008-02-29",
select = c(`Units Sold`, Revenue))
barplot(summary(factor(spmt$`Units Sold`)), xlab = "Units Sold", ylab="Counts", main = "Units Sold during January and February of 2008")
hist(spmt$Revenue, xlab = "Revenue", ylab = "Counts", main = "Revenue during January and February of 2008")
spmt1 <- subset(dtst, dtst$Gender == "F" & dtst$Homeowner == "Y" & dtst$`Marital Status`== "M" & dtst$`State or Province` == "CA",select = c(`Units Sold`, Revenue))
barplot(summary(factor(spmt1$`Units Sold`)), xlab = "Units Sold", ylab="Counts", main = "Units Sold made by married female homeowners \n in the state of California")
hist(spmt1$Revenue, xlab = "Revenue", ylab = "Counts", main = "Revenue made from married female homeowners \n in the state of California.")
Write a summary that is less than 100 words that describes your analysis.
The distributions about revenue of Supermarket Transactions are all right skewed. This indicates that most of its revenue are made from cheaper products that have a large proportion of all transactions.
The distributions of Units Sold are more symmetric. Most of the transactions fall in from 3 to 5.