ETL and EDA Using R: Peer Reviewed Assignment

Transform columns so that datatypes are appropriate. Specifically ensure that the CustomerCode variable is formatted as character, any other categorical variable is set as factor, and date column is set as a date type (Date/POSIXIt/POSIXct).

#Convert the specific columns to the specified data type. df$Department <- as.factor(df$Department) df$Category <- as.factor(df$Category) df$Date <- as.Date(df$Date, format = ‘%m/%d/%Y’) str(df) ## tibble [34,432 x 6] (S3: tbl_df/tbl/data.frame) ## $ Date : Date[1:34432], format: “2016-01-14” “2016-07-02” … ## $ Department : Factor w/ 3 levels “Entrees”,“Kabobs”,..: 2 3 3 3 2 1 1 3 3 1 … ## $ Category : Factor w/ 10 levels “Beef”,“Beef and Broccoli”,..: 7 8 8 8 1 6 2 10 8 6 … ## $ CustomerCode: chr [1:34432] “CWM11331L8O” “CWM11331L8O” “CXP4593H7E” “CWM11331L8O” … ## $ Price : int [1:34432] 28 9 9 9 25 18 26 12 9 12 … ## $ Quantity : int [1:34432] 11 5 14 6 7 13 9 6 11 22 … Display and interpret the summaries for the Quantity and Price columns.

summary(df[ ,5:6]) ## Price Quantity
## Min. : 3.00 Min. : 1.00
## 1st Qu.:12.00 1st Qu.: 8.00
## Median :25.00 Median :11.00
## Mean :22.81 Mean :11.31
## 3rd Qu.:33.00 3rd Qu.:15.00
## Max. :50.00 Max. :24.00
## NA’s :10 NA’s :7 Display the count of NA values in each column.

df %>% map(is.na) %>% map(sum) ## $Date ## [1] 0 ## ## $Department ## [1] 0 ## ## $Category ## [1] 0 ## ## $CustomerCode ## [1] 0 ## ## $Price ## [1] 10 ## ## $Quantity ## [1] 7 Display a bar chart for Category column.

#Get the Total Quantity of the category item category_df <- df %>% group_by(Category) %>% summarise(Total_Quantity = sum(Quantity, na.rm = T)) %>% arrange(desc(Total_Quantity)) #Plot the data ggplot(data = category_df, aes(x = reorder(Category, - Total_Quantity), y = Total_Quantity, fill = Total_Quantity)) + geom_col() + labs(x = ‘Category’, y = ‘Quantity’, title = ‘Bar Chart of Quantity by Category’) + theme(legend.position = ‘none’) + ggeasy::easy_rotate_x_labels(angle = 90)

Display the Departments and their revenue using a bar chart. Order the bars in a meaningful way.(Hint: You will need to create a new column Revenue by multiplying Price and Quantity.)

#Create a revenue column revenue_df <- df %>% mutate(revenue = Price*Quantity)

#Group the data by Department rev_by_dept <- revenue_df %>% group_by(Department) %>% summarise(Total_Revenue = sum(revenue, na.rm = T))

#Plot the data ggplot(rev_by_dept, aes(x = reorder(Department, - Total_Revenue), y = Total_Revenue, fill = Total_Revenue)) + geom_bar(stat = ‘identity’) + labs(title = ‘Bar Chart of Revenue by Department’, x = ‘Department’, y = ‘Revenue’) + theme(legend.position = ‘none’)

Create a histogram and box and whisker plot of the Price and Quantity columns.

#Plot for histogram plot1 <- ggplot(data = df, aes(x = Price, fill = ‘red’)) + geom_histogram(binwidth = 10, stat = ‘bin’, col = ‘black’)+ theme(legend.position = ‘none’) + labs(title = ‘Histogram of Price’, y = ‘Count’ , x = ‘Price’) #Plot for Boxplot plot2 <- ggplot(data = df, aes(y = Price, fill = ‘red’)) + geom_boxplot() + theme(legend.position = ‘none’) + labs(title = ‘Boxplot of Price’, y = ‘Price’) #Plot side by side plot_grid(plot1, plot2, labels = ‘AUTO’) ## Warning: Removed 10 rows containing non-finite values (stat_bin). ## Warning: Removed 10 rows containing non-finite values (stat_boxplot).

#Plot for histogram graph1 <- ggplot(data = df, aes(x = Quantity)) + geom_histogram(binwidth = 5, stat = ‘bin’, col = ‘black’, fill = ‘blue’)+ theme(legend.position = ‘none’) + labs(title = ‘Histogram of Quantity’, y = ‘Count’ , x = ‘Quantity’) #Plot for boxplot graph2 <- ggplot(data = df, aes(y = Quantity)) + geom_boxplot(fill = ‘blue’) + theme(legend.position = ‘none’) + labs(title = ‘Boxplot of Quantity’, y = ‘Quantity’) #Combine the plots plot_grid(graph1, graph2, labels = ‘AUTO’) ## Warning: Removed 7 rows containing non-finite values (stat_bin). ## Warning: Removed 7 rows containing non-finite values (stat_boxplot).

Write a short essay (150-200 words) to compare the strengths and weaknesses of (1) Power BI and (2) Alteryx with that of R, for this kind of analysis. You may discuss how each of these fare in terms of replicability, ease of use, cost, ability to share results with others, scalability, etc.

Strengths

One of the strengths of R is that it is open source. Its free to use. The results are reproducible. Your code can be recreated and reused if required. Weakness

Hard to learn. Requires that you know how to code in R programming. Power BI

Strengths

Its free to use. Power BI can be installed and used for free. Easy to learn. Uses a graphical interface to create visualizations. Weakness

Results cannot be reproduced. Because you use a graphical interface, another person cannot recreate the work you have done. Alteryx

Strengths

Easy to use. One just drags the elements they want to use and create a workflow. The results are reproducible. Because you are creating a workflow one can replicate your work. Weakness

Pay to use. After the free trial you must pay for continued use.