Answer all questions. Duration of test is 2 hours. Please STOP after 2 hours. Each question has the same weightage. Most questions DO NOT have a single correct answer. So leave comments when making an assumption or choice. You can use all available resources to solve questions (including Google and RStudio). Please feel free to discuss with others. Different and original thought process is given more weightage. Hence letting others directly copy your code may not be a good idea (Although discussion with others may help with clarity on your own thought process by bringing in other perspectives)

Some questions are real world questions. They may convey an ask from a stakeholder. They might not give 100% clarity on the question. Make intelligent assumptions or simplifications. But state them clearly as comments

Evaluation criteria

Submit a <yourname>.R file and a <yourname>_app.R file (for the Shiny app submission). I expect two <>.R file submissions per person. Please ensure you add library() components for any packages that you use to your submission files. This is to ensure that your code file runs as is when it is being evaluated

Questions

  1. Let: y <- list("x", "y", "z") and q <- list("X", "Y", "Z", "x", "y", "z"). Write code that will return all elements of q that are not in y, with the following result
[[1]]
[1] "X"

[[2]]
[1] "Y"

[[3]]
[1] "Z"
  1. Use the iris dataset. Create 2 new columns called sepal.size and petal.size. These should be the average of respective width and length. Plot a scatter plot between the size variables with colour of the dots represented by Species of the dot. Write down (as comments after the code) any 2 obvious insights from this scatter plot

  2. Write a code for a regression model (of your choice) that classifies the iris dataset. Target variable is Species. Check the accuracy of your model

  3. Write a function that prints out missing values for each column in a dataframe. Generic function should work on any dataframe

  4. Write a function that prints out frequency counts for each factor column in a dataframe. Consider a column to be a factor column if it has less than or equal to 10 factor levels. Generic function that should work on any dataframe

  5. Import the Test_Superstore_Sales.xls dataset that has been mailed to you. Import the 3 tabs as 3 different datasets

    • How many unique orders have been placed?
    • How many unique orders have been returned?
    • Join the orders and returns datasets to find out how many unique orders have been returned?
    • Plot trendlines for Profit, Unit Price and Order quantity vs time. Plot additionally for a derived variable called order_amount == Unit Price * Order Quantity
    • Regional Managers want to understand how many orders are being returned per region. Use a simple 4 region mapping (North, South, East, West. Map given regions into one of these 4 regions. Use Google maps if you don’t know where each region is). They also want to understand how many returns across time frames (yearly, monthly, daily). Plot a graphic to illustrate this ask
    • Sales manager wants to understand impact of shipping cost on profitability of lower cost items. His hypothesis is that profit is low for low cost items as shipping charges cut into the profit (assume company offers free shipping). Show a plot that will either prove or disprove his hypothesis
    • Which shipping mode has the highest return rate? Plot return rate vs shipping mode (costliest to cheapest on average)
      ~
  6. Make a Shiny app that works like a dashboard for the regional managers (managers can be found in the Users tab). There should be an overall tab that shows returns per region. This allows managers to compare their region’s performance against others. Design separate tabs for each region where a manager can see his/her region’s performance in detail. Make a call on what specific “performance details” a regional manager will need to see. Some hints are

All the best
—X—X—