library("plyr")
library("dplyr")
library("tidyr")
library("readxl")
library("knitr")

Exercise

Using R, clean the data set refine to make it easier to visualize and analyze.

Steps

1) Load the data in RStudio:

Save the data set as a CSV file called refine_original.csv and load it in RStudio into a data frame.

refine_original <- read_excel("C:/Users/xena0/Downloads/refine.xlsx")
View(refine_original)

2) Clean up brand names

Clean up the ‘company’ column so all of the misspellings of the brand names are standardized. For example, you can transform the values in the column to be: philips, akzo, van houten and unilever (all lowercase).

refine_clean <- refine_original %>% 
  mutate(company = ifelse(grepl("^phil|^fil|^phl", company, ignore.case = TRUE), "philips", company)) %>% 
  mutate(company = ifelse(grepl("ak", company, ignore.case = TRUE), "akzo", company)) %>% 
  mutate(company = ifelse(grepl("van", company, ignore.case = TRUE), "van houten", company)) %>% 
  mutate(company = ifelse(grepl("uni", company, ignore.case = TRUE), "unilever", company))

3) Separate product code and number

Separate the product code and product number into separate columns i.e. add two new columns called product_code and product_number, containing the product code and number respectively

refine_clean <- refine_clean %>%
  separate(col = "Product code / number",
          into = c("product_code", "product_number"),
          sep = "-")

4) Add product categories

You learn that the product codes actually represent the following product categories:

p = Smartphone

v = TV

x = Laptop

q = Tablet

In order to make the data more readable, add a column with the product category for each record.

refine_clean <- mutate(refine_clean, product_categories = 0)

refine_clean <- refine_clean %>% 
  mutate(product_categories = ifelse(grepl("p", product_code), "Smartphone", product_categories)) %>% 
  mutate(product_categories = ifelse(grepl("v", product_code), "TV", product_categories)) %>% 
  mutate(product_categories = ifelse(grepl("x", product_code), "Laptop", product_categories)) %>% 
  mutate(product_categories = ifelse(grepl("q", product_code), "Tablet", product_categories))

5) Add full address for geocoding

You’d like to view the customer information on a map. In order to do that, the addresses need to be in a form that can be easily geocoded. Create a new column full_address that concatenates the three address fields (address, city, country), separated by commas.

refine_clean <- refine_clean %>% 
  unite(full_address, address:country, sep = ", ", remove = FALSE)

6) Create dummy variables for company and product category

Both the company name and product category are categorical variables i.e. they take only a fixed set of values. In order to use them in further analysis you need to create dummy variables. Create dummy binary variables for each of them with the prefix company_ and product_ i.e.,

Add four binary (1 or 0) columns for company: company_philips, company_akzo, company_van_houten and company_unilever.

Add four binary (1 or 0) columns for product category: product_smartphone, product_tv, product_laptop and product_tablet.

refine_clean <- refine_clean %>% 
  mutate(company_philips = ifelse(grepl("philips", company), 1, 0)) %>% 
  mutate(company_akzo = ifelse(grepl("akzo", company),  1, 0)) %>% 
  mutate(company_van_houten = ifelse(grepl("van houten", company),  1, 0)) %>% 
  mutate(company_unilever = ifelse(grepl("unilever", company),  1, 0)) %>% 
  mutate(product_smartphone = ifelse(grepl("Smartphone", product_categories),  1, 0)) %>% 
  mutate(product_tv = ifelse(grepl("TV", product_categories),  1, 0)) %>% 
  mutate(product_laptop = ifelse(grepl("Laptop", product_categories),  1, 0)) %>% 
  mutate(product_tablet = ifelse(grepl("Tablet", product_categories),  1, 0))

6) Submit the project on Github

Include your code, the original data as a CSV file refine_original.csv, and the cleaned up data as a CSV file called refine_clean.csv.

kable(refine_clean[1:5, ], caption = "refine clean")
refine clean
company product_code product_number full_address address city country name product_categories company_philips company_akzo company_van_houten company_unilever product_smartphone product_tv product_laptop product_tablet
philips p 5 Groningensingel 147, arnhem, the netherlands Groningensingel 147 arnhem the netherlands dhr p. jansen Smartphone 1 0 0 0 1 0 0 0
philips p 43 Groningensingel 148, arnhem, the netherlands Groningensingel 148 arnhem the netherlands dhr p. hansen Smartphone 1 0 0 0 1 0 0 0
philips x 3 Groningensingel 149, arnhem, the netherlands Groningensingel 149 arnhem the netherlands dhr j. Gansen Laptop 1 0 0 0 0 0 1 0
philips x 34 Groningensingel 150, arnhem, the netherlands Groningensingel 150 arnhem the netherlands dhr p. mansen Laptop 1 0 0 0 0 0 1 0
philips x 12 Groningensingel 151, arnhem, the netherlands Groningensingel 151 arnhem the netherlands dhr p. fransen Laptop 1 0 0 0 0 0 1 0