Data Wrangling Exercise 1: Basic Data Manipulation

Notes:

Knitted RMarkdown HTML document of this project is available on Rpubs Here.
My Springboard GitHub repository for all projects is available here.

. .

0: Load the data in RStudio

Save the data set as a CSV file called refine_original.csv and load it in RStudio into a data frame.

library(readxl)
refine_original <- read_excel("refine.xlsx")
write.csv(refine_original,"refine_original.csv")
head(refine_original)

## # A tibble: 6 x 6
##   company  `Product code / num~ address        city  country     name      
##   <chr>    <chr>                <chr>          <chr> <chr>       <chr>     
## 1 Phillips p-5                  Groningensing~ arnh~ the nether~ dhr p. ja~
## 2 phillips p-43                 Groningensing~ arnh~ the nether~ dhr p. ha~
## 3 philips  x-3                  Groningensing~ arnh~ the nether~ dhr j. Ga~
## 4 phllips  x-34                 Groningensing~ arnh~ the nether~ dhr p. ma~
## 5 phillps  x-12                 Groningensing~ arnh~ the nether~ dhr p. fr~
## 6 phillipS p-23                 Groningensing~ arnh~ the nether~ dhr p. fr~

1: Clean up brand names

Clean up the ‘company’ column so all of the misspellings of the brand names are standardized. For example, you can transform the values in the column to be: philips, akzo, van houten and unilever (all lowercase).

library(tidyverse)

## -- Attaching packages --------

## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

refine_original[[1]] <- tolower(refine_original[[1]])
head(refine_original)

## # A tibble: 6 x 6
##   company  `Product code / num~ address        city  country     name      
##   <chr>    <chr>                <chr>          <chr> <chr>       <chr>     
## 1 phillips p-5                  Groningensing~ arnh~ the nether~ dhr p. ja~
## 2 phillips p-43                 Groningensing~ arnh~ the nether~ dhr p. ha~
## 3 philips  x-3                  Groningensing~ arnh~ the nether~ dhr j. Ga~
## 4 phllips  x-34                 Groningensing~ arnh~ the nether~ dhr p. ma~
## 5 phillps  x-12                 Groningensing~ arnh~ the nether~ dhr p. fr~
## 6 phillips p-23                 Groningensing~ arnh~ the nether~ dhr p. fr~

# fix spelling errors
library(dplyr,tidyr)
refine_original <- refine_original %>% mutate(company = ifelse(grepl('ph*s|fi*s', company, NA), 'philips', company))  %>%
  mutate(company = ifelse(grepl('akzoz0|ak zo|akz0|akzo', company, NA), 'akzozo', company))  %>%
  mutate(company = ifelse(grepl('unilver', company, NA), 'unilever', company))  
  
#                           )
                           
view(refine_original)

2: Separate product code and number

Separate the product code and product number into separate columns i.e. add two new columns called product_code and product_number, containing the product code and number respectively

library(tidyr)
refine_original <- separate(data = refine_original, col = "Product code / number", into = c("product_code", "product_number"), sep = "\\-")
head(refine_original)

## # A tibble: 6 x 7
##   company product_code product_number address     city   country   name    
##   <chr>   <chr>        <chr>          <chr>       <chr>  <chr>     <chr>   
## 1 philips p            5              Groningens~ arnhem the neth~ dhr p. ~
## 2 philips p            43             Groningens~ arnhem the neth~ dhr p. ~
## 3 philips x            3              Groningens~ arnhem the neth~ dhr j. ~
## 4 philips x            34             Groningens~ arnhem the neth~ dhr p. ~
## 5 philips x            12             Groningens~ arnhem the neth~ dhr p. ~
## 6 philips p            23             Groningens~ arnhem the neth~ dhr p. ~

3: Add product categories

You learn that the product codes actually represent the following product categories:

p = Smartphone
v = TV
x = Laptop
q = Tablet

In order to make the data more readable, add a column with the product category for each record.

refine_original <- refine_original %>%
                   mutate(category = ifelse(product_code == "p", "Smartphone", "")) %>%
                   mutate(category = ifelse(product_code == "v", "TV", category)) %>%
                   mutate(category = ifelse(product_code == "x", "Laptop", category)) %>%
                   mutate(category = ifelse(product_code == "q", "Tablet", category))
head(refine_original)

## # A tibble: 6 x 8
##   company product_code product_number address  city  country name  category
##   <chr>   <chr>        <chr>          <chr>    <chr> <chr>   <chr> <chr>   
## 1 philips p            5              Groning~ arnh~ the ne~ dhr ~ Smartph~
## 2 philips p            43             Groning~ arnh~ the ne~ dhr ~ Smartph~
## 3 philips x            3              Groning~ arnh~ the ne~ dhr ~ Laptop  
## 4 philips x            34             Groning~ arnh~ the ne~ dhr ~ Laptop  
## 5 philips x            12             Groning~ arnh~ the ne~ dhr ~ Laptop  
## 6 philips p            23             Groning~ arnh~ the ne~ dhr ~ Smartph~

4: Add full address for geocoding

You’d like to view the customer information on a map. In order to do that, the addresses need to be in a form that can be easily geocoded. Create a new column full_address that concatenates the three address fields (address, city, country), separated by commas.

refine_original <- mutate(refine_original,
                          full_address = paste(address, city, country, sep = ','))
head(refine_original)

## # A tibble: 6 x 9
##   company product_code product_number address city  country name  category
##   <chr>   <chr>        <chr>          <chr>   <chr> <chr>   <chr> <chr>   
## 1 philips p            5              Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 2 philips p            43             Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 3 philips x            3              Gronin~ arnh~ the ne~ dhr ~ Laptop  
## 4 philips x            34             Gronin~ arnh~ the ne~ dhr ~ Laptop  
## 5 philips x            12             Gronin~ arnh~ the ne~ dhr ~ Laptop  
## 6 philips p            23             Gronin~ arnh~ the ne~ dhr ~ Smartph~
## # ... with 1 more variable: full_address <chr>

5: Create dummy variables for company and product category

Both the company name and product category are categorical variables i.e. they take only a fixed set of values. In order to use them in further analysis you need to create dummy variables. Create dummy binary variables for each of them with the prefix company_ and product_ i.e.,

Add four binary (1 or 0) columns for company: company_philips, company_akzo, company_van_houten and company_unilever.
Add four binary (1 or 0) columns for product category: product_smartphone, product_tv, product_laptop and product_tablet.

refine_original <- refine_original %>%
                   mutate(company_philips = ifelse(company == "philips", 1, 0)) %>%
                   mutate(company_akso = ifelse(company == "akzo", 1, 0)) %>%
                   mutate(company_van_houten = ifelse(company == "van houten", 1, 0)) %>%
                   mutate(company_unilever = ifelse(company == "unilever", 1, 0)) %>%
  
                   mutate(product_smartphone = ifelse(category == "Smartphone", 1, 0)) %>%
                   mutate(product_tv = ifelse(category == "TV", 1, 0)) %>%  
                   mutate(product_laptop = ifelse(category == "Laptop", 1, 0)) %>%
                   mutate(product_tablet = ifelse(category == "Tablet", 1, 0)) 

head(refine_original)

## # A tibble: 6 x 17
##   company product_code product_number address city  country name  category
##   <chr>   <chr>        <chr>          <chr>   <chr> <chr>   <chr> <chr>   
## 1 philips p            5              Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 2 philips p            43             Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 3 philips x            3              Gronin~ arnh~ the ne~ dhr ~ Laptop  
## 4 philips x            34             Gronin~ arnh~ the ne~ dhr ~ Laptop  
## 5 philips x            12             Gronin~ arnh~ the ne~ dhr ~ Laptop  
## 6 philips p            23             Gronin~ arnh~ the ne~ dhr ~ Smartph~
## # ... with 9 more variables: full_address <chr>, company_philips <dbl>,
## #   company_akso <dbl>, company_van_houten <dbl>, company_unilever <dbl>,
## #   product_smartphone <dbl>, product_tv <dbl>, product_laptop <dbl>,
## #   product_tablet <dbl>

6. Submit the project on Github

Include your code, the original data as a CSV file refine_original.csv, and the cleaned up data as a CSV file called refine_clean.csv.

View(refine_original)
write.csv(refine_original,"refine_clean.csv")