Notes:
Knitted RMarkdown HTML document of this project is available on Rpubs Here.
My Springboard GitHub repository for all projects is available here.
. .
Save the data set as a CSV file called refine_original.csv and load it in RStudio into a data frame.
library(readxl)
refine_original <- read_excel("refine.xlsx")
write.csv(refine_original,"refine_original.csv")
head(refine_original)
## # A tibble: 6 x 6
## company `Product code / num~ address city country name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Phillips p-5 Groningensing~ arnh~ the nether~ dhr p. ja~
## 2 phillips p-43 Groningensing~ arnh~ the nether~ dhr p. ha~
## 3 philips x-3 Groningensing~ arnh~ the nether~ dhr j. Ga~
## 4 phllips x-34 Groningensing~ arnh~ the nether~ dhr p. ma~
## 5 phillps x-12 Groningensing~ arnh~ the nether~ dhr p. fr~
## 6 phillipS p-23 Groningensing~ arnh~ the nether~ dhr p. fr~
Clean up the ‘company’ column so all of the misspellings of the brand names are standardized. For example, you can transform the values in the column to be: philips, akzo, van houten and unilever (all lowercase).
library(tidyverse)
## -- Attaching packages --------
## v ggplot2 3.2.0 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
refine_original[[1]] <- tolower(refine_original[[1]])
head(refine_original)
## # A tibble: 6 x 6
## company `Product code / num~ address city country name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 phillips p-5 Groningensing~ arnh~ the nether~ dhr p. ja~
## 2 phillips p-43 Groningensing~ arnh~ the nether~ dhr p. ha~
## 3 philips x-3 Groningensing~ arnh~ the nether~ dhr j. Ga~
## 4 phllips x-34 Groningensing~ arnh~ the nether~ dhr p. ma~
## 5 phillps x-12 Groningensing~ arnh~ the nether~ dhr p. fr~
## 6 phillips p-23 Groningensing~ arnh~ the nether~ dhr p. fr~
# fix spelling errors
library(dplyr,tidyr)
refine_original <- refine_original %>% mutate(company = ifelse(grepl('ph*s|fi*s', company, NA), 'philips', company)) %>%
mutate(company = ifelse(grepl('akzoz0|ak zo|akz0|akzo', company, NA), 'akzozo', company)) %>%
mutate(company = ifelse(grepl('unilver', company, NA), 'unilever', company))
# )
view(refine_original)
Separate the product code and product number into separate columns i.e. add two new columns called product_code and product_number, containing the product code and number respectively
library(tidyr)
refine_original <- separate(data = refine_original, col = "Product code / number", into = c("product_code", "product_number"), sep = "\\-")
head(refine_original)
## # A tibble: 6 x 7
## company product_code product_number address city country name
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 philips p 5 Groningens~ arnhem the neth~ dhr p. ~
## 2 philips p 43 Groningens~ arnhem the neth~ dhr p. ~
## 3 philips x 3 Groningens~ arnhem the neth~ dhr j. ~
## 4 philips x 34 Groningens~ arnhem the neth~ dhr p. ~
## 5 philips x 12 Groningens~ arnhem the neth~ dhr p. ~
## 6 philips p 23 Groningens~ arnhem the neth~ dhr p. ~
You learn that the product codes actually represent the following product categories:
In order to make the data more readable, add a column with the product category for each record.
refine_original <- refine_original %>%
mutate(category = ifelse(product_code == "p", "Smartphone", "")) %>%
mutate(category = ifelse(product_code == "v", "TV", category)) %>%
mutate(category = ifelse(product_code == "x", "Laptop", category)) %>%
mutate(category = ifelse(product_code == "q", "Tablet", category))
head(refine_original)
## # A tibble: 6 x 8
## company product_code product_number address city country name category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 philips p 5 Groning~ arnh~ the ne~ dhr ~ Smartph~
## 2 philips p 43 Groning~ arnh~ the ne~ dhr ~ Smartph~
## 3 philips x 3 Groning~ arnh~ the ne~ dhr ~ Laptop
## 4 philips x 34 Groning~ arnh~ the ne~ dhr ~ Laptop
## 5 philips x 12 Groning~ arnh~ the ne~ dhr ~ Laptop
## 6 philips p 23 Groning~ arnh~ the ne~ dhr ~ Smartph~
You’d like to view the customer information on a map. In order to do that, the addresses need to be in a form that can be easily geocoded. Create a new column full_address that concatenates the three address fields (address, city, country), separated by commas.
refine_original <- mutate(refine_original,
full_address = paste(address, city, country, sep = ','))
head(refine_original)
## # A tibble: 6 x 9
## company product_code product_number address city country name category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 philips p 5 Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 2 philips p 43 Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 3 philips x 3 Gronin~ arnh~ the ne~ dhr ~ Laptop
## 4 philips x 34 Gronin~ arnh~ the ne~ dhr ~ Laptop
## 5 philips x 12 Gronin~ arnh~ the ne~ dhr ~ Laptop
## 6 philips p 23 Gronin~ arnh~ the ne~ dhr ~ Smartph~
## # ... with 1 more variable: full_address <chr>
Both the company name and product category are categorical variables i.e. they take only a fixed set of values. In order to use them in further analysis you need to create dummy variables. Create dummy binary variables for each of them with the prefix company_ and product_ i.e.,
Add four binary (1 or 0) columns for company: company_philips, company_akzo, company_van_houten and company_unilever.
Add four binary (1 or 0) columns for product category: product_smartphone, product_tv, product_laptop and product_tablet.
refine_original <- refine_original %>%
mutate(company_philips = ifelse(company == "philips", 1, 0)) %>%
mutate(company_akso = ifelse(company == "akzo", 1, 0)) %>%
mutate(company_van_houten = ifelse(company == "van houten", 1, 0)) %>%
mutate(company_unilever = ifelse(company == "unilever", 1, 0)) %>%
mutate(product_smartphone = ifelse(category == "Smartphone", 1, 0)) %>%
mutate(product_tv = ifelse(category == "TV", 1, 0)) %>%
mutate(product_laptop = ifelse(category == "Laptop", 1, 0)) %>%
mutate(product_tablet = ifelse(category == "Tablet", 1, 0))
head(refine_original)
## # A tibble: 6 x 17
## company product_code product_number address city country name category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 philips p 5 Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 2 philips p 43 Gronin~ arnh~ the ne~ dhr ~ Smartph~
## 3 philips x 3 Gronin~ arnh~ the ne~ dhr ~ Laptop
## 4 philips x 34 Gronin~ arnh~ the ne~ dhr ~ Laptop
## 5 philips x 12 Gronin~ arnh~ the ne~ dhr ~ Laptop
## 6 philips p 23 Gronin~ arnh~ the ne~ dhr ~ Smartph~
## # ... with 9 more variables: full_address <chr>, company_philips <dbl>,
## # company_akso <dbl>, company_van_houten <dbl>, company_unilever <dbl>,
## # product_smartphone <dbl>, product_tv <dbl>, product_laptop <dbl>,
## # product_tablet <dbl>
Include your code, the original data as a CSV file refine_original.csv, and the cleaned up data as a CSV file called refine_clean.csv.
View(refine_original)
write.csv(refine_original,"refine_clean.csv")
..