"Everything has beauty, but not everyone sees it." - Confucius
In this data wrangling project, I have performed various data pre-processing techniques to prepare a joint dataset of Japanese porn actresses biodata and their birthplace prefecture survey data for subsequent statistical analyses. The first dataset containing biodata of the actresses provides basic information on the actresses such as their name and birthday and key biometrics such as their height, bust and hip sizes. The second dataset contains population densities of Japan's 47 prefectures in the year 2015. These two datasets were joined to form the version that would subsequently undergo data analyses.
The first step was to remove technically "troublesome" and unnecessary columns in the actress dataset, for instance columns containing Japanese characters and the column containing URLs to the actresses' profile pictures. Then, numeric variables that were mixed together with strings were corrected by using Stringr's str_remove_all() function. To create a more informative dataset, a new variable storing the actresses' age were created from the existing birthday variable using functions of the lubridate package. The mutate() function of dplyr was also used to create a new variable storing the actresses' waist-to-hip ratios from existing waist and hip size variables. To perform necessary data type conversions, the base R as.numeric() and factor() functions were used. String operations were performed using functions of the Stringr package to ensure a minimum-loss joining of this dataset with the second one.
Prior to merging the two datasets, the population densities dataset was first converted into a long (tidy) format using dplyr's gather() function. Merging of the datasets was then performed using dply'r inner_join() function. The merged dataset was then inspected for missing values using base R is.na() combined with which() and colSums() functions. Missing values were imputed using Hmisc's impute() function. Subsequently, special values such as Inf and NaN were detected using the function sum(is.infinite(x))) applied to the whole dataset using sapply(). To detect univariate outliers, boxplots were generated to perform Tukey's IQR approach. All variables containing outliers were then capped or Winsorised using a custom-made function, applied to each variable using magrritr's pipe operator. Finally, the population density variable was square-transformed to better approximate it to a normal distribution.
The first dataset that I worked on for this project is a profile or biodata of Japanese porn (JAV) actresses. Information in the dataset was a result of a web scraping of the website Javhoo.com, done by the author twopothead curie. The online Japanese pornography data bank contains information on both actresses and the movies they star in, so the actress dataset I worked on is a subset of the entire website data repository. The dataset provides a biographical profile of the actresses, starting with basic information such as their name, birthday, hobby and birthplace and then the actresses' various biometrics such as height, bust size and waist circumference. The dataset was first published on Kaggle in 21 January 2020.
The original actress profile dataset in a comma-separated values format can be found in the following Kaggle URL: [URL: *https://www.kaggle.com/twopothead/japanese-pornstars-and-adult-videos*]
The original published dataset has 11 variables (columns) described in detail below (before any cleaning and type conversions):
1. id : This is an automatically generated number intended by the author to function as a primary key for the actress profile dataset. This is a double-precision floating point numeric variable.
2. name: This is the name of the porn actresses, written in Japanese characters (hiragana, katakana or kanji). This variable has the type of character (string).
3. imgurl: This is the URL of the actresses profile photos or headshots stored in Javhoo.com. This variable has the type of character.
4. birthday: This is the date of birth of the actresses in YYYY-MM-DD format. This variable has the type of date.
5. height: This is the actresses' height in cm. The numeric height measurements are followed by the string suffixes "cm", so this variable has the (wrong) type of character.
6. cup_size : This is the breast-size categories of the actresses. The categories follows the Japanese and South Korean alphabet breast-size classification based on centimetre measurements. This is a character variable.
7. bust : This is the breast size of the actresses measured in centimetres. The numeric breast-size measurements are followed by the string suffixes "cm", so this variable has the (wrong) type of character.
8. waist : This is the actresses' waist circumference measured in centimeteres. The numeric waist measurements are followed by the string suffixes "cm", so this variable has the (wrong) type of character.
9. hips : This is the actresses' hip circumference measured in centimetres. The numeric hip measurements are followed by the string suffixes "cm", so this variable has the (wrong) type of character.
10. birthplace : This is the birthplace of the actresses. Most are names of Japanese prefectures but for those actresses who were born outside Japan, the country or city names are written. All are written in Japanese characters. This variable has the type of character.
11. hobby : This is the actresses' hobby outside their professional life. For each observation, this variable can store more than one hobby, e.g. singing, cooking, swimming. The hobbies are written in Japanese characters. This variable has the type of character.
Being aware that reading non-english characters in R can be messy and unnecessarily complicated for the purposes of this assignment, I downloaded the original dataset and processed it in Google Sheets. Using the translate function of Google Sheets, I translated all Japanese character containing columns into English and saved the output into new columns. For example, I translated the "name" column and stored the output in a new column "translated_name". Then I exported the file back as a CSV file. This is the file version that I eventually imported to R and worked on for this project. This version of the dataset has 10,600 rows and 14 columns.
The second dataset of my project is a survey data containing the population densities of Japan's 47 prefectures in 2015. The survey was conducted by the Statistics Bureau of Japan (Ministry of Internal Affairs and Communications). The dataset is an adaptation of the original survey data, performed by the author leeDataWhiz. The author had modified the original data into a wide (untidy) format, with the intention of providing data tidying practice materials to beginners in R, Python or other data science tools. Population density for each prefecture is a calculated value, obtained by dividing the total population of a prefecture in the year 2015 by her total area in square kilometres. The author leeDataWhiz published this dataset on Kaggle in 2 October 2020.
The Japan prefecture 2015 population density dataset in a comma-separated values format can be found in the following Kaggle URL: [URL: *https://www.kaggle.com/leedatawhiz/untidy-japanese-prefecture-2015-population-density*]
The dataset has 47 columns, representing the 47 prefectures of Japan and only one row containing the population densities for each of the prefectures in the year 2015. The population densities has a type of double-precision floating point numeric
The working directory was set to a dedicated project folder where all files relevant to this project were kept. The getwd() function was then invoked to ascertain that the correct working directory has been set.
setwd("C:/Users/USER/Documents/DATA KULIAH/RMIT Master of Data Science/MATH2349-Data Wrangling/Assignments/Assignment 2")
getwd()
The following packages were loaded to carry out operations needed in this project. The packages used in this project fulfils the Assignment 2 requirement #10.
library(readr)
library(dplyr)
library(tidyr)
library(magrittr)
library(lubridate)
library(knitr)
library(Hmisc)
library(stringr)
library(outliers)
library(forecast)
Intact version of the first dataset was imported using the read_delim() function. The delimiter in the file is ";", therefore read_delim() with the specific delim argument was used. The file is a CSV file with UTF-8 encoding to facilitate Japanese (non-ASCII) characters. The locale argument was used to be able to properly import columns with Japanese characters.
actressprofile <- read_delim("javactresses.csv", delim = ";", col_names = TRUE, locale = locale(encoding = "UTF-8"))
These functions were then used to get a sneak-peek into the newly created data frame of the imported "javactresses" dataset. Specifically, the head() function displays the first 6 rows and 13 columns of the data frame. Successful output of this code confirmed a smooth import of the dataset.
head(actressprofile)
The str() function was used to give a summary of all columns in the data frame, as well as check the types of each variable. As observed from the output, the dataset contains multiple data types (more types would appear after further type conversions later in the process) and therefore satisfies requirement #2.
str(actressprofile)
## tibble [10,600 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:10600] 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr [1:10600] "<U+6CE2><U+591A><U+91CE><U+7D50><U+8863>" "<U+4E0A><U+539F><U+4E9C><U+8863>" "<U+5317><U+6761><U+9EBB><U+5983>" "<U+98A8><U+9593><U+3086><U+307F>" ...
## $ translated_name : chr [1:10600] "Yui Hatano" "Ai Uehara" "Hojo Asahi" "Yumi Kazama" ...
## $ imgurl : chr [1:10600] "https://pics.javhoo.net/2016/02/2jv_a-1.jpg" "https://pics.javhoo.net/2016/02/2ny_a.jpg" "https://pics.javhoo.net/2016/02/2lk.jpg" "https://pics.javhoo.net/2016/02/2t_a.jpg" ...
## $ birthday : Date[1:10600], format: "1988-05-24" "1992-11-12" ...
## $ height : chr [1:10600] "163cm" NA NA "160cm" ...
## $ cup_size : chr [1:10600] "D" "E" NA "F" ...
## $ bust : chr [1:10600] "88cm" "83cm" NA "93cm" ...
## $ waist : chr [1:10600] "59cm" "57cm" NA "60cm" ...
## $ hips : chr [1:10600] "85cm" "82cm" NA "90cm" ...
## $ birthplace : chr [1:10600] "<U+4EAC><U+90FD><U+5E9C>" NA NA "<U+6771><U+4EAC><U+90FD>" ...
## $ translated_birthplace: chr [1:10600] "Kyoto" "#VALUE!" "#VALUE!" "Tokyo" ...
## $ hobby : chr [1:10600] "<U+30B2><U+30FC><U+30E0>" NA NA "<U+6C34><U+6CF3>" ...
## $ translated_hobby : chr [1:10600] "game" "#VALUE!" "#VALUE!" "swimming" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. translated_name = col_character(),
## .. imgurl = col_character(),
## .. birthday = col_date(format = ""),
## .. height = col_character(),
## .. cup_size = col_character(),
## .. bust = col_character(),
## .. waist = col_character(),
## .. hips = col_character(),
## .. birthplace = col_character(),
## .. translated_birthplace = col_character(),
## .. hobby = col_character(),
## .. translated_hobby = col_character()
## .. )
Since Japanese characters were not displayed properly in the output of this function and considering that all columns containing Japanese characters already have their corresponding translations, it was decided to discard all columns containing Japanese characters to avoid information redundancy, as well as character encoding issues. Japanese characters were found in columns 2, 11 and 13 so these were removed using the standard subsetting technique.
actressprofile <- actressprofile[ , c(-2, -11, -13)]
The result of the subsetting was easily checked using the head() function. It was confirmed that all columns containing Japanese characters have been removed.
head(actressprofile)
The first column, containing arbitrary ID numbers of the actresses appear to serve no purpose in describing this data set or for further analyses later down the track. Likewise, the second column containing URL to the profile pictures or of the actresses seem unnecessary. Also, there seem to be not much interesting insight to be gained by knowing the hobbies of the actresses. To generate a lean and efficient dataset, I have decided to remove these, again by subsetting to remove columns 1, 3 and 11.
actressprofile <- actressprofile[ , c(-1, -3, -11)]
Success of this second subsetting was confirmed again using the head() function. The dataset is now much more straightforward.
head(actressprofile)
Consider the variables height, bust, waist and hips. Practically these are numeric measurements, but the stored values have been "contaminated" with the string suffixes "cm". All "cm" suffixes was removed using the str_remove_all() function of the Stringr package. This is a crucial step because if this was not performed, these numeric variables would not be able to undergo any statistical analyses. This step fulfils requirement #3 and arguably requirement #5 as a numeric value combined with a string in one cell is essentially two variables in one cell and thus violates the tidy data principle of Wickham and Grolemund (2016).
actressprofile$height <- str_remove_all(actressprofile$height,"[cm]")
actressprofile$bust <- str_remove_all(actressprofile$bust,"[cm]")
actressprofile$waist <- str_remove_all(actressprofile$waist, "[cm]")
actressprofile$hips <- str_remove_all(actressprofile$hips, "[cm]")
Again, the head() function was used to confirm that all contaminating "cm" suffixes have been deleted.
head(actressprofile)
The next data exploration step was to finding out the current age of the actresses. Lubridate's today() function was used to get a handle on the date at time of writing. Lubridate's ymd() function was used to ensure that the original birthday variable was correctly parsed into a YYYY-MM-DD format. These two functions were nested inside a difftime() function to calculate the actresses' age in days. The resulting values were assigned to a new variable named age. This mutation of a new variable from an existing ones satisfies requirement #6.
After the actresses' age in days has been obtained, it was converted to years by using the base R as.numeric() function with the division by 365.25 passed as part of the argument.
actressprofile$age <- difftime(today(), ymd(actressprofile$birthday), units = "days")
actressprofile$age <- as.numeric(actressprofile$age/365.25)
The newly created age column was moved to be adjacent to the birthday column. Placing related information together maintains the logical structure of the dataset for easy and efficient analyses and interpretation. The move was done using dplyr's relocate() function. The head() function was then used to confirm the relocation.
actressprofile <- actressprofile %>% relocate(age, .after = birthday)
head(actressprofile)
Having performed these data manipulation, the structure of the data frame was thoroughly checked to see if all variables are of the correct type. The str() function was used to get an overview of all variables and their types in the current state of the data frame.
str(actressprofile)
## tibble [10,600 x 9] (S3: tbl_df/tbl/data.frame)
## $ translated_name : chr [1:10600] "Yui Hatano" "Ai Uehara" "Hojo Asahi" "Yumi Kazama" ...
## $ birthday : Date[1:10600], format: "1988-05-24" "1992-11-12" ...
## $ age : num [1:10600] 32.4 27.9 NA 41.7 33.4 ...
## $ height : chr [1:10600] "163" NA NA "160" ...
## $ cup_size : chr [1:10600] "D" "E" NA "F" ...
## $ bust : chr [1:10600] "88" "83" NA "93" ...
## $ waist : chr [1:10600] "59" "57" NA "60" ...
## $ hips : chr [1:10600] "85" "82" NA "90" ...
## $ translated_birthplace: chr [1:10600] "Kyoto" "#VALUE!" "#VALUE!" "Tokyo" ...
There appeared to be quite a few variables that were not in the correct type, thus needed to be converted accordingly.The first variable that required a type conversion is height, because it was initially stored as a character variable. Height is a numeric variable that can have decimal points, so it was converted into a double floating point numeric value using the as.numeric() base R function. Subsequently, the typeof() function was used to confirm that the variable has indeed been converted to a double numeric (fulfils requirement #3).
actressprofile$height <- as.numeric(actressprofile$height)
typeof(actressprofile$height)
## [1] "double"
The variable cup_size stores the breast size category of an actress. Since it is an ordinal categorical data that has a physical ranking, it must be converted to an ordered factor. This was done using base R factor() function. Note that the ordered = TRUE argument was used, because the sizes must be ranked in ascending alphabetical order e.g. A < B < C etc. The class() function was then used to confirm a successful conversion from character to factor as well as the ordering (fulfils requirement #3 and #4).
actressprofile$cup_size <- factor(actressprofile$cup_size, ordered = TRUE)
class(actressprofile$cup_size)
## [1] "ordered" "factor"
The bust variable is a numeric measure of the actresses' breast sizes in centimetres. Initially it had the type of character, which made it impossible to be used in statistical analyses. Conversion to a double numeric type was again performed using the as.numeric() function. Subsequently, the class() and typeof() functions were used to ascertain that the variable has been converted to a double numeric. (fulfils requirement #3).
actressprofile$bust <- as.numeric(actressprofile$bust)
class(actressprofile$bust)
## [1] "numeric"
typeof(actressprofile$bust)
## [1] "double"
The waist variable is a numeric measure of the actresses' waist sizes in centimetres. Initially it has the type of character, which clearly is incorrect. Conversion to a double numeric type was again performed using the as.numeric() function.Subsequently, the class() and typeof() functions were used to ascertain that the variable has been converted to a double numeric. (fulfils requirement #3).
actressprofile$waist <- as.numeric(actressprofile$waist)
class(actressprofile$waist)
## [1] "numeric"
typeof(actressprofile$waist)
## [1] "double"
Finally, the hips variable is a numeric measure of the actresses' hip circumference in centimetres. Similar to other body measurements mentioned previously, it initially had the inappropriate type of character. It was converted to a double numeric variable using the as.numeric() function. Again, the class() and typeof() functions were used afterwards to ascertain that the variable has been converted to a double numeric. (fulfils requirement #3).
actressprofile$hips <- as.numeric(actressprofile$hips)
class(actressprofile$hips)
## [1] "numeric"
typeof(actressprofile$hips)
## [1] "double"
A key metric in the beauty-verse is the waist-to-hip ratio. In order to calculate this and store it in a new variable called waist-to-hip_ratio, dplyr's mutate() function was used. The ratio was calculated as waist size / hip size. Note that the resulting double floating point values were not rounded, to minimise compounded rounding error in subsequent statistical analyses. Creation of the new variable was confirmed using the str() function. (fulfils requirement #6).
actressprofile <- mutate(actressprofile, `waist-to-hip_ratio` = waist/hips)
str(actressprofile)
## tibble [10,600 x 10] (S3: tbl_df/tbl/data.frame)
## $ translated_name : chr [1:10600] "Yui Hatano" "Ai Uehara" "Hojo Asahi" "Yumi Kazama" ...
## $ birthday : Date[1:10600], format: "1988-05-24" "1992-11-12" ...
## $ age : num [1:10600] 32.4 27.9 NA 41.7 33.4 ...
## $ height : num [1:10600] 163 NA NA 160 158 168 154 157 157 158 ...
## $ cup_size : Ord.factor w/ 17 levels "A"<"B"<"C"<"D"<..: 4 5 NA 6 10 11 3 3 5 2 ...
## $ bust : num [1:10600] 88 83 NA 93 101 101 84 87 88 76 ...
## $ waist : num [1:10600] 59 57 NA 60 55 59 58 57 58 57 ...
## $ hips : num [1:10600] 85 82 NA 90 84 92 83 83 87 81 ...
## $ translated_birthplace: chr [1:10600] "Kyoto" "#VALUE!" "#VALUE!" "Tokyo" ...
## $ waist-to-hip_ratio : num [1:10600] 0.694 0.695 NA 0.667 0.655 ...
The newly created column was relocated and placed next to the waist and hips columns for a more logical and intuitive arrangement of the data frame.
actressprofile <- actressprofile %>% relocate(`waist-to-hip_ratio`, .after = hips)
Upon reviewing contents of the translated_birthplace column, many of the Japanese prefecture names were followed by the phrase "prefecture". These were removed in order to prepare this dataset for merging with the second dataset containing population data of Japanese prefectures. The operation was performed using the str_remove_all() function of the Stringr package.
actressprofile$translated_birthplace <- str_remove_all(actressprofile$translated_birthplace,"[Prefecture]")
All unique values of the translated_birthplace column was inspected using base R unique() function. This checking was done to detect and correct all mispellings of prefecture names before subsequent joining of this data frame with the second one. Mispellings of the prefecture names are obvious data errors, so this exercise fulfils requirement #7.
unique(actressprofile$translated_birthplace)
## [1] "Kyoo" "#VALUE!" "Tokyo" "England"
## [5] "Kanagawa " "Miyazaki " "Hokkaido" "Chiba "
## [9] "Yamaghi " "Gi " "Yamagaa " "Saiama"
## [13] "Nagano " "Hyogo " "Okinawa " "Miyagi "
## [17] "Gnma " "Shizoka " "Shiga " "Aomoi "
## [21] "Akia" "Hioshima " "Ibaaki " "Kmamoo "
## [25] "Osaka p" "Aihi " "Toyama " "Fkoka "
## [29] "Kanagawa" "Tohigi " "Ishikawa " "Kagoshima p"
## [33] "Fki " "Niigaa " "Ehim " "Shiman "
## [37] "Saga " "Fkshima " "Naa " "Okayama "
## [41] "Finland" "Los Angls" "Iwa " "Tooi "
## [45] "Wakayama " "Rssia" "Nagasaki " "Fkshima"
## [49] "Mi " "Nohn Eop" "Nagano" "Fan"
## [53] "Kagawa " "USA" "Kagoshima" "Kohi "
## [57] "Unid Sas o Amia" "Iiomo Island" "Niigaa" "Yamanashi "
## [61] "Bijing China" "Kob Ciy" "Spain" "Dominian Rpbli"
## [65] "Chiba" "Bazil" "Amia" "Taiwan"
## [69] "Soh o h island" "Ialy" "Fkoka" "Hyogo"
## [73] "Tokshima " "Aihi" "oland" "Oia "
After all mispellings were identified, they were corrected by using stringr's str_replace_all() function. (fulfils requirement #7).
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Kyoo", "Kyoto")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Yamagaa", "Yamagata")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Akia", "Akita")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Aihi", "Aichi")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Shiman", "Shimane")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Rssia", "Russia")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Kohi", "Kochi")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Bijing China", "China")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Bazil", "Brazil")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Fkoka", "Fukuoka")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Oia", "Oita")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "#VALUE!", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Kyoo", "Kyoto")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Saiama", "Saitama")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Gnma", "Gumma")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Hioshima", "Hiroshima")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Kagoshima p", "Kagoshima")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Los Angls", "USA")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Fan", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Unid Sas o Amia", "USA")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Kob Ciy", "Kobe")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Amia", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Shizoka", "Shizuoka")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Ibaaki", "Ibaraki")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Fkoka", "Fukuoka")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Fki", "Fukui")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Fkshima", "Fukushima")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Iwa", "Iwate")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Fkshima", "Fukushima")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Iiomo Island", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Tokshima", "Tokushima")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Yamaghi", "Yamaguchi")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Kmamoo", "Kumamoto")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Niigaa", "Niigata")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Naa", "Nara")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Tooi", "Tottori")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Mi", "Mie")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Niigaa", "Niigata")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Dominian Rpbli", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Soh o h island", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Aihi", "Aichi")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Gi", "Gifu")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Aomoi", "Aomori")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Osaka p", "Osaka")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Tohigi", "Tochigi")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Ehim", "Ehime")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Nohn Eop", "")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Ialy", "Italy")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "oland", "Poland")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Mieyazaki", "Miyazaki")
actressprofile$translated_birthplace <- str_replace_all(actressprofile$translated_birthplace, "Mieyagi", "Miyagi")
Success of the corrections were checked using the unique() function again.
unique(actressprofile$translated_birthplace)
## [1] "Kyoto" "" "Tokyo" "England" "Kanagawa "
## [6] "Miyazaki " "Hokkaido" "Chiba " "Yamaguchi " "Gifu "
## [11] "Yamagata " "Saitama" "Nagano " "Hyogo " "Okinawa "
## [16] "Miyagi " "Gumma " "Shizuoka " "Shiga " "Aomori "
## [21] "Akita" "Hiroshima " "Ibaraki " "Kumamoto " "Osaka"
## [26] "Aichi " "Toyama " "Fukuoka " "Kanagawa" "Tochigi "
## [31] "Ishikawa " "Kagoshima" "Fukui " "Niigata " "Ehime "
## [36] "Shimane " "Saga " "Fukushima " "Nara " "Okayama "
## [41] "Finland" "USA" "Iwate " "Tottori " "Wakayama "
## [46] "Russia" "Nagasaki " "Fukushima" "Mie " "Nagano"
## [51] "Kagawa " "Kochi " "Niigata" "Yamanashi " "China"
## [56] "Kobe" "Spain" "Chiba" "Brazil" "Taiwan"
## [61] "Italy" "Fukuoka" "Hyogo" "Tokushima " "Aichi"
## [66] "Poland" "Oita "
It was observed that there were numerous occurences of white spaces after prefecture names. This can become problematic in the subsequent joining process. All white spaces were removed using stringr's str_trim() function. The unique() function was then used to confirm removal of all white spaces, both on the right and left sides of the string.
actressprofile$translated_birthplace <- str_trim(actressprofile$translated_birthplace, "right")
actressprofile$translated_birthplace <- str_trim(actressprofile$translated_birthplace, "left")
unique(actressprofile$translated_birthplace)
## [1] "Kyoto" "" "Tokyo" "England" "Kanagawa" "Miyazaki"
## [7] "Hokkaido" "Chiba" "Yamaguchi" "Gifu" "Yamagata" "Saitama"
## [13] "Nagano" "Hyogo" "Okinawa" "Miyagi" "Gumma" "Shizuoka"
## [19] "Shiga" "Aomori" "Akita" "Hiroshima" "Ibaraki" "Kumamoto"
## [25] "Osaka" "Aichi" "Toyama" "Fukuoka" "Tochigi" "Ishikawa"
## [31] "Kagoshima" "Fukui" "Niigata" "Ehime" "Shimane" "Saga"
## [37] "Fukushima" "Nara" "Okayama" "Finland" "USA" "Iwate"
## [43] "Tottori" "Wakayama" "Russia" "Nagasaki" "Mie" "Kagawa"
## [49] "Kochi" "Yamanashi" "China" "Kobe" "Spain" "Brazil"
## [55] "Taiwan" "Italy" "Tokushima" "Poland" "Oita"
The actressprofile data frame appeared to be in a tidy and ideal form at this stage. The final touch applied was to ensure that all column names were descriptive, succinct and gives a professional impression. Renaming of columns were performed using dplyr's rename() function.
actressprofile <- rename(actressprofile, Name = translated_name)
actressprofile <- rename(actressprofile, Birthday = birthday)
actressprofile <- rename(actressprofile, Age = age)
actressprofile <- rename(actressprofile, Height = height)
actressprofile <- rename(actressprofile, Cup_size = cup_size)
actressprofile <- rename(actressprofile, Bust = bust)
actressprofile <- rename(actressprofile, Waist = waist)
actressprofile <- rename(actressprofile, Hips = hips)
actressprofile <- rename(actressprofile, `Waist-Hip_ratio` = `waist-to-hip_ratio`)
actressprofile <- rename(actressprofile, Prefecture = translated_birthplace)
The second dataset containing information on population densities of Japan's 47 prefectures in 2015 was imported using the read_csv() function.
popdensity2015 <- read_csv("prefecturepopdensity2015.csv", col_names = TRUE)
To check successful importing of the dataset, the head() function was used.
head(popdensity2015)
The structure of the dataset was then checked using the str() function. Because there are 47 columns in the dataset, the output of the str() function would be too long. Therefore only the code is shown here.
str(popdensity2015)
It was obvious that the dataset violated the tidy data principles of Wickham and Grolemund (2016). In a tidy dataset, each column must represent one variable. Currently, the columns of the popdensity2015 represent values that a variable "prefecture" would assume. The data frame is untidy and has a wide format. This form of data is not appropriate for further data analysis, especially in a vectorised programming environment such as R. To tidy and convert the data frame into a long (tidy) format, the gather() function of the tidyr package was used. (fulfils requirement #5).
popdensity2015 <- gather(popdensity2015, key = "Prefecture", value = "Population_density_2015")
To confirm a successful gather() tidying operation, the head() and str() functions were performed on the modified data frame.
head(popdensity2015)
str(popdensity2015)
## tibble [47 x 2] (S3: tbl_df/tbl/data.frame)
## $ Prefecture : chr [1:47] "Aichi" "Akita" "Aomori" "Chiba" ...
## $ Population_density_2015: num [1:47] 1446.9 87.9 135.7 1206.8 244.2 ...
A successful conversion of the data frame into a tidy (long) format was confirmed. The data frame also had all variables in the appropriate type. However, since the type of the population density variable was only indicated as a numeric, the typeof() function was used to ascertain that the type is indeed a double numeric.
typeof(popdensity2015$Population_density_2015)
## [1] "double"
In order to be able to get an idea of the characteristics of the actresses' birthplace prefectures, the actressprofile and popdensity2015 data frames were joined based on the common column or attribute of "Prefecture". The inner_join() function of dplyr was used to perform the joining operation. In this case, inner join was used because we only wanted to retain data where both datasets share the exact same prefecture names. This is crucial for data integrity. This joining of the two datasets fulfils requirement #1.
actressdata <- inner_join(actressprofile, popdensity2015, by = "Prefecture")
The newly created data frame generated by the join operation was inspected using the head() and str() functions.
head(actressdata)
str(actressdata)
## tibble [2,319 x 11] (S3: tbl_df/tbl/data.frame)
## $ Name : chr [1:2319] "Yui Hatano" "Yumi Kazama" "Rio (Tina Yuzuki)" "Aino Kishi" ...
## $ Birthday : Date[1:2319], format: "1988-05-24" "1979-02-22" ...
## $ Age : num [1:2319] 32.4 41.7 34 32.7 27.4 ...
## $ Height : num [1:2319] 163 160 154 157 157 158 160 158 160 162 ...
## $ Cup_size : Ord.factor w/ 17 levels "A"<"B"<"C"<"D"<..: 4 6 3 3 5 2 4 7 7 2 ...
## $ Bust : num [1:2319] 88 93 84 87 88 76 86 90 90 82 ...
## $ Waist : num [1:2319] 59 60 58 57 58 57 60 57 59 59 ...
## $ Hips : num [1:2319] 85 90 83 83 87 81 87 85 88 81 ...
## $ Waist-Hip_ratio : num [1:2319] 0.694 0.667 0.699 0.687 0.667 ...
## $ Prefecture : chr [1:2319] "Kyoto" "Tokyo" "Tokyo" "Tokyo" ...
## $ Population_density_2015: num [1:2319] 566 6168 6168 6168 6168 ...
It was confirmed that a new data frame, resulting from an inner join of the actresssprofile and popdensity2015 data frames had been successfully created. All variables were also confirmed to be in the appropriate types. Note that a substantial amount of rows has been lost after the join, even after prefecture name spelling corrections have been performed. This was because there were a large number of missing values in the prefecture column of the actressprofile dataset. It was ascertained that the loss in rows were not due to any mismatch of prefecture names in the two data frames, as the unique() function was used on the Prefecture column of the newly generated actressdata dataset to confirm that all 47 prefectures of Japan exists in the set.
unique(actressdata$Prefecture)
## [1] "Kyoto" "Tokyo" "Kanagawa" "Miyazaki" "Hokkaido" "Chiba"
## [7] "Yamaguchi" "Gifu" "Yamagata" "Saitama" "Nagano" "Hyogo"
## [13] "Okinawa" "Miyagi" "Gumma" "Shizuoka" "Shiga" "Aomori"
## [19] "Akita" "Hiroshima" "Ibaraki" "Kumamoto" "Osaka" "Aichi"
## [25] "Toyama" "Fukuoka" "Tochigi" "Ishikawa" "Kagoshima" "Fukui"
## [31] "Niigata" "Ehime" "Shimane" "Saga" "Fukushima" "Nara"
## [37] "Okayama" "Iwate" "Tottori" "Wakayama" "Nagasaki" "Mie"
## [43] "Kagawa" "Kochi" "Yamanashi" "Tokushima" "Oita"
Presence of missing values was detected by using base R is.na() function. Checking was performed on each column of the data frame. (fulfils requirement #7).
is.na(actressdata$Name)
is.na(actressdata$Birthday)
is.na(actressdata$Age)
is.na(actressdata$Height)
is.na(actressdata$Cup_size)
is.na(actressdata$Bust)
is.na(actressdata$Waist)
is.na(actressdata$Hips)
is.na(actressdata$`Waist-Hip_ratio`)
is.na(actressdata$Prefecture)
is.na(actressdata$Population_density_2015)
The precise locations of missing values in each column were identified using the which(is.na()) function. It was observed that the Name, Prefecture and Population_density_2015 columns did not contain any missing values.
which(is.na(actressdata$Name))
which(is.na(actressdata$Birthday))
which(is.na(actressdata$Age))
which(is.na(actressdata$Height))
which(is.na(actressdata$Cup_size))
which(is.na(actressdata$Bust))
which(is.na(actressdata$Waist))
which(is.na(actressdata$Hips))
which(is.na(actressdata$`Waist-Hip_ratio`))
which(is.na(actressdata$Prefecture))
which(is.na(actressdata$Population_density_2015))
Calculated the total number of missing values in each column using the colSums(is.na()) function. The previous observation that three columns did not have any missing values were confirmed.(fulfils requirement #7).
colSums(is.na(actressdata))
## Name Birthday Age
## 0 117 117
## Height Cup_size Bust
## 73 391 23
## Waist Hips Waist-Hip_ratio
## 17 17 18
## Prefecture Population_density_2015
## 0 0
An overview of the number of missing values in each column of the data frame was obtained. Missing values in columns containing numeric variables were dealt with first. Before imputation of missing values were performed, histograms of each variable with missing values were generated using the hist() function to examine the distribution of each variable. Knowledge on the distribution of each variable is crucial for deciding whether to impute the missing values with the mean or median. If the distribution is almost normally distributed, then mean would be the chosen imputation value. Otherwise median would be used. The par(mfrow) function was used to display multiple histograms in one compact frame to save document space.
par(mfrow = c(2,3))
hist(actressdata$Age, main = "Actresses age", xlab = "Age (years)", ylab = "Frequency", col = "pink")
hist(actressdata$Height, main = "Actresses height", xlab = "Height (cm)", ylab = "Frequency", col = "pink")
hist(actressdata$Bust, main = "Actresses bust size", xlab = "Bust size (cm)", ylab = "Frequency", col = "pink")
hist(actressdata$Waist, main = "Actresses waist size", xlab = "Waist size (cm)", ylab = "Frequency", col = "pink")
hist(actressdata$Hips, main = "Actresses hip size", xlab = "Hip size (cm)", ylab = "Frequency", col = "pink")
hist(actressdata$`Waist-Hip_ratio`, main = "Actresses Waist-to-hip Ratio", xlab = "Waist-to-hip ratio", ylab = "Frequency", col = "pink")
To handle the missing values in the Age column, the impute() function of the Hmisc package was used to replace all missing values with the median age of all actresses in the dataset. Median was chosen over the mean because upon inspecting the histogram of Age, the distribution appear to have some degree of right-skewness. Median is the appropriate measure of central tendency to use when working with skewed data.(fulfils requirement #7).
actressdata$Age <- impute(actressdata$Age, fun = median)
To confirm successful imputations, the which(is.na()) function was used on the Age column. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$Age))
## named integer(0)
which(is.imputed(actressdata$Age))
## [1] 60 93 117 137 169 184 190 205 228 230 238 242 253 287 293
## [16] 294 309 350 358 432 487 488 495 504 511 528 534 566 573 574
## [31] 585 633 634 651 663 668 669 672 712 716 783 792 804 806 824
## [46] 855 865 919 924 925 949 986 987 1105 1121 1131 1154 1161 1170 1191
## [61] 1202 1207 1214 1272 1276 1283 1312 1356 1370 1409 1415 1451 1459 1489 1516
## [76] 1530 1577 1611 1625 1636 1643 1651 1658 1670 1671 1672 1688 1702 1711 1724
## [91] 1821 1865 1881 1901 1911 1913 1915 1920 1931 1948 1959 1970 1997 2001 2004
## [106] 2024 2032 2127 2129 2135 2208 2232 2246 2274 2278 2291 2305
Next, to handle the missing values in the Height column, the impute() function of the Hmisc package was used to replace all missing values with the mean height of all actresses in the dataset. Mean is appropriate for this variable because inspecting the histogram for height, the values are almost perfectly normal distributed.(fulfils requirement #7).
actressdata$Height <- impute(actressdata$Height, fun = mean)
As previously, to confirm successful imputations, the which(is.na()) function was used on the Height column. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$Height))
## named integer(0)
which(is.imputed(actressdata$Height))
## [1] 54 57 138 157 169 187 216 296 350 356 395 423 432 524 574
## [16] 577 600 678 792 854 865 874 931 947 967 988 1003 1018 1026 1064
## [31] 1088 1126 1141 1152 1154 1161 1176 1207 1247 1273 1276 1312 1326 1345 1360
## [46] 1426 1481 1515 1516 1539 1544 1552 1599 1606 1636 1651 1658 1746 1771 1782
## [61] 1791 1796 1866 1881 1920 1948 2001 2032 2053 2078 2154 2218 2253
Then, to handle the missing values in the Cup_size column, the impute() function of the Hmisc package was used to replace all missing values with the mode cup size categories of all actresses in the dataset. Mode was selected as the appropriate measure of central tendency in this case because the variable is categorical. Imputation was the chosen approach than excluding missing values because the total number of missing values in this variable was approximately 17% of the total number of observations.(fulfils requirement #7).
actressdata$Cup_size <- impute(actressdata$Cup_size, fun = mode)
As previously, to confirm successful imputations, the which(is.na()) function was used on the Cup_size column. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$Cup_size))
## named integer(0)
which(is.imputed(actressdata$Cup_size))
## [1] 17 51 66 112 133 157 161 162 168 207 210 214 219 226 227
## [16] 248 285 286 290 307 309 350 351 352 379 382 384 392 395 399
## [31] 403 405 420 423 435 438 442 452 456 464 473 482 490 493 506
## [46] 517 522 525 536 544 545 546 551 555 566 575 584 590 591 600
## [61] 620 628 638 640 641 646 649 650 653 660 676 692 697 713 715
## [76] 718 724 725 739 740 749 753 758 762 773 776 778 788 807 811
## [91] 813 831 850 856 858 865 874 888 912 917 922 923 934 935 939
## [106] 944 956 959 960 962 964 965 967 971 974 977 979 982 995 1002
## [121] 1003 1007 1009 1015 1018 1023 1029 1034 1039 1043 1051 1059 1067 1069 1076
## [136] 1082 1090 1093 1096 1097 1103 1112 1115 1119 1124 1125 1127 1135 1139 1141
## [151] 1145 1148 1149 1150 1152 1154 1157 1159 1167 1174 1177 1185 1188 1193 1209
## [166] 1211 1212 1223 1228 1232 1243 1250 1262 1270 1273 1274 1278 1282 1289 1291
## [181] 1295 1296 1299 1300 1305 1307 1308 1312 1313 1314 1320 1324 1329 1338 1342
## [196] 1344 1345 1348 1350 1351 1352 1355 1360 1370 1376 1377 1383 1390 1394 1396
## [211] 1400 1402 1403 1411 1415 1416 1417 1419 1420 1426 1429 1431 1432 1438 1447
## [226] 1451 1457 1466 1467 1468 1476 1481 1489 1491 1494 1503 1506 1509 1510 1512
## [241] 1518 1521 1527 1530 1531 1540 1546 1557 1558 1560 1561 1576 1577 1581 1585
## [256] 1586 1598 1599 1604 1606 1614 1626 1634 1649 1653 1666 1677 1689 1690 1692
## [271] 1693 1699 1703 1708 1715 1716 1723 1730 1742 1760 1769 1784 1788 1793 1797
## [286] 1801 1803 1804 1806 1816 1818 1819 1822 1823 1827 1834 1836 1840 1843 1844
## [301] 1848 1852 1866 1874 1877 1881 1884 1890 1892 1896 1910 1912 1913 1918 1919
## [316] 1923 1925 1934 1936 1941 1952 1957 1969 1975 1980 1981 1984 1985 1992 1993
## [331] 1997 2011 2012 2021 2026 2031 2041 2045 2047 2051 2059 2064 2075 2077 2084
## [346] 2091 2094 2100 2110 2111 2115 2131 2134 2141 2150 2159 2160 2163 2166 2183
## [361] 2184 2185 2192 2193 2195 2197 2214 2219 2221 2222 2236 2237 2239 2241 2249
## [376] 2252 2253 2254 2256 2257 2258 2260 2272 2273 2280 2288 2291 2292 2300 2302
## [391] 2316
To handle the missing values in the Bust column, the impute() function of the Hmisc package was used to replace all missing values with the mean bust size of all actresses in the dataset. Mean was selected as a measure of central tendency in this case as the distribution of bust sizes in the dataset was practically normal distributed.(fulfils requirement #7).
actressdata$Bust <- impute(actressdata$Bust, fun = mean)
To confirm successful imputations, the which(is.na()) function was used on the Bust column. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$Bust))
## named integer(0)
which(is.imputed(actressdata$Bust))
## [1] 442 522 566 641 788 865 874 977 1152 1154 1300 1312 1370 1521 1530
## [16] 1599 1742 1840 1881 1919 2159 2166 2195
Next, to handle the missing values in the Waist column, the impute() function of the Hmisc package was used to replace all missing values with themean waist circumference of all actresses in the dataset. Mean was selected as a measure of central tendency in this case as the distribution of waist circumference in the histogram only exhibited minimal skewness.
actressdata$Waist <- impute(actressdata$Waist, fun = mean)
As the routine check for imputations, the which(is.na()) function was used on the Waist column. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$Waist))
## named integer(0)
which(is.imputed(actressdata$Waist))
## [1] 227 442 522 566 829 865 874 1059 1152 1154 1300 1312 1370 1599 1632
## [16] 1881 2195
To handle the missing values in the Hips column, the impute() function of the Hmisc package was used to replace all missing values with the mean hip size of all actresses in the dataset. Mean was selected as a measure of central tendency in this case as the distribution of hip size in the histogram only exhibited minimal skewness.
actressdata$Hips <- impute(actressdata$Hips, fun = mean)
To confirm that missing values for the Hips variable have been completely removed, the which(is.na()) function was used on the Hips column. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$Hips))
## named integer(0)
which(is.imputed(actressdata$Hips))
## [1] 420 442 522 566 829 865 874 1059 1152 1154 1300 1312 1370 1599 1632
## [16] 1881 2195
Next, to handle the missing values in the Waist-Hip_ratio column, the impute() function of the Hmisc package was used to replace all missing values with the mean waist-to-hip ratio of all actresses in the dataset. Mean was selected as a measure of central tendency in this case as the distribution of hip size in the histogram only exhibited minimal skewness.
actressdata$`Waist-Hip_ratio` <- impute(actressdata$`Waist-Hip_ratio`, fun = mean)
To confirm that missing values for the Waist-to-hip variable have been completely removed, the which(is.na()) function was used. In addition, the which(is.imputed()) function was performed to check the locations of the imputations.(fulfils requirement #7).
which(is.na(actressdata$`Waist-Hip_ratio`))
## named integer(0)
which(is.imputed(actressdata$`Waist-Hip_ratio`))
## [1] 227 420 442 522 566 829 865 874 1059 1152 1154 1300 1312 1370 1599
## [16] 1632 1881 2195
Having dealt with all missing values in the numeric variables, finally missing values in the Birthday variable was handled. There were 117 missing values for the Birthday variable. The total observations for the actressdata dataset is 2319. The missing birthday values was approximately 5% of the total observations. Since numerous authors in the literature recommend removing those observations with missing values if they account for up to 5% of the dataset, those 117 missing birthday values were excluded to avoid bias and problems in calculation during further statistical analyses. Exclusion of observations with missing birthday values were performed using a complete "flush" or cleansing of the whole dataframe from missing values. This was deemed a safe approach because all other missing values in the other columns had been imputed appropriately. There was no reasonable way to impute missing birthday values,hence they were simply excluded. The na.omit() function was used for this exercise.
actressdata <- na.omit(actressdata)
Now that all columns have been checked for missing values and appropriate imputations and removal had been performed, all columns were checked for the presence of special values, namely NaN and Inf. In order to perform this efficiently and not having to check each column repetitively, a custom function was created to detect Inf or NaN values in columns with numeric variables and return the sum of Inf or NaN values if any for each colum in the data frame. The function was passed as an argument inside sapply(). Note that since this technique gives us the sums of special values in each column, we avoided having to produce long outputs since the actressdata data frame is quite large. (fulfils requirement #7).
sapply(actressdata, function(x) sum(is.infinite(x)))
## Name Birthday Age
## 0 0 0
## Height Cup_size Bust
## 0 0 0
## Waist Hips Waist-Hip_ratio
## 0 0 0
## Prefecture Population_density_2015
## 0 0
sapply(actressdata, function(x) sum(is.nan(x)))
## Name Birthday Age
## 0 0 0
## Height Cup_size Bust
## 0 0 0
## Waist Hips Waist-Hip_ratio
## 0 0 0
## Prefecture Population_density_2015
## 0 0
It was confirmed that none of the columns in the data frame had special values. Regarding obvious errors, by simply reviewing the summary statistics of the dataset using the summary() function, there appeared to be no non-sensical values in all variables of the dataset e.g. age > 120, height < 100 etc.
summary(actressdata)
##
## 53 values imputed to 158.4176
##
##
## Imputed Values:
##
## X[[i]]
## n missing distinct
## 366 25 12
##
## lowest : A B C D E, highest: H I J K L
##
## Value A B C D E F G H I J K
## Frequency 1 15 71 155 47 33 27 9 4 2 1
## Proportion 0.003 0.041 0.194 0.423 0.128 0.090 0.074 0.025 0.011 0.005 0.003
##
## Value L
## Frequency 1
## Proportion 0.003
##
##
## 16 values imputed to 86.65897
##
##
## 11 values imputed to 58.39227
##
##
## 11 values imputed to 85.62728
##
##
## 12 values imputed to 0.6823638
## Name Birthday Age Height
## Length:2202 Min. :1957-02-09 Min. :23.44 Min. :138.0
## Class :character 1st Qu.:1981-03-28 1st Qu.:32.02 1st Qu.:155.0
## Mode :character Median :1985-03-30 Median :35.56 Median :158.0
## Mean :1984-08-15 Mean :36.18 Mean :158.4
## 3rd Qu.:1988-10-13 3rd Qu.:39.56 3rd Qu.:162.0
## Max. :1997-05-13 Max. :63.69 Max. :182.0
##
## Cup_size Bust Waist Hips
## D :813 Min. : 70.00 Min. :51.00 Min. : 58.00
## C :391 1st Qu.: 83.00 1st Qu.:57.00 1st Qu.: 84.00
## E :351 Median : 86.00 Median :58.00 Median : 85.00
## F :225 Mean : 86.57 Mean :58.34 Mean : 85.58
## G :172 3rd Qu.: 88.00 3rd Qu.:60.00 3rd Qu.: 88.00
## B :101 Max. :124.00 Max. :87.00 Max. :100.00
## (Other):149
## Waist-Hip_ratio Prefecture Population_density_2015
## Min. :0.5955 Length:2202 Min. : 68.6
## 1st Qu.:0.6705 Class :character 1st Qu.:1206.8
## Median :0.6824 Mode :character Median :4639.9
## Mean :0.6821 Mean :3894.7
## 3rd Qu.:0.6951 3rd Qu.:6168.1
## Max. :1.0235 Max. :6168.1
##
With the dataset being complete (no missing-values) and containing correct values (no non-sensical) figures, the next step was to check for outliers. This is a crucial step because outliers can bias the results of statistical analyses and reduce the power of statistical tests. The approach used to detect outliers in this dataset was Tukey's method of outlier detection based on the IQR. Tukey's method of outlier detection is non-parametric, meaning that it does not depend on the shape of the dataset's distribution. I have decided to use this method as it is robust to any departure from normality in the data. Boxplots were used to detect outliers in the dataset using Tukey's method. For this purpose, I have used the base R boxplot() function. Note here that the variables had to be re-converted to a numeric type using is.numeric() because these variables had just undergone missing value imputation. Without this re-conversion, the boxplot() function was observed to return an error, specifying that there was in incorrect number of dimensions. (fulfils requirement #8).
par(mfrow = c(2,4))
age_boxplot <- boxplot(as.numeric(actressdata$Age), main = "Actresses age", ylab = "Age", col = "pink")
height_boxplot <- boxplot(as.numeric(actressdata$Height), main = "Actresses height", ylab = "Height", col = "pink")
bust_boxplot <- boxplot(as.numeric(actressdata$Bust), main = "Actresses bust size", ylab = "Bust size", col = "pink")
waist_boxplot <- boxplot(as.numeric(actressdata$Waist), main = "Actresses waist size", ylab = "Waist size", col = "pink")
hips_boxplot <- boxplot(as.numeric(actressdata$Hips), main = "Actresses hip size", ylab = "Hip size", col = "pink")
waist_hip_boxplot <- boxplot(as.numeric(actressdata$`Waist-Hip_ratio`), main = "Actresses waist-hip ratio", ylab = "Waist-hip ratio", col = "pink")
popdensity_boxplot <- boxplot(as.numeric(actressdata$Population_density_2015), main = "Prefecture pop density", ylab = "Population density", col = "pink")
From the generated boxplots, we can clearly see that there are numerous outliers present in the dataset. In order to identify the exact values and location of these outliers, the "out" attribute of each boxplot were invoked. For example "bust_boxplot$out".
age_boxplot$out
## [1] 51.80014 51.36482 56.09856 51.29637 51.75907 51.09377 53.70568 51.52909
## [9] 58.06708 53.16085 58.36003 63.14305 54.48871 56.73922 53.19918 52.58316
## [17] 54.89391 63.14031 52.79671 54.42574 52.90349 53.58795 53.25120 51.96988
## [25] 56.91170 55.13210 56.55031 51.54278 56.10678 51.66324 57.82341 54.35181
## [33] 55.21697 58.72416 52.46817 57.05133 55.05818 54.62286 59.55647 58.61739
## [41] 63.69336 52.80219 54.03422 52.35044 53.68378 58.29979 52.57769 55.11294
## [49] 54.79535 55.93155 52.42984 51.99179 52.18344 51.63860 52.99110 51.87953
## [57] 52.63518 52.96099 52.29021 53.43737 51.27173 51.42231
height_boxplot$out
## [1] 143 175 143 174 140 178 144 174 175 176 143 178 178 177 173 173 173 143 175
## [20] 178 182 144 177 143 138 174 175
bust_boxplot$out
## [1] 100 101 97 111 98 96 97 98 100 98 100 97 102 110 110 100 98 96
## [19] 108 111 74 96 100 96 124 98 97 100 105 72 100 101 96 98 103 98
## [37] 103 101 107 100 101 100 110 100 110 104 108 100 100 98 103 103 96 97
## [55] 105 70 97 100 96 103 100 100 101 97 98 105 75 96 98 96 97 98
## [73] 100 101 98 100 99 96 98 96 96 101 99 100 75 101 100 98 96 100
## [91] 99 100 98 100 96 105 98 96 96 96 101 97 99 97 98 100 96 103
## [109] 96 98 96 100 98 96 100 98 75 115 105 98 99 96 96 101 100 96
## [127] 98 105 98 101 105 100 105 97 106 98 110 100 100 101
waist_boxplot$out
## [1] 65 52 65 65 65 65 65 66 65 52 87 67 51 52 52 66 65 68 65 65 68 65 52 65 52
hips_boxplot$out
## [1] 95 98 96 95 98 95 95 75 75 95 58 95 96 95 77 98 100 96 95
## [20] 98 63
waist_hip_boxplot$out
## [1] 0.6315789 0.6021505 0.7375000 0.6321839 0.7386364 0.7325581 0.6236559
## [8] 0.6122449 0.6236559 0.6222222 0.7500000 0.7325581 0.7386364 0.6315789
## [15] 0.6333333 1.0235294 0.6309524 0.7444444 0.7500000 0.6071429 0.6117647
## [22] 0.6222222 0.7560976 0.7500000 0.6210526 0.7471264 0.8095238 1.0172414
## [29] 0.6250000 0.6162791 0.6000000 0.6236559 0.6210526 0.6279070 0.6333333
## [36] 0.6304348 0.7402597 0.6292135 0.6179775 0.6309524 0.7349398 0.7349398
## [43] 0.7411765 0.7647059 0.7411765 0.6250000 0.7407407 0.6145833 0.6111111
## [50] 0.7500000 0.6292135 0.6235294 0.6315789 0.6321839 0.6279070 0.7349398
## [57] 0.7349398 0.7441860 0.6250000 0.7500000 0.7558140 0.6263736 0.7500000
## [64] 0.7407407 0.5955056 0.6309524 0.6122449 0.8888889 0.7500000 0.7380952
## [71] 0.7375000
popdensity_boxplot$out
## numeric(0)
From this exercise, it was found that the Bust variable had the greatest number of outliers (140 observations). This is roughly 6% of the entire dataset. The average number of outliers in all the numeric columns of the dataset is 58 observations or approximately 3% of the dataset. Considering the greatest number of outliers is 6% of the entire dataset, which is considerable (greater than the recommended 5% guideline), simply deleting outlier observations in the dataset was deemed inappropriate. Therefore, the capping or winsorising approach was chosen to deal with all outlier values in the dataset, column by column.
Capping performed used the IQR approach as shown in the custom-created function below. Essentially, this function replaces observations above the upper outlier fence with the value of the 95th percentile. Observations below the lower outlier fence were replaced by the value of the 5th percentile. (fulfils requirement #8).
winsoriser <- function(x){
quantiles <- quantile(x, c(.05, 0.25, 0.75, .95))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
x
}
The newly created "Winsoriser" function was subsequently applied to all numeric columns of the actressdata dataset that was found to contain outliers. Note that the magrittr pipe operator was used to perform this exercise.
actressdata$Age <- actressdata$Age %>% winsoriser()
actressdata$Height <- actressdata$Height %>% winsoriser()
actressdata$Bust <- actressdata$Bust %>% winsoriser()
actressdata$Waist <- actressdata$Waist %>% winsoriser()
actressdata$Hips <- actressdata$Hips %>% winsoriser()
actressdata$`Waist-Hip_ratio` <- actressdata$`Waist-Hip_ratio` %>% winsoriser()
After the Winsorising or capping had been performed on all numeric columns containing outliers, the boxplots of those variables were checked again to verify successful handling of outliers.
par(mfrow = c(2,3))
age_boxplot <- boxplot(as.numeric(actressdata$Age), main = "Actresses age", ylab = "Age", col = "pink")
height_boxplot <- boxplot(as.numeric(actressdata$Height), main = "Actresses height", ylab = "Height", col = "pink")
bust_boxplot <- boxplot(as.numeric(actressdata$Bust), main = "Actresses bust size", ylab = "Bust size", col = "pink")
waist_boxplot <- boxplot(as.numeric(actressdata$Waist), main = "Actresses waist size", ylab = "Waist size", col = "pink")
hips_boxplot <- boxplot(as.numeric(actressdata$Hips), main = "Actresses hip size", ylab = "Hip size", col = "pink")
waist_hip_boxplot <- boxplot(as.numeric(actressdata$`Waist-Hip_ratio`), main = "Actresses waist-hip ratio", ylab = "Waist-hip ratio", col = "pink")
Inspecting the boxplots, it was clear that almost all numeric variables has been cleaned from any outlier. An exception is in the Bust variable where one outlier observation still exists. It was decided to leave that specific observation as it was, considering that the outlier did not appear to be very far away from the boxplot upper outlier fence.
Having effectively handled all outliers in the dataset, the final step was to perform necessary data transformations to ensure that the dataset is fully ready to undergo statistical analyses. After examining the histogram of all numeric variables in the dataset, it was found that only the population density variable deviated most significantly from normality, as observed in the following.
hist(actressdata$Population_density_2015, main = "Japanese prefectures 2015 population density", xlab = "Population density 2015", ylab = "Frequency")
Different transformation techniques were tried on the this variable to identify the one that produced a distribution closest to a normal distribution.
log_popdensity <- log10(actressdata$Population_density_2015)
ln_popdensity <- log(actressdata$Population_density_2015)
sqrt_popdensity <- sqrt(actressdata$Population_density_2015)
cuberoot_popdensity <- actressdata$Population_density_2015^(1/3)
square_popdensity <- actressdata$Population_density_2015^2
cube_popdensity <- actressdata$Population_density_2015^3
recip_popdensity <- 1/actressdata$Population_density_2015
recipsquare_popdensity <- actressdata$Population_density_2015^(-2)
fourthpower_popdensity <- actressdata$Population_density_2015^4
boxcox_popdensity <- BoxCox(actressdata$Population_density_2015, lambda = "auto")
Evaluating the resuts of these transformations, it appeared that the square transformation was the one that changed the distribution of the population density variable most closely to an approximately normal distribution (despite the final distribution still considerably far from normal).
hist(square_popdensity, main = "Square transformed population density", xlab = "Squared population density", ylab = "Frequency")
Therefore the square transformed population density was appended to the actressdata data frame. (fulfils requirement #9).
actressdata$Squared_population_density_2015 <- square_popdensity
actressdata <- actressdata %>% relocate(Squared_population_density_2015, .after = Population_density_2015)
With the data transformation successfully conducted, wrangling of the JAV actresses data was fully completed. The final form of the data frame was summarised using the glimpse() function of the tibble package.
glimpse(actressdata)
## Rows: 2,202
## Columns: 12
## $ Name <chr> "Yui Hatano", "Yumi Kazama", "Rio (...
## $ Birthday <date> 1988-05-24, 1979-02-22, 1986-10-29...
## $ Age <dbl> 32.40794, 41.65914, 33.97673, 32.71...
## $ Height <dbl> 163, 160, 154, 157, 157, 158, 160, ...
## $ Cup_size <ord> D, F, C, C, E, B, D, G, G, B, C, D,...
## $ Bust <dbl> 88, 93, 84, 87, 88, 76, 86, 90, 90,...
## $ Waist <dbl> 59, 60, 58, 57, 58, 57, 60, 57, 59,...
## $ Hips <dbl> 85, 90, 83, 83, 87, 81, 87, 85, 88,...
## $ `Waist-Hip_ratio` <dbl> 0.6941176, 0.6666667, 0.6987952, 0....
## $ Prefecture <chr> "Kyoto", "Tokyo", "Tokyo", "Tokyo",...
## $ Population_density_2015 <dbl> 565.9, 6168.1, 6168.1, 6168.1, 6168...
## $ Squared_population_density_2015 <dbl> 320242.81, 38045457.61, 38045457.61...
Bache, SM & Wickham, H 2014, magrittr: A Forward-Pipe Operator for R, viewed 12 October 2020, < https://cran.r-project.org/web/packages/magrittr/index.html >.
François, R 2012, How to replace outliers with the 5th and 95th percentile values in R, Stack Overflow, 12 November, viewed 13 October 2020, < https://stackoverflow.com/questions/13339685/how-to-replace-outliers-with-the-5th-and-95th-percentile-values-in-r?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa >.
Harrell Jr, FE 2020, Hmisc: Harrell Miscellaneous, viewed 11 October 2020, < https://cran.r-project.org/web/packages/Hmisc/index.html >.
Hyndman, R, Athanasopoulos, G, Bergmeir, C, Caceres, G, Chhay, L, O'Hara-Wild, M, Petropoulos, F, Razbash, S, Wang, E & Yasmeen, F 2020, forecast: Forecasting Functions for Time Series and Linear Models, viewed 15 October 2020, < https://cran.r-project.org/web/packages/forecast/index.html >.
Komsta, L 2011, outliers: Tests for outliers, viewed 15 October 2020, < https://cran.r-project.org/web/packages/outliers/index.html >.
leeDataWhiz, 2020, Untidy Japanese Prefecture 2015 Population Density - Population density data of Japanese prefectures in 2015 for data tidying Kaggle, 2 October, viewed 8 October 2020, < https://www.kaggle.com/leedatawhiz/untidy-japanese-prefecture-2015-population-density >.
Spinu, V, Grolemund, G & Wickham, H 2020, lubridate: Make Dealing with Dates a Little Easier, viewed 9 October 2020, < https://cran.r-project.org/web/packages/lubridate/index.html >.
twopothead, c 2020, Japanese Pornstars and Adult Videos - metadata of Japanese adult videos and av idols, Kaggle, 21 January, viewed 2 October 2020, < https://www.kaggle.com/twopothead/japanese-pornstars-and-adult-videos >.
Wickham ,H, François ,R, Henry ,L & Müller K 2018, dplyr: A Grammar of Data Manipulation. R package version 0.7.6 , < https://CRAN.R-project.org/package=dplyr >.
Wickham, H, Grolemund, G 2016, R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, O'Reilly Media,Inc., California, USA.
Wickham, H & Hester, J 2020, readr: Read Rectangular Text Data, viewed 12 October 2020, < https://cran.r-project.org/web/packages/readr/index.html >.
Wickham, H 2019, stringr: Simple, Consistent Wrappers for Common String Operations, viewed 11 October 2020, < https://cran.r-project.org/web/packages/stringr/index.html >.
Wickham, H 2020, tidyr: Tidy Messy Data, viewed 12 October 2020, < https://cran.r-project.org/web/packages/tidyr/index.html >.
Xie, Y 2020, knitr: A General-Purpose Package for Dynamic Report Generation in R, viewed 9 October 2020, < https://cran.r-project.org/web/packages/knitr/index.html >.