For this project, we have chosen a dataset from Chicago Face Database containing characteristics of 600 different faces.
The dataset consists of variables of two kinds. There are straight physical appearance objective traits, such as face length, eye height, eye length, etc. Then we have also subjective variables, where the faces were rated from 1 to 7 on the basis of aggression, age appearance, happiness, dominance, etc. by a group of people. Later for every face, an average was calculated for every variable of this type.
This dataset gives us a lot of possibilities in terms of choosing the variable of interest – we can investigate many features and the influences of the different measures. We decided to choose Attractive as our dependent variable and to create a model predicting one’s attractiveness based on available attributes.
Dataset is available in a xlsx file with several worksheets. The first sheet contains a list of variables, as well as their detailed descriptions. The next sheets contain observations of some of the variables and differ from each other. For further analysis, we chose the second sheet that consists of 597 observations and 118 attributes. The first 7 rows are empty, so after removing them and one extra row, the dataset is loaded and ready for further processing.
setwd("C:\\Users\\lewsz\\OneDrive\\Desktop\\UW_materials_semester_2\\machine_learning\\project_small")
df = read_xlsx("CFD 3.0 Norming Data and Codebook.xlsx", sheet = 2, skip = 7)
data <- df[-1,]
head(data)
## # A tibble: 6 x 118
## Model EthnicitySelf GenderSelf AgeSelf AgeRated ...6 FemaleProb MaleProb
## <chr> <chr> <chr> <lgl> <chr> <chr> <dbl> <dbl>
## 1 AF-2~ A F NA 32.5714~ <NA> 1 0
## 2 AF-2~ A F NA 23.6666~ <NA> 1 0
## 3 AF-2~ A F NA 24.4482~ <NA> 0.828 0.172
## 4 AF-2~ A F NA 22.7586~ <NA> 1 0
## 5 AF-2~ A F NA 30.1379~ <NA> 1 0
## 6 AF-2~ A F NA 26.5925~ <NA> 1 0
## # ... with 110 more variables: AsianProb <dbl>, ChineseAsianProb <lgl>,
## # JapaneseAsianProb <lgl>, IndianAsianProb <lgl>, OtherAsianProb <lgl>,
## # MiddleEasternProb <lgl>, BlackProb <dbl>, LatinoProb <dbl>,
## # MultiProb <dbl>, OtherProb <dbl>, WhiteProb <dbl>, Afraid <chr>,
## # ...21 <chr>, Angry <chr>, ...23 <chr>, Attractive <chr>, ...25 <chr>,
## # Babyfaced <chr>, ...27 <chr>, Disgusted <chr>, ...29 <chr>, Dominant <chr>,
## # ...31 <chr>, Feminine <chr>, ...33 <chr>, Happy <chr>, ...35 <chr>,
## # Masculine <chr>, ...37 <chr>, Prototypic <chr>, ...39 <chr>, Sad <chr>,
## # ...41 <chr>, Suitability <chr>, ...43 <chr>, Surprised <chr>, ...45 <chr>,
## # Threatening <chr>, ...47 <chr>, Trustworthy <chr>, ...49 <chr>,
## # Unusual <chr>, ...51 <chr>, Warm <chr>, ...53 <chr>, Competent <chr>,
## # ...55 <chr>, SocialStatus <chr>, ...57 <chr>, LuminanceMedian <dbl>,
## # NoseWidth <dbl>, NoseLength <dbl>, LipThickness <dbl>, FaceLength <dbl>,
## # EyeHeightR <dbl>, EyeHeightL <dbl>, EyeHeightAvg <dbl>, EyeWidthR <dbl>,
## # EyeWidthL <dbl>, EyeWidthAvg <dbl>, FaceWidthCheeks <dbl>,
## # FaceWidthMouth <dbl>, FaceWidthBZ <dbl>, Forehead <dbl>,
## # UpperFaceLength2 <dbl>, PupilTopR <dbl>, PupilTopL <dbl>,
## # PupilTopAsymmetry <dbl>, PupilLipR <dbl>, PupilLipL <dbl>,
## # PupilLipAvg <dbl>, PupilLipAsymmetry <dbl>, EyeDistance <lgl>,
## # BottomLipChin <dbl>, MidcheekChinR <dbl>, MidcheekChinL <dbl>,
## # CheeksAvg <dbl>, MidbrowHairlineR <dbl>, MidbrowHairlineL <dbl>,
## # MidbrowHairlineAvg <dbl>, FaceColorRed <lgl>, FaceColorGreen <lgl>,
## # FaceColorBlue <lgl>, HairColorRed <lgl>, HairColorGreen <lgl>,
## # HairColorBlue <lgl>, HairLuminance <lgl>, EyeLuminanceR <lgl>,
## # EyeLuminanceL <lgl>, EyeBrowThicknessR <lgl>, EyeBrowThicknessL <lgl>,
## # EyeBrowThicknessAvg <lgl>, EyeLidThicknessR <lgl>, EyeLidThicknessL <lgl>,
## # EyeLidThicknessAvg <lgl>, FaceShape <dbl>, Heartshapeness <dbl>,
## # NoseShape <dbl>, LipFullness <dbl>, EyeShape <dbl>, ...
The next step is to check if some of the columns contain missing values.
sapply(data, function(x) sum(is.na(x)))
## Model EthnicitySelf GenderSelf AgeSelf
## 0 0 0 597
## AgeRated ...6 FemaleProb MaleProb
## 0 597 0 0
## AsianProb ChineseAsianProb JapaneseAsianProb IndianAsianProb
## 0 597 597 597
## OtherAsianProb MiddleEasternProb BlackProb LatinoProb
## 597 597 0 0
## MultiProb OtherProb WhiteProb Afraid
## 0 0 0 0
## ...21 Angry ...23 Attractive
## 597 0 597 0
## ...25 Babyfaced ...27 Disgusted
## 597 0 597 0
## ...29 Dominant ...31 Feminine
## 597 0 597 0
## ...33 Happy ...35 Masculine
## 597 0 597 0
## ...37 Prototypic ...39 Sad
## 597 0 597 0
## ...41 Suitability ...43 Surprised
## 597 0 597 0
## ...45 Threatening ...47 Trustworthy
## 597 0 597 0
## ...49 Unusual ...51 Warm
## 597 0 597 597
## ...53 Competent ...55 SocialStatus
## 597 597 597 597
## ...57 LuminanceMedian NoseWidth NoseLength
## 597 0 0 0
## LipThickness FaceLength EyeHeightR EyeHeightL
## 0 0 0 0
## EyeHeightAvg EyeWidthR EyeWidthL EyeWidthAvg
## 0 0 0 0
## FaceWidthCheeks FaceWidthMouth FaceWidthBZ Forehead
## 0 0 0 0
## UpperFaceLength2 PupilTopR PupilTopL PupilTopAsymmetry
## 0 0 0 0
## PupilLipR PupilLipL PupilLipAvg PupilLipAsymmetry
## 0 0 0 0
## EyeDistance BottomLipChin MidcheekChinR MidcheekChinL
## 597 0 0 0
## CheeksAvg MidbrowHairlineR MidbrowHairlineL MidbrowHairlineAvg
## 0 0 0 0
## FaceColorRed FaceColorGreen FaceColorBlue HairColorRed
## 597 597 597 597
## HairColorGreen HairColorBlue HairLuminance EyeLuminanceR
## 597 597 597 597
## EyeLuminanceL EyeBrowThicknessR EyeBrowThicknessL EyeBrowThicknessAvg
## 597 597 597 597
## EyeLidThicknessR EyeLidThicknessL EyeLidThicknessAvg FaceShape
## 597 597 597 0
## Heartshapeness NoseShape LipFullness EyeShape
## 0 0 0 0
## EyeSize UpperHeadLength MidfaceLength ChinLength
## 0 0 0 0
## ForeheadHeight CheekboneHeight CheekboneProminence FaceRoundness
## 0 0 0 0
## fWHR2 RaterN
## 0 0
Since we have a lot of columns thet are entirely empty, we can’t impute them. Let’s get rid of them.
data <- Filter(function(x)!all(is.na(x)), data)
print(paste('missings values: ', sapply(data, function(x) sum(is.na(x))) %>% sum()))
## [1] "missings values: 0"
The column Suitability is mostly composed of dots ‘.’, which serves as an empty value. Let’s remove this column. The first column, Model, is just a combination of the index, gender and ethnicity, so we don’t need it either. We can also notice, that we have two columns describing the same thing - how raters rated the gender of a person. We have the probability of a person being a woman or a man, but we need only one. Let’s get rid of MaleProb.
data$Suitability <- NULL
data$Model <- NULL
data$MaleProb <- NULL
Before further inspection of the variables, we need to convert the data to appropriate types. We have two nominal variables - EthnicitySelf and GenderSelf, and the rest of the variables are numerical.
data$EthnicitySelf <- as.factor(data$EthnicitySelf)
data$GenderSelf <- as.factor(data$GenderSelf)
data[3:70] <- sapply(data[3:70], as.numeric)
Before we move on, let’s save our data.
save(list = "data",
file = "data_prepped.RData")
Now we will explore individual variables and inspect their distributions. Since we want to perform all data transformation based on the training sample, we should perform exploratory data analysis on the training sample as well. We will split the data now.
set.seed(987654321)
data_which_train <- createDataPartition(data$Attractive, p = 0.7, list = FALSE)
data_train <- data[data_which_train,]
data_test <- data[-data_which_train,]
summary(data_train$Attractive) - summary(data_test$Attractive)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.265714 -0.001963 -0.006288 -0.011330 0.000000 0.358261
Our test and training sets have very similar distribution and key characteristics of the dependent variable. Brilliant.
Now that we have our training set, let’s explore all numeric variables and see their distributions. This investigation will help us decide which predictors should be dropped and which ones transformed.
We begin our analysis with Shapiro-Wilk test for normality for all the variables.
## AgeRated FemaleProb AsianProb BlackProb
## 9.740699e-13 3.445736e-28 1.224777e-31 6.707938e-29
## LatinoProb MultiProb OtherProb WhiteProb
## 7.864492e-28 1.227589e-26 3.291197e-32 2.264187e-27
## Afraid Angry Attractive Babyfaced
## 1.582241e-06 4.125152e-10 4.042334e-04 7.487910e-06
## Disgusted Dominant Feminine Happy
## 1.308519e-07 2.351465e-07 4.768585e-18 1.763661e-09
## Masculine Prototypic Sad Surprised
## 3.815395e-15 7.449087e-10 1.256757e-06 2.824375e-10
## Threatening Trustworthy Unusual LuminanceMedian
## 6.422250e-11 7.272072e-01 2.154571e-13 1.393703e-15
## NoseWidth NoseLength LipThickness FaceLength
## 5.306538e-03 6.740494e-01 3.172683e-01 1.737156e-01
## EyeHeightR EyeHeightL EyeHeightAvg EyeWidthR
## 1.168706e-02 2.006980e-02 6.379702e-03 8.260895e-01
## EyeWidthL EyeWidthAvg FaceWidthCheeks FaceWidthMouth
## 7.343919e-02 2.998899e-01 2.973269e-04 1.672890e-10
## FaceWidthBZ Forehead UpperFaceLength2 PupilTopR
## 3.039637e-16 7.692969e-01 3.963254e-02 9.154165e-01
## PupilTopL PupilTopAsymmetry PupilLipR PupilLipL
## 5.749035e-01 1.464373e-24 8.267845e-01 3.553666e-01
## PupilLipAvg PupilLipAsymmetry BottomLipChin MidcheekChinR
## 6.018057e-01 7.973380e-16 3.008053e-01 1.742065e-02
## MidcheekChinL CheeksAvg MidbrowHairlineR MidbrowHairlineL
## 8.874782e-02 1.639179e-02 5.015669e-01 2.101965e-01
## MidbrowHairlineAvg FaceShape Heartshapeness NoseShape
## 3.374247e-01 2.714406e-07 1.635449e-03 1.089110e-04
## LipFullness EyeShape EyeSize UpperHeadLength
## 3.076655e-01 2.651957e-02 1.497209e-01 9.797878e-04
## MidfaceLength ChinLength ForeheadHeight CheekboneHeight
## 6.428587e-01 7.025049e-02 8.831936e-01 1.739686e-04
## CheekboneProminence FaceRoundness fWHR2 RaterN
## 5.092498e-20 1.899210e-13 7.131977e-04 3.763099e-29
As we can observe, when it comes to larger data sets, normality tests can be unreliable. Slight deviations from a Gaussian distributions can yield very low p-values. To decide how to treat particular variables we generated plots for their distributions and distributions of their transformed forms. The graphs for all of the variables are accessible within the attached R code. Below we presented graphs of the variables chosen to be deleted or transformed.
The distributions for these are either extremely skewed, with almost all of the data being a value near zero, or have an opposite of a normal distribution, with extremely high frequency of values around zero and one, and nearly none in between. Log transformation does not help to transform the data into a normal distribution. For now we will get rid of this data.
Two of the irregular distributions stand out from the rest - AsianProb and MultiProb. If we look at the log transfrmations of these variables, we can see a they could work as categorical variables. If we look at the OtherProb and AsianProb variable, we can observe there are clusters at certain values. This means that they may be used to classify to which cluster, or a group, a certain value belongs. However, after performing these transformations, our models in a later section displayed worse performance, so eventually these variables were dropped as well.
In the distributions of those variables we can observe enough normality to keep them in our model after necessary transformations.
All of them display more similarity to normal distribution after log transformation.
After taking a closer look, we can see that our dependent variable Attractive resembles a normal distribution. Good.
There is no need to perform additional transformations.
Let’s apply the transformations to some of our variables that we selected earlier and delete unnecessary ones.
data_train[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")] <- log(data_train[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")])
data_test[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")] <- log(data_test[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")])
data_train[c( "Masculine", "LuminanceMedian", "PupilTopAsymMetry", "PupilLipAsymmetry", "CheekboneProminence", "RaterN", "FemaleProb", "WhiteProb", "BlackProb", "LatinoProb", "MultiProb", "AsianProb", "OtherProb")] <- NULL
data_test[c( "Masculine", "LuminanceMedian", "PupilTopAsymMetry", "PupilLipAsymmetry", "CheekboneProminence", "RaterN", "FemaleProb", "WhiteProb", "BlackProb", "LatinoProb", "MultiProb", "AsianProb", "OtherProb")] <- NULL
Before moving on, let’s save our training and testing set.
save(list = c("data_train", "data_test"),
file = "data_train_test.RData")
The next step in development of our predictive model will be reducing the number of input variables by checking which of them are most highly correlated with the dependent variable. In order to do that we create the lists of numeric and categorical (2 in our case) variables.
data_numeric_vars <- sapply(data_train, is.numeric) %>% which() %>% names()
data_numeric_vars
## [1] "AgeRated" "Afraid" "Angry"
## [4] "Attractive" "Babyfaced" "Disgusted"
## [7] "Dominant" "Feminine" "Happy"
## [10] "Prototypic" "Sad" "Surprised"
## [13] "Threatening" "Trustworthy" "Unusual"
## [16] "NoseWidth" "NoseLength" "LipThickness"
## [19] "FaceLength" "EyeHeightR" "EyeHeightL"
## [22] "EyeHeightAvg" "EyeWidthR" "EyeWidthL"
## [25] "EyeWidthAvg" "FaceWidthCheeks" "FaceWidthMouth"
## [28] "FaceWidthBZ" "Forehead" "UpperFaceLength2"
## [31] "PupilTopR" "PupilTopL" "PupilTopAsymmetry"
## [34] "PupilLipR" "PupilLipL" "PupilLipAvg"
## [37] "BottomLipChin" "MidcheekChinR" "MidcheekChinL"
## [40] "CheeksAvg" "MidbrowHairlineR" "MidbrowHairlineL"
## [43] "MidbrowHairlineAvg" "FaceShape" "Heartshapeness"
## [46] "NoseShape" "LipFullness" "EyeShape"
## [49] "EyeSize" "UpperHeadLength" "MidfaceLength"
## [52] "ChinLength" "ForeheadHeight" "CheekboneHeight"
## [55] "FaceRoundness" "fWHR2"
data_categorical_vars <- sapply(data_train, is.factor) %>% which() %>% names()
data_categorical_vars
## [1] "EthnicitySelf" "GenderSelf"
First, the correlation of the numeric variables will be explored. Let’s perform a check for multicolinearity - if any of our variables are correlated with each other and should be dropped.
data_linear_combinations <- findLinearCombos(data_train[, data_numeric_vars])
variables <- c()
for (item in data_linear_combinations$linearCombos){
print(names(data_train[, data_numeric_vars][item]))
}
## [1] "EyeWidthAvg" "EyeWidthR" "EyeWidthL"
## [1] "PupilLipAvg" "PupilLipR" "PupilLipL"
## [1] "CheeksAvg" "MidcheekChinR" "MidcheekChinL"
## [1] "MidbrowHairlineAvg" "MidbrowHairlineR" "MidbrowHairlineL"
names(data_train[, data_numeric_vars])[data_linear_combinations$remove]
## [1] "EyeWidthAvg" "PupilLipAvg" "CheeksAvg"
## [4] "MidbrowHairlineAvg"
However before we remove the suggested variables, let’s inspect these correlations.
colinear_variables <- c("EyeWidthAvg", "EyeWidthR", "EyeWidthL"
,"PupilLipAvg", "PupilLipR", "PupilLipL", "CheeksAvg", "MidcheekChinR", "MidcheekChinL", "MidbrowHairlineAvg", "MidbrowHairlineR", "MidbrowHairlineL")
correlations <- cor(data_train[,colinear_variables])
corrplot(correlations)
We can see that all the cases of multicolinearity regard the same part of the face. We have high correlations in eye width, pupil to lip ratio, cheek characteristics and hairline. However, if we remove the suggested variable, we will still have another two very highly correlated variables. Let’s go against the suggestions and keep the variables EyeWidthAvg, PupilLipAvg, CheeksAvg, MidbrowHairlineAvg and remove the other ones.
highly_correlated <- (c( "EyeWidthR", "EyeWidthL", "PupilLipR", "PupilLipL", "MidcheekChinR", "MidcheekChinL", "MidbrowHairlineR", "MidbrowHairlineL"))
data_train[c(highly_correlated)] <- NULL
data_test[c(highly_correlated)] <- NULL
Correlation explains how one or more variables are related to each other. It determines how attributes change in relation with the dependent variable which is extremely helpful in further feature selection. Let’s see if we caught all highly correlated variables in the previous section. In order to choose the most correlated variables, regardless the correlation sign (positive or negative), we plotted the absolute values of those correlations.
There is some high correlation among the EyeHeight, EyeHeightR, EyeHeightL and EyeSize and EyeShape. Let’s take a closer look.
Let’s keep EyeHeightAvg and EyeShape.
highly_correlated <- c( "EyeHeightR", "EyeHeightL", "EyeSize")
data_train[c(highly_correlated)] <- NULL
data_test[c(highly_correlated)] <- NULL
And we continue our exploration.
This is a bit unreadable. Let’s order the variables in relation to their correlation to our dependent variable.
Let’s see the mostly correlated variables - in our case the ones that show at least around 0.2 correlation with a dependent variable.
Fortunately we don’t have too high correlations with our dependent variable. Let’s see the relation of the target variable with 5 most correlated variables.
As it is visible on the graphs, two mostly correlated variables display a positive correlation with our dependent variable. The next three predictors are strongly negatively correlated.
Now let’s check the relationship of categorical predictors with the target variable. Our dependent variable is quantitative and predictors are qualitative, so we can use ANOVA to check their relation with dependent variable.
anova_ethnicity <- aov(data_train$Attractive ~ data_train$EthnicitySelf)
summary(anova_ethnicity)
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$EthnicitySelf 3 2.26 0.7527 1.259 0.288
## Residuals 415 248.16 0.5980
The EthnicitySelf variable yielded a p-value of 0.288. This means we cannot reject the null hypothesis, that the variable is significant and we can drop it. Additionally, if you look at their boxplots, you will see that the attractiveness scores are very similar for all four ethnicities which is also confirmed by their mean value.
data_train %>% select(Attractive, EthnicitySelf) %>% group_by(EthnicitySelf) %>%
summarise(mean(Attractive))
## # A tibble: 4 x 2
## EthnicitySelf `mean(Attractive)`
## <fct> <dbl>
## 1 A 3.21
## 2 B 3.31
## 3 L 3.25
## 4 W 3.13
ggplot(data_train, aes(x = EthnicitySelf, y = Attractive, fill = EthnicitySelf)) +
geom_boxplot(alpha = 0.6) +
scale_fill_manual(values = c("#94D4FF", "#4FB0FF", "#1F8FFF", "#3079D1"))
anova_gender <- aov(data_train$Attractive ~ data_train$GenderSelf)
summary(anova_gender)
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$GenderSelf 1 24.39 24.389 44.99 6.46e-11 ***
## Residuals 417 226.03 0.542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see p-value of the F statistic obtains very low value for GenderSelf, so we can definitely reject the null hypothesis - variable GenderSelf impacts Attractive. The scores differ greatly, by roughly 0.5 on a scale of 5, that is a huge difference.
ggplot(data_train, aes(x = GenderSelf, y = Attractive, fill = GenderSelf)) +
geom_boxplot(alpha = 0.6) +
scale_fill_manual(values = c("#94D4FF", "#1F8FFF"))
Finally, we will remove EthnicitySelf and keep the GenderSelf variable.
data_train$EthnicitySelf <- NULL
data_test$EthnicitySelf <- NULL
Next step would be to identify low-varying variables. As visible below the frequency ratio for most of the variables takes value close to 1 which implies that they are diversified, none of them show low variance.
nearZeroVar(data_train, saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## GenderSelf 1.043902 0.477327 FALSE FALSE
## AgeRated 1.000000 96.897375 FALSE FALSE
## Afraid 2.500000 64.200477 FALSE FALSE
## Angry 2.200000 74.463007 FALSE FALSE
## Attractive 1.200000 77.326969 FALSE FALSE
## Babyfaced 1.000000 74.701671 FALSE FALSE
## Disgusted 2.428571 66.109785 FALSE FALSE
## Dominant 1.333333 63.007160 FALSE FALSE
## Feminine 1.250000 80.190931 FALSE FALSE
## Happy 1.500000 79.474940 FALSE FALSE
## Prototypic 2.000000 84.248210 FALSE FALSE
## Sad 2.200000 74.940334 FALSE FALSE
## Surprised 1.125000 57.279236 FALSE FALSE
## Threatening 3.600000 68.973747 FALSE FALSE
## Trustworthy 2.400000 70.405728 FALSE FALSE
## Unusual 1.833333 67.303103 FALSE FALSE
## NoseWidth 1.375000 44.391408 FALSE FALSE
## NoseLength 1.142857 42.482100 FALSE FALSE
## LipThickness 1.142857 42.959427 FALSE FALSE
## FaceLength 1.000000 66.587112 FALSE FALSE
## EyeHeightAvg 1.333333 30.548926 FALSE FALSE
## EyeWidthAvg 1.000000 39.379475 FALSE FALSE
## FaceWidthCheeks 1.000000 48.448687 FALSE FALSE
## FaceWidthMouth 1.400000 57.756563 FALSE FALSE
## FaceWidthBZ 1.111111 42.243437 FALSE FALSE
## Forehead 1.250000 56.324582 FALSE FALSE
## UpperFaceLength2 1.200000 48.210024 FALSE FALSE
## PupilTopR 1.000000 58.472554 FALSE FALSE
## PupilTopL 1.000000 59.904535 FALSE FALSE
## PupilTopAsymmetry 1.117647 18.615752 FALSE FALSE
## PupilLipAvg 1.200000 63.245823 FALSE FALSE
## BottomLipChin 1.000000 46.300716 FALSE FALSE
## CheeksAvg 1.000000 65.393795 FALSE FALSE
## MidbrowHairlineAvg 1.250000 73.747017 FALSE FALSE
## FaceShape 1.000000 99.522673 FALSE FALSE
## Heartshapeness 1.000000 99.045346 FALSE FALSE
## NoseShape 1.000000 99.284010 FALSE FALSE
## LipFullness 1.000000 99.522673 FALSE FALSE
## EyeShape 1.000000 97.613365 FALSE FALSE
## UpperHeadLength 1.000000 99.045346 FALSE FALSE
## MidfaceLength 1.000000 99.522673 FALSE FALSE
## ChinLength 2.000000 99.761337 FALSE FALSE
## ForeheadHeight 1.000000 99.284010 FALSE FALSE
## CheekboneHeight 2.000000 99.761337 FALSE FALSE
## FaceRoundness 2.000000 99.761337 FALSE FALSE
## fWHR2 1.500000 98.329356 FALSE FALSE
For modeling we will be using two sets of data - the 20 most correlated numeric variables with added variable GenderSelf and all the variables from the dataset.
selected_variables <- data_correlation_ordered[1:21]
selected_variables <- append(selected_variables, c("GenderSelf"))
data_train_high <- data_train[selected_variables]
data_test_high <- data_test[selected_variables]
Let’s save the list of selected variables as well.
save(list = "selected_variables",
file = "selected_variables.RData")
Before estimating the models we will create a function that will help us present the results in a more convenient and clear way. The function will visualize the values predicted by the model and compare them with the real data.
visualize <- function(model){
predicted_values <- predict(model)
plot_1 <- ggplot(data.frame(error = data_train$Attractive - predicted_values), aes(x = error)) + geom_histogram(fill = "dodgerblue1", bins = 100) +
theme_bw()
plot_2 <- ggplot(data.frame(real = data_train$Attractive, predicted = predicted_values), aes(x = predicted, y = real)) +
geom_point(col = "dodgerblue1") +
theme_bw()
print(ggarrange(plot_1, plot_2))
cor(data_train$Attractive,
predicted_values)
}
The models tested below:
faces_lm1 <- lm(Attractive ~ .,
data = data_train_high)
faces_lm2 <- lm(Attractive ~ .,
data = data_train)
faces_lm1_backward_p_value <- ols_step_backward_p(faces_lm1,prem = 0.05, progress = TRUE)
faces_lm2_backward_p_value <- ols_step_backward_p(faces_lm2,prem = 0.05, progress = TRUE)
options(contrasts = c("contr.treatment", "contr.treatment"))
ctrl_cv7 <- trainControl(method = "cv", number = 7)
parameters_ridge <- expand.grid(alpha = 0, lambda = seq(10, 1e4, 10))
faces_ridge_1 <- train(Attractive ~ .,
data = data_train_high,
method = "glmnet",
tuneGrid = parameters_ridge,
trControl = ctrl_cv7)
faces_ridge_2 <- train(Attractive ~ .,
data = data_train,
method = "glmnet",
tuneGrid = parameters_ridge,
trControl = ctrl_cv7)
lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))
parameters_lasso <- expand.grid(alpha = 1, lambda = lambdas)
faces_lasso_1 <-
train(Attractive ~ .,
data = data_train_high,
method = "glmnet",
tuneGrid = parameters_lasso,
trControl = ctrl_cv7)
faces_lasso_2 <-
train(Attractive ~ .,
data = data_train,
method = "glmnet",
tuneGrid = parameters_lasso,
trControl = ctrl_cv7)
parameters_elastic <- expand.grid(alpha = seq(0, 1, 0.2), lambda = seq(0, 1e2, 1))
faces_elastic_1 <- train(Attractive ~ .,
data = data_train_high,
method = "glmnet",
tuneGrid = parameters_elastic,
trControl = ctrl_cv7)
faces_elastic_2 <- train(Attractive ~ .,
data = data_train,
method = "glmnet",
tuneGrid = parameters_elastic,
trControl = ctrl_cv7)
model_list <- list(faces_lm1, faces_lm2, faces_lm1_backward_p_value$model,
faces_lm2_backward_p_value$model,
faces_ridge_1, faces_ridge_2, faces_lasso_1, faces_lasso_2,
faces_elastic_1, faces_elastic_2)
model_names <- c("linear_regression_high", "linear_regression_all",
"OLS_high", "OLS_all",
"ridge_high", "ridge_all", "lasso_high", "lasso_all",
"elastic_net_high", "elastic_net_all")
We calculated predicted values based on all models. Now, we will visualise those values versus the real ones and check the distribution of errors.
## [1] 0.8453847
## [1] 0.8850862
## [1] 0.8423868
## [1] 0.878183
## [1] 0.7062318
## [1] 0.6960377
## [1] 0.8382624
## [1] 0.873047
## [1] 0.8453302
## [1] 0.884708
Here we define a function regression metrics, that will show us the performance of our models.
regressionMetrics <- function(real, predicted) {
MSE <- mean((real - predicted)^2)
RMSE <- sqrt(MSE)
MAE <- mean(abs(real - predicted))
MAPE <- mean(abs(real - predicted)/real)
MedAE <- median(abs(real - predicted))
MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
TSS <- sum((real - mean(real))^2)
RSS <- sum((predicted - real)^2)
R2 <- 1 - RSS/TSS
result <- data.frame(MSE, RMSE, MAE, MAPE, MedAE, MSLE, R2)
return(result)
}
We check the parameters of all estimated models and compare them. The investigated parameters:
results_dataframe <- data.frame()
for (item in model_list){
results <- regressionMetrics(data_test$Attractive,
predict(item, data_test))
results_dataframe <- rbind(results_dataframe, results)
}
row.names(results_dataframe) <- model_names
results_dataframe
## MSE RMSE MAE MAPE MedAE
## linear_regression_high 0.2103240 0.4586110 0.3592296 0.1210782 0.3033424
## linear_regression_all 0.2027092 0.4502324 0.3545443 0.1182855 0.2920863
## OLS_high 0.2033141 0.4509037 0.3553863 0.1194138 0.2919233
## OLS_all 0.2011840 0.4485354 0.3521739 0.1176472 0.2896622
## ridge_high 0.4530821 0.6731137 0.5530207 0.1799830 0.5107613
## ridge_all 0.4480510 0.6693661 0.5468848 0.1782385 0.4852803
## lasso_high 0.2027022 0.4502246 0.3521961 0.1171740 0.3054579
## lasso_all 0.1858561 0.4311103 0.3417936 0.1137362 0.3007018
## elastic_net_high 0.2095648 0.4577825 0.3583281 0.1207287 0.3025907
## elastic_net_all 0.2014788 0.4488639 0.3537071 0.1180233 0.2943244
## MSLE R2
## linear_regression_high 0.01280973 0.6320254
## linear_regression_all 0.01243196 0.6453480
## OLS_high 0.01230307 0.6442896
## OLS_all 0.01225716 0.6480163
## ridge_high 0.02503912 0.2073054
## ridge_all 0.02478734 0.2161076
## lasso_high 0.01201220 0.6453601
## lasso_all 0.01112291 0.6748334
## elastic_net_high 0.01274583 0.6333536
## elastic_net_all 0.01233853 0.6475007
We get the best results from the LASSO regression model set on all the variables. Let’s visualize the errors in our predictions from that model.
par(mfrow=c(3,1))
errors <- abs(predict(faces_lasso_2, data_test) - data_test$Attractive)
plot(abs(predict(faces_lasso_2, data_test) - data_test$Attractive),
main = "magnitude of the errors",
ylab = 'error', pch = 19)
hist(errors, breaks = 100)
plot(ecdf(errors))
If we look at the magnitude of errors, we can see that very few surpassed 1, and a staggering majority stayed between the range 0 to 0.5. If we direct our attention to the density function of the errors, we can see that 80% of the errors were 0.5 or lower, which means that with our model we are able to predict rather accurately someone’s attractiveness, with only a few outlier exceptions.
Final coefficients for variables in the LASSO model:
## 46 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.522361e+00
## GenderSelfM 1.141081e+00
## AgeRated -2.622102e-02
## Afraid .
## Angry .
## Babyfaced .
## Disgusted -9.407564e-03
## Dominant 3.935651e-01
## Feminine 5.788030e-01
## Happy 2.003957e-01
## Prototypic -3.195150e-02
## Sad -3.217708e-01
## Surprised -1.393618e-01
## Threatening -4.035229e-02
## Trustworthy 1.573323e+00
## Unusual -1.953583e-01
## NoseWidth .
## NoseLength .
## LipThickness .
## FaceLength .
## EyeHeightAvg 4.999653e-03
## EyeWidthAvg 1.479866e-03
## FaceWidthCheeks .
## FaceWidthMouth -8.676962e-04
## FaceWidthBZ 1.093849e-03
## Forehead .
## UpperFaceLength2 5.535949e-05
## PupilTopR 4.913545e-04
## PupilTopL 3.113078e-04
## PupilTopAsymmetry .
## PupilLipAvg .
## BottomLipChin .
## CheeksAvg 1.889630e-04
## MidbrowHairlineAvg .
## FaceShape .
## Heartshapeness .
## NoseShape .
## LipFullness .
## EyeShape .
## UpperHeadLength .
## MidfaceLength .
## ChinLength -1.895147e+00
## ForeheadHeight .
## CheekboneHeight 1.688344e+00
## FaceRoundness .
## fWHR2 -3.899939e-01
From all the available variables, 21 were selected for the model estimation and 10 different models were evaluated. From all of the estimated models, LASSO regression with all variables yielded the best results. We obtained R2 at the level of 67.48% and mean square error of 18.58%. The second best model turned out to be the cross validation elastic net model with all variables. The worst performance was displayed by ridge regression model and it obtained around two times worse results than the rest of the models.
The important thing that we learned from this project is also the fact that you cannot always trust the statistical estimates or results generated by a specific test, sometimes you need an intuition and a good knowledge of your data to make decisions and obtain the best results.