For this project, we have chosen a dataset from Chicago Face Database containing characteristics of 600 different faces.

The dataset consists of variables of two kinds. There are straight physical appearance objective traits, such as face length, eye height, eye length, etc. Then we have also subjective variables, where the faces were rated from 1 to 7 on the basis of aggression, age appearance, happiness, dominance, etc. by a group of people. Later for every face, an average was calculated for every variable of this type.

This dataset gives us a lot of possibilities in terms of choosing the variable of interest – we can investigate many features and the influences of the different measures. We decided to choose Attractive as our dependent variable and to create a model predicting one’s attractiveness based on available attributes.

1. Initial Data Inspection

Dataset is available in a xlsx file with several worksheets. The first sheet contains a list of variables, as well as their detailed descriptions. The next sheets contain observations of some of the variables and differ from each other. For further analysis, we chose the second sheet that consists of 597 observations and 118 attributes. The first 7 rows are empty, so after removing them and one extra row, the dataset is loaded and ready for further processing.

setwd("C:\\Users\\lewsz\\OneDrive\\Desktop\\UW_materials_semester_2\\machine_learning\\project_small")
df = read_xlsx("CFD 3.0 Norming Data and Codebook.xlsx", sheet = 2, skip = 7)
data <- df[-1,]
head(data)
## # A tibble: 6 x 118
##   Model EthnicitySelf GenderSelf AgeSelf AgeRated ...6  FemaleProb MaleProb
##   <chr> <chr>         <chr>      <lgl>   <chr>    <chr>      <dbl>    <dbl>
## 1 AF-2~ A             F          NA      32.5714~ <NA>       1        0    
## 2 AF-2~ A             F          NA      23.6666~ <NA>       1        0    
## 3 AF-2~ A             F          NA      24.4482~ <NA>       0.828    0.172
## 4 AF-2~ A             F          NA      22.7586~ <NA>       1        0    
## 5 AF-2~ A             F          NA      30.1379~ <NA>       1        0    
## 6 AF-2~ A             F          NA      26.5925~ <NA>       1        0    
## # ... with 110 more variables: AsianProb <dbl>, ChineseAsianProb <lgl>,
## #   JapaneseAsianProb <lgl>, IndianAsianProb <lgl>, OtherAsianProb <lgl>,
## #   MiddleEasternProb <lgl>, BlackProb <dbl>, LatinoProb <dbl>,
## #   MultiProb <dbl>, OtherProb <dbl>, WhiteProb <dbl>, Afraid <chr>,
## #   ...21 <chr>, Angry <chr>, ...23 <chr>, Attractive <chr>, ...25 <chr>,
## #   Babyfaced <chr>, ...27 <chr>, Disgusted <chr>, ...29 <chr>, Dominant <chr>,
## #   ...31 <chr>, Feminine <chr>, ...33 <chr>, Happy <chr>, ...35 <chr>,
## #   Masculine <chr>, ...37 <chr>, Prototypic <chr>, ...39 <chr>, Sad <chr>,
## #   ...41 <chr>, Suitability <chr>, ...43 <chr>, Surprised <chr>, ...45 <chr>,
## #   Threatening <chr>, ...47 <chr>, Trustworthy <chr>, ...49 <chr>,
## #   Unusual <chr>, ...51 <chr>, Warm <chr>, ...53 <chr>, Competent <chr>,
## #   ...55 <chr>, SocialStatus <chr>, ...57 <chr>, LuminanceMedian <dbl>,
## #   NoseWidth <dbl>, NoseLength <dbl>, LipThickness <dbl>, FaceLength <dbl>,
## #   EyeHeightR <dbl>, EyeHeightL <dbl>, EyeHeightAvg <dbl>, EyeWidthR <dbl>,
## #   EyeWidthL <dbl>, EyeWidthAvg <dbl>, FaceWidthCheeks <dbl>,
## #   FaceWidthMouth <dbl>, FaceWidthBZ <dbl>, Forehead <dbl>,
## #   UpperFaceLength2 <dbl>, PupilTopR <dbl>, PupilTopL <dbl>,
## #   PupilTopAsymmetry <dbl>, PupilLipR <dbl>, PupilLipL <dbl>,
## #   PupilLipAvg <dbl>, PupilLipAsymmetry <dbl>, EyeDistance <lgl>,
## #   BottomLipChin <dbl>, MidcheekChinR <dbl>, MidcheekChinL <dbl>,
## #   CheeksAvg <dbl>, MidbrowHairlineR <dbl>, MidbrowHairlineL <dbl>,
## #   MidbrowHairlineAvg <dbl>, FaceColorRed <lgl>, FaceColorGreen <lgl>,
## #   FaceColorBlue <lgl>, HairColorRed <lgl>, HairColorGreen <lgl>,
## #   HairColorBlue <lgl>, HairLuminance <lgl>, EyeLuminanceR <lgl>,
## #   EyeLuminanceL <lgl>, EyeBrowThicknessR <lgl>, EyeBrowThicknessL <lgl>,
## #   EyeBrowThicknessAvg <lgl>, EyeLidThicknessR <lgl>, EyeLidThicknessL <lgl>,
## #   EyeLidThicknessAvg <lgl>, FaceShape <dbl>, Heartshapeness <dbl>,
## #   NoseShape <dbl>, LipFullness <dbl>, EyeShape <dbl>, ...

1.1. Handling Empty Values

The next step is to check if some of the columns contain missing values.

sapply(data, function(x) sum(is.na(x)))
##               Model       EthnicitySelf          GenderSelf             AgeSelf 
##                   0                   0                   0                 597 
##            AgeRated                ...6          FemaleProb            MaleProb 
##                   0                 597                   0                   0 
##           AsianProb    ChineseAsianProb   JapaneseAsianProb     IndianAsianProb 
##                   0                 597                 597                 597 
##      OtherAsianProb   MiddleEasternProb           BlackProb          LatinoProb 
##                 597                 597                   0                   0 
##           MultiProb           OtherProb           WhiteProb              Afraid 
##                   0                   0                   0                   0 
##               ...21               Angry               ...23          Attractive 
##                 597                   0                 597                   0 
##               ...25           Babyfaced               ...27           Disgusted 
##                 597                   0                 597                   0 
##               ...29            Dominant               ...31            Feminine 
##                 597                   0                 597                   0 
##               ...33               Happy               ...35           Masculine 
##                 597                   0                 597                   0 
##               ...37          Prototypic               ...39                 Sad 
##                 597                   0                 597                   0 
##               ...41         Suitability               ...43           Surprised 
##                 597                   0                 597                   0 
##               ...45         Threatening               ...47         Trustworthy 
##                 597                   0                 597                   0 
##               ...49             Unusual               ...51                Warm 
##                 597                   0                 597                 597 
##               ...53           Competent               ...55        SocialStatus 
##                 597                 597                 597                 597 
##               ...57     LuminanceMedian           NoseWidth          NoseLength 
##                 597                   0                   0                   0 
##        LipThickness          FaceLength          EyeHeightR          EyeHeightL 
##                   0                   0                   0                   0 
##        EyeHeightAvg           EyeWidthR           EyeWidthL         EyeWidthAvg 
##                   0                   0                   0                   0 
##     FaceWidthCheeks      FaceWidthMouth         FaceWidthBZ            Forehead 
##                   0                   0                   0                   0 
##    UpperFaceLength2           PupilTopR           PupilTopL   PupilTopAsymmetry 
##                   0                   0                   0                   0 
##           PupilLipR           PupilLipL         PupilLipAvg   PupilLipAsymmetry 
##                   0                   0                   0                   0 
##         EyeDistance       BottomLipChin       MidcheekChinR       MidcheekChinL 
##                 597                   0                   0                   0 
##           CheeksAvg    MidbrowHairlineR    MidbrowHairlineL  MidbrowHairlineAvg 
##                   0                   0                   0                   0 
##        FaceColorRed      FaceColorGreen       FaceColorBlue        HairColorRed 
##                 597                 597                 597                 597 
##      HairColorGreen       HairColorBlue       HairLuminance       EyeLuminanceR 
##                 597                 597                 597                 597 
##       EyeLuminanceL   EyeBrowThicknessR   EyeBrowThicknessL EyeBrowThicknessAvg 
##                 597                 597                 597                 597 
##    EyeLidThicknessR    EyeLidThicknessL  EyeLidThicknessAvg           FaceShape 
##                 597                 597                 597                   0 
##      Heartshapeness           NoseShape         LipFullness            EyeShape 
##                   0                   0                   0                   0 
##             EyeSize     UpperHeadLength       MidfaceLength          ChinLength 
##                   0                   0                   0                   0 
##      ForeheadHeight     CheekboneHeight CheekboneProminence       FaceRoundness 
##                   0                   0                   0                   0 
##               fWHR2              RaterN 
##                   0                   0

Since we have a lot of columns thet are entirely empty, we can’t impute them. Let’s get rid of them.

data <- Filter(function(x)!all(is.na(x)), data)
print(paste('missings values: ', sapply(data, function(x) sum(is.na(x))) %>% sum()))
## [1] "missings values:  0"

The column Suitability is mostly composed of dots ‘.’, which serves as an empty value. Let’s remove this column. The first column, Model, is just a combination of the index, gender and ethnicity, so we don’t need it either. We can also notice, that we have two columns describing the same thing - how raters rated the gender of a person. We have the probability of a person being a woman or a man, but we need only one. Let’s get rid of MaleProb.

data$Suitability <- NULL
data$Model <- NULL
data$MaleProb <- NULL

1.2. Data Converting

Before further inspection of the variables, we need to convert the data to appropriate types. We have two nominal variables - EthnicitySelf and GenderSelf, and the rest of the variables are numerical.

data$EthnicitySelf <- as.factor(data$EthnicitySelf)
data$GenderSelf <- as.factor(data$GenderSelf)
data[3:70] <- sapply(data[3:70], as.numeric)

Before we move on, let’s save our data.

save(list = "data",
     file = "data_prepped.RData")

2. Splitting the Data

Now we will explore individual variables and inspect their distributions. Since we want to perform all data transformation based on the training sample, we should perform exploratory data analysis on the training sample as well. We will split the data now.

set.seed(987654321)
data_which_train <- createDataPartition(data$Attractive, p = 0.7, list = FALSE) 

data_train <- data[data_which_train,]
data_test <- data[-data_which_train,]
summary(data_train$Attractive) - summary(data_test$Attractive)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.265714 -0.001963 -0.006288 -0.011330  0.000000  0.358261

Our test and training sets have very similar distribution and key characteristics of the dependent variable. Brilliant.

3. Variables Transformation

Now that we have our training set, let’s explore all numeric variables and see their distributions. This investigation will help us decide which predictors should be dropped and which ones transformed.

3.1. Variables Distribution

We begin our analysis with Shapiro-Wilk test for normality for all the variables.

##            AgeRated          FemaleProb           AsianProb           BlackProb 
##        9.740699e-13        3.445736e-28        1.224777e-31        6.707938e-29 
##          LatinoProb           MultiProb           OtherProb           WhiteProb 
##        7.864492e-28        1.227589e-26        3.291197e-32        2.264187e-27 
##              Afraid               Angry          Attractive           Babyfaced 
##        1.582241e-06        4.125152e-10        4.042334e-04        7.487910e-06 
##           Disgusted            Dominant            Feminine               Happy 
##        1.308519e-07        2.351465e-07        4.768585e-18        1.763661e-09 
##           Masculine          Prototypic                 Sad           Surprised 
##        3.815395e-15        7.449087e-10        1.256757e-06        2.824375e-10 
##         Threatening         Trustworthy             Unusual     LuminanceMedian 
##        6.422250e-11        7.272072e-01        2.154571e-13        1.393703e-15 
##           NoseWidth          NoseLength        LipThickness          FaceLength 
##        5.306538e-03        6.740494e-01        3.172683e-01        1.737156e-01 
##          EyeHeightR          EyeHeightL        EyeHeightAvg           EyeWidthR 
##        1.168706e-02        2.006980e-02        6.379702e-03        8.260895e-01 
##           EyeWidthL         EyeWidthAvg     FaceWidthCheeks      FaceWidthMouth 
##        7.343919e-02        2.998899e-01        2.973269e-04        1.672890e-10 
##         FaceWidthBZ            Forehead    UpperFaceLength2           PupilTopR 
##        3.039637e-16        7.692969e-01        3.963254e-02        9.154165e-01 
##           PupilTopL   PupilTopAsymmetry           PupilLipR           PupilLipL 
##        5.749035e-01        1.464373e-24        8.267845e-01        3.553666e-01 
##         PupilLipAvg   PupilLipAsymmetry       BottomLipChin       MidcheekChinR 
##        6.018057e-01        7.973380e-16        3.008053e-01        1.742065e-02 
##       MidcheekChinL           CheeksAvg    MidbrowHairlineR    MidbrowHairlineL 
##        8.874782e-02        1.639179e-02        5.015669e-01        2.101965e-01 
##  MidbrowHairlineAvg           FaceShape      Heartshapeness           NoseShape 
##        3.374247e-01        2.714406e-07        1.635449e-03        1.089110e-04 
##         LipFullness            EyeShape             EyeSize     UpperHeadLength 
##        3.076655e-01        2.651957e-02        1.497209e-01        9.797878e-04 
##       MidfaceLength          ChinLength      ForeheadHeight     CheekboneHeight 
##        6.428587e-01        7.025049e-02        8.831936e-01        1.739686e-04 
## CheekboneProminence       FaceRoundness               fWHR2              RaterN 
##        5.092498e-20        1.899210e-13        7.131977e-04        3.763099e-29

As we can observe, when it comes to larger data sets, normality tests can be unreliable. Slight deviations from a Gaussian distributions can yield very low p-values. To decide how to treat particular variables we generated plots for their distributions and distributions of their transformed forms. The graphs for all of the variables are accessible within the attached R code. Below we presented graphs of the variables chosen to be deleted or transformed.

Selected for Deletion

The distributions for these are either extremely skewed, with almost all of the data being a value near zero, or have an opposite of a normal distribution, with extremely high frequency of values around zero and one, and nearly none in between. Log transformation does not help to transform the data into a normal distribution. For now we will get rid of this data.

Masculine

LuminanceMedian

PupilTopAsymmetry

PupilLipAsymmetry

CheekboneProminence

RaterN

FemaleProb

WhiteProb

BlackProb

AsianProb

LatinoProb

OtherProb

MultiProb

Two of the irregular distributions stand out from the rest - AsianProb and MultiProb. If we look at the log transfrmations of these variables, we can see a they could work as categorical variables. If we look at the OtherProb and AsianProb variable, we can observe there are clusters at certain values. This means that they may be used to classify to which cluster, or a group, a certain value belongs. However, after performing these transformations, our models in a later section displayed worse performance, so eventually these variables were dropped as well.

Selected for Transformation

In the distributions of those variables we can observe enough normality to keep them in our model after necessary transformations.

Babyfaced

Happy

Sad

Surprised

Trustworthy

NoseWidth

All of them display more similarity to normal distribution after log transformation.

3.2. Dependent Variable

After taking a closer look, we can see that our dependent variable Attractive resembles a normal distribution. Good.

There is no need to perform additional transformations.

3.3. Predictors

Let’s apply the transformations to some of our variables that we selected earlier and delete unnecessary ones.

data_train[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")] <- log(data_train[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")])

data_test[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")] <- log(data_test[c("Babyfaced", "Happy", "Sad", "Surprised", "Trustworthy", "NoseWidth")])
data_train[c( "Masculine", "LuminanceMedian", "PupilTopAsymMetry", "PupilLipAsymmetry", "CheekboneProminence", "RaterN", "FemaleProb", "WhiteProb", "BlackProb", "LatinoProb", "MultiProb", "AsianProb", "OtherProb")] <- NULL

data_test[c( "Masculine", "LuminanceMedian", "PupilTopAsymMetry", "PupilLipAsymmetry", "CheekboneProminence", "RaterN", "FemaleProb", "WhiteProb", "BlackProb", "LatinoProb", "MultiProb", "AsianProb", "OtherProb")] <- NULL

Before moving on, let’s save our training and testing set.

save(list = c("data_train", "data_test"),
     file = "data_train_test.RData")

4. Feature Selection

The next step in development of our predictive model will be reducing the number of input variables by checking which of them are most highly correlated with the dependent variable. In order to do that we create the lists of numeric and categorical (2 in our case) variables.

data_numeric_vars <- sapply(data_train, is.numeric) %>% which() %>% names()
data_numeric_vars
##  [1] "AgeRated"           "Afraid"             "Angry"             
##  [4] "Attractive"         "Babyfaced"          "Disgusted"         
##  [7] "Dominant"           "Feminine"           "Happy"             
## [10] "Prototypic"         "Sad"                "Surprised"         
## [13] "Threatening"        "Trustworthy"        "Unusual"           
## [16] "NoseWidth"          "NoseLength"         "LipThickness"      
## [19] "FaceLength"         "EyeHeightR"         "EyeHeightL"        
## [22] "EyeHeightAvg"       "EyeWidthR"          "EyeWidthL"         
## [25] "EyeWidthAvg"        "FaceWidthCheeks"    "FaceWidthMouth"    
## [28] "FaceWidthBZ"        "Forehead"           "UpperFaceLength2"  
## [31] "PupilTopR"          "PupilTopL"          "PupilTopAsymmetry" 
## [34] "PupilLipR"          "PupilLipL"          "PupilLipAvg"       
## [37] "BottomLipChin"      "MidcheekChinR"      "MidcheekChinL"     
## [40] "CheeksAvg"          "MidbrowHairlineR"   "MidbrowHairlineL"  
## [43] "MidbrowHairlineAvg" "FaceShape"          "Heartshapeness"    
## [46] "NoseShape"          "LipFullness"        "EyeShape"          
## [49] "EyeSize"            "UpperHeadLength"    "MidfaceLength"     
## [52] "ChinLength"         "ForeheadHeight"     "CheekboneHeight"   
## [55] "FaceRoundness"      "fWHR2"
data_categorical_vars <- sapply(data_train, is.factor) %>% which() %>% names()
data_categorical_vars
## [1] "EthnicitySelf" "GenderSelf"

4.1. Numerical Variables

4.1.1. Colinearity

First, the correlation of the numeric variables will be explored. Let’s perform a check for multicolinearity - if any of our variables are correlated with each other and should be dropped.

data_linear_combinations <- findLinearCombos(data_train[, data_numeric_vars])
variables <- c()
for (item in data_linear_combinations$linearCombos){
  print(names(data_train[, data_numeric_vars][item]))
}
## [1] "EyeWidthAvg" "EyeWidthR"   "EyeWidthL"  
## [1] "PupilLipAvg" "PupilLipR"   "PupilLipL"  
## [1] "CheeksAvg"     "MidcheekChinR" "MidcheekChinL"
## [1] "MidbrowHairlineAvg" "MidbrowHairlineR"   "MidbrowHairlineL"
names(data_train[, data_numeric_vars])[data_linear_combinations$remove]
## [1] "EyeWidthAvg"        "PupilLipAvg"        "CheeksAvg"         
## [4] "MidbrowHairlineAvg"

However before we remove the suggested variables, let’s inspect these correlations.

colinear_variables <- c("EyeWidthAvg", "EyeWidthR", "EyeWidthL"  
,"PupilLipAvg", "PupilLipR", "PupilLipL", "CheeksAvg", "MidcheekChinR", "MidcheekChinL", "MidbrowHairlineAvg", "MidbrowHairlineR", "MidbrowHairlineL")
correlations <- cor(data_train[,colinear_variables])
corrplot(correlations)

We can see that all the cases of multicolinearity regard the same part of the face. We have high correlations in eye width, pupil to lip ratio, cheek characteristics and hairline. However, if we remove the suggested variable, we will still have another two very highly correlated variables. Let’s go against the suggestions and keep the variables EyeWidthAvg, PupilLipAvg, CheeksAvg, MidbrowHairlineAvg and remove the other ones.

highly_correlated <- (c( "EyeWidthR", "EyeWidthL", "PupilLipR", "PupilLipL",  "MidcheekChinR", "MidcheekChinL", "MidbrowHairlineR", "MidbrowHairlineL"))

data_train[c(highly_correlated)] <- NULL
data_test[c(highly_correlated)] <- NULL

4.1.2. Correlations

Correlation explains how one or more variables are related to each other. It determines how attributes change in relation with the dependent variable which is extremely helpful in further feature selection. Let’s see if we caught all highly correlated variables in the previous section. In order to choose the most correlated variables, regardless the correlation sign (positive or negative), we plotted the absolute values of those correlations.

There is some high correlation among the EyeHeight, EyeHeightR, EyeHeightL and EyeSize and EyeShape. Let’s take a closer look.

Let’s keep EyeHeightAvg and EyeShape.

highly_correlated <- c( "EyeHeightR", "EyeHeightL", "EyeSize")
data_train[c(highly_correlated)] <- NULL
data_test[c(highly_correlated)] <- NULL

And we continue our exploration.

This is a bit unreadable. Let’s order the variables in relation to their correlation to our dependent variable.

Let’s see the mostly correlated variables - in our case the ones that show at least around 0.2 correlation with a dependent variable.

Fortunately we don’t have too high correlations with our dependent variable. Let’s see the relation of the target variable with 5 most correlated variables.

Mostly Correlated

Trustworthy

Happy

Threatening

ChinLength

Sad

As it is visible on the graphs, two mostly correlated variables display a positive correlation with our dependent variable. The next three predictors are strongly negatively correlated.

4.2. Categorical Variables

Now let’s check the relationship of categorical predictors with the target variable. Our dependent variable is quantitative and predictors are qualitative, so we can use ANOVA to check their relation with dependent variable.

4.2.1. Ethnicity

anova_ethnicity <- aov(data_train$Attractive ~ data_train$EthnicitySelf) 
summary(anova_ethnicity)
##                           Df Sum Sq Mean Sq F value Pr(>F)
## data_train$EthnicitySelf   3   2.26  0.7527   1.259  0.288
## Residuals                415 248.16  0.5980

The EthnicitySelf variable yielded a p-value of 0.288. This means we cannot reject the null hypothesis, that the variable is significant and we can drop it. Additionally, if you look at their boxplots, you will see that the attractiveness scores are very similar for all four ethnicities which is also confirmed by their mean value.

data_train %>% select(Attractive, EthnicitySelf) %>% group_by(EthnicitySelf) %>% 
  summarise(mean(Attractive))
## # A tibble: 4 x 2
##   EthnicitySelf `mean(Attractive)`
##   <fct>                      <dbl>
## 1 A                           3.21
## 2 B                           3.31
## 3 L                           3.25
## 4 W                           3.13
ggplot(data_train, aes(x = EthnicitySelf, y = Attractive, fill = EthnicitySelf)) +
  geom_boxplot(alpha = 0.6)  +
  scale_fill_manual(values = c("#94D4FF", "#4FB0FF", "#1F8FFF", "#3079D1"))

4.2.2. Gender

anova_gender <- aov(data_train$Attractive ~ data_train$GenderSelf)
summary(anova_gender)
##                        Df Sum Sq Mean Sq F value   Pr(>F)    
## data_train$GenderSelf   1  24.39  24.389   44.99 6.46e-11 ***
## Residuals             417 226.03   0.542                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see p-value of the F statistic obtains very low value for GenderSelf, so we can definitely reject the null hypothesis - variable GenderSelf impacts Attractive. The scores differ greatly, by roughly 0.5 on a scale of 5, that is a huge difference.

ggplot(data_train, aes(x = GenderSelf, y = Attractive, fill = GenderSelf)) +
  geom_boxplot(alpha = 0.6)  +
  scale_fill_manual(values = c("#94D4FF", "#1F8FFF"))

Finally, we will remove EthnicitySelf and keep the GenderSelf variable.

data_train$EthnicitySelf <- NULL
data_test$EthnicitySelf <- NULL

4.3. Diversity

Next step would be to identify low-varying variables. As visible below the frequency ratio for most of the variables takes value close to 1 which implies that they are diversified, none of them show low variance.

nearZeroVar(data_train, saveMetrics = TRUE)
##                    freqRatio percentUnique zeroVar   nzv
## GenderSelf          1.043902      0.477327   FALSE FALSE
## AgeRated            1.000000     96.897375   FALSE FALSE
## Afraid              2.500000     64.200477   FALSE FALSE
## Angry               2.200000     74.463007   FALSE FALSE
## Attractive          1.200000     77.326969   FALSE FALSE
## Babyfaced           1.000000     74.701671   FALSE FALSE
## Disgusted           2.428571     66.109785   FALSE FALSE
## Dominant            1.333333     63.007160   FALSE FALSE
## Feminine            1.250000     80.190931   FALSE FALSE
## Happy               1.500000     79.474940   FALSE FALSE
## Prototypic          2.000000     84.248210   FALSE FALSE
## Sad                 2.200000     74.940334   FALSE FALSE
## Surprised           1.125000     57.279236   FALSE FALSE
## Threatening         3.600000     68.973747   FALSE FALSE
## Trustworthy         2.400000     70.405728   FALSE FALSE
## Unusual             1.833333     67.303103   FALSE FALSE
## NoseWidth           1.375000     44.391408   FALSE FALSE
## NoseLength          1.142857     42.482100   FALSE FALSE
## LipThickness        1.142857     42.959427   FALSE FALSE
## FaceLength          1.000000     66.587112   FALSE FALSE
## EyeHeightAvg        1.333333     30.548926   FALSE FALSE
## EyeWidthAvg         1.000000     39.379475   FALSE FALSE
## FaceWidthCheeks     1.000000     48.448687   FALSE FALSE
## FaceWidthMouth      1.400000     57.756563   FALSE FALSE
## FaceWidthBZ         1.111111     42.243437   FALSE FALSE
## Forehead            1.250000     56.324582   FALSE FALSE
## UpperFaceLength2    1.200000     48.210024   FALSE FALSE
## PupilTopR           1.000000     58.472554   FALSE FALSE
## PupilTopL           1.000000     59.904535   FALSE FALSE
## PupilTopAsymmetry   1.117647     18.615752   FALSE FALSE
## PupilLipAvg         1.200000     63.245823   FALSE FALSE
## BottomLipChin       1.000000     46.300716   FALSE FALSE
## CheeksAvg           1.000000     65.393795   FALSE FALSE
## MidbrowHairlineAvg  1.250000     73.747017   FALSE FALSE
## FaceShape           1.000000     99.522673   FALSE FALSE
## Heartshapeness      1.000000     99.045346   FALSE FALSE
## NoseShape           1.000000     99.284010   FALSE FALSE
## LipFullness         1.000000     99.522673   FALSE FALSE
## EyeShape            1.000000     97.613365   FALSE FALSE
## UpperHeadLength     1.000000     99.045346   FALSE FALSE
## MidfaceLength       1.000000     99.522673   FALSE FALSE
## ChinLength          2.000000     99.761337   FALSE FALSE
## ForeheadHeight      1.000000     99.284010   FALSE FALSE
## CheekboneHeight     2.000000     99.761337   FALSE FALSE
## FaceRoundness       2.000000     99.761337   FALSE FALSE
## fWHR2               1.500000     98.329356   FALSE FALSE

4.4. Final Selection

For modeling we will be using two sets of data - the 20 most correlated numeric variables with added variable GenderSelf and all the variables from the dataset.

selected_variables <- data_correlation_ordered[1:21]
selected_variables <- append(selected_variables, c("GenderSelf"))

data_train_high <- data_train[selected_variables]
data_test_high <- data_test[selected_variables]

Let’s save the list of selected variables as well.

save(list = "selected_variables",
     file = "selected_variables.RData")

5. Modeling

Before estimating the models we will create a function that will help us present the results in a more convenient and clear way. The function will visualize the values predicted by the model and compare them with the real data.

visualize <- function(model){
  predicted_values <- predict(model)
  
plot_1 <- ggplot(data.frame(error = data_train$Attractive - predicted_values), aes(x = error))   + geom_histogram(fill = "dodgerblue1", bins = 100) +
   theme_bw()

plot_2 <- ggplot(data.frame(real = data_train$Attractive, predicted = predicted_values), aes(x = predicted, y = real)) +
                 geom_point(col = "dodgerblue1") +
                 theme_bw() 

print(ggarrange(plot_1, plot_2))

cor(data_train$Attractive,
    predicted_values)
}

5.1 Estimating All Models

The models tested below:

  • Linear regression model with highly correlated variables
  • Linear regression model with all variables
  • Linear model with backward elimination based on p-value with highly correlated variables
  • Linear model with backward elimination based on p-value with all variables
  • Ridge regression model with highly correlated variables
  • Ridge regression model with all variables
  • LASSO regression model with highly correlated variables
  • LASSO regression model with all variables
  • Elastic Net model with highly correlated variables
  • Elastic Net model with all variables
faces_lm1 <- lm(Attractive ~ ., 
                 data = data_train_high)

faces_lm2 <- lm(Attractive ~ .,
                data = data_train)

faces_lm1_backward_p_value <- ols_step_backward_p(faces_lm1,prem = 0.05, progress = TRUE)

faces_lm2_backward_p_value <- ols_step_backward_p(faces_lm2,prem = 0.05, progress = TRUE)

options(contrasts = c("contr.treatment", "contr.treatment"))

ctrl_cv7 <- trainControl(method = "cv", number = 7)
parameters_ridge <- expand.grid(alpha = 0, lambda  = seq(10, 1e4, 10))

faces_ridge_1 <- train(Attractive ~ .,
                        data = data_train_high,
                        method = "glmnet",
                        tuneGrid = parameters_ridge,
                        trControl = ctrl_cv7)

faces_ridge_2 <- train(Attractive ~ .,
                        data = data_train,
                        method = "glmnet",
                        tuneGrid = parameters_ridge,
                        trControl = ctrl_cv7)

lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))
parameters_lasso <- expand.grid(alpha = 1, lambda = lambdas)

faces_lasso_1 <- 
  train(Attractive ~ .,
                       data = data_train_high,
                       method = "glmnet", 
                       tuneGrid = parameters_lasso,
                       trControl = ctrl_cv7)

faces_lasso_2 <- 
  train(Attractive ~ .,
                       data = data_train,
                       method = "glmnet", 
                       tuneGrid = parameters_lasso,
                       trControl = ctrl_cv7)

parameters_elastic <- expand.grid(alpha = seq(0, 1, 0.2), lambda = seq(0, 1e2, 1))

faces_elastic_1 <- train(Attractive ~ .,
                        data = data_train_high,
                        method = "glmnet", 
                        tuneGrid = parameters_elastic,
                        trControl = ctrl_cv7)

faces_elastic_2 <- train(Attractive ~ .,
                        data = data_train,
                        method = "glmnet", 
                        tuneGrid = parameters_elastic,
                        trControl = ctrl_cv7)

model_list <- list(faces_lm1, faces_lm2, faces_lm1_backward_p_value$model, 
                faces_lm2_backward_p_value$model,
                faces_ridge_1, faces_ridge_2, faces_lasso_1, faces_lasso_2,
                faces_elastic_1, faces_elastic_2)

model_names <- c("linear_regression_high", "linear_regression_all",
                 "OLS_high", "OLS_all",
                "ridge_high", "ridge_all", "lasso_high", "lasso_all",
                "elastic_net_high", "elastic_net_all")

5.2. Visualization

We calculated predicted values based on all models. Now, we will visualise those values versus the real ones and check the distribution of errors.

Linear Regression

Selected Variables

## [1] 0.8453847
All Variables

## [1] 0.8850862

OLS Backstepping

Selected Variables

## [1] 0.8423868
All Variables

## [1] 0.878183

Ridge Regression

Selected Variables

## [1] 0.7062318
All Variables

## [1] 0.6960377

LASSO Regression

Selected Variables

## [1] 0.8382624
All Variables

## [1] 0.873047

Cross Validation Elastic Net

Selected Variables

## [1] 0.8453302
All Variables

## [1] 0.884708

6. Evaluation

Here we define a function regression metrics, that will show us the performance of our models.

regressionMetrics <- function(real, predicted) {
  MSE <- mean((real - predicted)^2)
  RMSE <- sqrt(MSE)
  MAE <- mean(abs(real - predicted))
  MAPE <- mean(abs(real - predicted)/real)
  MedAE <- median(abs(real - predicted))
  MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
  TSS <- sum((real - mean(real))^2)
  RSS <- sum((predicted - real)^2)
  R2 <- 1 - RSS/TSS
  
  result <- data.frame(MSE, RMSE, MAE, MAPE, MedAE, MSLE, R2)
  return(result)
}

We check the parameters of all estimated models and compare them. The investigated parameters:

  • MSE - mean square error - tells us how close a regression line is to the observations
  • RMSE - root mean square error - standard deviation of the residuals
  • MAE - mean absolute error - measure of errors between paired observations
  • MAPE - mean absolute percentage error - expresses the accuracy as a ratio
  • MedAE - median absolute error - calculated by taking the median of all absolute differences between the target and the prediction
  • MSLE - mean squared logarithmic error - squared difference between the log of the predictions and log of actual values
  • R2 - R squared - proportion of the variance for a dependent variable that’s explained by an independent variables
results_dataframe <- data.frame()
for (item in model_list){
  results <- regressionMetrics(data_test$Attractive,
                               predict(item, data_test))
  results_dataframe <- rbind(results_dataframe, results)
}
row.names(results_dataframe) <- model_names
results_dataframe
##                              MSE      RMSE       MAE      MAPE     MedAE
## linear_regression_high 0.2103240 0.4586110 0.3592296 0.1210782 0.3033424
## linear_regression_all  0.2027092 0.4502324 0.3545443 0.1182855 0.2920863
## OLS_high               0.2033141 0.4509037 0.3553863 0.1194138 0.2919233
## OLS_all                0.2011840 0.4485354 0.3521739 0.1176472 0.2896622
## ridge_high             0.4530821 0.6731137 0.5530207 0.1799830 0.5107613
## ridge_all              0.4480510 0.6693661 0.5468848 0.1782385 0.4852803
## lasso_high             0.2027022 0.4502246 0.3521961 0.1171740 0.3054579
## lasso_all              0.1858561 0.4311103 0.3417936 0.1137362 0.3007018
## elastic_net_high       0.2095648 0.4577825 0.3583281 0.1207287 0.3025907
## elastic_net_all        0.2014788 0.4488639 0.3537071 0.1180233 0.2943244
##                              MSLE        R2
## linear_regression_high 0.01280973 0.6320254
## linear_regression_all  0.01243196 0.6453480
## OLS_high               0.01230307 0.6442896
## OLS_all                0.01225716 0.6480163
## ridge_high             0.02503912 0.2073054
## ridge_all              0.02478734 0.2161076
## lasso_high             0.01201220 0.6453601
## lasso_all              0.01112291 0.6748334
## elastic_net_high       0.01274583 0.6333536
## elastic_net_all        0.01233853 0.6475007

We get the best results from the LASSO regression model set on all the variables. Let’s visualize the errors in our predictions from that model.

par(mfrow=c(3,1))
errors <- abs(predict(faces_lasso_2, data_test) - data_test$Attractive)
plot(abs(predict(faces_lasso_2, data_test) - data_test$Attractive),
     main = "magnitude of the errors",
     ylab = 'error', pch = 19)
hist(errors, breaks = 100)
plot(ecdf(errors))

If we look at the magnitude of errors, we can see that very few surpassed 1, and a staggering majority stayed between the range 0 to 0.5. If we direct our attention to the density function of the errors, we can see that 80% of the errors were 0.5 or lower, which means that with our model we are able to predict rather accurately someone’s attractiveness, with only a few outlier exceptions.

7. Results and Conclusions

Final coefficients for variables in the LASSO model:

## 46 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)        -1.522361e+00
## GenderSelfM         1.141081e+00
## AgeRated           -2.622102e-02
## Afraid              .           
## Angry               .           
## Babyfaced           .           
## Disgusted          -9.407564e-03
## Dominant            3.935651e-01
## Feminine            5.788030e-01
## Happy               2.003957e-01
## Prototypic         -3.195150e-02
## Sad                -3.217708e-01
## Surprised          -1.393618e-01
## Threatening        -4.035229e-02
## Trustworthy         1.573323e+00
## Unusual            -1.953583e-01
## NoseWidth           .           
## NoseLength          .           
## LipThickness        .           
## FaceLength          .           
## EyeHeightAvg        4.999653e-03
## EyeWidthAvg         1.479866e-03
## FaceWidthCheeks     .           
## FaceWidthMouth     -8.676962e-04
## FaceWidthBZ         1.093849e-03
## Forehead            .           
## UpperFaceLength2    5.535949e-05
## PupilTopR           4.913545e-04
## PupilTopL           3.113078e-04
## PupilTopAsymmetry   .           
## PupilLipAvg         .           
## BottomLipChin       .           
## CheeksAvg           1.889630e-04
## MidbrowHairlineAvg  .           
## FaceShape           .           
## Heartshapeness      .           
## NoseShape           .           
## LipFullness         .           
## EyeShape            .           
## UpperHeadLength     .           
## MidfaceLength       .           
## ChinLength         -1.895147e+00
## ForeheadHeight      .           
## CheekboneHeight     1.688344e+00
## FaceRoundness       .           
## fWHR2              -3.899939e-01

From all the available variables, 21 were selected for the model estimation and 10 different models were evaluated. From all of the estimated models, LASSO regression with all variables yielded the best results. We obtained R2 at the level of 67.48% and mean square error of 18.58%. The second best model turned out to be the cross validation elastic net model with all variables. The worst performance was displayed by ridge regression model and it obtained around two times worse results than the rest of the models.

The important thing that we learned from this project is also the fact that you cannot always trust the statistical estimates or results generated by a specific test, sometimes you need an intuition and a good knowledge of your data to make decisions and obtain the best results.