PCA _ Missing Value, Factor Loading, Factor Score¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA #alternative package is statsmodels
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
df = pd.read_csv("BEPF Data.csv") 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 282 entries, 0 to 281
Data columns (total 30 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   PF_Dow        282 non-null    int64 
 1   PF_Sta        282 non-null    object
 2   PF_Res        282 non-null    int64 
 3   PF_Act        282 non-null    object
 4   PF_Dyn        282 non-null    int64 
 5   PF_Inn        282 non-null    int64 
 6   PF_Agg        282 non-null    int64 
 7   PF_Bol        282 non-null    int64 
 8   PF_Ord        282 non-null    int64 
 9   PF_Sim        282 non-null    int64 
 10  PF_Rom        282 non-null    int64 
 11  PF_Sen        282 non-null    object
 12  BE_Recog      282 non-null    int64 
 13  BE_Aware      282 non-null    int64 
 14  BE_ComMind    282 non-null    int64 
 15  BE_RecalLG    282 non-null    int64 
 16  BE_DIFF_Im    282 non-null    int64 
 17  BE_Quality    282 non-null    int64 
 18  BE_Function   282 non-null    int64 
 19  BE_BLoyal     282 non-null    int64 
 20  BE_FirstChoi  282 non-null    int64 
 21  BE_NBuyOth    282 non-null    object
 22  Gender        282 non-null    int64 
 23  Age           282 non-null    int64 
 24  Ethn          282 non-null    object
 25  Schyea        282 non-null    object
 26  EAtt_BG       282 non-null    object
 27  EAtt_DL       282 non-null    object
 28  EAtt_UP       282 non-null    object
 29  EAtt_UF       282 non-null    object
dtypes: int64(20), object(10)
memory usage: 66.2+ KB
In [4]:
df.iloc[:,0:22] = df.iloc[:,0:23].apply(pd.to_numeric, errors='coerce')
#df.astype() doesn't work here
In [5]:
BEPF=df.iloc[:,0:12]
BEPF.info()
#fyi, variables with missing value will be converted to the float format by defaculty. This is a sign showing which variable with missing value
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 282 entries, 0 to 281
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PF_Dow  282 non-null    int64  
 1   PF_Sta  281 non-null    float64
 2   PF_Res  282 non-null    int64  
 3   PF_Act  281 non-null    float64
 4   PF_Dyn  282 non-null    int64  
 5   PF_Inn  282 non-null    int64  
 6   PF_Agg  282 non-null    int64  
 7   PF_Bol  282 non-null    int64  
 8   PF_Ord  282 non-null    int64  
 9   PF_Sim  282 non-null    int64  
 10  PF_Rom  282 non-null    int64  
 11  PF_Sen  280 non-null    float64
dtypes: float64(3), int64(9)
memory usage: 26.6 KB

Note that some modeling functions/packages could not handle missing data.


You need to check the severity of missing value and sometime where these missing value are distributed in the dataset.
In [6]:
np.where(BEPF.isnull()) #show the position of missing values; 1st array for nrow and 2nd array for ncol.
Out[6]:
(array([194, 275, 277, 280], dtype=int64),
 array([ 3,  1, 11, 11], dtype=int64))

Types of Missing Data¶

  1. Missing completely at random. All variables or observations have the same probability of being missing. This type of missing value is random in nature. The causes of the missing data are unrelated to the data (i.e., missing values of Y are not dependent on X). For example, a fan didn't respond to a (online) survey question due to Internet Breakout.

  2. Missing at random. The probability of being missing links to the value of the variable or other variables in the dataset (i.e., missing values of Y depend on X, but not on Y). Asking low-income fans's skybox experience in game attendance."MAR is more general and more realistic than MCAR. Modern missing data methods generally start from the MAR assumption." van Buuren (2018)

  3. Missing not at random. Missing values of Y depend on Y. For example, asking respondents' confidiential info (e.g., cell phone, home address, specific annual income).

How much is too much?¶

Under 10%
Any of the imputatin methods can be applied when missing data are this low, although the complete case method has been shown to be the least preferred.

10% - 20%
The increased presence of missing data makes the all-available, hot deck case substitution, and regression methods most preferred for MCAR data, where as model-based methods are necessary with MAR missing data process"

Over 20%
If it is deemed necessary to impute missing data when the level is over 20 percent, the preferred methods are:

  • The regression method for MCAR situations;
  • Model-based methods when MAR missing data occur.

Imputation of Missing Data¶

A rule of thumb by Hair et al. (2010) is that missing data under 10% for an individual case or observation can generally be ignored.

You need to use one of the following strategy to address the missing value:

  1. Casewide deletion: Deleting whole rows with missing values
  2. Imputation: Imputing new values for the missing values for (both continuous and categorical variables).
    • mean
    • mode
    • median
    • random sampling
    • multivariate imputation
    • regression estimation
    • Nearest neighbors imputation
    • maximum likehood estimation

    For references, see resources below: - ***Flexible Imputation of Missing Data*** (https://stefvanbuuren.name/fimd/) by van Buuren (2018).
    - ***Imputation of missing values with scikit-learn*** (https://scikit-learn.org/stable/modules/impute.html) - ***Imputing missing values with variants of IterativeImputer*** (https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html)
  3. Use packages/functions with aforementioned imputation methods embedded.
In [7]:
imp = IterativeImputer(estimator=RandomForestRegressor())
imp.fit(BEPF)
BEPF_new = pd.DataFrame(imp.transform(BEPF), columns = BEPF.columns)
BEPF_new
Out[7]:
PF_Dow PF_Sta PF_Res PF_Act PF_Dyn PF_Inn PF_Agg PF_Bol PF_Ord PF_Sim PF_Rom PF_Sen
0 5.0 6.0 7.0 7.0 6.0 5.0 7.0 6.0 2.0 2.0 1.0 2.00
1 4.0 7.0 5.0 5.0 4.0 6.0 5.0 5.0 6.0 3.0 1.0 2.00
2 6.0 6.0 6.0 5.0 5.0 7.0 3.0 5.0 4.0 4.0 4.0 4.00
3 2.0 6.0 5.0 6.0 4.0 5.0 6.0 6.0 5.0 4.0 1.0 2.00
4 5.0 6.0 5.0 7.0 6.0 5.0 3.0 5.0 2.0 4.0 2.0 4.00
... ... ... ... ... ... ... ... ... ... ... ... ...
277 6.0 7.0 7.0 7.0 4.0 3.0 2.0 7.0 5.0 1.0 3.0 2.21
278 5.0 5.0 6.0 6.0 4.0 4.0 3.0 3.0 2.0 6.0 1.0 2.00
279 5.0 6.0 6.0 7.0 6.0 5.0 6.0 4.0 3.0 1.0 1.0 1.00
280 5.0 5.0 5.0 6.0 6.0 6.0 6.0 6.0 2.0 5.0 2.0 3.06
281 5.0 6.0 5.0 3.0 5.0 3.0 4.0 4.0 6.0 2.0 1.0 2.00

282 rows × 12 columns

In [8]:
pca=PCA(n_components=5)
pca_results=pca.fit_transform(BEPF_new)
pca_results = pd.DataFrame(data = pca_results, columns = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])
pca_results
Out[8]:
PC1 PC2 PC3 PC4 PC5
0 0.884511 -4.149250 0.642189 0.498042 -0.537676
1 -1.100343 -0.657103 -0.968603 1.512359 -0.847872
2 1.065119 1.014804 0.751841 -1.609559 -0.361946
3 -1.197785 -0.902073 -0.422372 2.698197 0.809483
4 -0.151352 -0.806130 0.851878 -1.926525 1.432481
... ... ... ... ... ...
277 -0.632945 -1.070345 -0.116560 -1.940312 -3.383962
278 -2.931960 -0.057941 -1.240816 -1.826628 1.314025
279 -0.852092 -3.958035 0.415825 0.450932 -0.978745
280 0.980150 -1.343022 0.741052 0.845582 1.338341
281 -3.242454 0.189610 -0.309440 1.038124 -2.505080

282 rows × 5 columns

In [9]:
explained_variance=[]
for i in pca.explained_variance_ratio_:
    explained_variance.append(i)
    
sum(explained_variance)
Out[9]:
0.7859135360278126

Factor Loading and Factor Score¶

Factor loading is correlation coefficient between the original variable(feature) and the principal component/factor.

Factor loading shows the variance explained by the variable on that particular factor.

With the avaibility of each variable's factor loading, we can calculate factor score for each of principle component via regression model in which factor loadings are regression coeffcients. For example, we can generate PC1's factor score by regressing all 12 PF variables to PC1.

In [10]:
loadings = pd.DataFrame(pca.components_.T, 
                        columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'], 
                        index=BEPF.columns.values).round(3)
loadings
Out[10]:
PC1 PC2 PC3 PC4 PC5
PF_Dow 0.271 -0.008 -0.214 -0.270 -0.542
PF_Sta 0.188 -0.094 -0.276 -0.342 -0.244
PF_Res 0.249 -0.126 -0.248 -0.352 -0.088
PF_Act 0.270 -0.227 -0.165 -0.123 0.261
PF_Dyn 0.314 -0.200 -0.038 0.089 0.264
PF_Inn 0.388 -0.201 0.030 0.007 0.314
PF_Agg 0.355 -0.161 0.090 0.595 -0.159
PF_Bol 0.372 -0.196 0.047 0.233 -0.102
PF_Ord 0.086 0.426 -0.431 0.425 -0.222
PF_Sim 0.110 0.453 -0.530 0.031 0.421
PF_Rom 0.301 0.426 0.398 -0.041 -0.281
PF_Sen 0.366 0.460 0.390 -0.268 0.249