import pandas as pd
import numpy as np
from sklearn.decomposition import PCA #alternative package is statsmodels
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("BEPF Data.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 282 entries, 0 to 281 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PF_Dow 282 non-null int64 1 PF_Sta 282 non-null object 2 PF_Res 282 non-null int64 3 PF_Act 282 non-null object 4 PF_Dyn 282 non-null int64 5 PF_Inn 282 non-null int64 6 PF_Agg 282 non-null int64 7 PF_Bol 282 non-null int64 8 PF_Ord 282 non-null int64 9 PF_Sim 282 non-null int64 10 PF_Rom 282 non-null int64 11 PF_Sen 282 non-null object 12 BE_Recog 282 non-null int64 13 BE_Aware 282 non-null int64 14 BE_ComMind 282 non-null int64 15 BE_RecalLG 282 non-null int64 16 BE_DIFF_Im 282 non-null int64 17 BE_Quality 282 non-null int64 18 BE_Function 282 non-null int64 19 BE_BLoyal 282 non-null int64 20 BE_FirstChoi 282 non-null int64 21 BE_NBuyOth 282 non-null object 22 Gender 282 non-null int64 23 Age 282 non-null int64 24 Ethn 282 non-null object 25 Schyea 282 non-null object 26 EAtt_BG 282 non-null object 27 EAtt_DL 282 non-null object 28 EAtt_UP 282 non-null object 29 EAtt_UF 282 non-null object dtypes: int64(20), object(10) memory usage: 66.2+ KB
df.iloc[:,0:22] = df.iloc[:,0:23].apply(pd.to_numeric, errors='coerce')
#df.astype() doesn't work here
BEPF=df.iloc[:,0:12]
BEPF.info()
#fyi, variables with missing value will be converted to the float format by defaculty. This is a sign showing which variable with missing value
<class 'pandas.core.frame.DataFrame'> RangeIndex: 282 entries, 0 to 281 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PF_Dow 282 non-null int64 1 PF_Sta 281 non-null float64 2 PF_Res 282 non-null int64 3 PF_Act 281 non-null float64 4 PF_Dyn 282 non-null int64 5 PF_Inn 282 non-null int64 6 PF_Agg 282 non-null int64 7 PF_Bol 282 non-null int64 8 PF_Ord 282 non-null int64 9 PF_Sim 282 non-null int64 10 PF_Rom 282 non-null int64 11 PF_Sen 280 non-null float64 dtypes: float64(3), int64(9) memory usage: 26.6 KB
Note that some modeling functions/packages could not handle missing data.
np.where(BEPF.isnull()) #show the position of missing values; 1st array for nrow and 2nd array for ncol.
(array([194, 275, 277, 280], dtype=int64), array([ 3, 1, 11, 11], dtype=int64))
Missing completely at random. All variables or observations have the same probability of being missing. This type of missing value is random in nature. The causes of the missing data are unrelated to the data (i.e., missing values of Y are not dependent on X). For example, a fan didn't respond to a (online) survey question due to Internet Breakout.
Missing at random. The probability of being missing links to the value of the variable or other variables in the dataset (i.e., missing values of Y depend on X, but not on Y). Asking low-income fans's skybox experience in game attendance."MAR is more general and more realistic than MCAR. Modern missing data methods generally start from the MAR assumption." van Buuren (2018)
Missing not at random. Missing values of Y depend on Y. For example, asking respondents' confidiential info (e.g., cell phone, home address, specific annual income).
Under 10%
Any of the imputatin methods can be applied when missing data are this low, although the complete case method has been shown to be the least preferred.
10% - 20%
The increased presence of missing data makes the all-available, hot deck case substitution, and regression methods most preferred for MCAR data, where as model-based methods are necessary with MAR missing data process"
Over 20%
If it is deemed necessary to impute missing data when the level is over 20 percent, the preferred methods are:
A rule of thumb by Hair et al. (2010) is that missing data under 10% for an individual case or observation can generally be ignored.
You need to use one of the following strategy to address the missing value:
imp = IterativeImputer(estimator=RandomForestRegressor())
imp.fit(BEPF)
BEPF_new = pd.DataFrame(imp.transform(BEPF), columns = BEPF.columns)
BEPF_new
| PF_Dow | PF_Sta | PF_Res | PF_Act | PF_Dyn | PF_Inn | PF_Agg | PF_Bol | PF_Ord | PF_Sim | PF_Rom | PF_Sen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.0 | 6.0 | 7.0 | 7.0 | 6.0 | 5.0 | 7.0 | 6.0 | 2.0 | 2.0 | 1.0 | 2.00 |
| 1 | 4.0 | 7.0 | 5.0 | 5.0 | 4.0 | 6.0 | 5.0 | 5.0 | 6.0 | 3.0 | 1.0 | 2.00 |
| 2 | 6.0 | 6.0 | 6.0 | 5.0 | 5.0 | 7.0 | 3.0 | 5.0 | 4.0 | 4.0 | 4.0 | 4.00 |
| 3 | 2.0 | 6.0 | 5.0 | 6.0 | 4.0 | 5.0 | 6.0 | 6.0 | 5.0 | 4.0 | 1.0 | 2.00 |
| 4 | 5.0 | 6.0 | 5.0 | 7.0 | 6.0 | 5.0 | 3.0 | 5.0 | 2.0 | 4.0 | 2.0 | 4.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 277 | 6.0 | 7.0 | 7.0 | 7.0 | 4.0 | 3.0 | 2.0 | 7.0 | 5.0 | 1.0 | 3.0 | 2.21 |
| 278 | 5.0 | 5.0 | 6.0 | 6.0 | 4.0 | 4.0 | 3.0 | 3.0 | 2.0 | 6.0 | 1.0 | 2.00 |
| 279 | 5.0 | 6.0 | 6.0 | 7.0 | 6.0 | 5.0 | 6.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.00 |
| 280 | 5.0 | 5.0 | 5.0 | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 | 2.0 | 5.0 | 2.0 | 3.06 |
| 281 | 5.0 | 6.0 | 5.0 | 3.0 | 5.0 | 3.0 | 4.0 | 4.0 | 6.0 | 2.0 | 1.0 | 2.00 |
282 rows × 12 columns
pca=PCA(n_components=5)
pca_results=pca.fit_transform(BEPF_new)
pca_results = pd.DataFrame(data = pca_results, columns = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])
pca_results
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| 0 | 0.884511 | -4.149250 | 0.642189 | 0.498042 | -0.537676 |
| 1 | -1.100343 | -0.657103 | -0.968603 | 1.512359 | -0.847872 |
| 2 | 1.065119 | 1.014804 | 0.751841 | -1.609559 | -0.361946 |
| 3 | -1.197785 | -0.902073 | -0.422372 | 2.698197 | 0.809483 |
| 4 | -0.151352 | -0.806130 | 0.851878 | -1.926525 | 1.432481 |
| ... | ... | ... | ... | ... | ... |
| 277 | -0.632945 | -1.070345 | -0.116560 | -1.940312 | -3.383962 |
| 278 | -2.931960 | -0.057941 | -1.240816 | -1.826628 | 1.314025 |
| 279 | -0.852092 | -3.958035 | 0.415825 | 0.450932 | -0.978745 |
| 280 | 0.980150 | -1.343022 | 0.741052 | 0.845582 | 1.338341 |
| 281 | -3.242454 | 0.189610 | -0.309440 | 1.038124 | -2.505080 |
282 rows × 5 columns
explained_variance=[]
for i in pca.explained_variance_ratio_:
explained_variance.append(i)
sum(explained_variance)
0.7859135360278126
Factor loading is correlation coefficient between the original variable(feature) and the principal component/factor.
Factor loading shows the variance explained by the variable on that particular factor.
With the avaibility of each variable's factor loading, we can calculate factor score for each of principle component via regression model in which factor loadings are regression coeffcients. For example, we can generate PC1's factor score by regressing all 12 PF variables to PC1.
loadings = pd.DataFrame(pca.components_.T,
columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'],
index=BEPF.columns.values).round(3)
loadings
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| PF_Dow | 0.271 | -0.008 | -0.214 | -0.270 | -0.542 |
| PF_Sta | 0.188 | -0.094 | -0.276 | -0.342 | -0.244 |
| PF_Res | 0.249 | -0.126 | -0.248 | -0.352 | -0.088 |
| PF_Act | 0.270 | -0.227 | -0.165 | -0.123 | 0.261 |
| PF_Dyn | 0.314 | -0.200 | -0.038 | 0.089 | 0.264 |
| PF_Inn | 0.388 | -0.201 | 0.030 | 0.007 | 0.314 |
| PF_Agg | 0.355 | -0.161 | 0.090 | 0.595 | -0.159 |
| PF_Bol | 0.372 | -0.196 | 0.047 | 0.233 | -0.102 |
| PF_Ord | 0.086 | 0.426 | -0.431 | 0.425 | -0.222 |
| PF_Sim | 0.110 | 0.453 | -0.530 | 0.031 | 0.421 |
| PF_Rom | 0.301 | 0.426 | 0.398 | -0.041 | -0.281 |
| PF_Sen | 0.366 | 0.460 | 0.390 | -0.268 | 0.249 |