# Set working directory and path to datacd<-"/Users/arvindsharma/Dropbox/WCAS/Econometrics/"### CHANGE !!!setwd(cd)df_train<-read.csv("moneyball-training-data.csv")df_eval<-read.csv("moneyball-evaluation-data.csv")str(df_train)
'data.frame': 2276 obs. of 17 variables:
$ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
$ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
$ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
$ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
$ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
$ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
$ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
$ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
$ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
$ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
$ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
$ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
$ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
$ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
$ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
$ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
$ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
Missing Observations
vis_dat(df_train)
vis_miss(df_train)
Correlation Plot
cor(): Compute the correlation of x and y if these are vectors. It is the entire data in our example below.
cor_pmat() : Compute a correlation matrix p-values. Tells us of the correlation coefficient is significatly different from 0 or not.
?cormycorr<-cor(x =df_train[ , 1:ncol(df_train)], # all columns use ="pairwise.complete.obs")p.mat<-ggcorrplot::cor_pmat(x =df_train[,1:ncol(df_train)])
Now, lets plot it out.
myplot<-ggcorrplot(corr =mycorr, # correlation matrix to visualize method ="square", # "square" (default), "circle" type ="lower", # "full" (default), "lower" or "upper" display title ="Correlation Plot", colors =c("red", "white","green"), # low, mid & high correlation values lab =TRUE, # add corr coeff on the plot. lab_size =2, # labels. used when lab = TRUE. p.mat =p.mat, # matrix of p-value. If NULL, arguments sig.level, insig, pch, pch.col, pch.cex is invalid. insig ="pch", # character, specialized insignificant correlation coefficients, "pch" (default), "blank". If "blank", wipe away the corresponding glyphs; if "pch", add characters (see pch for details) on corresponding glyphs. pch =4, # add character on the glyphs of insignificant correlation coefficients (only valid when insig is "pch"). Default value is 4. hc.order =FALSE, # corr matrix will be hc.ordered using hclust function. tl.cex =8, # the size, the color and the string rotation of text label tl.col ="black", digits =2)myplot
# create histogram using ggplot2ggplot(df_melted, aes(value))+geom_histogram()+facet_wrap(~variable, scales ="free_x")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 3478 rows containing non-finite outside the scale range
(`stat_bin()`).
PCA (prcomp)
PCA is a powerful technique for dimensionality reduction, data exploration, and feature selection in R.
Use prcomp() if you want if you prefer Principal Component Analysis using Singular Value Decomposition (SVD) method. SVD is a numerically stable method for decomposing a data matrix into orthogonal components.
You can adapt the steps below to your specific dataset and analysis goals.
Batting
We will try to use the six variable to create on batting variable.
stargazer(table =batting_data, type ="text", title ="Team Batting Summary Statistics", covariate.labels =c("Singles", "Doubles", "Triples", "Home Runs", "BB", "Strike Outs"))
Team Batting Summary Statistics
==============================================
Statistic N Mean St. Dev. Min Max
----------------------------------------------
Singles 2,276 1,469.270 144.591 891 2,554
Doubles 2,276 241.247 46.801 69 458
Triples 2,276 55.250 27.939 0 223
Home Runs 2,276 99.612 60.547 0 264
BB 2,276 501.559 122.671 0 878
Strike Outs 2,174 735.605 248.526 0 1,399
----------------------------------------------
Step 2: Preprocess Your Data:
PCA is sensitive to the scale of your variables, so it’s a good practice to standardize (center and scale) your data if the variables have different units or scales. You can use the scale() function for this purpose.
# typeof(batting_data)# class(batting_data)# Standardize the datascaled_batting_data<-scale(batting_data)# typeof(scaled_batting_data)# class(scaled_batting_data)# Store as dataframe instead of a matrixscaled_batting_data<-as.data.frame(scaled_batting_data)
See if the scaling (applied correctly) worked or not.
stargazer(table =scaled_batting_data, type ="text", title ="Team Batting Summary Statistics (Scaled)", covariate.labels =c("Singles", "Doubles", "Triples", "Home Runs", "BB", "Strike Outs"))
Team Batting Summary Statistics (Scaled)
==============================================
Statistic N Mean St. Dev. Min Max
----------------------------------------------
Singles 2,276 -0.000 1.000 -3.999 7.502
Doubles 2,276 -0.000 1.000 -3.680 4.631
Triples 2,276 -0.000 1.000 -1.978 6.004
Home Runs 2,276 -0.000 1.000 -1.645 2.715
BB 2,276 -0.000 1.000 -4.089 3.069
Strike Outs 2,174 0.000 1.000 -2.960 2.669
----------------------------------------------
Step 3: Perform PCA
Use the prcomp() function to perform PCA on your scaled data. You need to provide the standardized data as input, and you can specify additional options if needed.
You cannot have any missing observations in the variables. Either you impute the missing values, or drop the rows with missing values. I will go with the later approach. You can drop the columns with missing values as well.
?prcomp# Perform PCApca_result<-prcomp(scaled_batting_data)# Alternatively, you can center the data with the scale command directlypca_result2<-prcomp(batting_data_noMissing, scale =TRUE, # False center =TRUE)?princomp
This matrix contains the loadings of each original variable on each principal component.
The loadings indicate the strength and direction of the relationship between the original variables and the principal components.
Loadings close to zero suggest that a variable has little influence on a particular principal component, while loadings with higher absolute values indicate stronger influence.
This matrix is crucial for understanding which variables contribute most to each principal component and how they are related.
pca_result$x: This matrix contains the scores for each observation on each principal component.
Each row corresponds to an observation, and each column corresponds to a principal component.
These scores represent how each observation is positioned in the reduced-dimensional space defined by the principal components.
PCA reduces the dimensionality of the data by projecting the original observations onto these new axes, and the scores tell you the coordinates of each observation in this new space.
pca_result$sdev: This vector contains the standard deviations of the principal components.
Centering involves subtracting the mean of each variable from the data to make the variables have zero means.
Scaling involves dividing the centered data by the standard deviation of each variable to ensure that variables have the same scale.
These centering and scaling transformations are applied to the data before PCA to ensure that variables with different units or scales do not dominate the results.
Understanding and interpreting these components is essential for making sense of the results of a PCA analysis and for using PCA effectively in data exploration, dimensionality reduction, and feature selection tasks.
You can visualize the results, create biplots, and explore how much variance each principal component explains.
Visualization:
You can create visualizations to explore the results. For example, you can plot the variance explained by each principal component to determine how many components to retain:
Graph of Variables
# Plot the variance explained by each component plot(pca_result)
Interpret the results, including the loadings of variables on principal components and the scores of observations on principal components. This can help you understand which variables contribute the most to each component and how observations are positioned in the reduced-dimensional space.
Alternative Commands
PCA (princomp)
princomp is another standard function in base R that performs Principal Component Analysis but uses using Covariance Matrix.
?princomp# Perform PCA using princompprincomp_pca_result<-princomp(scaled_batting_data)summary(princomp_pca_result)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.6804565 1.2759216 0.8500474 0.57492701 0.54269471
Proportion of Variance 0.4773952 0.2752146 0.1221546 0.05587904 0.04978915
Cumulative Proportion 0.4773952 0.7526098 0.8747644 0.93064343 0.98043258
Comp.6
Standard deviation 0.34021626
Proportion of Variance 0.01956742
Cumulative Proportion 1.00000000
Should give similar answer to the prcomp() function, which is often preferred for large datasets as it can be computationally more efficient.
Guides may be helpful to see how people create clusters as well
Predict
Principal Component Analysis (PCA) is primarily a dimensionality reduction technique and does not perform predictions in the traditional sense, such as making predictions for new or unseen data points. Instead, PCA is used for data exploration, feature selection, and reducing the dimensionality of a dataset while retaining as much variance as possible.
However, you can use PCA in combination with statistical techniques for prediction tasks. Here’s a general outline of how you can incorporate PCA into a predictive modeling workflow:
Data Preparation: Preprocess your data, including data cleaning, missing value imputation, and feature scaling. Standardize or normalize your data if it’s not already done.
PCA: Perform PCA on your dataset to reduce its dimensionality. Determine the number of principal components to retain based on the explained variance or other criteria.
Feature Selection: You can choose to use the retained principal components as features for your prediction model. These components capture the most important information in the original data.
Split Data: Split your data into training and testing sets for model development and evaluation.
Model Building: Build a predictive model using the retained principal components (or other selected features) and the target variable. You can use various modeling techniques, such as regression, classification, or clustering, depending on your prediction task.
Model Training: Train your predictive model on the training data.
Model Evaluation: Evaluate the model’s performance using the testing dataset. Common evaluation metrics depend on the specific prediction task but may include accuracy, precision, recall, F1-score, or mean squared error, among others.
Prediction: Use the trained model to make predictions on new or unseen data.
Inverse PCA (Optional): If needed, you can perform an inverse PCA transformation to obtain predictions or representations in the original data space. This step may be necessary if you want predictions in the original feature space.
Assess Model Performance: Evaluate the performance of your prediction model on the original data space or in the space of the retained principal components, depending on your goals.
It’s important to note that while PCA can be a useful preprocessing step to reduce dimensionality and remove multicollinearity in your data, it may not always improve predictive model performance. The decision to use PCA should be based on the specific characteristics of your dataset and the goals of your prediction task.
In summary, PCA is a valuable tool in data preprocessing, but its primary role is not prediction. Instead, it can help improve the efficiency and interpretability of predictive models when used as part of a broader machine learning or statistical analysis pipeline.