You’ll need the FactoMineR library for PCA and the factoextra library for visualization:
library (ggfortify)
## Loading required package: ggplot2
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(readxl)
library(scales)
df <- read_excel("FinVizScreener_Tech.xlsx")
head(df)
## # A tibble: 6 × 16
## No. Ticker `Perf Week` `Perf Month` `Perf Quart` `Perf Half` `Perf Year`
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 AAOI 0.156 -0.279 -0.2881 0.5768 5.1941
## 2 2 ACMR 0.138 0.498 0.7326 0.9871 1.7659
## 3 3 AEHR 0.0309 -0.0894 -0.4626 -0.6555 -0.5988
## 4 4 ASTS -0.121 -0.107 -0.4662 -0.2756 -0.533
## 5 5 BB -0.0632 0.0075 -0.3488 -0.4785 -0.2725
## 6 6 BBAI -0.180 -0.0291 0.1976 0.3514 0.087
## # ℹ 9 more variables: `Perf YTD` <dbl>, `Volatility W` <dbl>,
## # `Volatility M` <dbl>, Recom <chr>, `Avg Volume` <chr>, `Rel Volume` <dbl>,
## # Price <dbl>, Change <dbl>, Volume <dbl>
Before performing PCA, ensure the data frame only contains the numeric columns you want to analyze. You will need to exclude non-numeric columns such as ‘No.’ and ‘Ticker’ and convert any character columns containing numeric data into actual numeric columns:
# Assuming `df` is your data frame
df <- read_excel("FinVizScreener_Tech.xlsx")
df_numeric <- df[, -which(names(df) %in% c("No.", "Ticker", "Perf Half", "Perf Year", "Recom", "Avg Volume"))]
df_numeric <- data.frame(lapply(df_numeric, as.numeric))
## Warning in lapply(df_numeric, as.numeric): NAs introduced by coercion
# Convert factors to characters
df_numeric <- data.frame(lapply(df_numeric, function(x) as.character(x)))
# Replace any non-numeric characters with NA
df_numeric <- data.frame(lapply(df_numeric, function(x) {
x <- gsub("[^0-9.-]", "", x)
ifelse(x == "", NA, as.numeric(x))
}))
## Warning in ifelse(x == "", NA, as.numeric(x)): NAs introduced by coercion
# Handle NAs as needed, for example:
df_numeric[is.na(df_numeric)] <- 0 # Replace NA with 0
# or
df_numeric <- na.omit(df_numeric) # Remove rows with NA
summary(df_numeric)
## Perf.Week Perf.Month Perf.Quart Perf.YTD
## Min. :-0.376400 Min. :-0.431100 Min. :-0.59140 Min. :-0.62030
## 1st Qu.:-0.031700 1st Qu.:-0.129700 1st Qu.:-0.29210 1st Qu.:-0.31020
## Median : 0.009300 Median :-0.055200 Median :-0.08640 Median :-0.14580
## Mean : 0.005502 Mean : 0.003578 Mean :-0.00819 Mean :-0.06197
## 3rd Qu.: 0.062700 3rd Qu.: 0.081600 3rd Qu.: 0.10930 3rd Qu.: 0.01150
## Max. : 0.295300 Max. : 1.027000 Max. : 1.93780 Max. : 1.89620
## Volatility.W Volatility.M Rel.Volume Price
## Min. :0.0022 Min. :0.00520 Min. :0.1000 Min. : 1.540
## 1st Qu.:0.0425 1st Qu.:0.04210 1st Qu.:0.5900 1st Qu.: 2.760
## Median :0.0558 Median :0.06230 Median :0.7900 Median : 5.420
## Mean :0.0670 Mean :0.07348 Mean :0.9678 Mean : 8.619
## 3rd Qu.:0.0859 3rd Qu.:0.09000 3rd Qu.:1.2500 3rd Qu.:11.650
## Max. :0.1646 Max. :0.24120 Max. :2.9100 Max. :34.810
## Change Volume
## Min. :-0.1168 Min. : 163530
## 1st Qu.:-0.0499 1st Qu.: 996694
## Median :-0.0231 Median : 1756819
## Mean :-0.0281 Mean : 4376523
## 3rd Qu.:-0.0043 3rd Qu.: 3231865
## Max. : 0.0858 Max. :85057485
We want to best ensure we have no null values to be able to continue.
Principal Component Analysis (PCA) is a statistical procedure that transforms a dataset characterized by possibly correlated variables into a set of linearly uncorrelated variables known as principal components. The transformation is defined in such a way that the first principal component has the highest possible variance, and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components.
Mathematically, for a data matrix \(\mathbf{X}\) of zero empirical mean, where each column represents a variable, and each row represents an observation, PCA aims to project the data onto a new subspace with reduced dimensionality. This is achieved by the eigendecomposition of the covariance matrix \(\mathbf{C}\):
\[ \mathbf{C} = \frac{1}{n-1} (\mathbf{X}^\top \mathbf{X}) \]
Here, \(\mathbf{X}^\top\) denotes the transpose of \(\mathbf{X}\), and \(n\) is the number of observations.
The eigendecomposition is performed as follows:
\[ \mathbf{C} = \mathbf{V} \mathbf{D} \mathbf{V}^\top \]
Where \(\mathbf{V}\) is the matrix whose columns are the eigenvectors of \(\mathbf{C}\), and \(\mathbf{D}\) is the diagonal matrix of eigenvalues. The eigenvectors correspond to the principal directions, while the eigenvalues indicate the variance explained by each principal direction.
The principal components themselves are the result of this transformation applied to \(\mathbf{X}\), where the dataset is projected onto the new set of orthogonal axes defined by the eigenvectors. These components are ranked such that \(\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n\), with \(\lambda_i\) denoting the eigenvalue associated with the \(i\)-th principal component, reflecting the amount of variance it explains in the dataset.
In the context of PCA, the first principal component \(\mathbf{u}_1\) is aligned with the direction of maximum variance in \(\mathbf{X}\), and each subsequent principal component \(\mathbf{u}_i\) is orthogonal to the previous components while also aligning with the next highest variance direction.
PCA provides a powerful method for data analysis and pattern recognition often utilized in exploratory data analysis, noise reduction, feature extraction, and data compression.
# Check the current names in the data frame
print(names(df))
## [1] "No." "Ticker" "Perf Week" "Perf Month" "Perf Quart"
## [6] "Perf Half" "Perf Year" "Perf YTD" "Volatility W" "Volatility M"
## [11] "Recom" "Avg Volume" "Rel Volume" "Price" "Change"
## [16] "Volume"
# Convert character columns to numeric, handling non-numeric characters
df$`Perf Quart` <- suppressWarnings(as.numeric(as.character(df$`Perf Quart`)))
df$`Perf Half` <- suppressWarnings(as.numeric(as.character(df$`Perf Half`)))
df$`Perf Year` <- suppressWarnings(as.numeric(as.character(df$`Perf Year`)))
tail(df)
## # A tibble: 6 × 16
## No. Ticker `Perf Week` `Perf Month` `Perf Quart` `Perf Half` `Perf Year`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44 SPWR -0.0283 -0.130 -0.378 -0.585 -0.785
## 2 45 STEM 0.0757 -0.266 -0.464 -0.550 -0.692
## 3 46 VMEO 0.0093 0.358 0.344 0.530 0.560
## 4 47 VYX 0.06 -0.106 -0.200 -0.213 -0.0204
## 5 48 YOU 0.0627 0.133 -0.0434 0.124 -0.0686
## 6 49 ZUO -0.0079 0.0771 -0.0146 0.0615 -0.0548
## # ℹ 9 more variables: `Perf YTD` <dbl>, `Volatility W` <dbl>,
## # `Volatility M` <dbl>, Recom <chr>, `Avg Volume` <chr>, `Rel Volume` <dbl>,
## # Price <dbl>, Change <dbl>, Volume <dbl>
# Note: This might introduce NAs for any non-numeric entries that cannot be converted
# this is a little tweak so that things line up nicely later on
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Step 2: Pre-process data (example, adjust according to your dataset specifics)
df <- na.omit(df) # Remove rows with NA values
# Step 3: Select relevant columns for PCA
pca_data <- df[, c("Perf Quart", "Price", "Rel Volume")]
# Ensure the data is numeric
pca_data <- sapply(pca_data, as.numeric)
# Step 4: Perform PCA
pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)
# Summarize PCA results
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.1624 0.9503 0.8636
## Proportion of Variance 0.4504 0.3010 0.2486
## Cumulative Proportion 0.4504 0.7514 1.0000
# Visualize the results
biplot(pca_result)
For the sake of the project, we focus on financial dimensions that give us back the best amount of information such as “Perf Quart”, “Price”, “Rel Volume”.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Extract the scores (principal components)
scores <- pca_result$x
# Ensure your labels are correctly matched with the PCA scores
# Assuming 'df$Ticker' exists and corresponds to the order of PCA scores
# Ensure the length of 'labels' is the same as the number of rows in 'scores'
labels <- df$Ticker[1:nrow(scores)]
# Now 'labels' has the same length as 'scores', so the sizes are compatible
# Create a 3D plot using plotly with hover labels
plotly_3d <- plot_ly(x = ~scores[, "PC1"], y = ~scores[, "PC2"], z = ~scores[, "PC3"],
type = "scatter3d", mode = "markers",
text = ~labels, # Add the ticker labels here
hoverinfo = 'text', # Show text on hover
marker = list(size = 5))
# Add custom axis labels
plotly_3d <- plotly_3d %>% layout(scene = list(
xaxis = list(title = 'PC1 (Perf Quart)'),
yaxis = list(title = 'PC2 (Price)'),
zaxis = list(title = 'PC3 (Relative Volume)')
))
# Render the plot
plotly_3d
Standard deviation: The square roots of the eigenvalues of the covariance/correlation matrix, which represent the amount of variance captured by each principal component.
Proportion of Variance: Indicates the fraction of the total variance explained by each principal component.
This tells us that the first principal component (PC1) captures 45.04% of the total variability in the data, the second (PC2) captures an additional 30.10%, and the third (PC3) captures the remaining 24.86%, totaling 100% of the variability.
Vectors (Loadings): The arrows labeled “Perf Quart”, “Price”, and “Rel Volume” represent the loading of each variable. The direction and length of these vectors indicate how each variable influences the principal components:
The direction indicates which variables are correlated with each other or are opposite. The length of the vector indicates the strength of the variable’s influence on the principal components. The longer the vector, the stronger its influence on the point’s placement in the principal component space.
Variable Contributions: “Perf Quart” seems to have a strong positive correlation with PC1, and it doesn’t contribute much to PC2 since its vector is almost horizontal. “Price” has a positive correlation with both PC1 and PC2 but more so with PC2. “Rel Volume” seems to have a slight negative correlation with PC1 and a slight positive correlation with PC2, given its direction pointing towards the negative side of the PC1 axis and the positive side of the PC2 axis.
Inter-Variable Relationships: You can infer the relationships between the variables based on the angles between their vectors:
“Perf Quart” and “Price” are somewhat positively correlated with each other since the angle between their vectors is less than 90 degrees. “Rel Volume” seems to be less correlated or even negatively correlated with “Perf Quart”, as the angle between their vectors is slightly more than 90 degrees.
# Filter for small-cap stocks, assuming you've defined small-cap as stocks with a price < $25
small_caps <- df[df$Price < 25, ]
# Check and handle missing or infinite values
# Removing rows with NA or infinite values in the specific columns
small_caps <- small_caps[!is.na(small_caps$`Perf Quart`) & !is.infinite(small_caps$`Perf Quart`), ]
small_caps <- small_caps[!is.na(small_caps$Price) & !is.infinite(small_caps$Price), ]
small_caps <- small_caps[!is.na(small_caps$`Rel Volume`) & !is.infinite(small_caps$`Rel Volume`), ]
# Calculate PCA scores for small-cap stocks
small_caps_pca_scores <- prcomp(small_caps[, c("Perf Quart", "Price", "Rel Volume")], center = TRUE, scale. = TRUE)
# Extract the PCA scores
scores <- small_caps_pca_scores$x
# Combine with original data
small_caps <- cbind(small_caps, scores)
# Define "best" based on PCA scores and other criteria
# Example: consider stocks with high PC1 and PC2 as "best" (assumed, but should be altered by running portfolio strategy)
best_criteria <- function(row) {
row["PC1"] + row["PC2"]
}
# numeric matrix/data.frame from the PCA results
scores <- as.data.frame(small_caps_pca_scores$x)
# Add the scores back to the small_caps dataframe
small_caps <- cbind(small_caps, scores)
# Define "best" based on PCA scores (ensure the scores are treated as numeric)
small_caps$best_score <- rowSums(small_caps[, c("PC1", "PC2")])
# Get the top 2 small-cap stocks based on the best score (Risky - However this can be changed by the investors strategy criteria in result changing minimum/optimization)
top_small_caps <- head(small_caps[order(-small_caps$best_score), ], 2)
top_small_caps
## No. Ticker Perf Week Perf Month Perf Quart Perf Half Perf Year Perf YTD
## 38 43 SOUN -0.3109 0.5466 1.9378 2.2147 2.1979 1.8962
## 24 28 MTC 0.0227 1.0270 1.4194 3.6239 -0.0546 1.2500
## Volatility W Volatility M Recom Avg Volume Rel Volume Price Change Volume
## 38 0.1646 0.1929 2.0 56.39M 1.51 6.14 -0.0808 85057485
## 24 0.1368 0.2412 - 1.64M 0.10 2.25 -0.0217 163530
## PC1 PC2 PC3 PC1 PC2 PC3 best_score
## 38 3.006265 0.5050745 -3.285945 3.006265 0.5050745 -3.285945 3.511339
## 24 3.440785 -0.9044591 -1.001043 3.440785 -0.9044591 -1.001043 2.536326