Libraries used

You’ll need the FactoMineR library for PCA and the factoextra library for visualization:

library (ggfortify)
## Loading required package: ggplot2
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(readxl)
library(scales)

Importing Data

df <- read_excel("FinVizScreener_Tech.xlsx")
head(df)
## # A tibble: 6 × 16
##     No. Ticker `Perf Week` `Perf Month` `Perf Quart` `Perf Half` `Perf Year`
##   <dbl> <chr>        <dbl>        <dbl> <chr>        <chr>       <chr>      
## 1     1 AAOI        0.156       -0.279  -0.2881      0.5768      5.1941     
## 2     2 ACMR        0.138        0.498  0.7326       0.9871      1.7659     
## 3     3 AEHR        0.0309      -0.0894 -0.4626      -0.6555     -0.5988    
## 4     4 ASTS       -0.121       -0.107  -0.4662      -0.2756     -0.533     
## 5     5 BB         -0.0632       0.0075 -0.3488      -0.4785     -0.2725    
## 6     6 BBAI       -0.180       -0.0291 0.1976       0.3514      0.087      
## # ℹ 9 more variables: `Perf YTD` <dbl>, `Volatility W` <dbl>,
## #   `Volatility M` <dbl>, Recom <chr>, `Avg Volume` <chr>, `Rel Volume` <dbl>,
## #   Price <dbl>, Change <dbl>, Volume <dbl>

Before performing PCA, ensure the data frame only contains the numeric columns you want to analyze. You will need to exclude non-numeric columns such as ‘No.’ and ‘Ticker’ and convert any character columns containing numeric data into actual numeric columns:

# Assuming `df` is your data frame
df <- read_excel("FinVizScreener_Tech.xlsx")
df_numeric <- df[, -which(names(df) %in% c("No.", "Ticker", "Perf Half", "Perf Year", "Recom", "Avg Volume"))]
df_numeric <- data.frame(lapply(df_numeric, as.numeric))
## Warning in lapply(df_numeric, as.numeric): NAs introduced by coercion
# Convert factors to characters
df_numeric <- data.frame(lapply(df_numeric, function(x) as.character(x)))

# Replace any non-numeric characters with NA
df_numeric <- data.frame(lapply(df_numeric, function(x) {
  x <- gsub("[^0-9.-]", "", x)
  ifelse(x == "", NA, as.numeric(x))
}))
## Warning in ifelse(x == "", NA, as.numeric(x)): NAs introduced by coercion
# Handle NAs as needed, for example:
df_numeric[is.na(df_numeric)] <- 0 # Replace NA with 0
# or
df_numeric <- na.omit(df_numeric) # Remove rows with NA

summary(df_numeric)
##    Perf.Week           Perf.Month          Perf.Quart          Perf.YTD       
##  Min.   :-0.376400   Min.   :-0.431100   Min.   :-0.59140   Min.   :-0.62030  
##  1st Qu.:-0.031700   1st Qu.:-0.129700   1st Qu.:-0.29210   1st Qu.:-0.31020  
##  Median : 0.009300   Median :-0.055200   Median :-0.08640   Median :-0.14580  
##  Mean   : 0.005502   Mean   : 0.003578   Mean   :-0.00819   Mean   :-0.06197  
##  3rd Qu.: 0.062700   3rd Qu.: 0.081600   3rd Qu.: 0.10930   3rd Qu.: 0.01150  
##  Max.   : 0.295300   Max.   : 1.027000   Max.   : 1.93780   Max.   : 1.89620  
##   Volatility.W     Volatility.M       Rel.Volume         Price       
##  Min.   :0.0022   Min.   :0.00520   Min.   :0.1000   Min.   : 1.540  
##  1st Qu.:0.0425   1st Qu.:0.04210   1st Qu.:0.5900   1st Qu.: 2.760  
##  Median :0.0558   Median :0.06230   Median :0.7900   Median : 5.420  
##  Mean   :0.0670   Mean   :0.07348   Mean   :0.9678   Mean   : 8.619  
##  3rd Qu.:0.0859   3rd Qu.:0.09000   3rd Qu.:1.2500   3rd Qu.:11.650  
##  Max.   :0.1646   Max.   :0.24120   Max.   :2.9100   Max.   :34.810  
##      Change            Volume        
##  Min.   :-0.1168   Min.   :  163530  
##  1st Qu.:-0.0499   1st Qu.:  996694  
##  Median :-0.0231   Median : 1756819  
##  Mean   :-0.0281   Mean   : 4376523  
##  3rd Qu.:-0.0043   3rd Qu.: 3231865  
##  Max.   : 0.0858   Max.   :85057485

We want to best ensure we have no null values to be able to continue.

PCA

Principal Component Analysis (PCA) is a statistical procedure that transforms a dataset characterized by possibly correlated variables into a set of linearly uncorrelated variables known as principal components. The transformation is defined in such a way that the first principal component has the highest possible variance, and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components.

Mathematically, for a data matrix \(\mathbf{X}\) of zero empirical mean, where each column represents a variable, and each row represents an observation, PCA aims to project the data onto a new subspace with reduced dimensionality. This is achieved by the eigendecomposition of the covariance matrix \(\mathbf{C}\):

\[ \mathbf{C} = \frac{1}{n-1} (\mathbf{X}^\top \mathbf{X}) \]

Here, \(\mathbf{X}^\top\) denotes the transpose of \(\mathbf{X}\), and \(n\) is the number of observations.

The eigendecomposition is performed as follows:

\[ \mathbf{C} = \mathbf{V} \mathbf{D} \mathbf{V}^\top \]

Where \(\mathbf{V}\) is the matrix whose columns are the eigenvectors of \(\mathbf{C}\), and \(\mathbf{D}\) is the diagonal matrix of eigenvalues. The eigenvectors correspond to the principal directions, while the eigenvalues indicate the variance explained by each principal direction.

The principal components themselves are the result of this transformation applied to \(\mathbf{X}\), where the dataset is projected onto the new set of orthogonal axes defined by the eigenvectors. These components are ranked such that \(\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n\), with \(\lambda_i\) denoting the eigenvalue associated with the \(i\)-th principal component, reflecting the amount of variance it explains in the dataset.

In the context of PCA, the first principal component \(\mathbf{u}_1\) is aligned with the direction of maximum variance in \(\mathbf{X}\), and each subsequent principal component \(\mathbf{u}_i\) is orthogonal to the previous components while also aligning with the next highest variance direction.

PCA provides a powerful method for data analysis and pattern recognition often utilized in exploratory data analysis, noise reduction, feature extraction, and data compression.

Check Name Convention for df

# Check the current names in the data frame
print(names(df))
##  [1] "No."          "Ticker"       "Perf Week"    "Perf Month"   "Perf Quart"  
##  [6] "Perf Half"    "Perf Year"    "Perf YTD"     "Volatility W" "Volatility M"
## [11] "Recom"        "Avg Volume"   "Rel Volume"   "Price"        "Change"      
## [16] "Volume"

Double check columns are numeric

# Convert character columns to numeric, handling non-numeric characters
df$`Perf Quart` <- suppressWarnings(as.numeric(as.character(df$`Perf Quart`)))
df$`Perf Half` <- suppressWarnings(as.numeric(as.character(df$`Perf Half`)))
df$`Perf Year` <- suppressWarnings(as.numeric(as.character(df$`Perf Year`)))

tail(df)
## # A tibble: 6 × 16
##     No. Ticker `Perf Week` `Perf Month` `Perf Quart` `Perf Half` `Perf Year`
##   <dbl> <chr>        <dbl>        <dbl>        <dbl>       <dbl>       <dbl>
## 1    44 SPWR       -0.0283      -0.130       -0.378      -0.585      -0.785 
## 2    45 STEM        0.0757      -0.266       -0.464      -0.550      -0.692 
## 3    46 VMEO        0.0093       0.358        0.344       0.530       0.560 
## 4    47 VYX         0.06        -0.106       -0.200      -0.213      -0.0204
## 5    48 YOU         0.0627       0.133       -0.0434      0.124      -0.0686
## 6    49 ZUO        -0.0079       0.0771      -0.0146      0.0615     -0.0548
## # ℹ 9 more variables: `Perf YTD` <dbl>, `Volatility W` <dbl>,
## #   `Volatility M` <dbl>, Recom <chr>, `Avg Volume` <chr>, `Rel Volume` <dbl>,
## #   Price <dbl>, Change <dbl>, Volume <dbl>
# Note: This might introduce NAs for any non-numeric entries that cannot be converted
# this is a little tweak so that things line up nicely later on
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Step 2: Pre-process data (example, adjust according to your dataset specifics)
df <- na.omit(df)  # Remove rows with NA values

# Step 3: Select relevant columns for PCA
pca_data <- df[, c("Perf Quart", "Price", "Rel Volume")]

# Ensure the data is numeric
pca_data <- sapply(pca_data, as.numeric)

# Step 4: Perform PCA
pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)

# Summarize PCA results
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.1624 0.9503 0.8636
## Proportion of Variance 0.4504 0.3010 0.2486
## Cumulative Proportion  0.4504 0.7514 1.0000
# Visualize the results
biplot(pca_result)

PCA - 3D Visual

For the sake of the project, we focus on financial dimensions that give us back the best amount of information such as “Perf Quart”, “Price”, “Rel Volume”.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Extract the scores (principal components)

scores <- pca_result$x

# Ensure your labels are correctly matched with the PCA scores
# Assuming 'df$Ticker' exists and corresponds to the order of PCA scores
# Ensure the length of 'labels' is the same as the number of rows in 'scores'
labels <- df$Ticker[1:nrow(scores)]

# Now 'labels' has the same length as 'scores', so the sizes are compatible
# Create a 3D plot using plotly with hover labels
plotly_3d <- plot_ly(x = ~scores[, "PC1"], y = ~scores[, "PC2"], z = ~scores[, "PC3"],
                     type = "scatter3d", mode = "markers",
                     text = ~labels,  # Add the ticker labels here
                     hoverinfo = 'text',  # Show text on hover
                     marker = list(size = 5))

# Add custom axis labels
plotly_3d <- plotly_3d %>% layout(scene = list(
                                   xaxis = list(title = 'PC1 (Perf Quart)'),
                                   yaxis = list(title = 'PC2 (Price)'),
                                   zaxis = list(title = 'PC3 (Relative Volume)')
                                   ))

# Render the plot
plotly_3d

Standard deviation: The square roots of the eigenvalues of the covariance/correlation matrix, which represent the amount of variance captured by each principal component.

Proportion of Variance: Indicates the fraction of the total variance explained by each principal component.

Quick Summary

This tells us that the first principal component (PC1) captures 45.04% of the total variability in the data, the second (PC2) captures an additional 30.10%, and the third (PC3) captures the remaining 24.86%, totaling 100% of the variability.

Vectors (Loadings): The arrows labeled “Perf Quart”, “Price”, and “Rel Volume” represent the loading of each variable. The direction and length of these vectors indicate how each variable influences the principal components:

The direction indicates which variables are correlated with each other or are opposite. The length of the vector indicates the strength of the variable’s influence on the principal components. The longer the vector, the stronger its influence on the point’s placement in the principal component space.

Variable Contributions: “Perf Quart” seems to have a strong positive correlation with PC1, and it doesn’t contribute much to PC2 since its vector is almost horizontal. “Price” has a positive correlation with both PC1 and PC2 but more so with PC2. “Rel Volume” seems to have a slight negative correlation with PC1 and a slight positive correlation with PC2, given its direction pointing towards the negative side of the PC1 axis and the positive side of the PC2 axis.

Inter-Variable Relationships: You can infer the relationships between the variables based on the angles between their vectors:

“Perf Quart” and “Price” are somewhat positively correlated with each other since the angle between their vectors is less than 90 degrees. “Rel Volume” seems to be less correlated or even negatively correlated with “Perf Quart”, as the angle between their vectors is slightly more than 90 degrees.

# Filter for small-cap stocks, assuming you've defined small-cap as stocks with a price < $25
small_caps <- df[df$Price < 25, ]

# Check and handle missing or infinite values
# Removing rows with NA or infinite values in the specific columns
small_caps <- small_caps[!is.na(small_caps$`Perf Quart`) & !is.infinite(small_caps$`Perf Quart`), ]
small_caps <- small_caps[!is.na(small_caps$Price) & !is.infinite(small_caps$Price), ]
small_caps <- small_caps[!is.na(small_caps$`Rel Volume`) & !is.infinite(small_caps$`Rel Volume`), ]

# Calculate PCA scores for small-cap stocks
small_caps_pca_scores <- prcomp(small_caps[, c("Perf Quart", "Price", "Rel Volume")], center = TRUE, scale. = TRUE)

# Extract the PCA scores
scores <- small_caps_pca_scores$x

# Combine with original data
small_caps <- cbind(small_caps, scores)

# Define "best" based on PCA scores and other criteria
# Example: consider stocks with high PC1 and PC2 as "best" (assumed, but should be altered by running portfolio strategy)
best_criteria <- function(row) {
  row["PC1"] + row["PC2"] 
}

# numeric matrix/data.frame from the PCA results
scores <- as.data.frame(small_caps_pca_scores$x)

# Add the scores back to the small_caps dataframe
small_caps <- cbind(small_caps, scores)

# Define "best" based on PCA scores (ensure the scores are treated as numeric)
small_caps$best_score <- rowSums(small_caps[, c("PC1", "PC2")])

# Get the top 2 small-cap stocks based on the best score (Risky - However this can be changed by the investors strategy criteria in result changing minimum/optimization)
top_small_caps <- head(small_caps[order(-small_caps$best_score), ], 2)
top_small_caps
##    No. Ticker Perf Week Perf Month Perf Quart Perf Half Perf Year Perf YTD
## 38  43   SOUN   -0.3109     0.5466     1.9378    2.2147    2.1979   1.8962
## 24  28    MTC    0.0227     1.0270     1.4194    3.6239   -0.0546   1.2500
##    Volatility W Volatility M Recom Avg Volume Rel Volume Price  Change   Volume
## 38       0.1646       0.1929   2.0     56.39M       1.51  6.14 -0.0808 85057485
## 24       0.1368       0.2412     -      1.64M       0.10  2.25 -0.0217   163530
##         PC1        PC2       PC3      PC1        PC2       PC3 best_score
## 38 3.006265  0.5050745 -3.285945 3.006265  0.5050745 -3.285945   3.511339
## 24 3.440785 -0.9044591 -1.001043 3.440785 -0.9044591 -1.001043   2.536326