In the modern era of data-driven decision-making, datasets often contain numerous variables, many of which may be redundant or contribute minimally to the analysis. High-dimensional data can pose significant challenges, such as increased computational cost, difficulties in visualization, and risks of overfitting in predictive models. To address these issues, dimension reduction techniques are employed to condense the dataset while preserving its essential information.
Data Source: This analysis uses the Housing Prices Dataset, sourced from Kaggle. The dataset provides information on housing features and prices, making it an ideal candidate for regression and dimension reduction analysis.
Principal Component Analysis (PCA) is one of the most widely used linear techniques for dimension reduction. PCA transforms the original variables into orthogonal components that capture the maximum variance in the data. This ensures that the reduced dataset is easier to analyze and interpret while retaining its underlying structure.
This project explores the application of PCA to a housing dataset. We utilize statistical tests such as Bartlett’s test and the Kaiser-Meyer-Olkin (KMO) test to evaluate the dataset’s suitability for PCA. Through visualization and interpretation, we aim to highlight the contributions of variables and the insights gained from dimension reduction.
Understanding the structure and characteristics of the dataset is a critical first step in any analysis. Here, we load the housing dataset and inspect its structure and contents.
# Load the Housing dataset
housing_data <- read.csv("/Users/mjay/Desktop/Data science and Business Analytics/unsupervised learning project /Housing.csv", sep=",")
# Display structure and first few rows
str(housing_data)
## 'data.frame': 545 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : int 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : chr "yes" "yes" "yes" "yes" ...
## $ guestroom : chr "no" "no" "no" "no" ...
## $ basement : chr "no" "no" "yes" "yes" ...
## $ hotwaterheating : chr "no" "no" "no" "no" ...
## $ airconditioning : chr "yes" "yes" "no" "yes" ...
## $ parking : int 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : chr "yes" "no" "yes" "yes" ...
## $ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
head(housing_data)
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
Data cleaning ensures that the dataset is free from inconsistencies, missing values, and irrelevant variables. This step prepares the data for statistical analysis.
# Remove non-numeric columns
housing_data_cleaned <- housing_data %>% select_if(is.numeric)
# Check for missing values
sum(is.na(housing_data_cleaned))
## [1] 0
# Remove rows with missing values
housing_data_cleaned <- housing_data_cleaned %>% drop_na()
EDA provides a comprehensive understanding of the dataset by summarizing its statistical properties and visualizing variable relationships. Here, we calculate summary statistics and generate a correlation heatmap to uncover relationships between variables.
# Summary statistics
summary(housing_data_cleaned)
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 Median :3.000 Median :1.000
## Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000
## stories parking
## Min. :1.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :0.0000
## Mean :1.806 Mean :0.6936
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.000 Max. :3.0000
# Correlation heatmap
corr_matrix <- cor(housing_data_cleaned)
corr_data <- as.data.frame(as.table(corr_matrix))
ggplot(corr_data, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
labs(title = "Correlation Heatmap", x = "Variables", y = "Variables") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
theme_minimal()
The correlation heatmap visually represents the relationships between variables in the dataset. The intensity of the color indicates the strength of the correlation, with red signifying a high positive correlation and blue representing a strong negative correlation. For example:
Price and Area exhibit a strong positive correlation, indicating that larger properties tend to have higher prices. Bedrooms and Bathrooms also show a positive correlation, suggesting that houses with more bedrooms are likely to have more bathrooms. Parking has weaker correlations with other variables, indicating that it may have limited direct influence on other housing features. These insights provide a preliminary understanding of variable relationships, guiding further steps like PCA to streamline analysis.
Standardization is essential for PCA, as it ensures that all variables contribute equally to the analysis. Without standardization, variables with larger scales could dominate the results.
# Standardize the data
scaled_data <- scale(housing_data_cleaned)
# Convert to a data frame for easier handling
scaled_data_df <- as.data.frame(scaled_data)
The Kaiser-Meyer-Olkin (KMO) test measures sampling adequacy. A KMO value closer to 1 indicates that PCA is suitable, whereas values below 0.5 suggest that the dataset is not appropriate for PCA.
# Custom KMO Calculation
msa_test <- function(correlation_matrix) {
partial_corr_matrix <- solve(correlation_matrix)
diag_elements <- diag(partial_corr_matrix)
partial_corr_matrix <- -partial_corr_matrix / sqrt(outer(diag_elements, diag_elements))
diag(partial_corr_matrix) <- diag_elements
kmo_numerator <- sum(correlation_matrix^2) - sum(diag(correlation_matrix^2))
kmo_denominator <- kmo_numerator + sum(partial_corr_matrix^2) - sum(diag(partial_corr_matrix^2))
kmo_value <- kmo_numerator / kmo_denominator
# Calculate individual MSA values
msa_values <- 1 - diag(partial_corr_matrix) / diag(correlation_matrix)
return(list(KMO = kmo_value, MSA = msa_values))
}
# Run the KMO test
msa_results <- msa_test(cor(scaled_data_df))
kmo_value <- msa_results$KMO
cat("KMO Statistic:", kmo_value, "\n")
## KMO Statistic: 0.69553
KMO Statistic: The KMO value obtained is 0.69553, which is above the threshold of 0.5. This suggests that the dataset is moderately adequate for PCA. A value closer to 1 would indicate stronger sampling adequacy. Interpretation: The result implies that the correlations between variables are significant and suitable for reducing dimensions through PCA. This value confirms that the variables share sufficient variance to justify the use of PCA as a dimension reduction technique.
The Bartlett’s Test is used to determine whether the correlation matrix significantly differs from an identity matrix. This test is a prerequisite for PCA as it ensures that the variables are interrelated and not orthogonal. A significant p-value (< 0.05) indicates that the dataset is appropriate for PCA.
# Bartlett's Test
bartlett_result <- cortest.bartlett(cor(scaled_data_df), n = nrow(scaled_data_df))
print(bartlett_result)
## $chisq
## [1] 756.9171
##
## $p.value
## [1] 1.349047e-151
##
## $df
## [1] 15
Results Interpretation: Chi-Square Value: The test yielded a chi-square statistic of 756.9171. This high value suggests a strong relationship among the variables in the dataset. Degrees of Freedom (df): The test used 15 degrees of freedom, which corresponds to the number of pairwise correlations in the dataset. P-Value: The p-value is extremely small (1.349047e-151), well below the 0.05 threshold. This strongly rejects the null hypothesis that the correlation matrix is an identity matrix. Conclusion: These results confirm that the dataset contains significant correlations among variables, making it suitable for dimension reduction techniques such as PCA. This aligns with the objectives of the analysis, providing a solid foundation for extracting principal components.
Visualizing the contributions of variables helps identify their relative importance before dimension reduction.
# Calculate Contributions of Initial Columns
initial_contributions <- colSums(abs(scaled_data_df))
initial_contributions_df <- data.frame(Variable = names(initial_contributions), Contribution = initial_contributions)
# Initial Columns Contribution Bar Chart
ggplot(initial_contributions_df, aes(x = reorder(Variable, Contribution), y = Contribution, fill = Variable)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Contributions of Initial Columns", x = "Variables", y = "Contribution") +
theme_minimal()
PCA transforms the dataset into principal components, ranked by their ability to explain variance. This step identifies and visualizes the contributions of these components.
# Perform PCA
pca_results <- prcomp(scaled_data_df, scale. = TRUE)
# Summary of PCA
summary(pca_results)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.599 1.1032 0.8229 0.8104 0.76866 0.54963
## Proportion of Variance 0.426 0.2029 0.1129 0.1094 0.09847 0.05035
## Cumulative Proportion 0.426 0.6289 0.7417 0.8512 0.94965 1.00000
# Eigenvalues
pca_eigenvalues <- pca_results$sdev^2
# Calculate Contributions of Principal Components
pca_contributions <- colSums(abs(pca_results$x))
pca_contributions_df <- data.frame(PrincipalComponent = names(pca_contributions), Contribution = pca_contributions)
# Bar Chart of PCA Contributions
ggplot(pca_contributions_df, aes(x = reorder(PrincipalComponent, Contribution), y = Contribution, fill = PrincipalComponent)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Contributions of Principal Components", x = "Principal Components", y = "Contribution") +
theme_minimal()
Scree and cumulative variance plots help determine the optimal number of components to retain for analysis.
# Eigenvalues
explained_variance <- pca_eigenvalues / sum(pca_eigenvalues)
# Scree Plot
plot(explained_variance, type = "b", pch = 19,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
main = "Scree Plot")
The scree plot provides a visual representation of how much variance is explained by each principal component. The key insights from the plot are as follows:
Variance Explained: The first component (PC1) accounts for approximately 40% of the total variance in the dataset. The second component (PC2) explains about 20% of the variance. Together, the first two components explain around 60% of the total variance.
Elbow Point: The “elbow” in the scree plot, where the curve flattens significantly, is observed after the second component. This indicates that the first two components capture the majority of the variance and additional components contribute diminishing returns in terms of explained variance.
Retained Dimensions: Based on the scree plot and the cumulative variance explained, we will retain the first two principal components for further analysis. These two dimensions collectively preserve a substantial portion of the dataset’s variability (approximately 60%), reducing the original dimensions from six variables to two principal components. This simplification enhances interpretability without significant loss of information.
Let me know if you would like this addition incorporated into the R Markdown file or expanded further!
# Cumulative Variance
cumulative_variance <- cumsum(explained_variance)
plot(cumulative_variance, type = "b", pch = 19,
xlab = "Principal Component",
ylab = "Cumulative Variance Explained",
main = "Cumulative Variance Plot")
Cumulative Variance Plot: Analysis
The cumulative variance plot illustrates the proportion of total variance captured by the principal components. In this analysis:
Principal Component Selection:
The first two components explain approximately 70% of the total variance, which is substantial for summarizing the dataset. These components are sufficient to capture the majority of the variation in the data. Including the third component increases the cumulative explained variance to nearly 85%, providing a more comprehensive representation of the dataset while maintaining simplicity. By the fourth component, the cumulative explained variance exceeds 90%, indicating diminishing returns with additional components. Final Dimensions:
To balance interpretability and information retention, the analysis will focus on the first three principal components. These components collectively explain most of the dataset’s variability while reducing dimensionality. Decision Rationale:
Selecting three components achieves a significant reduction from the original variables while retaining sufficient variance for meaningful analysis. This ensures computational efficiency and avoids overfitting in downstream tasks such as predictive modeling. Implications for the Dataset:
The reduced dataset retains key information about the relationships and variations among variables, enabling effective use in regression or clustering tasks. Variables such as area, price, and stories are prominent contributors to the retained components, as shown in the biplot and variable contribution analyses. By reducing dimensions to the first three principal components, we achieve a streamlined dataset that is optimized for analysis while minimizing the loss of valuable information.
The biplot combines variables and principal components, showing their relationships and contributions.
# PCA Biplot
biplot(pca_results, scale = 0, main = "PCA Biplot")
In the biplot each arrow represents a variable, while the data points represent observations projected onto the principal components. Variables with arrows pointing in similar directions are positively correlated, while those pointing in opposite directions are negatively correlated. The length of an arrow indicates the strength of the variable’s contribution to the PCs. For instance:
Variables such as price and stories may cluster closely, indicating shared variance. Area and bedrooms could contribute strongly to PC1, as their arrows align along the axis. This plot helps identify which variables are most influential in the dataset and their relationships, aiding in dimensional reduction and feature selection.
This section visualizes the contribution of original variables to the principal components, providing insights into their importance in the reduced dataset.
# Contributions of variables to PCs
loadings <- as.data.frame(pca_results$rotation)
loadings$Variable <- rownames(loadings)
loadings_long <- loadings %>% pivot_longer(cols = starts_with("PC"),
names_to = "PrincipalComponent",
values_to = "Contribution")
ggplot(loadings_long, aes(x = Variable, y = Contribution, fill = PrincipalComponent)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Variable Contributions to Principal Components",
x = "Variables", y = "Contribution") +
theme_minimal()
The “Variable Contributions to Principal Components” bar chart provides insights into how each original variable contributes to the principal components (PCs).
Overall Distribution of Contributions:
Each variable contributes differently to the principal components. Variables such as area, bedrooms, and price show notable contributions across various components, indicating their importance in explaining the dataset’s variance. Dominant Contributions:
For PC1, variables such as area and price have a higher contribution, implying that this component captures variance largely influenced by these features. For PC2, the variable stories appears dominant, showing its unique variance is significant in this component. Similarly, PC3 and PC4 highlight contributions from variables like bedrooms and parking. Negative Contributions:
Negative contributions in the chart indicate an inverse relationship of the variable with the specific principal component. For instance, bathrooms shows negative contributions in some components, suggesting a contrasting variance pattern relative to other variables. Spread Across Components:
The variables are distributed across multiple components, which demonstrates the orthogonal nature of principal components. This confirms that PCA effectively separates overlapping variances and represents them distinctly. Interpretation of Component Usage:
Components with higher contributions from a small set of variables may represent specific patterns or clusters in the data (e.g., PC2 heavily represents stories). Broader contributions across components indicate shared influence among variables. Significance in Reduction:
This graph showcases the importance of PCA in dimension reduction, as it isolates variables with distinct contributions and discards noise or redundant features in higher components. This analysis validates the effectiveness of PCA in reducing the housing dataset while maintaining interpretability and variance representation.
The analysis demonstrates that: