Portfolio

Teja Savatapalli

2024-04-26

About Me

Over the span of 6+ years, I have delved deeply into Azure Cloud technology, honing my expertise in automating infrastructure, designing complex network architectures, and orchestrating the integration of IaaS and PaaS services. This extensive experience has not only fortified my grasp of creating robust and secure cloud frameworks but has also shed light on the critical role of data in shaping strategic decisions. Concurrently, more than two years dedicated to manual and Selenium automation testing refined my capacity for problem-solving and fueled my drive to innovate within the realm of software development.

My academic progression, marked by pursuing a Master’s degree in Data Science and Analytics at Grand Valley State University, represented a significant shift in my professional trajectory. It was here that my foundational technical skills were augmented with a profound appreciation for the nuanced application of data. The curriculum led me through various analytical projects, including an in-depth analysis of Portland Police incident reports and a comprehensive study of patterns emerging from the ICC Men’s T20 World Cup 2022. Utilizing advanced visualization tools like Power BI, Tableau, and Google Colab, I was able to distill complex datasets into actionable insights.

As I step forward into the realm of Data Science and Analytics, I bring a synthesis of my extensive cloud infrastructure knowledge, a meticulous approach garnered from quality assurance roles, and an unabating enthusiasm for data discovery. I am driven by the goal of uncovering underlying trends, projecting future scenarios, and empowering organizations to craft strategies anchored in data. Positioned at this nexus, I am ready to champion and contribute to an ethos where informed, data-driven decision-making is the cornerstone of organizational success.

Course Objectives

As part of the course - Statistical Modeling and Regression course at GVSU, I delved into:

  • Probability as a foundation of statistical modeling (including inference and maximum likelihood estimation):

Demonstration: Project’s analysis included probability-based methods crucial for understanding statistical modeling, such as the interpretation of p-values and confidence intervals. These were demonstrated through the application of inferential statistics to assess the significance of predictor variables in the regression models.

Explanation: Probability theory was the bedrock upon which the entire modeling process was built, allowing for the extension of sample findings to broader population claims. This objective was met by applying inferential techniques grounded in probability, reflecting a deep understanding of the probabilistic nature of statistical models.

  • Determine and apply the appropriate generalized linear model for a specific data context:

Demonstration: The project showcased the use of Multiple Linear Regression (MLR), a specific type of generalized linear model, to examine the data’s underlying relationships. This choice was substantiated through residual analysis and validation of model assumptions, confirming MLR’s suitability for the data context.

Explanation: The exploratory phase led to the determination that MLR would best account for the multiple predictors influencing temperature, signifying an informed application of generalized linear models tailored to the unique aspects of the dataset.

  • Conduct model selection for a set of candidate models:

Demonstration: Model selection was illustrated by evaluating both SLR and MLR models, using RMSE as a criterion for performance comparison. The preferred model was selected based on its predictive accuracy and adherence to theoretical expectations.

Explanation: The project involved a systematic selection process, weighing statistical metrics alongside conceptual considerations, to ascertain the model that best encapsulated the data’s intricacies, fulfilling the objective of conducting thorough model selection.

  • Communicate the results of statistical models to a general audience:

Demonstration: The findings from the statistical models were translated into accessible visual narratives, clearly illustrating the comparison between predicted and actual temperature values and the distribution of model residuals.

Explanation: The results were communicated effectively to a general audience through clear visualizations and accompanying narrative, demonstrating an ability to distill complex statistical concepts into understandable forms.

  • Proficiency in Programming for Statistical Analysis:

Demonstration: The project’s thorough use of R for data manipulation, model fitting, diagnostics, and visualization highlighted a comprehensive skill set in programming for statistical analysis.

Explanation: By elaborating on the purpose and function of the R code utilized in each analytical phase, the project evidenced a nuanced understanding of programming within statistical methodology.

Project Title: “Urban Canopy Influence: A Statistical Analysis of Green Spaces on Temperature Dynamics”

Introduction: In an era where urbanization’s environmental impact is more pronounced than ever, understanding the role of green spaces within cityscapes is vital. The project titled “Urban Canopy Influence” embarks on a statistical journey to decipher the relationship between urban greenery and local temperature patterns. Utilizing a data-driven approach, this analysis harnesses the power of Multiple Linear Regression (MLR) to unearth the subtle, yet significant, influence of urban green spaces on moderating temperatures.

The genesis of this project lies in my extensive background in Azure Cloud technology and a keen interest in leveraging data to elucidate environmental trends. This analysis is not only an academic endeavor but also a step towards sustainable urban planning and a testament to the potential of data science in contributing to ecological stewardship. By methodically analyzing historical weather data with a suite of statistical tools and programming languages like R, this project aims to provide actionable insights into how urban design can be optimized for environmental harmony.

In this portfolio, we navigate through the various phases of the project — from initial data preprocessing to complex model diagnostics — each step building upon the last to create a comprehensive picture of our urban ecosystems. The insights gained here are poised to inform policy-makers and urban developers alike, emphasizing the significance of green spaces in crafting the sustainable cities of tomorrow.

Determine and apply the appropriate generalized linear model for a specific data context

# Load the necessary packages
suppressWarnings({
library(tidyverse)
library(lubridate)
library(caret)
library(glmnet)
library(ISLR)
library(leaps)
})
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Loading required package: Matrix
## 
## 
## Attaching package: 'Matrix'
## 
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Loaded glmnet 4.1-8

The weather dataset is imported from a CSV file. This dataset is the foundation of the project and contains the historical weather data necessary for the temperature forecast model.

# Data Import and Preprocessing
weather_data <- read.csv("C:/Users/sthar/Downloads/weather.csv")

This code transforms the data by converting date strings to Date objects and changing character-based numerical values to actual numeric types. It also selects the relevant columns for modeling and removes any rows with missing data, ensuring a clean dataset for analysis.

# Data Transformation
weather_data <- weather_data %>%
  mutate(
    Date = as.Date(Date.Full, format="%m/%d/%Y"),
    AvgTemp = as.numeric(Data.Temperature.Avg.Temp),
    WindSpeed = as.numeric(Data.Wind.Speed),
    Precipitation = as.numeric(Data.Precipitation)
  ) %>%
  select(AvgTemp, WindSpeed, Precipitation) %>%
  na.omit() # Handle missing values

Here, the data is split into training and test sets, with 80% of the data used for training the model and the remaining 20% for testing its predictive power. The set.seed function ensures that the results are reproducible, an essential practice in data science.

# Train-Test Split
set.seed(123) # For reproducibility
training_indices <- createDataPartition(weather_data$AvgTemp, p = 0.8, list = FALSE)
train_data <- weather_data[training_indices, ]
test_data <- weather_data[-training_indices, ]

This code chunk performs a Simple Linear Regression, modeling temperature as a function of wind speed only. It’s a preliminary step that provides a baseline for model performance.

# Simple Linear Regression
slr_model <- lm(AvgTemp ~ WindSpeed, data = train_data)
slr_predictions <- predict(slr_model, newdata = test_data)

A Multiple Linear Regression model is developed here, which includes both wind speed and precipitation as predictors, providing a more sophisticated model than the simple linear regression.

# Multiple Linear Regression
mlr_model <- lm(AvgTemp ~ WindSpeed + Precipitation, data = train_data)
mlr_predictions <- predict(mlr_model, newdata = test_data)

The performance of both the simple and multiple linear regression models is compared using the Root Mean Square Error (RMSE) metric, which measures the average magnitude of the prediction errors.

# Compare the performance of SLR vs MLR using RMSE
slr_rmse <- sqrt(mean((test_data$AvgTemp - slr_predictions)^2))
mlr_rmse <- sqrt(mean((test_data$AvgTemp - mlr_predictions)^2))

The RMSE values for both models are printed to the console, providing a clear indicator of which model performs better in predicting temperatures.

# Print RMSE values to the console
print(paste("SLR RMSE:", slr_rmse))
## [1] "SLR RMSE: 18.3522446023526"
print(paste("MLR RMSE:", mlr_rmse))
## [1] "MLR RMSE: 18.1964085534141"

Model Visualization and Diagnostics

This code prepares a new data frame that includes both the actual temperatures and the predicted temperatures from both the SLR and MLR models, setting up for a comparative visualization.

# Prepare data for comparison plot
comparison_data <- test_data %>%
  mutate(
    SLR_Predicted = slr_predictions,
    MLR_Predicted = mlr_predictions
  )

The actual vs. predicted temperatures for both models are plotted here, allowing for a visual comparison of their predictions. This plot helps in visually assessing which model predicts temperature more accurately.

# Plot to compare actual vs. predicted values for SLR and MLR
ggplot(comparison_data) +
  geom_point(aes(x = AvgTemp, y = SLR_Predicted), color = '#4169E1', alpha = 0.5) +
  geom_point(aes(x = AvgTemp, y = MLR_Predicted), color = '#FF6B35', alpha = 0.5) +
  labs(title = "Comparison of SLR and MLR Predictions", x = "Actual Temperature", y = "Predicted Temperature") +
  theme_minimal() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") # Line y=x for reference

This plot displays the actual temperatures against the predicted temperatures from both Simple Linear Regression (SLR) and Multiple Linear Regression (MLR) models. The blue points represent SLR predictions and the red points represent MLR predictions.

The closeness of the points to the dashed identity line (where actual temperature equals predicted temperature) indicates the accuracy of the models. Both models show a spread of points that suggests variability in their predictions. However, the red points (MLR) are slightly closer to the line, suggesting that the MLR model has a marginally better fit. The overlap of blue and red points suggests that adding more predictors in MLR only slightly improved the predictive power compared to SLR, as reflected by the similar RMSE values.

The below chunk provides a summary of the coefficients from the Multiple Linear Regression model, including estimates, standard errors, t-values, and p-values. It gives insight into the significance and impact of each predictor in the model.

# MLR Model Coefficients Summary
summary(mlr_model)
## 
## Call:
## lm(formula = AvgTemp ~ WindSpeed + Precipitation, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.222 -12.280   1.499  14.078  53.720 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   60.36753    0.34281  176.09   <2e-16 ***
## WindSpeed     -0.88523    0.04559  -19.42   <2e-16 ***
## Precipitation  2.26343    0.16039   14.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.46 on 13394 degrees of freedom
## Multiple R-squared:  0.04131,    Adjusted R-squared:  0.04116 
## F-statistic: 288.6 on 2 and 13394 DF,  p-value: < 2.2e-16

The normality of residuals is checked here. The residuals from the MLR model should be normally distributed if the model is appropriately specified.

# Model Diagnostics for MLR
# Checking for Residual Normality
residuals <- resid(mlr_model)
ggplot(data = data.frame(residuals), aes(x = residuals)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.5, colour = "black", fill = "white") +
  geom_density(alpha = .2, fill = "#800020") +
  labs(title = "Residuals Distribution", x = "Residuals", y = "Density")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The histogram overlaid with a density plot shows the distribution of the residuals from the MLR model.

The residuals should ideally follow a normal distribution, which would appear as a symmetrical bell-shaped curve. The plot shows that the residuals are approximately normally distributed, with some minor deviations, such as potential outliers or slight skewness. The near-normal distribution suggests that the MLR model is quite well-specified, but the deviations indicate potential room for improvement, possibly by addressing outliers or using transformations.

The below plot is used to check for homoscedasticity – the assumption that the residuals have constant variance across all levels of the fitted values. A pattern in this plot could suggest issues with the model such as Heteroscedasticity.

# Checking for Homoscedasticity
ggplot(data = data.frame(residuals), aes(x = fitted(mlr_model), y = residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals")

This plot shows residuals on the y-axis and fitted (predicted) values on the x-axis.

We want to see no discernible patterns in this plot, which would indicate homoscedasticity, one of the assumptions of linear regression. The ‘funnel’ shape, where the spread of residuals increases with the fitted values, indicates heteroscedasticity. This could be a sign that the model is not capturing all the variance in the data and that the errors are not consistent across all levels of prediction.

The autocorrelation function plot is used to check for independence of residuals. For a well-fitted model, we expect no autocorrelation in the residuals. This below check is particularly important for time series data.

# Checking for Independence
acf(residuals)

The ACF plot shows the correlation of the residuals with lagged versions of themselves.

Ideally, for a well-fitting model, most autocorrelation coefficients at different lags should fall within the blue dashed confidence bands. The ACF plot indicates that the residuals do not exhibit significant autocorrelation at various lags, as all bars are within the confidence bounds, which suggests the model’s residuals are independent over time.

The model diagnostics reveal that while the MLR model adheres to some of the linear regression assumptions, such as the independence of errors, it also indicates potential issues with heteroscedasticity and the normal distribution of residuals. These issues could be mitigated with further model refinement strategies, such as variable transformation or the addition of interaction terms. Despite this, the MLR model does provide a statistically significant fit to the data with the predictors ‘Wind Speed’ and ‘Precipitation’, which have both shown to be influential factors in the prediction of ‘AvgTemp’, according to their coefficients and p-values.

The similar RMSE values for SLR and MLR suggest that the added complexity of the MLR model does not drastically improve predictive performance. However, the inclusion of additional relevant variables might capture more complex relationships and potentially provide more insightful and actionable predictions, aligning with the goals of a data-driven approach in the field of cloud infrastructure and data analytics.

Reflecting on a Journey of Statistical Modelling and Regression: Integrating Examples and Project.

Over the duration of this course, I’ve diligently applied statistical methodologies to tangible datasets, transforming abstract concepts into applied practice. My primary project revolved around the impact of urban greenery on local temperature variations, a subject that bridges the gap between environmental concerns and data science. This exploration has been instrumental in honing my analytical skills, enabling me to embody the core principles of the course: a staunch commitment to evidence-based reasoning and a robust, data-centric approach to uncovering insights.

By examining the intricate relationship between green spaces and their thermoregulatory effects, I’ve navigated through a trove of data, using statistical tools to unearth patterns and establish correlations. This project not only enriched my understanding of urban ecological dynamics but also underscored the transformative potential of data in addressing contemporary environmental challenges. Through this lens, I’ve witnessed firsthand the transformative power of data analytics as a vehicle for informed decision-making and strategic planning in urban development.

Reflection on My Participation

As I look back on my time in this course, I see it not just as a journey in learning but as a collective adventure steered by the professor. My role was diverse, reaching out from the standard classroom setting into a space where I could really see the theories come to life.

During the course, my exchanges with statistical models and regression analyses were lively and went beyond the usual class participation. The discussions I had were more than just Q&A, they were deep dives into the complex material we were tackling. With the professor’s guidance, these concepts became clearer, especially when we applied them to real-world data, which really drove my curiosity to dig deeper.

My interaction with classmates was crucial to my involvement. Working together on projects or reviewing each other’s work wasn’t just academic, it was about sharing our unique viewpoints to create something richer. Those group brainstorming sessions really showed me the power of putting our heads together. Bringing my cloud tech background to the table gave us new angles to look at our data projects, blending theory with my hands-on experience.

I approached my studies with a hands-on, proactive mindset. I dived into the tools we used—like Power BI and R programming—ready to master them. With each assignment, I aimed to better my skills, using precise statistical methods on everything from city weather patterns to big sports data. The challenging course work pushed me to give my best, which is clear in the thoroughness of my project work and my in-depth analyses.

In hindsight, my participation was a process of continual growth in skills and analytical thinking. It was a mix of solid academic work, team learning, and a relentless push for precision in data analysis. As the weeks passed, my grasp of everything from basic linear regression to the finer points of model selection got stronger.

Our professor’s approach to teaching focusing on real understanding instead of just memorization really matched my own views on learning and it’s something I plan to keep with me as I move forward in my career.

All in all, my time in the course was defined by active involvement, a keen mind, and teamwork. I wrote this chapter of my academic story with care and a true respect for the field of statistics, and it’s one I’ll remember fondly.

Citation

Dataset: https://www.ncdc.noaa.gov/cdo-web/datasets

  • Dataset Insights: The weather.csv dataset comprises daily weather measurements from various locations, crucial for analyzing the impact of urban green spaces on local temperatures. Key variables include:

Data.Precipitation: Rainfall in millimeters. Data.Temperature.Avg.Temp: Average daily temperature. Data.Wind.Speed: Wind speed, affecting temperature distribution. Date.Full: Date of data recording, for temporal analysis. Station.City and Station.State: Geographic identifiers.

  • Significance:

This dataset enables us to apply and evaluate statistical models that explore the relationship between urban green spaces and local climate conditions, particularly focusing on temperature modulation. By leveraging detailed meteorological data, the project assesses how effectively urban forestry and green infrastructure can contribute to urban heat island mitigation.

  • Data Integrity and Accessibility:

The integrity of the data is maintained through rigorous quality control measures by the data providers, ensuring high reliability for academic and practical applications. The dataset’s accessibility allows for reproducibility of the study and facilitates further research by scholars and policymakers interested in urban environmental planning.