Purpose of this discussion board is to use the dataset avialable in HERE to learn more on the locally weighted scatterplot smoothing (LOWESS) method for multiple regression as well as how to run it using R.
I picked the following data set:
| Package | Item | Title | CSV | Doc |
|---|---|---|---|---|
| Stat2Data | TechStocks | Daily Prices of Three Tech Stocks | CSV | Doc |
Downloaded the csv to local folder and used the following commands to load the data into R
# Setting working directory
setwd("/Users/jacob/Downloads")
# Load the CSV file into a data frame
tech_stocks <- read.csv("TechStocks.csv")
# Display the first few rows of the data frame
head(tech_stocks)
## rownames Date AAPL GOOG MSFT t
## 1 1 12/1/2015 117.34 767.04 55.22 1
## 2 2 12/2/2015 116.28 762.38 55.21 2
## 3 3 12/3/2015 115.20 752.54 54.20 3
## 4 4 12/4/2015 119.03 766.81 55.91 4
## 5 5 12/7/2015 118.28 763.25 55.81 5
## 6 6 12/8/2015 118.23 762.37 55.79 6
# View the column names of your dataset
colnames(tech_stocks)
## [1] "rownames" "Date" "AAPL" "GOOG" "MSFT" "t"
Loess stands for locally estimated scatterplot smoothing (lowess stands for locally weighted scatterplot smoothing) and is one of many non-parametric regression techniques, but arguably the most flexible. A smoothing function is a function that attempts to capture general patterns in stressor-response relationships while reducing the noise and it makes minimal assumptions about the relationships among variables. The result of a loess application is a line through the moving central tendency of the stressor-response relationship. Loess is essentially used to visually assess the relationship between two variables and is especially useful for large datasets, where trends can be hard to visualize (United States Environmental Protection Agency, 2016).
As stated in the defintion above it is clear that LOWESS is a non-parametric regression technique. The LOWESS (Locally Weighted Scatterplot Smoothing) method is primarily used for smoothing and capturing the local patterns in a dataset without assuming a specific functional form. It is nonparametric because it doesn’t rely on a predefined model structure.The general methodology is characterized by greater flexibility and reduced reliance on assumptions concerning data distribution or the specific functional form, as opposed to the more rigid constraints inherent in traditional parametric regression models.
Locally Weighted Scatterplot Smoothing (LOWESS) has several advantages from a computational standpoint. One notable advantage is its adaptability to complex and non-linear relationships in data. LOWESS employs a local regression approach to capture intricate patterns without assuming a global functional form. This flexibility is particularly advantageous in situations where relationships may vary across different regions of the dataset. Additionally, LOWESS handles outliers and noisy data well due to its weighted approach, assigning lower influence to outliers, contributing to more robust and reliable results.
However, LOWESS also comes with some computational challenges. The method’s computational complexity can be relatively high, especially for large datasets, as it involves fitting a local regression model at multiple points across the data. This can result in increased computation time and resource requirements. Additionally, choosing the smoothing parameter (bandwidth) can impact the computational efficiency and the smoothing performance; finding an optimal bandwidth involves trial and error. Despite these challenges, LOWESS remains a powerful tool for exploratory data analysis and visualization, striking a balance between computational efficiency and the ability to capture complex relationships in the data.
# Explore the structure of the dataset
str(tech_stocks)
## 'data.frame': 504 obs. of 6 variables:
## $ rownames: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : chr "12/1/2015" "12/2/2015" "12/3/2015" "12/4/2015" ...
## $ AAPL : num 117 116 115 119 118 ...
## $ GOOG : num 767 762 753 767 763 ...
## $ MSFT : num 55.2 55.2 54.2 55.9 55.8 ...
## $ t : int 1 2 3 4 5 6 7 8 9 10 ...
# Display summary statistics of the dataset
summary(tech_stocks)
## rownames Date AAPL GOOG
## Min. : 1.0 Length:504 Min. : 90.34 Min. : 668.3
## 1st Qu.:126.8 Class :character 1st Qu.:105.70 1st Qu.: 742.5
## Median :252.5 Mode :character Median :116.22 Median : 786.9
## Mean :252.5 Mean :125.01 Mean : 820.5
## 3rd Qu.:378.2 3rd Qu.:146.59 3rd Qu.: 922.3
## Max. :504.0 Max. :175.88 Max. :1054.2
## MSFT t
## Min. :48.43 Min. : 1.0
## 1st Qu.:54.82 1st Qu.:126.8
## Median :60.75 Median :252.5
## Mean :62.39 Mean :252.5
## 3rd Qu.:69.41 3rd Qu.:378.2
## Max. :84.88 Max. :504.0
# Perform multiple regression (on 'rownames', 'AAPL', 'GOOG', and 'MSFT')
model <- lm(rownames ~ AAPL + GOOG + MSFT, data = tech_stocks)
# Display the regression summary
summary(model)
##
## Call:
## lm(formula = rownames ~ AAPL + GOOG + MSFT, data = tech_stocks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.21 -12.22 14.72 30.22 95.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -602.8625 28.2903 -21.310 < 2e-16 ***
## AAPL 1.2549 0.3594 3.492 0.000522 ***
## GOOG -0.1981 0.1014 -1.953 0.051432 .
## MSFT 13.8002 1.1254 12.262 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.46 on 500 degrees of freedom
## Multiple R-squared: 0.8944, Adjusted R-squared: 0.8938
## F-statistic: 1412 on 3 and 500 DF, p-value: < 2.2e-16
library(ggplot2)
# Assuming 'Date' is in the correct date format and you want to analyze 'AAPL' stock prices
# If needed, convert 'Date' to a Date object using:
tech_stocks$Date <- as.Date(tech_stocks$Date, format = "%m/%d/%Y")
# Perform LOWESS analysis on 'AAPL' stock prices
lowess_result <- lowess(tech_stocks$AAPL, as.numeric(tech_stocks$Date))
# Create a ggplot with the original data and LOWESS curve for AAPL
ggplot (tech_stocks,aes(x = Date, y = AAPL)) +
geom_line(aes (color = "Actual"), show.legend = TRUE) +
geom_smooth (method = 'loess', aes(color= "LOWESS"), show.legend = TRUE) +
scale_color_manual (values = c("Actual" ="blue" , "LOWESS" = "red")) +
labs(title = 'LOWESS Analysis for AAPL stock', x = 'Date', y = 'Stock Price')
## `geom_smooth()` using formula = 'y ~ x'
ggplot (tech_stocks,aes(x = Date, y = GOOG)) +
geom_line(aes (color = "Actual"), show.legend = TRUE) +
geom_smooth (method = 'loess', aes(color= "LOWESS"), show.legend = TRUE) +
scale_color_manual (values = c("Actual" ="blue" , "LOWESS" = "red")) +
labs(title = 'LOWESS Analysis for GOOG stock', x = 'Date', y = 'Stock Price')
## `geom_smooth()` using formula = 'y ~ x'
ggplot (tech_stocks,aes(x = Date, y = MSFT)) +
geom_line(aes (color = "Actual"), show.legend = TRUE) +
geom_smooth (method = 'loess', aes(color= "LOWESS"), show.legend = TRUE) +
scale_color_manual (values = c("Actual" ="blue" , "LOWESS" = "red")) +
labs(title = 'LOWESS Analysis for MSFT stock', x = 'Date', y = 'Stock Price')
## `geom_smooth()` using formula = 'y ~ x'
The regression analysis was conducted using a dataset with 504 observations and six variables: ‘rownames’, ‘Date’, ‘AAPL’ (representing Apple stock values), ‘GOOG’ (representing Google stock values), ‘MSFT’ (representing Microsoft stock values), and ‘t’ (an index variable). The summary statistics reveal that the minimum closing prices for the stocks are 90.34 (AAPL), 668.3 (GOOG), and 48.43 (MSFT), while the maximum values are 175.88 (AAPL), 1054.2 (GOOG), and 84.88 (MSFT).
The multiple regression model was fitted using the ‘rownames ~ AAPL + GOOG + MSFT’ formula. Here, ‘rownames’ captures the daily price of each stock on different dates. The coefficients indicate that holding other variables constant, for each unit increase in ‘AAPL’, there is a 1.2549 unit increase in the response variable ‘rownames’. Similarly, for each unit increase in ‘GOOG’, there is a -0.1981 unit change, and for ‘MSFT’, a 13.8002 unit increase. The intercept is -602.8625. The p-values suggest that ‘AAPL’ and ‘MSFT’ are statistically significant predictors, while ‘GOOG’ shows a marginally significant effect (p = 0.051432). The overall model is significant (p < 2.2e-16), explaining 89.44% of the variance in the response variable. The residuals range from -162.21 to 95.37, with a standard error of 47.46 on 500 degrees of freedom. The adjusted R-squared value is 0.8938, indicating a good fit for the model.
A systematic approach to obtaining critical parameters in the LOWESS technique is likely to produce data that optimally meets assumptions made in the data preprocessing step, making studies utilizing the LOWESS method unambiguous and easier to repeat(Berger et al,2004). Parametric data is not essential for Locally Weighted Scatterplot Smoothing (LOWESS), as LOWESS is a nonparametric method. LOWESS effectively captures complex, nonlinear relationships in data without relying on specific assumptions about the underlying distribution. It excels in scenarios where parametric assumptions may not hold, providing flexibility in modeling diverse data patterns.
LOWESS distinguishes itself by offering flexibility in capturing intricate, nonlinear patterns in data without assuming a specific functional form. Unlike parametric methods, LOWESS adapts locally, making it robust to outliers and suitable for diverse datasets. However, it can be computationally intensive, especially for large datasets. Compared to other smoothening techniques like moving averages or polynomial fits, LOWESS provides more nuanced insights into the underlying trends, making it particularly valuable for exploratory data analysis and visualization when intricate patterns may exist in the data. ***
Berger, J. A., Hautaniemi, S., Järvinen, A. K., Edgren, H., Mitra, S. K., & Astola, J. (2004). Optimized LOWESS normalization parameter selection for DNA microarray data. BMC bioinformatics, 5(1), 1-13.
United States Environmental Protection Agency. (2016). LOESS (or LOWESS). https://www.epa.gov/sites/default/files/2016-07/documents/loess-lowess.pdf