12-07-2023 CS871-2305C-01 Unit 4

Unit 4 - Discussion Board - Purpose

Purpose of this discussion board is to use the dataset avialable in HERE to learn more on the locally weighted scatterplot smoothing (LOWESS) method for multiple regression as well as how to run it using R.

I picked the following data set:

Package	Item	Title	CSV	Doc
Stat2Data	TechStocks	Daily Prices of Three Tech Stocks	CSV	Doc

Downloaded the csv to local folder and used the following commands to load the data into R

# Setting working directory
setwd("/Users/jacob/Downloads")

# Load the CSV file into a data frame
tech_stocks <- read.csv("TechStocks.csv")

# Display the first few rows of the data frame
head(tech_stocks)

##   rownames      Date   AAPL   GOOG  MSFT t
## 1        1 12/1/2015 117.34 767.04 55.22 1
## 2        2 12/2/2015 116.28 762.38 55.21 2
## 3        3 12/3/2015 115.20 752.54 54.20 3
## 4        4 12/4/2015 119.03 766.81 55.91 4
## 5        5 12/7/2015 118.28 763.25 55.81 5
## 6        6 12/8/2015 118.23 762.37 55.79 6

# View the column names of your dataset
colnames(tech_stocks)

## [1] "rownames" "Date"     "AAPL"     "GOOG"     "MSFT"     "t"

Discussion Board Task 1: Is using the locally weighted scatterplot smoothing (LOWESS) method for multiple regression models in a k-nearest neighbors-based model a parametric or nonparametric method?

Loess stands for locally estimated scatterplot smoothing (lowess stands for locally weighted scatterplot smoothing) and is one of many non-parametric regression techniques, but arguably the most flexible. A smoothing function is a function that attempts to capture general patterns in stressor-response relationships while reducing the noise and it makes minimal assumptions about the relationships among variables. The result of a loess application is a line through the moving central tendency of the stressor-response relationship. Loess is essentially used to visually assess the relationship between two variables and is especially useful for large datasets, where trends can be hard to visualize (United States Environmental Protection Agency, 2016).

As stated in the defintion above it is clear that LOWESS is a non-parametric regression technique. The LOWESS (Locally Weighted Scatterplot Smoothing) method is primarily used for smoothing and capturing the local patterns in a dataset without assuming a specific functional form. It is nonparametric because it doesn’t rely on a predefined model structure.The general methodology is characterized by greater flexibility and reduced reliance on assumptions concerning data distribution or the specific functional form, as opposed to the more rigid constraints inherent in traditional parametric regression models.

Discussion Board Task 1.1: Discuss some of the advantages and disadvantages of LOWESS from a computational standpoint.

Locally Weighted Scatterplot Smoothing (LOWESS) has several advantages from a computational standpoint. One notable advantage is its adaptability to complex and non-linear relationships in data. LOWESS employs a local regression approach to capture intricate patterns without assuming a global functional form. This flexibility is particularly advantageous in situations where relationships may vary across different regions of the dataset. Additionally, LOWESS handles outliers and noisy data well due to its weighted approach, assigning lower influence to outliers, contributing to more robust and reliable results.

However, LOWESS also comes with some computational challenges. The method’s computational complexity can be relatively high, especially for large datasets, as it involves fitting a local regression model at multiple points across the data. This can result in increased computation time and resource requirements. Additionally, choosing the smoothing parameter (bandwidth) can impact the computational efficiency and the smoothing performance; finding an optimal bandwidth involves trial and error. Despite these challenges, LOWESS remains a powerful tool for exploratory data analysis and visualization, striking a balance between computational efficiency and the ability to capture complex relationships in the data.

Discussion Board Task 2: Examine the variables, and prepare a multiple regression in R for the data set TechStocks.csv .

# Explore the structure of the dataset
str(tech_stocks)

## 'data.frame':    504 obs. of  6 variables:
##  $ rownames: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date    : chr  "12/1/2015" "12/2/2015" "12/3/2015" "12/4/2015" ...
##  $ AAPL    : num  117 116 115 119 118 ...
##  $ GOOG    : num  767 762 753 767 763 ...
##  $ MSFT    : num  55.2 55.2 54.2 55.9 55.8 ...
##  $ t       : int  1 2 3 4 5 6 7 8 9 10 ...

# Display summary statistics of the dataset
summary(tech_stocks)

##     rownames         Date                AAPL             GOOG       
##  Min.   :  1.0   Length:504         Min.   : 90.34   Min.   : 668.3  
##  1st Qu.:126.8   Class :character   1st Qu.:105.70   1st Qu.: 742.5  
##  Median :252.5   Mode  :character   Median :116.22   Median : 786.9  
##  Mean   :252.5                      Mean   :125.01   Mean   : 820.5  
##  3rd Qu.:378.2                      3rd Qu.:146.59   3rd Qu.: 922.3  
##  Max.   :504.0                      Max.   :175.88   Max.   :1054.2  
##       MSFT             t        
##  Min.   :48.43   Min.   :  1.0  
##  1st Qu.:54.82   1st Qu.:126.8  
##  Median :60.75   Median :252.5  
##  Mean   :62.39   Mean   :252.5  
##  3rd Qu.:69.41   3rd Qu.:378.2  
##  Max.   :84.88   Max.   :504.0

# Perform multiple regression (on 'rownames', 'AAPL', 'GOOG', and 'MSFT')
model <- lm(rownames ~ AAPL + GOOG + MSFT, data = tech_stocks)

# Display the regression summary
summary(model)

## 
## Call:
## lm(formula = rownames ~ AAPL + GOOG + MSFT, data = tech_stocks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.21  -12.22   14.72   30.22   95.37 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -602.8625    28.2903 -21.310  < 2e-16 ***
## AAPL           1.2549     0.3594   3.492 0.000522 ***
## GOOG          -0.1981     0.1014  -1.953 0.051432 .  
## MSFT          13.8002     1.1254  12.262  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.46 on 500 degrees of freedom
## Multiple R-squared:  0.8944, Adjusted R-squared:  0.8938 
## F-statistic:  1412 on 3 and 500 DF,  p-value: < 2.2e-16

Discussion Board Task 3.1 : Show the code and the output for LOWESS curve for AAPL

library(ggplot2)

# Assuming 'Date' is in the correct date format and you want to analyze 'AAPL' stock prices
# If needed, convert 'Date' to a Date object using: 
tech_stocks$Date <- as.Date(tech_stocks$Date, format = "%m/%d/%Y")

# Perform LOWESS analysis on 'AAPL' stock prices
lowess_result <- lowess(tech_stocks$AAPL, as.numeric(tech_stocks$Date))

# Create a ggplot with the original data and LOWESS curve for AAPL

ggplot (tech_stocks,aes(x = Date, y = AAPL)) +
geom_line(aes (color = "Actual"), show.legend = TRUE) +
geom_smooth (method = 'loess', aes(color= "LOWESS"), show.legend = TRUE) +
scale_color_manual (values = c("Actual" ="blue" , "LOWESS" = "red")) +
  

labs(title = 'LOWESS Analysis for AAPL stock', x = 'Date', y = 'Stock Price')

## `geom_smooth()` using formula = 'y ~ x'

Discussion Board Task 3.2 : Show the code and the output for LOWESS curve for GOOG

ggplot (tech_stocks,aes(x = Date, y = GOOG)) +
geom_line(aes (color = "Actual"), show.legend = TRUE) +
geom_smooth (method = 'loess', aes(color= "LOWESS"), show.legend = TRUE) +
scale_color_manual (values = c("Actual" ="blue" , "LOWESS" = "red")) +
  

labs(title = 'LOWESS Analysis for GOOG stock', x = 'Date', y = 'Stock Price')

## `geom_smooth()` using formula = 'y ~ x'

Discussion Board Task 3.3 : Show the code and the output for LOWESS curve for MSFT

ggplot (tech_stocks,aes(x = Date, y = MSFT)) +
geom_line(aes (color = "Actual"), show.legend = TRUE) +
geom_smooth (method = 'loess', aes(color= "LOWESS"), show.legend = TRUE) +
scale_color_manual (values = c("Actual" ="blue" , "LOWESS" = "red")) +
  

labs(title = 'LOWESS Analysis for MSFT stock', x = 'Date', y = 'Stock Price')

## `geom_smooth()` using formula = 'y ~ x'

Discussion Board Task 4: Discuss the results.

The regression analysis was conducted using a dataset with 504 observations and six variables: ‘rownames’, ‘Date’, ‘AAPL’ (representing Apple stock values), ‘GOOG’ (representing Google stock values), ‘MSFT’ (representing Microsoft stock values), and ‘t’ (an index variable). The summary statistics reveal that the minimum closing prices for the stocks are 90.34 (AAPL), 668.3 (GOOG), and 48.43 (MSFT), while the maximum values are 175.88 (AAPL), 1054.2 (GOOG), and 84.88 (MSFT).

The multiple regression model was fitted using the ‘rownames ~ AAPL + GOOG + MSFT’ formula. Here, ‘rownames’ captures the daily price of each stock on different dates. The coefficients indicate that holding other variables constant, for each unit increase in ‘AAPL’, there is a 1.2549 unit increase in the response variable ‘rownames’. Similarly, for each unit increase in ‘GOOG’, there is a -0.1981 unit change, and for ‘MSFT’, a 13.8002 unit increase. The intercept is -602.8625. The p-values suggest that ‘AAPL’ and ‘MSFT’ are statistically significant predictors, while ‘GOOG’ shows a marginally significant effect (p = 0.051432). The overall model is significant (p < 2.2e-16), explaining 89.44% of the variance in the response variable. The residuals range from -162.21 to 95.37, with a standard error of 47.46 on 500 degrees of freedom. The adjusted R-squared value is 0.8938, indicating a good fit for the model.

Discussion Board Task 5: How important is parametric data for LOWESS?

A systematic approach to obtaining critical parameters in the LOWESS technique is likely to produce data that optimally meets assumptions made in the data preprocessing step, making studies utilizing the LOWESS method unambiguous and easier to repeat(Berger et al,2004). Parametric data is not essential for Locally Weighted Scatterplot Smoothing (LOWESS), as LOWESS is a nonparametric method. LOWESS effectively captures complex, nonlinear relationships in data without relying on specific assumptions about the underlying distribution. It excels in scenarios where parametric assumptions may not hold, providing flexibility in modeling diverse data patterns.

Discussion Board Task 6: How does LOWESS compare to the other producers?

LOWESS distinguishes itself by offering flexibility in capturing intricate, nonlinear patterns in data without assuming a specific functional form. Unlike parametric methods, LOWESS adapts locally, making it robust to outliers and suitable for diverse datasets. However, it can be computationally intensive, especially for large datasets. Compared to other smoothening techniques like moving averages or polynomial fits, LOWESS provides more nuanced insights into the underlying trends, making it particularly valuable for exploratory data analysis and visualization when intricate patterns may exist in the data. ***

Reference

Berger, J. A., Hautaniemi, S., Järvinen, A. K., Edgren, H., Mitra, S. K., & Astola, J. (2004). Optimized LOWESS normalization parameter selection for DNA microarray data. BMC bioinformatics, 5(1), 1-13.

United States Environmental Protection Agency. (2016). LOESS (or LOWESS). https://www.epa.gov/sites/default/files/2016-07/documents/loess-lowess.pdf

12-07-2023 CS871-2305C-01 Unit 4 - Discussion Board

Jeesmon Jacob

2023-12-08