2023-06-11

Regression

  • What is a Simple Linear Regression?
    • Predictive Analysis method using one independent variable to determine its effect on the dependent variable
  • Multiple Linear Regression
    • Same basic constant as before, but with multiple independent variables impacting the single dependent Variable
  • Multivariate Regression
    • There is also a high level regression where once considers the impact multiple independent variables have on multiple dependent variables

Apple Example

Let’s take the following apple data- say we want to find out, with all else constant, how does the stock price of Apple vary year-by-year? In this example, I created a data set listing Price, or the price of apple stock at the 1st of that year, where Days, is the number of days between Jan 1, 2013, when this stock price recording begins, til Jan 1, 2023, at a yearly interval.

##    Days  Price
## 1     0  18.82
## 2   365  19.31
## 3   730  27.33
## 4  1095  24.28
## 5  1461  30.00
## 6  1826  44.27
## 7  2129  39.06
## 8  2556  77.58
## 9  2922 139.07
## 10 3287 172.17
## 11 3652 137.81

Code For Apple Scatter Plot

library(ggplot2)
scatter_plot <- ggplot(apple_data, aes(x = Days, y = Price)) +
  geom_point() +
  labs(x = "Days", y = "Price") +
  ggtitle("Apple Stock Price") +
  theme_minimal()

We can utilize this code to create the plot found below, but now we want to understand what line best fits all these data points, to mathematically represent the trend this stock has been following over time.

Apple Scatter Plot

print(scatter_plot)

Create Linear Equation

With all of our data points, we want to create a linear model that follows this equation: \[ y = \beta_0 + \beta_1Xi + \varepsilon \]

We can use the following code to create a linear regression model

##Linear Model

## 
## Call:
## lm(formula = Price ~ Days, data = apple_data)
## 
## Coefficients:
## (Intercept)         Days  
##    -9.17933      0.04149

Linear Model Cont

intercept <- coef(model)[1] #this is our b0
slope <- coef(model)[2] #this is our b1

return(intercept)
## (Intercept) 
##   -9.179325
return(slope)
##       Days 
## 0.04148592

\[ \text{{Price}} = -9.179325 + 0.04148592 \times \text{{Date}} \]

GGPlot Code of Regression Line

Using the data set, we calculated the intercept, and slope of the best fit line’s equation. We can now use that to create a scatter plot, with the best fit line across it, to visualize the trajectory of the stock over time.

best_fit <- ggplot(data = apple_data, aes(x = Days, y = Price)) +
  geom_point() +    
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Date", y = "Price")  

Scatter Plot with Line of Best Fit

print(best_fit)
## `geom_smooth()` using formula = 'y ~ x'

Real World Examples

This example we just ran through was just an exercise to understand the basics of regression. Obviously, pricing a stock isn’t as simple as running s single regression, and plugging in a given year into the equation.

In the true cases of high finance, private equity, and even commerical real-estate, large scale models are being developed which input enormous amounts of data, since the financial markets are one of those things that are impacted by every part of our lives.

Lets take a turn into the world of real estate where we can look into the following data set that I found on kaggle(https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction/code?datasetId=88705&language=R):

Real Estate Data Set

head(Data)
##   NMB Transaction_Date House_Age Dist_MRT_station Num_of_conv_store Latitude
## 1   1         2012.917      32.0         84.87882                10 24.98298
## 2   2         2012.917      19.5        306.59470                 9 24.98034
## 3   3         2013.583      13.3        561.98450                 5 24.98746
## 4   4         2013.500      13.3        561.98450                 5 24.98746
## 5   5         2012.833       5.0        390.56840                 5 24.97937
## 6   6         2012.667       7.1       2175.03000                 3 24.96305
##   Longitude Price_Unit
## 1  121.5402       37.9
## 2  121.5395       42.2
## 3  121.5439       47.3
## 4  121.5439       54.8
## 5  121.5425       43.1
## 6  121.5125       32.1

Real Estate Convenienve Stores to Price

For the sake of simplicity, we’ll look at one part of that larger predictory analysis- how the number of convenient stores near a home impacts its price/unit.

Data Science, Finance, and Business.

With the data above, it takes many data-processing steps like normalizing, and cleansing, as well as testing the variances, and impact each variable has on the dependent variable. However with this high level analysis you are able to more accurately predict house pricing. In the case of high finance, where banks and financial institutions are managing various levels of assets, that all have different responses to variables, being able to utilize advanced multiple regression techniques could help navigate the uncertainty of various markets. Regression tools are essential in generating variables, testing theories, forecasting, and even risk assessment.