Question1

The data set pcancer.data contains 97 observations on a prostate cancer marker (lpsa) and a number of clinical measures for men undergoing cancer treatment. The file pcancer-README.txt includes a brief description of the variables in the data set. Please attempt the following tasks:

## [1] 97
##  [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
##  [8] "pgg45"   "lpsa"    "train"
  1. Produce summary statistics and plots for the independent variables as well as the response variable. Provide commentary on these outputs.

The descriptive statistics is as follows

## Non-numerical variable(s) ignored: train

Descriptive Statistics
pcancer
N: 97

age gleason lbph lcavol lcp lpsa lweight pgg45 svi
Mean 63.87 6.75 0.10 1.35 -0.18 2.48 3.63 24.38 0.22
Std.Dev 7.45 0.72 1.45 1.18 1.40 1.15 0.43 28.20 0.41
Min 41.00 6.00 -1.39 -1.35 -1.39 -0.43 2.37 0.00 0.00
Q1 60.00 6.00 -1.39 0.51 -1.39 1.73 3.38 0.00 0.00
Median 65.00 7.00 0.30 1.45 -0.80 2.59 3.62 15.00 0.00
Q3 68.00 7.00 1.56 2.13 1.18 3.06 3.88 40.00 0.00
Max 79.00 9.00 2.33 3.82 2.90 5.58 4.78 100.00 1.00
MAD 5.93 0.00 2.50 1.28 0.87 1.15 0.38 22.24 0.00
IQR 8.00 1.00 2.94 1.61 2.56 1.32 0.50 40.00 0.00
CV 0.12 0.11 14.46 0.87 -7.80 0.47 0.12 1.16 1.91
Skewness -0.80 1.22 0.13 -0.24 0.71 0.00 0.06 0.94 1.36
SE.Skewness 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24
Kurtosis 0.96 2.36 -1.75 -0.60 -1.01 0.43 0.36 -0.37 -0.16
N.Valid 97.00 97.00 97.00 97.00 97.00 97.00 97.00 97.00 97.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
[1] 0
# Comment:\

The data had a sample size of 97 with 10 variables namely “lcavol”, “lweight”, “age”, “lbph”, “svi”, “lcp”, “gleason”, “pgg45”, “lpsa” and “train” where train was a non-numeric variable

Visualization

Here we began by visualizing each independent variable we began with a bar chart of the training observations using the ggplot command we categorized the categorical data into two groups False or True

Comment::

From the plot we see that the training set is greater than the testing set

Comment::

From the boxplot outliers were detected

Comment::

From the density plot we check for normality as its evident the data is bell shaped hence its from a normal distribution

  1. Extract the training data (where train = T) and estimate an OLS model using all of the variables. Provide detailed commentary on the model output.
## 
## Call:
## lm(formula = lpsa ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64870 -0.34147 -0.05424  0.44941  1.48675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.429170   1.553588   0.276  0.78334    
## lcavol       0.576543   0.107438   5.366 1.47e-06 ***
## lweight      0.614020   0.223216   2.751  0.00792 ** 
## age         -0.019001   0.013612  -1.396  0.16806    
## lbph         0.144848   0.070457   2.056  0.04431 *  
## svi          0.737209   0.298555   2.469  0.01651 *  
## lcp         -0.206324   0.110516  -1.867  0.06697 .  
## gleason     -0.029503   0.201136  -0.147  0.88389    
## pgg45        0.009465   0.005447   1.738  0.08755 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7123 on 58 degrees of freedom
## Multiple R-squared:  0.6944, Adjusted R-squared:  0.6522 
## F-statistic: 16.47 on 8 and 58 DF,  p-value: 2.042e-12

Interpretation

The regression coefficients for our model that’s \[\beta_0,\dots,\beta_{9}\] are 0.429170,0.576543,0.614020,-0.019001,0.144848,0.737209,-0.206324,-0.029503 and 0.009465 respectively the coefficient shows the variation of the dependent variable whether increasing for positive or decreasing for negative value. The t-value represents the models coefficient divided by the standard error. The model had a multiple R-squared of 0.6944 which was the fraction of the variation in the dependent variable that’s lpsa that can be predicted by the predictor variables, from the p-value only the following independent variables were statistically significant as the p-value is less than 0.05 the assumed alpha lcavol, lweigth, lbph and svi

  1. Run the estimated model from (b) on the test data (train = F) and comment on the adequacy of the model. You may use your knowledge of diagnostic testing to suggest improvements for the model.
## [1] 0.7219931
## [1] 0.5233719
## [1] 0.5033799

  1. Estimate an OLS model using all of the data available i.e. all 97 observations. Comment on the model output. Compare this result to results in (b) and (c).
## 
## Call:
## lm(formula = lpsa ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Comment:: 

Interpretation

As like the intial model the regression coefficients for this model that’s \[\beta_0,\dots,\beta_{9}\] are 0.181561,0.564341,0.622020,-0.021248,0.096713,0.761673,-0.10605,0.049228 and 0.004458 respectively the coefficient shows the variation of the dependent variable whether increasing for positive or decreasing for negative value. The t-value represents the models coefficient divided by the standard error. The model had a multiple R-squared of 0.6634 which is lesser than that of the training set a the fraction of the variation in the dependent variable that’s lpsa that can be predicted by the predictor variables, from the p-value only the following independent variables were statistically significant as the p-value is less than 0.05 the assumed alpha thats lcavol, lweigth and svi remains while lbph was dropped from being significant.

  1. Estimate a model using the LASSO shrinkage method. Provide comments on how the tuning parameter was selected. Which of the variables were omitted? Compare these results to the OLS results in (d).Note: Commentary (where requested) should be based on the statistical interpretation of the results. Students are not expected to provide interpretations in terms of the medical/clinical attributes. ### Comment:: 

Lasso regression model was fit as the regression model showed multicollinearity is present in the data. \[\operatorname{RSS}=\sum(y_i-\bar{y_i})^2\] \[\operatorname{RSS}+\lambda\sum|\beta_j|\] where the second term is the skrinkage penalty

Question 2

Use a simulation study to investigate the behaviour of shrinkage estimators. Start by simulating a regression model and attempt to estimate the true values of the coefficients using standard estimation techniques (OLS or MLE) as well as shrinkage estimators (Ridge or LASSO). Do the shrinkage estimates change if the: 

  1. sample size of the simulation changes? Yes it does as the model has a size of 8 Since the lasso penalty consists of the absolute model parameters, large values are not taken into account more strongly than smaller values. This means that our lasso penalty would not prioritize minimizing any particular model parameter, unlike the ridge penalty, which prioritizes large parameters.
## Determine regression coefficients for SIZE.8 model
## Specified shrinkage intensity lambda (correlation matrix):  0 
## Specified shrinkage intensity lambda.var (variance vector): 0
## $regularization
##        lambda lambda.var
## SIZE.8      0          0
## 
## $std.coefficients
##        (Intercept)    lcavol   lweight       age      lbph       svi        lcp
## SIZE.8           0 0.5931445 0.2422914 -0.118023 0.1755303 0.2563475 -0.2392803
##            gleason     pgg45
## SIZE.8 -0.01731521 0.2296268
## 
## $coefficients
##        (Intercept)    lcavol lweight         age      lbph       svi        lcp
## SIZE.8   0.4291701 0.5765432 0.61402 -0.01900102 0.1448481 0.7372086 -0.2063242
##            gleason       pgg45
## SIZE.8 -0.02950288 0.009465162
## 
## $numpred
## SIZE.8 
##      8 
## 
## $R2
##    SIZE.8 
## 0.6943712 
## 
## $sd.resid
##    SIZE.8 
## 0.6677232 
## 
## attr(,"class")
## [1] "slm"
  1. size/magnitude of the true coefficients changes?

Comment::

Yes it does. Based on the original regression model the coefficient values thats 0.429170,0.576543,0.614020,-0.019001,0.144848,0.737209,-0.206324,-0.029503 and 0.009465 changed having the previous regression coefficients as where less than that of shrinkage thus 0, 0.5931445, 0.2422914, -0.118023, 0.1755303, 0.2563475, -0.2392803, -0.01731521 and 0.2296268 bringing a variation

  1. correlation between the explanatory variables changes? (Hint: generate a pair of correlated x variables). Clearly explain your findings for each of the above tasks.
## Warning in cor(x, method = "pearson"): the standard deviation is zero

Comment::

We will use the Karl Pearson’s coefficient of correlation to check if there exists a relationship between the independent variables. The Pearson product-moment correlation coefficient is a measure of the strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit). The Pearson correlation coefficient, r, can take a range of values from +1→-1. A value of 0indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases. The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either+1 or -1 depending on whether the relationship is positive or negative, respectively.

Basing on the knowledge above we see there is a strong relationship between lcavol and lcp with a correlation coefficient of 0.68 with svi,lcp, gleason and pgg45 showing a strong relationship with lcavol. We also see a weak relationship between lbph and lcavol with a correlation coefficient of 0.03

Question 3

The following tasks require the data file interest-rate.csv. Please refer to interest-rate-README.txt for a brief description of the variables in the data set
  1. Using the variables in the data, create an additional variable called spread where spread = r5-Tbill. Plot this newly created variable ensuring that time is on the x-axis. Provide a very brief non-technical understanding of what spread is measuring as well as comment on the plot.

Comment::

The trend plot looks stationary meaning certain attributes don’t change over time.

  1. Test if spread is stationary and discuss the results. If it is not stationary, take the first difference (spreadt − spreadt−1), and retest for stationarity. Implementing time series
## [1] "ts"

Augmented Dickey-Fuller Test tests for Stationarity

## Warning in adf.test(ts_model): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_model
## Dickey-Fuller = -5.0045, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

Comment::

From the ADF test we see as the p-value is less than assumed alpha hence its stationary

  1. Plot the ACF and PACF of the final series from (b). Based on these plots, what is your modelling strategy? Discuss in terms of lag length for both autoregressive and moving-average models i.e. AR(p) and MA(q). #### Choosing the moving average or q

Comment::

The cut off for ACF is 6 meaning that the q value of the moving average is 6

Choosing the auto regressive model or p

Comment

While cut off for PACF is 1 meaning that the p value for the auto-regressive is 1

  1. Based on your strategy from (c), estimate three models - AR(p), MA(q) or ARMA(p,q). Compare your results to ARMA(1,1). Discuss your results.
## 
## Call:
## arma(x = ts_model, order = c(1, 6))
## 
## Model:
## ARMA(1,6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.32920 -0.24187 -0.05142  0.24277  1.71216 
## 
## Coefficient(s):
##            Estimate  Std. Error  t value Pr(>|t|)    
## ar1         0.80495     0.10122    7.953 1.78e-15 ***
## ma1         0.27340     0.13350    2.048   0.0406 *  
## ma2        -0.11656     0.13123   -0.888   0.3744    
## ma3         0.15214     0.10399    1.463   0.1435    
## ma4         0.05109     0.09214    0.554   0.5793    
## ma5        -0.05265     0.12540   -0.420   0.6746    
## ma6        -0.05157     0.10831   -0.476   0.6340    
## intercept   0.22982     0.12579    1.827   0.0677 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Fit:
## sigma^2 estimated as 0.2215,  Conditional Sum-of-Squares = 45.4,  AIC = 298.03

Discussion::

\[\operatorname{Y_t}=0.22982+0.80495Y_{t-1}+e_t+0.27340e_{t-1}-0.11656e_{t-2}+0.15214_{t-3}+0.05109_{t-4}-0.05265_{t-5}-0.05157_{t-6}\] AIC=298.03 shows the strength of the model ARMA(1,6)

## 
## Call:
## arma(x = ts_model, order = c(1, 1))
## 
## Model:
## ARMA(1,1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47000 -0.23479 -0.05317  0.19203  2.03767 
## 
## Coefficient(s):
##            Estimate  Std. Error  t value Pr(>|t|)    
## ar1         0.75873     0.05466   13.881  < 2e-16 ***
## ma1         0.37937     0.09475    4.004 6.23e-05 ***
## intercept   0.29362     0.08028    3.657 0.000255 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Fit:
## sigma^2 estimated as 0.2253,  Conditional Sum-of-Squares = 47.32,  AIC = 291.72

Discussion::

\[\operatorname{Y_t}=0.29362+0.75873Y_{t-1}+e_t+0.37937e_{t-1}\] AIC=291.72 shows the strength of the model ARMA(1,1)

  1. Which of these models would you recommend and why? Discuss your reasoning and provide evidence.
    ### Comment:: The best model is the one with the lowest AIC value as it indicates a better fit of the model

Question 4

Use your knowledge on time series concepts to complete the following questions:
(a) Simulate an AR(1) process \[y_t = \phi_0 + \phi_1y_{t−1} + u_t\] where \[\phi_0 = 0.5, \phi_1 = 1 \text{ and }y_0 = 1\]. Assume that \(u_t\) is normally distributed with mean 0 and variance 1 and let t = 2000. Plot the simulated series and briefly describe it. Focus on describing(not deriving) the behaviour of \(E(y_t)\) and \(V(y_t)\).

  1. Simulate another AR(1) process \[x_t = \theta_0 +\theta_1x_{t−1} + v_t\] where \[\theta_0 = 0.25, theta_1 = 1 \text{ and } x_0 = 1\]. Assume that \(v_t\) is normally distributed with mean 0 and variance 1.5 and let t = 2000. Plot the simulated series and briefly describe it. Focus on describing (not deriving) the behaviour of \(E(x_t)\) and \(V(x_t)\)

  2. Produce a scatter plot of yt (y-axis) and xt. Next, regress yt on xt. Discuss the regression resultsi.e. coefficient estimates, standard errors and other model outputs. You may perform regression diagnostics as well.

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## 
## Call:
## lm(formula = y_t ~ x_t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6774 -0.7862  0.0485  0.7888  4.0988 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.043133   0.025926  -1.664   0.0963 .
## x_t         -0.006934   0.020412  -0.340   0.7341  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 1998 degrees of freedom
## Multiple R-squared:  5.776e-05,  Adjusted R-squared:  -0.0004427 
## F-statistic: 0.1154 on 1 and 1998 DF,  p-value: 0.7341
the regression coefficients for this model3 that’s \[\beta_0 \text{ and } \beta_{1}\] are -0.01921 and -0.03277 respectively the coefficient shows the variation of the dependent variable decreasing since it has negative value. The t-value represents the model’s coefficient divided by the standard error. The model had a multiple R-squared of 0.0012 which is the fraction of the variation in the dependent variable that’s y that can be predicted by the predictor x, from the p-value the independent variable wasnt statistically significant.
  1. Given that we have simulated the two variables independently, can you explain what is happening?Do the results change if any of the model parameters (excl. error terms) are modified? Discuss this.
    ### Comment:: A variation in either of the two variables will give different results of the model parameters

Question 5

A client would like to invest $20 million dollars in companies listed on the Australian Stock Exchange(ASX). The client would like your advice on (a) Which stocks they should invest in. (b) How much should they invest in each stock. Prepare a case outlining how you will attempt to find a solution to this problem i.e. methodology only.You should begin by considering an appropriate data source and outline a step by step approach. Do consider applying the topics/concepts that we have covered in this unit. However, you may also use concepts from other units that you have already completed. The final output will be a series of steps containing details of how you would fulfill the client’s request. Please explicitly state any assumptions you have made such as focusing on stock prices versus stock returns. Identifying and stating any challenges associated with your proposed methodology will demonstrate that you have thought about this in-depth.

Step 1: Selecting the Australian Stock Exchange companies

Here will consider the BHP_GROUP_LIMITED and the CSL limited companies were comparison will be done to guide the client on which stock they should invest in and how much in each stock

Step 2: Getting the Stock data The data for both stock companies were collected from 2019

## Loading required package: xts
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## Loading required package: TTR
##            BHP.Close
## 2019-01-02  42.69402
## 2019-01-03  41.38269
## 2019-01-04  43.90723
## 2019-01-07  44.13916
## 2019-01-08  44.00535
## 2019-01-09  44.24621
##            CSL.Close
## 2019-01-02    100.97
## 2019-01-03     99.04
## 2019-01-04    101.04
## 2019-01-07    101.04
## 2019-01-08    102.59
## 2019-01-09    104.01

Step 3: Calculating the Stock data returns

## [1] "BHP_GROUP_LIMITED"
##            monthly.returns
## 2019-01-31      0.06726412
## 2019-02-28      0.03285917
## 2019-03-29      0.03291172
## 2019-04-30     -0.03196701
## 2019-05-31     -0.01983660
## 2019-06-28      0.11282631
## [1] "CSL LIMITED"
##            monthly.returns
## 2019-01-31      0.06480467
## 2019-02-28      0.13320644
## 2019-03-29     -0.00374440
## 2019-04-30      0.14264401
## 2019-05-31     -0.05905694
## 2019-06-28      0.05188951

set 4: Methodology

Here will adopt linear regression a statistical for prediction.

Linear regression

The, initial step is formulating the model, a linear regression model is a model with a single regressor x that has a relationship with a response variable y. \[y=\beta_o+\beta_1x+\epsilon\] The regression coefficients \(\beta_0\) and \(\beta_1\) are the intercepts and slope parameters and \(\epsilon\) is the random error component for the linear model that assumes to have a mean of zero and unknown variance. \[\epsilon\sim N(0,\sigma^2)\] The error term is assumed to be uncorrelated meaning the error term of one model does not depend on that of the other. The model has a mean and variance of \[\operatorname{E(y | x)} =\beta_0+\beta_1x\] \[\operatorname{Var(y | x)}= Var(\beta_0+\beta_1 x+ \epsilon)=\sigma^2\] Method of least square for the regression coefficients The estimation of \(\beta_0\) and \(\beta_1\) is as follows Let: \[\operatorname{y_i}=\beta_0+\beta_1x_i+\epsilon_i , i=1,\dots,n\] \[\operatorname{S(\beta_0,\beta_1 )}=\sum_{i=1}^n (yi-\beta_o-\beta_1 x_i )^2\]

\[\frac{\partial S}{\partial{\beta_0}}|_{β_o β_1}=-2\sum_{i=1}^n (y_i-\bar{β_0 }-\bar{β_1 }x_i ) =0\]

and

\[\frac{\partial S}{\partial \beta_1}|_{β_o β_1}=-2\sum_{i=1}^n (y_i-\bar{\beta_0}-\bar{\beta_1}x_i )x_i=0\] \[n\bar{\beta_0}+\bar{β_1}\sum_{i=1}^nx_i =\sum_{i=1}^ny_i\] \[\bar{β_0}\sum_{i=1}^nx_i +\bar{\beta_1}\sum_{i=1}^n x_i^2 =\sum_{i=1}^n y_i x_i\] \[\bar{\beta_0}=\bar{y}\bar{\beta_1}\bar{x}\]

\[\bar{\beta_1}=\frac{\sum_{i=1}^n y_i x_i-\frac{(\sum_{i=1}^ny_i)(\sum_{i=1}^n x_i)}{n}}{\sum_{i=1}^n x_i^2-\frac{(\sum_{i=1}^n x_i )^2}{n}}\] where \[\bar{y}=\frac{1}{n} \sum_{i=1}^n y_i\] and \[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\] Here will build the model \[\operatorname{stock price}=\beta_0+\beta_1 \operatorname{stock returns}+\epsilon\]

step 5 Analysis

## 
## Call:
## lm(formula = BHP.Close ~ monthly.returns, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.898  -5.440  -2.395   7.183  16.537 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       51.770      1.281  40.424   <2e-16 ***
## monthly.returns   23.418     14.045   1.667    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.68 on 44 degrees of freedom
## Multiple R-squared:  0.05943,    Adjusted R-squared:  0.03805 
## F-statistic:  2.78 on 1 and 44 DF,  p-value: 0.1025
## [1] "72.0692397137279 MSE"
## 
## Call:
## lm(formula = CSL.Close ~ monthly.returns, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75.34 -45.38 -23.38  44.57 119.57 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      176.247      8.617  20.453   <2e-16 ***
## monthly.returns  105.229    111.190   0.946    0.349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.8 on 44 degrees of freedom
## Multiple R-squared:  0.01995,    Adjusted R-squared:  -0.002324 
## F-statistic: 0.8957 on 1 and 44 DF,  p-value: 0.3491
## [1] "2978.09256343547 MSE"

Step 6 Interpretation
Here to choose the best company we consider the MSE value where the lower the value the better the model in this scenario will consider BHP_GROUP LIMITED as the better stock company to invest on.

Basing on the models multiple r-square that shows the models predictive capability the client should invest 0.06066 on BHP_GROUP_LIMITED that’s 6.6% on the stock data and 0.01494 and 1.494% on csl stock data

Challenges and problems::

How to classify each company in the linear model