The data set pcancer.data contains 97 observations on a prostate cancer marker (lpsa) and a number of clinical measures for men undergoing cancer treatment. The file pcancer-README.txt includes a brief description of the variables in the data set. Please attempt the following tasks:
## [1] 97
## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
## [8] "pgg45" "lpsa" "train"
## Non-numerical variable(s) ignored: train
Descriptive Statistics
pcancer
N: 97
| age | gleason | lbph | lcavol | lcp | lpsa | lweight | pgg45 | svi | |
|---|---|---|---|---|---|---|---|---|---|
| Mean | 63.87 | 6.75 | 0.10 | 1.35 | -0.18 | 2.48 | 3.63 | 24.38 | 0.22 |
| Std.Dev | 7.45 | 0.72 | 1.45 | 1.18 | 1.40 | 1.15 | 0.43 | 28.20 | 0.41 |
| Min | 41.00 | 6.00 | -1.39 | -1.35 | -1.39 | -0.43 | 2.37 | 0.00 | 0.00 |
| Q1 | 60.00 | 6.00 | -1.39 | 0.51 | -1.39 | 1.73 | 3.38 | 0.00 | 0.00 |
| Median | 65.00 | 7.00 | 0.30 | 1.45 | -0.80 | 2.59 | 3.62 | 15.00 | 0.00 |
| Q3 | 68.00 | 7.00 | 1.56 | 2.13 | 1.18 | 3.06 | 3.88 | 40.00 | 0.00 |
| Max | 79.00 | 9.00 | 2.33 | 3.82 | 2.90 | 5.58 | 4.78 | 100.00 | 1.00 |
| MAD | 5.93 | 0.00 | 2.50 | 1.28 | 0.87 | 1.15 | 0.38 | 22.24 | 0.00 |
| IQR | 8.00 | 1.00 | 2.94 | 1.61 | 2.56 | 1.32 | 0.50 | 40.00 | 0.00 |
| CV | 0.12 | 0.11 | 14.46 | 0.87 | -7.80 | 0.47 | 0.12 | 1.16 | 1.91 |
| Skewness | -0.80 | 1.22 | 0.13 | -0.24 | 0.71 | 0.00 | 0.06 | 0.94 | 1.36 |
| SE.Skewness | 0.24 | 0.24 | 0.24 | 0.24 | 0.24 | 0.24 | 0.24 | 0.24 | 0.24 |
| Kurtosis | 0.96 | 2.36 | -1.75 | -0.60 | -1.01 | 0.43 | 0.36 | -0.37 | -0.16 |
| N.Valid | 97.00 | 97.00 | 97.00 | 97.00 | 97.00 | 97.00 | 97.00 | 97.00 | 97.00 |
| Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| [1] 0 | |||||||||
| # Comment:\ |
The data had a sample size of 97 with 10 variables namely
“lcavol”, “lweight”, “age”, “lbph”, “svi”, “lcp”, “gleason”, “pgg45”,
“lpsa” and “train” where train was a non-numeric variable
Here we began by visualizing each independent variable we began with
a bar chart of the training observations using the ggplot command we
categorized the categorical data into two groups False or True
From the plot we see that the training set is greater than
the testing set
From the density plot we check for normality as its evident the data
is bell shaped hence its from a normal distribution
##
## Call:
## lm(formula = lpsa ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64870 -0.34147 -0.05424 0.44941 1.48675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.429170 1.553588 0.276 0.78334
## lcavol 0.576543 0.107438 5.366 1.47e-06 ***
## lweight 0.614020 0.223216 2.751 0.00792 **
## age -0.019001 0.013612 -1.396 0.16806
## lbph 0.144848 0.070457 2.056 0.04431 *
## svi 0.737209 0.298555 2.469 0.01651 *
## lcp -0.206324 0.110516 -1.867 0.06697 .
## gleason -0.029503 0.201136 -0.147 0.88389
## pgg45 0.009465 0.005447 1.738 0.08755 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7123 on 58 degrees of freedom
## Multiple R-squared: 0.6944, Adjusted R-squared: 0.6522
## F-statistic: 16.47 on 8 and 58 DF, p-value: 2.042e-12
The regression coefficients for our model that’s \[\beta_0,\dots,\beta_{9}\] are
0.429170,0.576543,0.614020,-0.019001,0.144848,0.737209,-0.206324,-0.029503
and 0.009465 respectively the coefficient shows the variation of the
dependent variable whether increasing for positive or decreasing for
negative value. The t-value represents the models coefficient divided by
the standard error. The model had a multiple R-squared of 0.6944 which
was the fraction of the variation in the dependent variable that’s lpsa
that can be predicted by the predictor variables, from the p-value only
the following independent variables were statistically significant as
the p-value is less than 0.05 the assumed alpha lcavol, lweigth, lbph
and svi
## [1] 0.7219931
## [1] 0.5233719
## [1] 0.5033799
##
## Call:
## lm(formula = lpsa ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
As like the intial model the regression coefficients for this model
that’s \[\beta_0,\dots,\beta_{9}\] are
0.181561,0.564341,0.622020,-0.021248,0.096713,0.761673,-0.10605,0.049228
and 0.004458 respectively the coefficient shows the variation of the
dependent variable whether increasing for positive or decreasing for
negative value. The t-value represents the models coefficient divided by
the standard error. The model had a multiple R-squared of 0.6634 which
is lesser than that of the training set a the fraction of the variation
in the dependent variable that’s lpsa that can be predicted by the
predictor variables, from the p-value only the following independent
variables were statistically significant as the p-value is less than
0.05 the assumed alpha thats lcavol, lweigth and svi remains while lbph
was dropped from being significant.
Lasso regression model was fit as the regression model showed
multicollinearity is present in the data. \[\operatorname{RSS}=\sum(y_i-\bar{y_i})^2\]
\[\operatorname{RSS}+\lambda\sum|\beta_j|\]
where the second term is the skrinkage penalty
Use a simulation study to investigate the behaviour of shrinkage estimators. Start by simulating a regression model and attempt to estimate the true values of the coefficients using standard estimation techniques (OLS or MLE) as well as shrinkage estimators (Ridge or LASSO). Do the shrinkage estimates change if the:
## Determine regression coefficients for SIZE.8 model
## Specified shrinkage intensity lambda (correlation matrix): 0
## Specified shrinkage intensity lambda.var (variance vector): 0
## $regularization
## lambda lambda.var
## SIZE.8 0 0
##
## $std.coefficients
## (Intercept) lcavol lweight age lbph svi lcp
## SIZE.8 0 0.5931445 0.2422914 -0.118023 0.1755303 0.2563475 -0.2392803
## gleason pgg45
## SIZE.8 -0.01731521 0.2296268
##
## $coefficients
## (Intercept) lcavol lweight age lbph svi lcp
## SIZE.8 0.4291701 0.5765432 0.61402 -0.01900102 0.1448481 0.7372086 -0.2063242
## gleason pgg45
## SIZE.8 -0.02950288 0.009465162
##
## $numpred
## SIZE.8
## 8
##
## $R2
## SIZE.8
## 0.6943712
##
## $sd.resid
## SIZE.8
## 0.6677232
##
## attr(,"class")
## [1] "slm"
Yes it does. Based on the original regression model the coefficient
values thats
0.429170,0.576543,0.614020,-0.019001,0.144848,0.737209,-0.206324,-0.029503
and 0.009465 changed having the previous regression coefficients as
where less than that of shrinkage thus 0, 0.5931445, 0.2422914,
-0.118023, 0.1755303, 0.2563475, -0.2392803, -0.01731521 and 0.2296268
bringing a variation
## Warning in cor(x, method = "pearson"): the standard deviation is zero
We will use the Karl Pearson’s coefficient of correlation to check if
there exists a relationship between the independent variables. The
Pearson product-moment correlation coefficient is a measure of the
strength of a linear association between two variables and is denoted by
r. Basically, a Pearson product-moment correlation attempts to draw a
line of best fit through the data of two variables, and the Pearson
correlation coefficient, r, indicates how far away all these data points
are to this line of best fit (i.e., how well the data points fit this
new model/line of best fit). The Pearson correlation coefficient, r, can
take a range of values from +1→-1. A value of 0indicates that there is
no association between the two variables. A value greater than 0
indicates a positive association; that is, as the value of one variable
increases, so does the value of the other variable. A value less than 0
indicates a negative association; that is, as the value of one variable
increases, the value of the other variable decreases. The stronger the
association of the two variables, the closer the Pearson correlation
coefficient, r, will be to either+1 or -1 depending on whether the
relationship is positive or negative, respectively.
Basing on the knowledge above we see there is a strong relationship between lcavol and lcp with a correlation coefficient of 0.68 with svi,lcp, gleason and pgg45 showing a strong relationship with lcavol. We also see a weak relationship between lbph and lcavol with a correlation coefficient of 0.03
The trend plot looks stationary meaning certain attributes don’t
change over time.
## [1] "ts"
## Warning in adf.test(ts_model): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: ts_model
## Dickey-Fuller = -5.0045, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
From the ADF test we see as the p-value is less than assumed alpha
hence its stationary
The cut off for ACF is 6 meaning that the q value of the moving
average is 6
While cut off for PACF is 1 meaning that the p value for the
auto-regressive is 1
##
## Call:
## arma(x = ts_model, order = c(1, 6))
##
## Model:
## ARMA(1,6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.32920 -0.24187 -0.05142 0.24277 1.71216
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ar1 0.80495 0.10122 7.953 1.78e-15 ***
## ma1 0.27340 0.13350 2.048 0.0406 *
## ma2 -0.11656 0.13123 -0.888 0.3744
## ma3 0.15214 0.10399 1.463 0.1435
## ma4 0.05109 0.09214 0.554 0.5793
## ma5 -0.05265 0.12540 -0.420 0.6746
## ma6 -0.05157 0.10831 -0.476 0.6340
## intercept 0.22982 0.12579 1.827 0.0677 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.2215, Conditional Sum-of-Squares = 45.4, AIC = 298.03
\[\operatorname{Y_t}=0.22982+0.80495Y_{t-1}+e_t+0.27340e_{t-1}-0.11656e_{t-2}+0.15214_{t-3}+0.05109_{t-4}-0.05265_{t-5}-0.05157_{t-6}\] AIC=298.03 shows the strength of the model ARMA(1,6)
##
## Call:
## arma(x = ts_model, order = c(1, 1))
##
## Model:
## ARMA(1,1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47000 -0.23479 -0.05317 0.19203 2.03767
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## ar1 0.75873 0.05466 13.881 < 2e-16 ***
## ma1 0.37937 0.09475 4.004 6.23e-05 ***
## intercept 0.29362 0.08028 3.657 0.000255 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fit:
## sigma^2 estimated as 0.2253, Conditional Sum-of-Squares = 47.32, AIC = 291.72
\[\operatorname{Y_t}=0.29362+0.75873Y_{t-1}+e_t+0.37937e_{t-1}\]
AIC=291.72 shows the strength of the model ARMA(1,1)
Use your knowledge on time series concepts to complete the following
questions:
(a) Simulate an AR(1) process \[y_t = \phi_0
+ \phi_1y_{t−1} + u_t\] where \[\phi_0
= 0.5, \phi_1 = 1 \text{ and }y_0 = 1\]. Assume that \(u_t\) is normally distributed with mean 0
and variance 1 and let t = 2000. Plot the simulated series and briefly
describe it. Focus on describing(not deriving) the behaviour of \(E(y_t)\) and \(V(y_t)\).
Simulate another AR(1) process \[x_t =
\theta_0 +\theta_1x_{t−1} + v_t\] where \[\theta_0 = 0.25, theta_1 = 1 \text{ and } x_0 =
1\]. Assume that \(v_t\) is
normally distributed with mean 0 and variance 1.5 and let t = 2000. Plot
the simulated series and briefly describe it. Focus on describing (not
deriving) the behaviour of \(E(x_t)\)
and \(V(x_t)\)
Produce a scatter plot of yt (y-axis) and xt. Next, regress yt on xt. Discuss the regression resultsi.e. coefficient estimates, standard errors and other model outputs. You may perform regression diagnostics as well.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
##
## Call:
## lm(formula = y_t ~ x_t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6774 -0.7862 0.0485 0.7888 4.0988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043133 0.025926 -1.664 0.0963 .
## x_t -0.006934 0.020412 -0.340 0.7341
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.158 on 1998 degrees of freedom
## Multiple R-squared: 5.776e-05, Adjusted R-squared: -0.0004427
## F-statistic: 0.1154 on 1 and 1998 DF, p-value: 0.7341
the regression coefficients for this model3 that’s \[\beta_0 \text{ and } \beta_{1}\] are
-0.01921 and -0.03277 respectively the coefficient shows the variation
of the dependent variable decreasing since it has negative value. The
t-value represents the model’s coefficient divided by the standard
error. The model had a multiple R-squared of 0.0012 which is the
fraction of the variation in the dependent variable that’s y that can be
predicted by the predictor x, from the p-value the independent variable
wasnt statistically significant.A client would like to invest $20 million dollars in companies listed
on the Australian Stock Exchange(ASX). The client would like your advice
on (a) Which stocks they should invest in. (b) How much should they
invest in each stock. Prepare a case outlining how you will attempt to
find a solution to this problem i.e. methodology only.You should begin
by considering an appropriate data source and outline a step by step
approach. Do consider applying the topics/concepts that we have covered
in this unit. However, you may also use concepts from other units that
you have already completed. The final output will be a series of steps
containing details of how you would fulfill the client’s request. Please
explicitly state any assumptions you have made such as focusing on stock
prices versus stock returns. Identifying and stating any challenges
associated with your proposed methodology will demonstrate that you have
thought about this in-depth.
Step 1: Selecting the Australian Stock Exchange companies
Here will consider the BHP_GROUP_LIMITED and the CSL limited
companies were comparison will be done to guide the client on which
stock they should invest in and how much in each stock
Step 2: Getting the Stock data The data for both stock companies were collected from 2019
## Loading required package: xts
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
## BHP.Close
## 2019-01-02 42.69402
## 2019-01-03 41.38269
## 2019-01-04 43.90723
## 2019-01-07 44.13916
## 2019-01-08 44.00535
## 2019-01-09 44.24621
## CSL.Close
## 2019-01-02 100.97
## 2019-01-03 99.04
## 2019-01-04 101.04
## 2019-01-07 101.04
## 2019-01-08 102.59
## 2019-01-09 104.01
Step 3: Calculating the Stock data returns
## [1] "BHP_GROUP_LIMITED"
## monthly.returns
## 2019-01-31 0.06726412
## 2019-02-28 0.03285917
## 2019-03-29 0.03291172
## 2019-04-30 -0.03196701
## 2019-05-31 -0.01983660
## 2019-06-28 0.11282631
## [1] "CSL LIMITED"
## monthly.returns
## 2019-01-31 0.06480467
## 2019-02-28 0.13320644
## 2019-03-29 -0.00374440
## 2019-04-30 0.14264401
## 2019-05-31 -0.05905694
## 2019-06-28 0.05188951
set 4: Methodology
Here will adopt linear regression a statistical for prediction.
The, initial step is formulating the model, a linear regression model is a model with a single regressor x that has a relationship with a response variable y. \[y=\beta_o+\beta_1x+\epsilon\] The regression coefficients \(\beta_0\) and \(\beta_1\) are the intercepts and slope parameters and \(\epsilon\) is the random error component for the linear model that assumes to have a mean of zero and unknown variance. \[\epsilon\sim N(0,\sigma^2)\] The error term is assumed to be uncorrelated meaning the error term of one model does not depend on that of the other. The model has a mean and variance of \[\operatorname{E(y | x)} =\beta_0+\beta_1x\] \[\operatorname{Var(y | x)}= Var(\beta_0+\beta_1 x+ \epsilon)=\sigma^2\] Method of least square for the regression coefficients The estimation of \(\beta_0\) and \(\beta_1\) is as follows Let: \[\operatorname{y_i}=\beta_0+\beta_1x_i+\epsilon_i , i=1,\dots,n\] \[\operatorname{S(\beta_0,\beta_1 )}=\sum_{i=1}^n (yi-\beta_o-\beta_1 x_i )^2\]
\[\frac{\partial S}{\partial{\beta_0}}|_{β_o β_1}=-2\sum_{i=1}^n (y_i-\bar{β_0 }-\bar{β_1 }x_i ) =0\]
and
\[\frac{\partial S}{\partial \beta_1}|_{β_o β_1}=-2\sum_{i=1}^n (y_i-\bar{\beta_0}-\bar{\beta_1}x_i )x_i=0\] \[n\bar{\beta_0}+\bar{β_1}\sum_{i=1}^nx_i =\sum_{i=1}^ny_i\] \[\bar{β_0}\sum_{i=1}^nx_i +\bar{\beta_1}\sum_{i=1}^n x_i^2 =\sum_{i=1}^n y_i x_i\] \[\bar{\beta_0}=\bar{y}\bar{\beta_1}\bar{x}\]
\[\bar{\beta_1}=\frac{\sum_{i=1}^n y_i x_i-\frac{(\sum_{i=1}^ny_i)(\sum_{i=1}^n x_i)}{n}}{\sum_{i=1}^n x_i^2-\frac{(\sum_{i=1}^n x_i )^2}{n}}\] where \[\bar{y}=\frac{1}{n} \sum_{i=1}^n y_i\] and \[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\] Here will build the model \[\operatorname{stock price}=\beta_0+\beta_1 \operatorname{stock returns}+\epsilon\]
step 5 Analysis
##
## Call:
## lm(formula = BHP.Close ~ monthly.returns, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.898 -5.440 -2.395 7.183 16.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.770 1.281 40.424 <2e-16 ***
## monthly.returns 23.418 14.045 1.667 0.103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.68 on 44 degrees of freedom
## Multiple R-squared: 0.05943, Adjusted R-squared: 0.03805
## F-statistic: 2.78 on 1 and 44 DF, p-value: 0.1025
## [1] "72.0692397137279 MSE"
##
## Call:
## lm(formula = CSL.Close ~ monthly.returns, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.34 -45.38 -23.38 44.57 119.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.247 8.617 20.453 <2e-16 ***
## monthly.returns 105.229 111.190 0.946 0.349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.8 on 44 degrees of freedom
## Multiple R-squared: 0.01995, Adjusted R-squared: -0.002324
## F-statistic: 0.8957 on 1 and 44 DF, p-value: 0.3491
## [1] "2978.09256343547 MSE"
Step 6 Interpretation
Here to choose the best company we consider the MSE value where the
lower the value the better the model in this scenario will consider
BHP_GROUP LIMITED as the better stock company to invest on.
Basing on the models multiple r-square that shows the models predictive capability the client should invest 0.06066 on BHP_GROUP_LIMITED that’s 6.6% on the stock data and 0.01494 and 1.494% on csl stock data
How to classify each company in the linear model
Comment::
From the boxplot outliers were detected
