IMPORTANT NOTE: I have coded my functions to work such that the argument n results in a vector of n length, that has n - 1 steps. This is a slightly deviation from the assignment which has requested functions written where the argument n will produce a vector of size n + 1 with n steps.
The first question regards random walks. I have created two functions that produce identical results, by setting the random seed to a pre-defined number. I have also measured the system time, to illustrate the benefits of vectorisation. Using Sys.time() is less accurate than using system.time() (which measures CPU time) or some of the packages available for this purpose. This article was quite helpful (https://nicercode.github.io/guides/repeating-things/).
Note that there are more precise methods of measuring time (i.e. exact amount of cpu time dedicated to the process), as sys.time only measures the system clock. This means the benchmark will be affected by previous computations still being completed (e.g. graphic) or other system processes.
If n < 10^7, the functions will plot a line graph of the walk. The functions default to using a random seed each time they are called and not producing graphs.
options(digits = 10)
options(scipen = 10)
#par(mfrow = c(2,1))
randomWalkslow.func <- function(n = 100,k = 1,graphs = FALSE, random = TRUE) { #defining the function
if (random == FALSE) {
set.seed(12345) #setting seed for reproducibility
}
timeStart_slow <- Sys.time() #measuring intial start time
slow_walk.vec <- NULL
slow_walk.vec[0] <- c(0)
for (i in 1:(n)) { #starts loop
if(i == 1) {
slow_walk.vec[i] <- (runif(1,-k,+k)) #sets first value using runif
}
else {
slow_walk.vec[i] <- (slow_walk.vec[i - 1] + runif(1,-k,+k)) # assigns next runif output plus previous vector value to current iteration position in the vector
}
}
timeFinish_slow <- Sys.time()
timeTaken_slow <- timeFinish_slow - timeStart_slow #calculating time taken
if (n < 1000000){ #skips making a graph for very large numbers
if (graphs == TRUE){
plot(slow_walk.vec, main = "Slow Walk", sub = paste("Time taken (s): ",timeTaken_slow, sep = " "), ylab = "Current Walk Value", xlab = "Step No.", type = "s")
mtext(text = paste("n = ",n,", k = ", k,sep = " "), col = "red",adj = 1)
#lines(slow_walk.vec)
}
}
slow_walk.list <- list(slow_walk.vec,timeTaken_slow) #bundles results as list for output
invisible(slow_walk.list) #supresses output, but can still be assigned
}
randomWalkfast.func <- function(n = 100,k = 1,graphs = FALSE, random = TRUE) {
if (random == FALSE) {
set.seed(12345) #setting seed for reproducibility
}
timeStart_fast <- Sys.time() #measuring intial time
fast_walk.vec <- NULL
fast_walk.vec[0] <- c(0)
fast_walk.vec[1:n] <- cumsum(runif(n,-k,+k)) #runif generates a vectors of random numbers from the uniform distribution; cumsum cumulatively sums them
timeFinish_fast <- Sys.time() #measuring final time
timeTaken_fast <- timeFinish_fast - timeStart_fast #calculating time taken
if (n < 1000000){
if (graphs == TRUE){
plot(fast_walk.vec, main = "Fast Walk", sub = paste("Time taken (s): ",timeTaken_fast), ylab = "Current Walk Value", xlab = "Step No.",type = "s")
mtext(text = paste("n = ",n,", k = ", k,sep = " "), col = "red",adj = 1)
#lines(fast_walk.vec, lwd = 0.1)
}
}
fast_walk.list <- list(fast_walk.vec,timeTaken_fast)
invisible(fast_walk.list) #supresses output, but can still be assigned
}
par(mfrow = c(2,1))
randomWalkfast.list <- randomWalkfast.func(100000,10, graphs = TRUE, random = FALSE); randomWalkslow.list <- randomWalkslow.func(100000,10, graphs = TRUE, random = FALSE)Trying out 10^7 steps;
options(digits = 3)
options(scipen = 4)
randomWalkfast.list <- randomWalkfast.func(10000000,100, random = FALSE)
Sys.sleep(2)
randomWalkslow.list <-randomWalkslow.func(10000000,100, random = FALSE)
results.vec <- c(length(unlist(randomWalkfast.list[1])),length(unlist(randomWalkslow.list[1])),sum(unlist(randomWalkfast.list[1])),sum(unlist(randomWalkslow.list[1])),randomWalkfast.list[2],randomWalkslow.list[2])
kable(as.data.frame(results.vec,col.names = c("Length Fast","Length Slow","Sum Fast","Sum Slow","Time Fast","Time Slow")),caption = "A Comparison of Random Walk Functions for 10^7 steps")
print(head(unlist(randomWalkfast.list[1])))
print(head(unlist(randomWalkslow.list[1]))) #showing head of vector to check for errors| Length.Fast | Length.Slow | Sum.Fast | Sum.Slow | Time.Fast | Time.Slow |
|---|---|---|---|---|---|
| 10000000 | 10000000 | 457733962914 | 457733962914 | 0.634 secs | 23 secs |
## [1] 44.2 119.3 171.5 248.8 240.1 173.3
## [1] 44.2 119.3 171.5 248.8 240.1 173.3
I have made the plot a ‘step’ type plot to emphasise the way the data was generated.
randomWalkfast.func(graphs = TRUE)thousand_walks_final <- replicate(1000,unlist(randomWalkfast.func(n = 50, k = 5)[[1]][50]))
#plot(sort(thousand_walks_final)) - interesting ECDF looking plotplot(density(thousand_walks_final), main = "Density Histogram of 1K Random Walks",ylim = c(0,0.022), xlab = "x")
x <- seq(-60,60,0.1); y <- seq(-60,60,0.1)
curve(dnorm(x, mean = 0, sd = 5 * sqrt(50/3)), add=TRUE, col="red", lwd=2)options(digits = 3)
#REF = https://www.r-bloggers.com/normal-distribution-functions/I have excluded sex in the matrix scatterplot, as this is a difficult plot to interpret (a barplot would better serve this purpose).
blood.df <- read.csv("blood.csv")
plot(blood.df[c(2,3,4)],panel = panel.smooth,main = "Matrix Scatterplot of Blood Variables")kable(cor(blood.df[c(2,3,4)]),caption = "Correlation matrix of blood variables")| Age | RedBCC | Haemoglobin | |
|---|---|---|---|
| Age | 1.000 | 0.046 | 0.142 |
| RedBCC | 0.046 | 1.000 | 0.822 |
| Haemoglobin | 0.142 | 0.822 | 1.000 |
Red blood cell count has a strong positive linear relationship with haemoglobin. The flat plateau seen in the bottom middle graph can be attributed to a sparsity of data (curves shown are LOESS, not OLS linear regression).
Haemoglobin and age have almost no relationship observed.
Age has a very weak negatively quadratic relationship with red blood cell count, although it would be fair to say there is no relationship here at all.
par(mfrow = c(1,2),mai = c(0.4,1,0.8,0.6))
boxplot(blood.df$Haemoglobin[blood.df$Sex == "M"], main = "Male Haemoglobin Counts", ylim = c(11,16),ylab = "Haemoglobin count")
mtext("Whisker length is defined as 1.5 times the IQR", side = 1, line = 1)
boxplot(blood.df$Haemoglobin[blood.df$Sex == "F"],main = "Female Haemoglobin Counts", ylim = c(11,16),ylab = "Haemoglobin count")The distribution of male haemoglobin counts is located right of the distribution of female haemoglobin counts, with the median count being around 15, versus a female median count of around 13. The female distribution has greater spread than the male distribution, with the IQR being 2 grams per decilitre, versus the male IQR of about 1 gram per decilitre The female distribution is also more symmetric than the male distribution. The female first and third haemoglobin quartiles are roughly equi-distant from the median, whilst the male distribution has some leftward skew.
Hence males generally have more haemoglobin in their blood than most females. All patients of both genders in our sample have more than 11 grams of haemoglobin per decilitre. Similarly, all patients of both genders have less than 16 grams of haemoglobin per decilitre.
There is one outlier noticeable on the boxplots, a male with a haemoglobin count of around 12.5g/dL. In practice this point would be worth investigating, as it may reflect an issue with data collection (perhaps a female patient miscoded as a male).
blood_fit_simple.lm <- lm(data = blood.df, Haemoglobin ~ Sex)
kable(summary(as.matrix(blood_fit_simple.lm$residuals)),col.names = "Residuals")| Residuals | |
|---|---|
| Min. :-2.413 | |
| 1st Qu.:-0.827 | |
| Median : 0.152 | |
| Mean : 0.000 | |
| 3rd Qu.: 0.687 | |
| Max. : 2.052 |
stargazer(blood_fit_simple.lm, title = "Haemoglobin as a function of Sex ", align = T, type = "html", omit.stat = c("aic","bic","ll"))| Dependent variable: | |
| Haemoglobin | |
| SexM | 1.860*** |
| (0.306) | |
| Constant | 13.000*** |
| (0.207) | |
| Observations | 50 |
| R2 | 0.437 |
| Adjusted R2 | 0.425 |
| Residual Std. Error | 1.080 (df = 48) |
| F Statistic | 37.200*** (df = 1; 48) |
| Note: | p<0.1; p<0.05; p<0.01 |
The median of the residuals is close to the mean, relative to the size of the residuals. This shows that there is a small amount of skew in the residuals. This probably does not constitute a violation of constant variance or normality of the residuals assumptions underlying the regression.
The estimated intercept for haemoglobin is 13.0. This indicates that the estimated haemoglobin value for a given patient when all predictors take the value 0 is 13.0 grams per decilitre of blood. This would give an estimate of haemoglobin for a female patient.
The p-value for SexM is less than 0.05, indicating that there is a lack of evidence for the null hypothesis that gender does not influence haemoglobin counts. Hence we reject the null hypothesis in favour of the alternate hypothesis that gender does effect haemoglobin counts. The p-value for the intercept is less than 0.05, indicated that there is a lack of evidence for the null hypothesis that the regression line goes through the origin (such that the estimated intercept value would be zero). Hence we reject the null hypothesis in favour of the alternate hypothesis that the values for the intercept of the regression is not zero.
The regression coefficient estimate for SexM is 1.9, indicating that the average expected shift in measured grams of haemoglobin per decilitre when moving from a female to male gender would be 1.9.
The p-value for the model is less than 0.05. Hence we can be confident our model does explain at least some of the variation present in the response variable and is performing better than what we would expect if we simply took the mean of the response variable (a horizontal fitted line at mean haemoglobin) and calculated the residuals from the observations less the fitted values of the observations on that line (i.e. residual sum of squares over total sum of squares produces a statistically unlikely result on the F-distribution). This means use gender to estimate haemoglobin performs significantly better than simply using mean haemoglobin as an estimate.
The estimated coefficient of determination is 0.44, indicating that 44% of the variation in haemoglobin counts is explained by our model. The adjusted coefficient of determination is roughly equivalent the coefficient of determination, which we would expect considering this is a very simple model. This is generally considered a moderate value, indicating that a model using gender alone has a moderate ability to explain the observed variation in measured haemoglobin from patients.
#blood_fit.lm <- lm(data = blood.df, Haemoglobin ~ Sex + RedBCC + (Sex*RedBCC)) unneccesary as additive terms include - output identical
blood_fit.lm <- lm(data = blood.df, Haemoglobin ~ Sex*RedBCC)
kable(summary(as.matrix(blood_fit.lm$residuals)),col.names = "Residuals")| Residuals | |
|---|---|
| Min. :-1.870 | |
| 1st Qu.:-0.367 | |
| Median : 0.165 | |
| Mean : 0.000 | |
| 3rd Qu.: 0.435 | |
| Max. : 1.439 |
stargazer(blood_fit.lm, title = "Haemoglobin as a function of Sex and Red Blood Cell Count", align = T, type = "html", omit.stat = c("aic","bic","ll"))| Dependent variable: | |
| Haemoglobin | |
| SexM | 9.740*** |
| (2.630) | |
| RedBCC | 2.830*** |
| (0.382) | |
| SexM:RedBCC | -1.990*** |
| (0.558) | |
| Constant | 0.940 |
| (1.640) | |
| Observations | 50 |
| R2 | 0.754 |
| Adjusted R2 | 0.738 |
| Residual Std. Error | 0.728 (df = 46) |
| F Statistic | 47.000*** (df = 3; 46) |
| Note: | p<0.1; p<0.05; p<0.01 |
As above, the median of the residuals is close to the mean, relative to the size of the residuals. This shows that there is a small amount of skew in the residuals. This probably does not constitute a violation of normality of the residuals.
The estimated intercept for haemoglobin is 0.94, indicating when all predictors equal zero (impossible in reality), we would expect to measure 0.94 grams of haemoglobin per decilitre in a patient’s blood stream. This would give an estimate of haemoglobin for a female patient who has no red blood cells.
The estimated coefficient of SexM is 9.7, indicating that the expected difference moving from the female to male gender on the binary sex predictor variable is 9.7 grams of haemoglobin per decilitre on the response variable. This increase is ignoring the effects of the interaction of sex and red blood cell count. This coefficient estimate is relative to the expected haemoglobin measurement for female patients.
The estimated coefficient of RedBCC is 2.8, showing that for each extra cell per microlitre of blood, there is a 2.8 increase in measured grams of haemoglobin per decilitre. This applies where SexM = 0. Hence we expect to observe this relationship between red blood cell count and measured haemoglobin. For an interpretation of the coefficient involving male patients, see the interaction term.
The estimated coefficient of the interaction between sex and red blood cell count is -2.0 grams of haemoglobin per decilitre. This indicates as red blood cell count becomes greater, being male has a deleterious effect on the positive impact of red blood cell count on haemoglobin. Although being male increases haemoglobin counts, this effect is moderated by high blood cell counts. Hence we would expect for a male, an increase of one measured red blood cell per microlitre would correspond to a 2.83 - 1.99 = 0.84 increase in measured grams of haemoglobin per decilitre.
The p-values for the estimated coefficients of the predictors are less than 0.05. Hence, given our model, we have sufficient evidence to reject the null hypothesis of no difference for each predictor and can expect a statistically significant effect from a change in these predictor variables. The p-value for the intercept, however, is greater than 0.05. Hence we do not reject the hypothesis that there is a y-intercept value different from 0, as it was reasonably likely we observed this OLS estimated intercept by chance, due to the variation involved in sampling from the population (this is obvious when comparing the standard error of the intercept to it’s estimated value).
The p-value for the model is less than 0.05. Hence we can be confident our model does explain at least some of the variation present in the response variable and is performing better than what we would expect if we simply took the mean of the response variable (a horizontal fitted line at mean haemoglobin) and calculated the residuals from the observations less the fitted values of the observations on that mean line (i.e. residual sum of squares over total sum of squares produces a statistically unlikely result on the F-distribution). This means using red blood cell count, gender and the interaction of these two predictors to predict haemoglobin performs significantly better than using mean haemoglobin as an estimate.
The estimated coefficient of determination is 0.75, indicating that 75% of the variation in haemoglobin counts is explained by our model. The adjusted coefficient of determination is roughly equivalent the coefficient of determination, providing a measure indicating that the complexity of the model is not especially disadvantageous (logical, given the model only has an intercept, two predictors and an interaction term). This is generally considered quite a high value and both R-sq and R-sq (adj.) are greatly improved upon the simpler model. This provides some evidence that our use of RedBCC and the interaction term as predictors leads to a superior model. This model has a strong ability to explain variation in the measured haemoglobin of patients.
The residual standard error is 0.75. This is an improvement upon the residual standard error of 1.1 from the simple model, showing that we have improved the fit of the model. This is a guarenteed result as our simple model is nested with our more complicated model. Hence using complexity compensating statistics, such as R-sq or AIC/BIC could allow us to decide whether this added complexity is worth the loss in degrees of freedom and the risk of overfitting.
par(mfrow = c(2,2))
plot(blood_fit.lm)The residuals versus fitted plot shows relatively constant scatter on the residuals as the fitted values increase. Hence we can be confident that the assumption of constant variance on the residuals is fufilled. Additionally, as there is no overarching trend in the value of the residuals. This provides evidence that our model fits the data reasonably well (and consistently). As with the residuals versus fitted plot, the scale location plot shows little cause for concern. The model appears to fit the data reasonably well. There is no clear trend or pattern present in the residuals.
The residuals versus fitted plot shows a few points that lie -2 standard deviations from 0, but this is not especially unlikely. The rest of the residuals lie within +/- 1 standard deviation from 0, hence model fits the data reasonably well.
The normal Q-Q plot shows some minor overarching curvature on the residuals. This systemic variation could be a cause for concern as it suggests the model does not consistently fit the data (in which case, the residuals would be normally distributed). Hence, a transformation of the data or a change in the model may be needed. In particular, there is clear deviation at the lowest theoetical quantiles of -2 to -1. Two of the marked outliers (indices of 7,26) should be investigated. If a decision was required, this would probably not be a severe enough departure from normality to constitute a violation of the assumption of normality of the residuals. As the sample size is reasonably large (n > 30), the confidence intervals for the coefficient estimates will likely be supported by the Central Limit Theorem.
The residuals versus leverage plot indicates several points of concern. Residuals that are large (poorly fitted by the model) and have high leverage (are able to unduly influence the model) are likely to be biasing the model. There are some unmarked points with high leverage that are fit by the model quite well and so are not of undue concern. However, point 26 exerts high leverage and is 3 standard deviations from the fitted regression line. Hence this point (and point 7 to a lesser extent) is a cause for concern. This is reflected by it’s high Cook’s distance value, which is almost 1.
Notes on interpreting diagnostic plots correctly were found here:
https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture09/lecture09-94842.html
It is worth noting that generally, the regression method used here is reasonably robust to violations of the underlying assumptions. Hence, unless there are any clear or severe violations of the assumptions we pick up on the diagnostic plots we can probably move forward with the analysis.
shapiro.test(residuals(blood_fit.lm))##
## Shapiro-Wilk normality test
##
## data: residuals(blood_fit.lm)
## W = 1, p-value = 0.2
The p-value for the Shapiro test is greater than 0.05 (W = 1). This indicates there is a lack of evidence against the null hypothesis, hence supporting the notion that the residuals were drawn from a normal distribution.
I have used colour to differentiate the lines here as opposed to dashes.
seq_blood_female <- data.frame(RedBCC = seq(from = 3.5, to = 6.0, length.out = 100),Sex = 'F')
seq_blood_male <- data.frame(RedBCC = seq(from = 3.5, to = 6.0, length.out = 100),Sex = 'M')
predict_blood_male <- predict(blood_fit.lm,seq_blood_male)
predict_blood_female <- predict(blood_fit.lm,seq_blood_female)
male_predictions.df <- cbind(predict_blood_male, seq_blood_male)
female_predictions.df <- cbind(predict_blood_female, seq_blood_female)
colnames(male_predictions.df)[1] <- "Haemoglobin"
colnames(female_predictions.df)[1] <- "Haemoglobin"
blood_predictions.df <- rbind(male_predictions.df,female_predictions.df)
blood_predictions.df$Sex <- as.character(blood_predictions.df$Sex)
ggplot() +
geom_line(data = blood_predictions.df,mapping = aes(x = RedBCC, y = Haemoglobin, group = Sex, color = Sex)) +
geom_point(data = blood.df, mapping = aes(y = Haemoglobin, x = RedBCC, color = Sex)) +
labs(title = "Predicted Haemoglobin by Red Blood Cells and Gender", y = "Haemoglobin, g/dL", x = expression(paste("Red Blood Cell Count, cells/",mu,"L")), caption = "model includes interaction of red blood cell count and gender",subtitle = "original data points overlaid") + scale_color_discrete(name = "Gender",labels = c("Female","Male"), breaks = c("F","M")) + theme_minimal()blood_model_matrix.mx <- model.matrix(blood_fit.lm)
blood_response.vec <- blood.df$HaemoglobinThe ordinary least squares estimator for \(\beta\) is;
\[\hat{\beta} = {({X^T}{X})^{-1}}{X^T}{y}\]
beta_estimates_blood.mx <- solve(t(blood_model_matrix.mx) %*% blood_model_matrix.mx) %*% t(blood_model_matrix.mx) %*% blood_response.vec #solve gives the inverse of the matrix times it's transpose!!!
beta_estimates_blood.mx %>% as.data.frame() -> beta_estimates_blood.df
colnames(beta_estimates_blood.df) <- c("Least Squares Coefficient Estimates")
kable(beta_estimates_blood.df,caption = "Beta Coefficient Estimates")| Least Squares Coefficient Estimates | |
|---|---|
| (Intercept) | 0.94 |
| SexM | 9.74 |
| RedBCC | 2.83 |
| SexM:RedBCC | -1.99 |
Here I have calculated the hat matrix as per the forumula in the notes. I have calculated y-hat using the hat matrix multiplied with the haemoglobin values in the data set.
To verify this corresponds to what the R lm function is doing, I have subtracted the y-hat values in the model from the y-hat values I have manually calculated and summed the answer. As can be seen, the difference is neglible. I have also printed the first three results the hat vector inside the output lm.influence (the first three diagonals of the hat matrix), for comparison with the manually calculated y-hat values.
From pg 19 Practical Regression and Anova using R (Faraway, 2002), the hat-matrix projects y orthogonally onto the predictor space, X.
The formula used below corresponds to:
\[H = {X}{({X^T}{X})^{-1}}{X^T}{y}\]
#blood_diagnostics <- influence(blood_fit.lm)
blood_hat.mx <- blood_model_matrix.mx %*% (solve(t(blood_model_matrix.mx) %*% blood_model_matrix.mx)) %*% t(blood_model_matrix.mx) #hat matrix calculation
blood_y_hat <- blood_hat.mx %*% blood_response.vec #obtaining y-hat from hat matrix and observed y
sum(blood_y_hat - blood_fit.lm$fitted.values) #calculating difference from manual y-hat and R lm fitted values
kable(blood_hat.mx[1:3,1:3],caption = "Manual Hat Matrix Sample")
kable(data.frame('Diagonals' = lm.influence(blood_fit.lm)$hat[1:3]),caption = "Lm.Influence - first three diagonal hat values")## [1] -3.34e-11
| 1 | 2 | 3 |
|---|---|---|
| 0.041 | 0.000 | 0.000 |
| 0.000 | 0.049 | 0.037 |
| 0.000 | 0.037 | 0.052 |
| Diagonals |
|---|
| 0.041 |
| 0.049 |
| 0.052 |
Manually calculated RSS and the results of extracting RSS from the linear model using the deviance() function are shown. This result follows from the use of normal equations. Ultimately, calculating RSS this way is identical to the geometric method, but has some nifty side products (e.g. the hat matrix).
The formula used below directly corresponds to;
\[RSS = {y^T}{y}-{y^T}{X}{({X^T}{X})^{-1}}{X^T}{y}\]
Which is equivalent to;
\[RSS = {y^T}{[I - H]}{y}\]
Source;
The formula for the identity matrix in the below calculation;
\[{I} = ({X{^T}}{X})({X{^T}}{X})^{-1}\]
Source (page 5);
http://pj.freefaculty.org/guides/stat/Regression/RegressionDiagnostics/OlsHatMatrix.pdf
Unfortunately, generating the identity matrix for the Hat matrix (instead of the model matrix, as in the formula above) proved to be difficult, resulting in a “singularity error” that would be quite difficult to diagnose and work through. Hence the ‘hackish’ comprimise of using the diag() function in R.
#t(blood_response.vec) %*% blood_response.vec - 2 %*% t(beta_estimates_blood.mx) %*% t(blood_model_matrix.mx) %*% blood_response.vec + t(beta_estimates_blood.mx) %*% t(blood_model_matrix.mx) %*% blood_model_matrix.mx %*% beta_estimates_blood.mx #using formula for least squares estimation / OLD
X <- blood_model_matrix.mx
H <- blood_hat.mx
y <- blood_response.vec
#I <- t(X) %*% X %*% (solve(t(X) %*% X)); I
I <- diag(dim(H)[1]) #creates the identity of H (a matrix with 0 and diagonal 1s)
# REF: https://www.r-bloggers.com/how-do-i-create-the-identity-matrix-in-r/
t(y) %*% y - t(y) %*% X %*% solve(t(X)%*%X) %*% t(X) %*% y #long formula [,1]
[1,] 24.4
t(y) %*% (I - H) %*% y #using only y and H objects (simplified) [,1]
[1,] 24.4
deviance(blood_fit.lm)[1] 24.4