Welcome to your first HW assignment!
The first thing you should note is that we are not using your typical R script, but instead something called R Markdown. R Markdown is a format for writing reproducible, dynamic reports with R, and it will be the medium by which all HW assignments are worked on and submitted.
Some weeks, the first half of your Markdown script will be dedicated to answering conceptual questions that do not require any code or data, whereas the second half will be questions that are based on output that you generate in R. Other weeks, all of your HW questions will be based on estimating models and interpreting results in R.
Importantly, each week (Friday AM), you will be provided with a script like this that essentially serves as a skeleton. Your HW is to answer all the questions to the best of your ability within each script.
Basic Intro to Markdown
One thing you may have noticed already is that you do not have to preface comments with #. This is because all code that is run in Markdown must be run inside a chunk.
I provided you all with a Markdown cheat sheet and I encourage you all to familiarize yourself with it. While your HW will not require you to generate intricate Markdown features, it is good to learn and can be very helpful for disseminating research findings. One benefit of Markdown is that you can use Rpubs.com to host any script that you have. This enables you to create a script, host it for free and then send anyone a link to it. You can even include a table of content that allows collaborators or your advisor to easily navigate different analyses (I can show you how to do this during sections).
As a side note, I also encourage you all to download the newest version of R Studio (https://rstudio.com/products/rstudio/download/preview/), as it has some amazing features that allow you to visualize your markdown as you go.
Okay… so how does Markdown work? Markdown works via chunks! You can create a chunk by clicking command + option + I.
A chunk is essentially a mini script, and running a chunk (which can be done by placing your cursor anywhere in the chunk and clicking enter + command or by clicking on the little green arrow on the far right side) will run all code linearly inside of the chunk.
#this is a chunk, and you have many options with how to control the content within your chunk and this is typically done by adding different arguments after the r inside the curly brackets (e.g., {r, echo = False} will not display any code)
#Notice that any comments you want inside a chunk have to begin with a hash tag, else it will render an error (analogous to a typical R script)
The chunk above will not generate any code because all content within is inside of a comment, which R will ignore.
install.packages(‘rmarkdown’)
Notice that at the very top of the script you have a title, which should include the week and your name, along with the output, which is defaulted to html_document. Here is where you also have a lot of flexibility on how you want to format your script. But for now, we will leave as is.
What goes into a chunk?
I like to organize my chunks by steps of analyses.
For instance, I use one chunk to load all the necessary packages, another for any functions I may want for this script, another to load in a csv etc. Once all of your code is in your script and is working, you will complete it by clicking “Knit”, which will render your script and generate the HTML file that you will submit for HW.
REMEMBER, always make sure to save your script in the same folder where your csv is, and set your working directory to that folder, or else nothing will work.
All your HW answers will be written OUTSIDE of a chunk where you are prompted (i.e. Answer: answer goes here)
The first set of questions entitled covariance will not require any code and therefore will not require any chunks.
1. What is a correlation? What can you interpret from the number? Answer: A correlation is the measure of linear association between two variables. The value of a correlation always falls between -1 and 1. A correlation of 1 indicates that there is a strong positive correlation, and a correlation of -1 indicates that there is a strong negative correlation. Additionally, a correlation of 0 indicates that there is no association between the variables. Therefore, from the correlation coefficient you can interpret the association between two variables and the strength of the association (strongly associated, moderately associated, or weakly associated), but you cannot determine a cause and effect relationship based off a correlation analysis. When evaluating the power, relative importance, and statistical significance of an association between variables, a correctional analysis is a strong predictor of these attributes.
2. What are some problems with correlations (Hint: why do we use Fisher’s Zr)? Answer: In a correlation, the units of measurement do not matter, the relative sizes of the variance does not matter, and there is no distinction between the independent and dependent variables, it is simply bivariate. Therefore, correlations reveal limited information about the associations. A correlation does not reveal the effect of the independent variable on the dependent variable. Additionally, correlations do not reveal the significance of the difference between two correlation coefficients from independent samples. Hypotheses cannot be tested using a correlation coefficient alone, as one is unable to use correlation to determine the value of the correlation coefficient between the independent and dependent variables. This is why the Fisher Z transformation is used, so that the sampling distribution of the sample coefficient has a normal distribution with a variance that remains stable over different variables of the underlying true correlation.
3. What is a covariance? What can you interpret from the number? Answer: A covariance is the measure of the joint variable of two random variables. From the covariance, one can analyze the directional relationship between the residuals of two variables. The relationship is positive if the covariance is a positive number, and the direction of the relationship on a scatterplot will be parallel. The relationship is negative if the covariance number is a negative number, and a scatterplot would show an inversely directional relationship. Additionally, covariance is useful for computing regression lines as well as useful for computing correlation coefficients.
4. How does covariance avoid some issues that correlations encounter? Answer: Correlations are difficult to interpret, as it does not reveal the effect the independent variable has on the dependent vairable, does not reveal causation, and is dimensionless. Therefore, covariance can be used to make correlations more interpretable and to determine a directional relationship. As correlation is unitless, it can leave out meaningful information about an association. For example, if one is studying the relationship between hours of studying and blood pressure, the units are essential for meaningfully understanding the association. In this case, covariance is preferred because it measures the directional relationship in units. Covariance can be used in this way to avoid interpetation issues that correlation presents. Additionally, you can derive a correlation matrix from a covariance matrix, but you cannot derive a covariance matrix from a correlation matrix.
5. What is not useful about a covariance (Hint: there are at least two things)? Answer: Covariance is not a very helpful statistic on its own as it is difficult to interpret. The covariance is difficult to interpret because it is not standardized. Since the covariance is not unitless, if the independent variable is in inches and the dependent variable is in gallons, the covariance unit would be inch-gallons, a meaningless unit of measurement. Additionally, the value of covariance is not robust and does not stay consistent across scales. For example, if the value of the variables is multiplied by a constant, then the covariance changes.
6. Under what circumstances is a covariance a correlation? Answer: A covariance becomes a correlation when the covariance becomes standardized. If the independent and dependent variables happen to be z-scores, then the equation becomes the definitional formula for Pearson’s correlation, r. Therefore, the z-scores have standardized the covariance into a correlation.
7. Under what circumstances is a covariance a variance? Answer: The covariance of any variable with itself is the variance. As the covariance between the indepndent variable and the dependent variable is the mean of the cross products, the covariance of one variable (or the covariance of the variable with itself), reveals where the mean of that variable lands in the spread of the dataset, thus producing the variance.
This next section will incorporate code, and your answers will require running code and interpreting the output.
The first thing you will want to do is create a chunk to load your libraries
#we will only use the psych package for HW 1
library(psych)
## Warning: package 'psych' was built under R version 4.0.3
#remember you first have to install the psych package before you can load it which can be done with install.packages("psych")
install.packages(“psych”) ### HW Prompt: You are offered a job with a starting salary of $54,000. Before you take the job, you want to know if the salary you are being offered is fair or not. To answer this question, you collect information on years of experience and salary for similar positions and obtain the following data:
#We are going to manually insert our data, using the values in the section assignment
#"Experience" is the name of our variable, the '<-' is equivalent to "equals" (you can also use "="),
#the 'c' means concatenate, which combines values into a vector
Experience <- c(0,2,5,8,10,4,12,5,8,10)
Salary <- c(35,44,52,42,56,50,58,42,48,50)
#Here we combine the columns using the 'cbind' command, creating one dataset
#it would be a matrix but we get it to be a data frame with as.data.frame
dataset <- as.data.frame(cbind(Experience, Salary))
#rm removes objects
rm(Experience, Salary)
#The plot command is pretty self-explanatory
#Note the dataset$variable format
Experience <- c(0,2,5,8,10,4,12,5,8,10)
Salary <- c(35,44,52,42,56,50,58,42,48,50)
cbind(Experience, Salary)
## Experience Salary
## [1,] 0 35
## [2,] 2 44
## [3,] 5 52
## [4,] 8 42
## [5,] 10 56
## [6,] 4 50
## [7,] 12 58
## [8,] 5 42
## [9,] 8 48
## [10,] 10 50
data.frame(Experience, Salary)
## Experience Salary
## 1 0 35
## 2 2 44
## 3 5 52
## 4 8 42
## 5 10 56
## 6 4 50
## 7 12 58
## 8 5 42
## 9 8 48
## 10 10 50
plot(dataset$Salary~dataset$Experience)
a. What is the first thing you should do when dealing with this data? Interpret your findings. Hint: It’s often best to start with plots.
plot(dataset$Salary~dataset$Experience)
#There are many ways to customize graphs to make them look nicer, which we will cover later
Answer: The first thing that I should do with this data is to make make the variables ordered into a factor within the data set, and to plot the data in the most eay to interpret plot for exploratory analysis, in this case a scatterplot. This scatterplot shows that the more experience one has, the higher the salary. The participant with no experience makes the lowest salary. Additionally, there is more fluctuation in salry for mid-experienced people.
b. Calculate and interpret the correlation.
cor.test(dataset$Salary, dataset$Experience)
##
## Pearson's product-moment correlation
##
## data: dataset$Salary and dataset$Experience
## t = 3.1806, df = 8, p-value = 0.01299
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2221654 0.9363433
## sample estimates:
## cor
## 0.7472636
#When running correlations, we can correlate an entire dataset...
cor(dataset)
## Experience Salary
## Experience 1.0000000 0.7472636
## Salary 0.7472636 1.0000000
#Or specific variables in that dataset. cor.test is also necessary for inferential statistics
cor.test(dataset$Salary, dataset$Experience)
##
## Pearson's product-moment correlation
##
## data: dataset$Salary and dataset$Experience
## t = 3.1806, df = 8, p-value = 0.01299
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2221654 0.9363433
## sample estimates:
## cor
## 0.7472636
Answer: Salary and experience were found to be strongly positively correlated, r(8)=0.747, p= .012. The results showcase that there is a strong association between high years of experience and high salary. These results suggest that years of work experience is strongly related to job salary.
c. Calculate and interpret the covariance.
#The 'cov' command has fewer options, for good reason
cov(dataset$Salary, dataset$Experience)
## [1] 20.13333
Answer: The covariance was calculated between years of experience (independent variable) and salary (Dependet variable). The covariance between these two variables is 20.133, showcasing that there is a strong, positive covariance between these two variables. This result reveals that years of experience and salary vary in the same direction. Therefore, as experience increases, salary increases and as experience decreases, salary decreases.
d. Calculate the means and standard deviations for each variable.
#the describe command from the psych package will give a variety of descriptive statistics (including sample SD)
describe(dataset)
## vars n mean sd median trimmed mad min max range skew kurtosis
## Experience 1 10 6.4 3.84 6.5 6.5 4.45 0 12 12 -0.16 -1.42
## Salary 2 10 47.7 7.02 49.0 48.0 8.90 35 58 23 -0.20 -1.16
## se
## Experience 1.21
## Salary 2.22
#Again, it could be used for an entire dataframe or individual variables
describe(dataset$Salary, dataset$Experience)
## Warning in if (!na.rm) x <- na.omit(x): the condition has length > 1 and only
## the first element will be used
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
## Warning in if (na.rm) "na.or.complete" else "everything": the condition has
## length > 1 and only the first element will be used
## Warning in if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE]
## [NA]): the condition has length > 1 and only the first element will be used
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
## Warning in if (na.rm) {: the condition has length > 1 and only the first element
## will be used
## Warning in if (na.rm) "na.or.complete" else "everything": the condition has
## length > 1 and only the first element will be used
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
## Warning in if (na.rm) {: the condition has length > 1 and only the first element
## will be used
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
## Warning in if (na.rm) "na.or.complete" else "everything": the condition has
## length > 1 and only the first element will be used
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 47.7 7.02 49 48 8.9 35 58 23 -0.2 -1.16 2.22
describe(dataset$Salary)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 47.7 7.02 49 48 8.9 35 58 23 -0.2 -1.16 2.22
describe(dataset$Experience)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 6.4 3.84 6.5 6.5 4.45 0 12 12 -0.16 -1.42 1.21
Answer: The mean for salary is 47.7 and the standard deviation for salary is 7.02. The mean for years of experience is 6.4 and the standard deviation for years of experience is 3.84.
e. Based on the correlation, means, and standard deviations, calculate the regression equation (and then check in R) predicting salary from years of experience.
#The lm command creates a linear model, here Salary is our outcome and Experience our predictor
model <- lm(Salary~Experience, data = dataset)
#the summary command gives us most relevant info for a lm object
summary(model)
##
## Call:
## lm(formula = Salary ~ Experience, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.890 -3.495 0.216 3.189 6.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.9411 3.1678 12.293 1.78e-06 ***
## Experience 1.3686 0.4303 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.951 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
#If you have a question on any R command, there is a simple way of finding the source documentation
?lm
## starting httpd help server ... done
Answer: A regression analysis was used to test if years of experience significantly predicted salary. A significant regression equation was found F(8, 6.12)= 10.12, p=.012, with a R-sqared of 0.558.
f. We want to know how much we expect salary to change for each additional year of experience gained. What term are we looking for? How do we interpret it? What does the significance test tell us? Answer: The term we are looking for is b, as it is the measure of effect size. The significance test tells us how strong the effect size is on the regression model. For example, if the effect size shows that salary increases by 20% per year of experience, but the results are not significant, than likely experience does not predict that percentage increase of salary.
g. We want to know what salary would we expect someone with no experience to earn? What term would we look at? How would we interpret it? What does the significant test tell us? Answer: The term we would look at is residuals. This would allow us to see the outlier statistic, 0 years experience, so that we can determine what salary to expect. We would interpret the residual by measuring it against the physical distance of the regression line. The significance test would tell us how likely that 0 years experience would predict that salary.
h. Is our intercept (b0) actually an interpretable/meaningful value? Is the significance test a meaningful test? Explain why or why not for both questions. Answer: Since the data is unstandardized, then the b0 statistic is meaningful and important If the scores were standardized, then b0 would not necessarily be very meaningful. The significance test is a meaningful test because it indicates that the independent variables strongly correlate with and are a predictor of the dependent variable. Therefore, the significance test gives meaning to the results and allows for hypotheses about the data to be corroborated by the results.
i. If we know that we have 6 years of experience, what salary should we expect to earn?
#you can do math by hand
(38.9411 + (1.3686 * 6)) * 1000
## [1] 47152.7
#or you can do it with code.
#first create a df with a column called experience with value = 6
predicted <- data.frame("Experience" = c(6))
#the predict command then uses the old model to predict new values.
predict(model, predicted)
## 1
## 47.15257
Answer: If we have 6 years experience, we can expect to earn approvimately $47,152 per year.
j. We are actually being offered $54,000, so given what we just predicted, should we take the job? Explain. Calculate how much better or worse it is.
#[] subsets
(54 - predict(model,predicted)[1]) * 1000
## 1
## 6847.432
Answer: Yes, we should take the job because it is approximately $6,847 higher than the predicted salary or “market salary” for 6 years experience. According to this model, this job is overpaying the average salary by almost $7,000, so we should take the job because according to this model, other jobs will likely pay less.
k. What did we just calculate? Answer: We just calculated the difference between the salary offered by this job and the average salary we would be offered for six years experience according to this model.
l. What is the standardized equation predicting salary from years of experience?
#the scale command allows us to both center and z-score our variables
#Here, the column 'Experience_z' is being added to the data frame 'dataset'
dataset$Experience_z <- scale(dataset$Experience)
dataset$Salary_z <- scale(dataset$Salary)
#Here we are running the same model, with z-scored variables
dataset$Experience_z <- scale(dataset$Experience)
dataset$Salary_z <- scale(dataset$Salary)
standardmodel <- lm(Salary_z ~ Experience_z, data=dataset)
summary(standardmodel)
##
## Call:
## lm(formula = Salary_z ~ Experience_z, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12316 -0.49750 0.03075 0.45395 0.88490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.795e-16 2.229e-01 0.000 1.000
## Experience_z 7.473e-01 2.349e-01 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7048 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: The standardized equation is calculated using the z scores. The standardized results are similar to the unstandardized results; F(8, 6.12)= 10.12, p=.012, with a R-sqared of 0.558. However the residual standard error increases from 4.951 to 0.7048
m. Interpret the standardized equation. Answer: As the residual standard error increased after the equation was standardized, this indicates that the predictions are weaker after the equation is standardized, as a smaller residual error indicates that the predictions are smaller. However, the R-squared remained consistent as did the significance, indicating that the results are robust. However, as the R-squared statistic remains small, it shows that the linear model that the proportion of variation in the dependent variable is not fully accounted for by the model.
n. Now, we are going to be working with centered years of experience (rather than years of experience directly). Obtain the regression equation and interpret the slope and intercept.
#Here we are using the scale command in much the same way, except 'scale=FALSE' will center, but not z-score
dataset$Experience_c <-scale(dataset$Experience, scale=FALSE)
#we can examine our values
dataset$Experience_c
## [,1]
## [1,] -6.4
## [2,] -4.4
## [3,] -1.4
## [4,] 1.6
## [5,] 3.6
## [6,] -2.4
## [7,] 5.6
## [8,] -1.4
## [9,] 1.6
## [10,] 3.6
## attr(,"scaled:center")
## [1] 6.4
centeredmodel <- lm(Salary ~ Experience_c, data=dataset)
summary(centeredmodel)
##
## Call:
## lm(formula = Salary ~ Experience_c, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.890 -3.495 0.216 3.189 6.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.7000 1.5657 30.466 1.46e-09 ***
## Experience_c 1.3686 0.4303 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.951 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: The regression equation is Y= 1.3686 + .013 + 1.5657 (Yi=b0 +b1xi +e). This equation can be interpreted as the center of the regression line. This regression equation shows that this equation is the linear center of the line, and that the regression line is a straight line. According to this equation, the center of the line is approximately 3, the exact center being 2.9473.
o. In general, why might we want to center our predictors (hint: what changed when we centered it)? Does it make sense in this particular case?
Answer: We want to center our predictors so that the predictors have a mean of 0, as it makes the terms easier to interpret. If the predictors have a mean of 0, this allows for the intercept term as the expected value of Yi to be easier to visually interpret on the scatter plot when the predictor values are set to their means.
p. Rather than working with salaries in $1000, we want to work with salary in $1. Multiply all the salary values by 1000.
#Here we are just doing a simple mathematical operation to create a variable
dataset$Salary2 <- dataset$Salary*1000
describe(dataset$Salary2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 47700 7024.56 49000 48000 8895.6 35000 58000 23000 -0.2 -1.16
## se
## X1 2221.36
Answer: When multiplying all the salary values by 1000, the mean becomes 47700 and the standard deviation becomes 7024.56 (as seen above).
q. Obtain the regression equation and interpret all pieces.
#running model and summarizing it
model2 <- lm(Salary2~Experience, data=dataset)
summary(model2)
##
## Call:
## lm(formula = Salary2 ~ Experience, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7890 -3495 216 3189 6216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38941.1 3167.8 12.293 1.78e-06 ***
## Experience 1368.6 430.3 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4951 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: The regression equation is Yi=1368.6 + 0.013xi + 430.3. The residual standard error is high when measuring salary in $1 terms, as the residual error is 4941, indicating that the model is not very strong when measuring in terms of $1, as there is too much variation to be able to consistently predict the dependent variable, the range is too large. The f-statistic, R-squared value, and p-value remained consistent.
r. Working with salary in $1 increments, calculate the standardized equation and interpret all pieces.
#We can also wrap everyting up in one line of code
summary(lm (scale(Salary2, scale=TRUE)~scale(Experience, scale=TRUE), data=dataset))
##
## Call:
## lm(formula = scale(Salary2, scale = TRUE) ~ scale(Experience,
## scale = TRUE), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12316 -0.49750 0.03075 0.45395 0.88490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.631e-16 2.229e-01 0.000 1.000
## scale(Experience, scale = TRUE) 7.473e-01 2.349e-01 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7048 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: Working with the data in $1 increments, the regression equation is Yi=7.473e-01 + 3.181xi + 2.49e-01. The residual standard error is greatly reduced after standardizing the model, as it is now 0.7048. This indicates that the model in $1 increments is weak before standardization, but stronger after standardization. The R-squared value, F-statistic, and significance remains the same.
s. What should you be noticing about all the standardized equations that we ran using (1) original values, (2) centered experience, and (3) having changed salary into dollars?
Answer: I notice that the residual error changes when running these new equations, showing how standardizing the equation effects the strength of the linear model. I also notice that the descriptive statistics of the regression model, the F-value, R-squared value, and the p-value remains consistent after standardization.
t. What is the relationship between correlation, residuals, and the accuracy of our predictions? (As the correlation increases, what happens to residuals and accuracy of predictions?)
Answer: The relationship between the correlation, residuals and the accuracy of our predictions are that if two variables are strongly correlated, then a linear regression equation can be created because one dependent variable can be predicted based on one independent variable. As the correlation increases, the residual standard error decreases, and a low residual standard error indicates that the predictions are strong. Therefore, as the correlation increase, the predictions become stronger and the model becomes stronger because the residuals decrease.
u. Assuming that x was completely uncorrelated with y, (e.g., b = 0), what would our regression equation predict for y? Or another way to say it: What would be the best overall predictor of y? Or yet another way: What prediction of y would give us the smallest residuals?
Answer: Uncorrelated variables are not necessarily unrelated with each other, as they might be strongly nonlinearly related. In this case, y would be predicted by prediction intervals. The equation is y~f(x,B), where y is predicted based off of a vector of independent variables (x) as it relates to y (dependent variable). This would allow us to find a small number of residuals and have a strong linear model.