Welcome to your first HW assignment!
The first thing you should note is that we are not using your typical R script, but instead something called R Markdown. R Markdown is a format for writing reproducible, dynamic reports with R, and it will be the medium by which all HW assignments are worked on and submitted.
Some weeks, the first half of your Markdown script will be dedicated to answering conceptual questions that do not require any code or data, whereas the second half will be questions that are based on output that you generate in R. Other weeks, all of your HW questions will be based on estimating models and interpreting results in R.
Importantly, each week (Friday AM), you will be provided with a script like this that essentially serves as a skeleton. Your HW is to answer all the questions to the best of your ability within each script.
Basic Intro to Markdown
One thing you may have noticed already is that you do not have to preface comments with #. This is because all code that is run in Markdown must be run inside a chunk.
I provided you all with a Markdown cheat sheet and I encourage you all to familiarize yourself with it. While your HW will not require you to generate intricate Markdown features, it is good to learn and can be very helpful for disseminating research findings. One benefit of Markdown is that you can use Rpubs.com to host any script that you have. This enables you to create a script, host it for free and then send anyone a link to it. You can even include a table of content that allows collaborators or your advisor to easily navigate different analyses (I can show you how to do this during sections).
As a side note, I also encourage you all to download the newest version of R Studio (https://rstudio.com/products/rstudio/download/preview/), as it has some amazing features that allow you to visualize your markdown as you go.
Okay... so how does Markdown work? Markdown works via chunks! You can create a chunk by clicking command + option + I.
A chunk is essentially a mini script, and running a chunk (which can be done by placing your cursor anywhere in the chunk and clicking enter + command or by clicking on the little green arrow on the far right side) will run all code linearly inside of the chunk.
#this is a chunk, and you have many options with how to control the content within your chunk and this is typically done by adding different arguments after the r inside the curly brackets (e.g., {r, echo = False} will not display any code)
#Notice that any comments you want inside a chunk have to begin with a hash tag, else it will render an error (analogous to a typical R script)
The chunk above will not generate any code because all content within is inside of a comment, which R will ignore.
Notice that at the very top of the script you have a title, which should include the week and your name, along with the output, which is defaulted to html_document. Here is where you also have a lot of flexibility on how you want to format your script. But for now, we will leave as is.
What goes into a chunk?
I like to organize my chunks by steps of analyses.
For instance, I use one chunk to load all the necessary packages, another for any functions I may want for this script, another to load in a csv etc. Once all of your code is in your script and is working, you will complete it by clicking "Knit", which will render your script and generate the HTML file that you will submit for HW.
REMEMBER, always make sure to save your script in the same folder where your csv is, and set your working directory to that folder, or else nothing will work.
All your HW answers will be written OUTSIDE of a chunk where you are prompted (i.e. Answer: answer goes here)
The first set of questions entitled covariance will not require any code and therefore will not require any chunks.
1. What is a correlation? What can you interpret from the number? Answer: A correlation is a relationship between two variables. From it, we can interpret how much one variable changes with a change in the other variable.
2. What are some problems with correlations (Hint: why do we use Fisher’s Zr)? Answer: Correlation is usually standardized, which means we can compare the strength of correlations among different analyses. We use z-scores for this. However, this also means that the correlation does not provide us much information about the data itself.
3. What is a covariance? What can you interpret from the number? Answer: A covariance can be seen as an unstandardized correlation, or how much the variables "go toghether". It is difficult to interpret, as its units are unintuitive.
4. How does covariance avoid some issues that correlations encounter? Answer: The covariance uses raw/unstandardized variables, so we have information about the actual scores in the data.
5. What is not useful about a covariance (Hint: there are at least two things)? Answer: A covariance is not standardized, so the same covariance value has a different meaning depending on the sample we got it from.
6. Under what circumstances is a covariance a correlation? Answer: A standardized covariance is the same as a correaltion.
7. Under what circumstances is a covariance a variance? Answer: An unstandardized covariance is a variance.
This next section will incorporate code, and your answers will require running code and interpreting the output.
The first thing you will want to do is create a chunk to load your libraries
#we will only use the psych package for HW 1
library(psych)
#remember you first have to install the psych package before you can load it which can be done with install.packages("psych")
You are offered a job with a starting salary of $54,000. Before you take the job, you want to know if the salary you are being offered is fair or not. To answer this question, you collect information on years of experience and salary for similar positions and obtain the following data:
#We are going to manually insert our data, using the values in the section assignment
#"Experience" is the name of our variable, the '<-' is equivalent to "equals" (you can also use "="),
#the 'c' means concatenate, which combines values into a vector
Experience <- c(0,2,5,8,10,4,12,5,8,10)
Salary <- c(35,44,52,42,56,50,58,42,48,50)
#Here we combine the columns using the 'cbind' command, creating one dataset
#it would be a matrix but we get it to be a data frame with as.data.frame
dataset <- as.data.frame(cbind(Experience, Salary))
#rm removes objects
rm(Experience, Salary)
#The plot command is pretty self-explanatory
#Note the dataset$variable format
a. What is the first thing you should do when dealing with this data? Interpret your findings. Hint: It's often best to start with plots.
plot(dataset$Salary~dataset$Experience)
#There are many ways to customize graphs to make them look nicer, which we will cover later
Answer: We plot the data set for a visual glance at the data. Here there seems to be a positive realtionship.
b. Calculate and interpret the correlation.
#When running correlations, we can correlate an entire dataset...
cor(dataset)
## Experience Salary
## Experience 1.0000000 0.7472636
## Salary 0.7472636 1.0000000
#Or specific variables in that dataset. cor.test is also necessary for inferential statistics
cor.test(dataset$Salary, dataset$Experience)
##
## Pearson's product-moment correlation
##
## data: dataset$Salary and dataset$Experience
## t = 3.1806, df = 8, p-value = 0.01299
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2221654 0.9363433
## sample estimates:
## cor
## 0.7472636
Answer: The correlation is about 0.75.
c. Calculate and interpret the covariance.
#The 'cov' command has fewer options, for good reason
cov(dataset$Salary, dataset$Experience)
## [1] 20.13333
Answer: The covariance is 20.13.
d. Calculate the means and standard deviations for each variable.
#the describe command from the psych package will give a variety of descriptive statistics (including sample SD)
describe(dataset)
## vars n mean sd median trimmed mad min max range skew kurtosis
## Experience 1 10 6.4 3.84 6.5 6.5 4.45 0 12 12 -0.16 -1.42
## Salary 2 10 47.7 7.02 49.0 48.0 8.90 35 58 23 -0.20 -1.16
## se
## Experience 1.21
## Salary 2.22
#Again, it could be used for an entire dataframe or individual variables
describe(dataset$Salary)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 47.7 7.02 49 48 8.9 35 58 23 -0.2 -1.16 2.22
Answer: Expereince: mean 6.4, SD 3.84 Salary: mean 47.7, SD 7.02
e. Based on the correlation, means, and standard deviations, calculate the regression equation (and then check in R) predicting salary from years of experience.
#The lm command creates a linear model, here Salary is our outcome and Experience our predictor
model <- lm(Salary~Experience, data = dataset)
#the summary command gives us most relevant info for a lm object
summary(model)
##
## Call:
## lm(formula = Salary ~ Experience, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.890 -3.495 0.216 3.189 6.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.9411 3.1678 12.293 1.78e-06 ***
## Experience 1.3686 0.4303 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.951 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
#If you have a question on any R command, there is a simple way of finding the source documentation
?lm
## starting httpd help server ... done
Answer:
f. We want to know how much we expect salary to change for each additional year of experience gained. What term are we looking for? How do we interpret it? What does the significance test tell us? Answer: We are looking for the regression coefficient, which here is 1.3686, significant at p < .01.
g. We want to know what salary would we expect someone with no experience to earn? What term would we look at? How would we interpret it? What does the significant test tell us? Answer: We would look at the intercept for salary. We would interpret it as the base salary with no experience. Here it is 38.8411, and statsitcially significant so not likely to occur as such if H0 is true.
h. Is our intercept (b0) actually an interpretable/meaningful value? Is the significance test a meaningful test? Explain why or why not for both questions. Answer: The b0 has a theoretical meaning, but often is not useful in practice. In this case, 0 years of experience actually does mean something, so the test is meaningful. The significance test, however, is not meaningful as we do not have a null hypothesis.
i. If we know that we have 6 years of experience, what salary should we expect to earn?
#you can do math by hand
(38.9411 + (1.3686 * 6)) * 1000
## [1] 47152.7
#or you can do it with code.
#first create a df with a column called experience with value = 6
predicted <- data.frame("Experience" = c(6))
#the predict command then uses the old model to predict new values.
predict(model, predicted)
## 1
## 47.15257
Answer: 47.1527
j. We are actually being offered $54,000, so given what we just predicted, should we take the job? Explain. Calculate how much better or worse it is.
#[] subsets
(54 - predict(model,predicted)[1]) * 1000
## 1
## 6847.432
Answer: Just this number does not tell us if this difference is meaningful. Otherwise, the position does pay more than is expected for our experience.
k. What did we just calculate? Answer: We calculated how much experience we should have given the offer of $54,000.
l. What is the standardized equation predicting salary from years of experience?
#the scale command allows us to both center and z-score our variables
#Here, the column 'Experience_z' is being added to the data frame 'dataset'
dataset$Experience_z <- scale(dataset$Experience)
dataset$Salary_z <- scale(dataset$Salary)
#Here we are running the same model, with z-scored variables
standardmodel <- lm(Salary_z ~ Experience_z, data=dataset)
summary(standardmodel)
##
## Call:
## lm(formula = Salary_z ~ Experience_z, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12316 -0.49750 0.03075 0.45395 0.88490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.795e-16 2.229e-01 0.000 1.000
## Experience_z 7.473e-01 2.349e-01 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7048 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer:
m. Interpret the standardized equation. Answer: For each 1-unit change in salary, there is a 7e-01 change in experience.
n. Now, we are going to be working with centered years of experience (rather than years of experience directly). Obtain the regression equation and interpret the slope and intercept.
#Here we are using the scale command in much the same way, except 'scale=FALSE' will center, but not z-score
dataset$Experience_c <-scale(dataset$Experience, scale=FALSE)
#we can examine our values
dataset$Experience_c
## [,1]
## [1,] -6.4
## [2,] -4.4
## [3,] -1.4
## [4,] 1.6
## [5,] 3.6
## [6,] -2.4
## [7,] 5.6
## [8,] -1.4
## [9,] 1.6
## [10,] 3.6
## attr(,"scaled:center")
## [1] 6.4
centeredmodel <- lm(Salary ~ Experience_c, data=dataset)
summary(centeredmodel)
##
## Call:
## lm(formula = Salary ~ Experience_c, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.890 -3.495 0.216 3.189 6.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.7000 1.5657 30.466 1.46e-09 ***
## Experience_c 1.3686 0.4303 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.951 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: The intercept, or the value where b0 is 0, is 47.7, meaning that for no experience there should be a salary of about $47,700. For each 1 SD increase in salary, there is a 1.37 SD increase in experience.
o. In general, why might we want to center our predictors (hint: what changed when we centered it)? Does it make sense in this particular case?
Answer: We want to center predictors in order to compare items to the mean (which is assigned 0). Centering at 0 does not make sense for this dataset as for both values 0 is the actual possible minimum.
p. Rather than working with salaries in $1000, we want to work with salary in $1. Multiply all the salary values by 1000.
#Here we are just doing a simple mathematical operation to create a variable
dataset$Salary2 <- dataset$Salary*1000
describe(dataset$Salary2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 47700 7024.56 49000 48000 8895.6 35000 58000 23000 -0.2 -1.16
## se
## X1 2221.36
Answer: Mean 47700, SD 7024.56
q. Obtain the regression equation and interpret all pieces.
#running model and summarizing it
model2 <- lm(Salary2~Experience, data=dataset)
summary(model2)
##
## Call:
## lm(formula = Salary2 ~ Experience, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7890 -3495 216 3189 6216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38941.1 3167.8 12.293 1.78e-06 ***
## Experience 1368.6 430.3 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4951 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: The salary intercept (b0) is 38941.1. For each 1000 increase in salary there is a 1368.6 increase in experience. All values are significant. Experience explains about 50% of the variance in salary.
r. Working with salary in $1 increments, calculate the standardized equation and interpret all pieces.
#We can also wrap everyting up in one line of code
summary(lm (scale(Salary2, scale=TRUE)~scale(Experience, scale=TRUE), data=dataset))
##
## Call:
## lm(formula = scale(Salary2, scale = TRUE) ~ scale(Experience,
## scale = TRUE), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12316 -0.49750 0.03075 0.45395 0.88490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.631e-16 2.229e-01 0.000 1.000
## scale(Experience, scale = TRUE) 7.473e-01 2.349e-01 3.181 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7048 on 8 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5032
## F-statistic: 10.12 on 1 and 8 DF, p-value: 0.01299
Answer: Same interpretation as above, but measured in standard deviations rather than original units.
s. What should you be noticing about all the standardized equations that we ran using (1) original values, (2) centered experience, and (3) having changed salary into dollars?
Answer: The original values give us the relationship between salary and experience based on the original units, so the values are fairly small given that salaries are measured in thousands of dollars. Centering changes the values, but not the relationships. Changing the salary into individual dollars makes the relationships seem even larger than originally.
t. What is the relationship between correlation, residuals, and the accuracy of our predictions? (As the correlation increases, what happens to residuals and accuracy of predictions?)
Answer: Correaltion shows the strength of the relationship, but not its accurary (in itself). Large residuals mean a large amount of unexplained variance, and they reduce accuracy. A stronger correlation suggests that more of the relationship was due to the relationship between the variables and less is due to third unexplained factors.
u. Assuming that x was completely uncorrelated with y, (e.g., b = 0), what would our regression equation predict for y? Or another way to say it: What would be the best overall predictor of y? Or yet another way: What prediction of y would give us the smallest residuals?
Answer: The best overall predictor of y would be its expected value.
Congratulations! You finished the first HW :)