Module 5 Assignment

Simple Linear Regression is a very straightforward simple linear approach for predicting a quantitative response y on the basis of a single predictor variable x. It assumes that there is approximately a linear relationship between x and y.

Q1. According to Advertising Age’s annual salary review, Mark Hurd, the 49-year-old chairman, president, and CEO of Hewlett-Packard Co., received an annual salary of $817,000, a bonus of more than $5 million, and other compensation exceeding $17 million. His total compensation was slightly better than the average CEO total pay of $12.4 million. The file ExecSalary.csv

Preview the document shows the age and annual salary (in thousands of dollars) for Mark Hurd and 14 other executives who led publicly held companies (Advertising Age, December 5, 2006)

setwd("C:/Users/plu5638/Desktop/Business Analytics/Module 5/")
Exec<-read.csv("ExecSalary.csv")
print(Exec)

##              Executive             Title               Company Age
## 1       Charles Prince          Chmn/CEO             Citigroup  56
## 2    Harold McGraw III     Chmn/Pres/CEO      McGraw-Hill Cos.  57
## 3          James Dimon          Pres/CEO JP Morgan Chase & Co.  50
## 4    K. Rupert Murdoch          Chmn/CEO            News Corp.  75
## 5     Kenneth D. Lewis     Chmn/Pres/CEO       Bank of America  58
## 6  Kenneth I. Chenault          Chmn/CEO  American Express Co.  54
## 7   Louis C. Camilleri          Chmn/CEO          Altria Group  51
## 8         Mark V. Hurd     Chmn/Pres/CEO   Hewlett-Packard Co.  49
## 9    Martin S. Sorrell               CEO             WPP Group  61
## 10  Robert L. Nardelli     Chmn/Pres/CEO            Home Depot  57
## 11 Samuel J. Palmisano     Chmn/Pres/CEO             IBM Corp.  55
## 12      David C. Novak     Chmn/Pres/CEO            Yum Brands  53
## 13  Henry R. Silverman          Chmn/CEO         Cendant Corp.  65
## 14    Robert C. Wright          Chmn/CEO         NBC Universal  62
## 15     Sumner Redstone Exec Chmn/Founder                Viacom  82
##    Salary_in_Thousands_USD
## 1                     1000
## 2                     1172
## 3                     1000
## 4                     4509
## 5                     1500
## 6                     1092
## 7                     1663
## 8                      817
## 9                     1562
## 10                    2164
## 11                    1680
## 12                    1173
## 13                    3300
## 14                    2500
## 15                    5807

Develop a scatter diagram for these data with the age of the executive as the independent variable. What does the scatter diagram indicate about the relationship between the two variables?

#Create a Scatter Plot 
plot(Exec$Age~Exec$Salary_in_Thousands_USD)

Now let’s create a simple regression model using Age as a predictor (independent) variable and Salary as a response (dependent) variable.

Show the output of the regression. Develop the least squares estimated regression equation. Interpret the meaning of the intercept and slope in the context of this exercise.

#Simple Linear Regression Model
linear_mod_exec<-lm(Age~Salary_in_Thousands_USD, data=Exec) # Here the dependent variable, Salary (which we are trying to predict) comes first then comes the independent variable.
summary(linear_mod_exec) # this is the regression model output

## 
## Call:
## lm(formula = Age ~ Salary_in_Thousands_USD, data = Exec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5806 -2.0710  0.3294  1.7972  5.0309 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.651e+01  1.362e+00   34.16 4.10e-14 ***
## Salary_in_Thousands_USD 6.054e-03  5.476e-04   11.06 5.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.946 on 13 degrees of freedom
## Multiple R-squared:  0.9039, Adjusted R-squared:  0.8965 
## F-statistic: 122.2 on 1 and 13 DF,  p-value: 5.546e-08

Adding best fit line to the scatter plot

#Scatter Plot with best fit line

plot(Exec$Age~Exec$Salary_in_Thousands_USD)
abline(lm(Exec$Age~Exec$Salary_in_Thousands_USD), col="red", lwd=3) #Adds best fit line to the plot

#Adding R-squared value to the scatter plot

#Option 1: Reading the output and adding the R-squared value on the plot

text(81, 10, "R-squared = 0.6641") # Refer page 129 of Stowell for selecting the position of text on the plot

#Option 2: Letting R fill out the R-squared value on the plot  

summ<-summary(linear_mod_exec) #we are saving the regression model output in object summ, then we can use it to call the value we want. 

rsq<-round((summ$r.squared), digits=4) #Here, we are calling the adj.r.squared value from the summ object, round it to 4 digits and then save the value as rsq

text(81, 10, paste0("R-squared = ", rsq)) # we paste the rsq on the plot

Interpretation of Simple Linear Regression output

The Age and Salary seem to have a positive linear relationship.

summary(linear_mod_exec)

## 
## Call:
## lm(formula = Age ~ Salary_in_Thousands_USD, data = Exec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5806 -2.0710  0.3294  1.7972  5.0309 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.651e+01  1.362e+00   34.16 4.10e-14 ***
## Salary_in_Thousands_USD 6.054e-03  5.476e-04   11.06 5.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.946 on 13 degrees of freedom
## Multiple R-squared:  0.9039, Adjusted R-squared:  0.8965 
## F-statistic: 122.2 on 1 and 13 DF,  p-value: 5.546e-08

Suppose Bill Gustin is the 72-year-old chairman, president, and CEO of a major electronics company. Predict the annual salary for Bill Gustin.

The summary(linear_mod_exec) command gives us p values and standard errors for the coefficients, as well as the R2 statistic and F-statistic for the model.

The value of R-squared is a measure of the goodness of fit of the estimated regression equation. The Multiple R-squared: 0.9039 indicates that using Age as independent variable, our model explains about 90.39% variability in Age. In other words, 90.39% of the variability in the values in the age in the sample can be explained by the linear relationship between the age and salary.

The intercept is 4.651e+01 . It is the estimated value of the dependent variable y when the independent variable x is equal to 40. In other words, if the age for an executive is 40 years old then the salary would be about 4.651e+01.

The slope of independent variable Executives Salary in Thousands USD is 6.054e-03. It means every one thousand increase in salary, the mean salary would increase by 6.054e-03 Salary.

The residuals are the difference between the actual response values (Age) and model predicted response values (Age). We will discuss about residuals in the following sections.

Now lets, write our simple linear regression equation using the intercept and coefficient values:

Age_72<-4.651e+01 + 6.054e-03*(72) ## Simple Linear Regression equation
Age_72

## [1] 46.94589

Q. 2. The Dow Jones Industrial Average (DJIA) and the Standard & Poor’s 500 (S&P 500) indexes are used as measures of overall movement in the stock market. The DJIA is based on the price movements of 30 large companies; the S&P 500 is an index composed of 500 stocks. Some say the S&P 500 is a better measure of stock market performance because it is broader based. The closing price for the DJIA and the S&P 500 for 15 weeks, beginning with January 6, 2012, (Barron’s web site, April 17, 2012) are given in DJIAS_P500.csvPreview the document. [12 Points]

setwd("C:/Users/plu5638/Desktop/Business Analytics/Module 5/")
DJIA_500<-read.csv("DJIAs_P500.csv")
print(DJIA_500)

##           Date  DJIA S.P.500
## 1    January 6 12360    1278
## 2   January 13 12422    1289
## 3   January 20 12720    1315
## 4   January 27 12660    1316
## 5   February 3 12862    1345
## 6  February 10 12801    1343
## 7  February 17 12950    1362
## 8  February 24 12983    1366
## 9      March 2 12978    1370
## 10     March 9 12922    1371
## 11    March 16 13233    1404
## 12    March 23 13081    1397
## 13    March 30 13212    1408
## 14     April 5 13060    1398
## 15    April 13 12850    1370

Develop a scatter chart for these data with DJIA as the independent variable. What does the scatter chart indicate about the relationship between DJIA and S&P 500?

#Create a Scatter Plot 
plot(DJIA_500$DJIA~DJIA_500$S.P.500)

Show the regression putput and develop an estimated regression equation. Provide an interpretation of the regression equation in the context of how DJIA is related to S&P 500.

#Simple Linear Regression Model
linear_mod_DJIA_500<-lm(DJIA~S.P.500, data=DJIA_500) # Here the dependent variable, Time (which we are trying to predict) comes first then comes the independent variable.
summary(linear_mod_DJIA_500) # this is the regression model output

## 
## Call:
## lm(formula = DJIA ~ S.P.500, data = DJIA_500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.59  -45.15   17.41   42.10   91.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4697.1066   528.0917   8.894 6.88e-07 ***
## S.P.500        6.0317     0.3894  15.488 9.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.5 on 13 degrees of freedom
## Multiple R-squared:  0.9486, Adjusted R-squared:  0.9446 
## F-statistic: 239.9 on 1 and 13 DF,  p-value: 9.292e-10

#Scatter Plot with best fit line

plot(DJIA_500$DJIA~DJIA_500$S.P.500)
abline(lm(DJIA_500$DJIA~DJIA_500$S.P.500), col="red", lwd=3) #Adds best fit line to the plot

#Adding R-squared value to the scatter plot

#Option 1: Reading the output and adding the R-squared value on the plot

text(81, 10, "R-squared = 0.6641") # Refer page 129 of Stowell for selecting the position of text on the plot

#Option 2: Letting R fill out the R-squared value on the plot  

summ_DJIA_500<-summary(linear_mod_DJIA_500) #we are saving the regression model output in object summ, then we can use it to call the value we want. 

rsq_DJIA_500<-round((summ_DJIA_500$r.squared), digits=4) #Here, we are calling the adj.r.squared value from the summ object, round it to 4 digits and then save the value as rsq

text(81, 10, paste0("R-squared = ", rsq)) # we paste the rsq on the plot

Interpretation of Simple Linear Regression output

The DJIA and S.P.500 seem to have a positive linear relationship.

summary(linear_mod_DJIA_500)

## 
## Call:
## lm(formula = DJIA ~ S.P.500, data = DJIA_500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.59  -45.15   17.41   42.10   91.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4697.1066   528.0917   8.894 6.88e-07 ***
## S.P.500        6.0317     0.3894  15.488 9.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.5 on 13 degrees of freedom
## Multiple R-squared:  0.9486, Adjusted R-squared:  0.9446 
## F-statistic: 239.9 on 1 and 13 DF,  p-value: 9.292e-10

What is the 95% confidence interval for the regression parameter LaTeX: 1? Based on this interval, what conclusion can you make about the hypotheses that the regression parameter LaTeX: 1 is equal to zero?

confint(linear_mod_DJIA_500, level = 0.95)# computes a confidence interval for the coefficient estimates

##                   2.5 %      97.5 %
## (Intercept) 3556.233890 5837.979265
## S.P.500        5.190417    6.873069

attributes(linear_mod_DJIA_500)# This prints the name of various attributes calculated in the linear model

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

linear_mod_DJIA_500$coefficients # intercept and coefficient of DJIA we looked on the regression model output

## (Intercept)     S.P.500 
## 4697.106578    6.031743

linear_mod_DJIA_500$fitted.values #These are the predicted values calculated by using the regression equation

##        1        2        3        4        5        6        7        8 
## 12405.67 12472.02 12628.85 12634.88 12809.80 12797.74 12912.34 12936.47 
##        9       10       11       12       13       14       15 
## 12960.59 12966.63 13165.67 13123.45 13189.80 13129.48 12960.59

linear_mod_DJIA_500$residuals #These are the errors

##           1           2           3           4           5           6 
##  -45.674299  -50.023473   91.151205   25.119462   52.198911    3.262398 
##           7           8           9          10          11          12 
##   37.659278   46.532306   17.405333  -44.626410   67.326067  -42.451731 
##          13          14          15 
##   22.199094  -69.483474 -110.594667

pred_time_DJIA_500<-linear_mod_DJIA_500$fitted.values
residual_error_DJIA_500<-linear_mod_DJIA_500$residuals

DJIA_500_pred<-cbind(DJIA_500,pred_time_DJIA_500,residual_error_DJIA_500) # the command cbind adds these two columns to the table
print(DJIA_500_pred)

##           Date  DJIA S.P.500 pred_time_DJIA_500 residual_error_DJIA_500
## 1    January 6 12360    1278           12405.67              -45.674299
## 2   January 13 12422    1289           12472.02              -50.023473
## 3   January 20 12720    1315           12628.85               91.151205
## 4   January 27 12660    1316           12634.88               25.119462
## 5   February 3 12862    1345           12809.80               52.198911
## 6  February 10 12801    1343           12797.74                3.262398
## 7  February 17 12950    1362           12912.34               37.659278
## 8  February 24 12983    1366           12936.47               46.532306
## 9      March 2 12978    1370           12960.59               17.405333
## 10     March 9 12922    1371           12966.63              -44.626410
## 11    March 16 13233    1404           13165.67               67.326067
## 12    March 23 13081    1397           13123.45              -42.451731
## 13    March 30 13212    1408           13189.80               22.199094
## 14     April 5 13060    1398           13129.48              -69.483474
## 15    April 13 12850    1370           12960.59             -110.594667

summary(linear_mod_DJIA_500)

## 
## Call:
## lm(formula = DJIA ~ S.P.500, data = DJIA_500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.59  -45.15   17.41   42.10   91.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4697.1066   528.0917   8.894 6.88e-07 ***
## S.P.500        6.0317     0.3894  15.488 9.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.5 on 13 degrees of freedom
## Multiple R-squared:  0.9486, Adjusted R-squared:  0.9446 
## F-statistic: 239.9 on 1 and 13 DF,  p-value: 9.292e-10

#sum of squares due to error (SSE) 

SSE_DJIA_500<-sum(((DJIA_500$DJIA)-(linear_mod_DJIA_500$fitted.values))^2) # It measures the total sum of squared error in using our regression equation.
SSE_DJIA_500

## [1] 46028.21

n<-10 #number of observations
RSE_DJIA_500<-sqrt((1/(n-2))*SSE_DJIA_500) #We can find this value on regression output
RSE_DJIA_500

## [1] 75.852

#TOTAL SUM OF SQUARES, SST 

SST_DJIA_500<-sum((DJIA_500$S.P.500 - mean(DJIA_500$S.P.500))^2) #It measures the total sum of squared error in using the mean of y to predict. 

SST_DJIA_500

## [1] 23345.73

#SUM OF SQUARES DUE TO REGRESSION, SSR

SSR_DJIA_500<-sum((linear_mod_DJIA_500$fitted.values - mean(DJIA_500$S.P.500))^2)

SSR_DJIA_500

## [1] 1990629939

#sum of squares due to error (SSE): another formula 

SSE_DJIA_500<-SST_DJIA_500-SSR_DJIA_500

SSE_DJIA_500

## [1] -1990606593

#The COEFFICIENT OF DETERMINATION (R-SQUARED, we also get this value on the linear model regression model output)


RSQ_DJIA_500<-SSR_DJIA_500/SST_DJIA_500 #Please see the interpretation of E-squared in the previous section.
RSQ_DJIA_500

## [1] 85267.4

#Residual Plot Against S.P. 500
plot(linear_mod_DJIA_500$fitted.values, linear_mod_DJIA_500$residuals, xlab="DJIA", ylab="Residuals", col="red",pch=19, main="Residual Plot Against S.P.500")
abline(h=0, lty=3) #Horizontal line at 0 residual value (y-axis), lyt=3 creates dashed line

#Normal Probability Plot of Residual
std_residual_DJIA_500<-rstandard(linear_mod_DJIA_500)
qqnorm(std_residual_DJIA_500,xlab="Normal Scores", ylab="Residuals", col="magenta3",  pch=19, main="Normal Probability Plot of Residual" )
qqline(std_residual_DJIA_500)

How much of the variation in the sample values of S&P 500 does the model estimated in part (b) explain?

shapiro.test(linear_mod_DJIA_500$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  linear_mod_DJIA_500$residuals
## W = 0.95691, p-value = 0.6389

After looking at the output p value (0.6389), I think that the residuals of our regression model are normally distributed.

Q. 3. In 2011, home prices and mortgage rates fell so far that in a number of cities the monthly cost of owning a home was less expensive than renting. The RentMortgage.csv Preview the document data show the average asking rent for 10 markets and the monthly mortgage on the median priced home (including taxes and insurance) for 10 cities where the average monthly mortgage payment was less than the average asking rent (The Wall Street Journal, November 26-27, 2011). [10 Points]

setwd("C:/Users/plu5638/Desktop/Business Analytics/Module 5/")
RentMort<-read.csv("RentMortgage.csv")
print(RentMort)

##                  City Rent.... Mortgage....
## 1             Atlanta      840          539
## 2             Chicago     1062         1002
## 3             Detroit      823          626
## 4  Jacksonville, Fla.      779          711
## 5           Las Vegas      796          655
## 6               Miami     1071          977
## 7         Minneapolis      953          776
## 8       Orlando, Fla.      851          695
## 9             Phoenix      762          651
## 10          St. Louis      723          654

Develop a simple linear regression model using the average asking rent as the independent variable and show the linear regression output. Develop the estimated regression equation that can be used to predict the monthly mortgage given the average asking rent.

#Simple Linear Regression Model
linear_mod_RentMort<-lm(Rent....~Mortgage...., data=RentMort) # Here the dependent variable, Rent (which we are trying to predict) comes first then comes the independent variable.
summary(linear_mod_RentMort) # this is the regression model output

## 
## Call:
## lm(formula = Rent.... ~ Mortgage...., data = RentMort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.278 -41.365   5.764  29.495 107.995 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  351.0815   105.3493   3.333  0.01035 * 
## Mortgage....   0.7067     0.1419   4.981  0.00108 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.03 on 8 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7257 
## F-statistic: 24.81 on 1 and 8 DF,  p-value: 0.001079

#Scatter Plot with best fit line

plot(RentMort$Rent....~RentMort$Mortgage....)
abline(lm(RentMort$Rent....~RentMort$Mortgage....), col="red", lwd=3) #Adds best fit line to the plot

#Adding R-squared value to the scatter plot

#Option 1: Reading the output and adding the R-squared value on the plot

text(81, 10, "R-squared = 0.6641") # Refer page 129 of Stowell for selecting the position of text on the plot

#Option 2: Letting R fill out the R-squared value on the plot  

summ_RentMort<-summary(linear_mod_RentMort) #we are saving the regression model output in object summ, then we can use it to call the value we want. 

rsq_RentMort<-round((summ_RentMort$r.squared), digits=4) #Here, we are calling the adj.r.squared value from the summ object, round it to 4 digits and then save the value as rsq

text(81, 10, paste0("R-squared = ", rsq)) # we paste the rsq on the plot

Interpretation of Simple Linear Regression output

summary(linear_mod_RentMort)

## 
## Call:
## lm(formula = Rent.... ~ Mortgage...., data = RentMort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.278 -41.365   5.764  29.495 107.995 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  351.0815   105.3493   3.333  0.01035 * 
## Mortgage....   0.7067     0.1419   4.981  0.00108 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.03 on 8 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7257 
## F-statistic: 24.81 on 1 and 8 DF,  p-value: 0.001079

confint(linear_mod_RentMort, level = 0.95)# computes a confidence interval for the coefficient estimates

##                    2.5 %     97.5 %
## (Intercept)  108.1455790 594.017520
## Mortgage....   0.3795108   1.033935

attributes(linear_mod_RentMort)# This prints the name of various attributes calculated in the linear model

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

linear_mod_RentMort$coefficients # intercept and coefficient of DJIA we looked on the regression model output

##  (Intercept) Mortgage.... 
##  351.0815492    0.7067231

linear_mod_RentMort$fitted.values #These are the predicted values calculated by using the regression equation

##         1         2         3         4         5         6         7         8 
##  732.0053 1059.2181  793.4902  853.5617  813.9852 1041.5500  899.4987  842.2541 
##         9        10 
##  811.1583  813.2785

linear_mod_RentMort$residuals #These are the errors

##          1          2          3          4          5          6          7 
## 107.994700   2.781904  29.509790 -74.561673 -17.985180  29.449982  53.501325 
##          8          9         10 
##   8.745896 -49.158287 -90.278457

pred_time_RentMort<-linear_mod_RentMort$fitted.values
residual_error_RentMort<-linear_mod_RentMort$residuals

RentMort_pred<-cbind(RentMort,pred_time_RentMort,residual_error_RentMort) # the command cbind adds these two columns to the table
print(RentMort_pred)

##                  City Rent.... Mortgage.... pred_time_RentMort
## 1             Atlanta      840          539           732.0053
## 2             Chicago     1062         1002          1059.2181
## 3             Detroit      823          626           793.4902
## 4  Jacksonville, Fla.      779          711           853.5617
## 5           Las Vegas      796          655           813.9852
## 6               Miami     1071          977          1041.5500
## 7         Minneapolis      953          776           899.4987
## 8       Orlando, Fla.      851          695           842.2541
## 9             Phoenix      762          651           811.1583
## 10          St. Louis      723          654           813.2785
##    residual_error_RentMort
## 1               107.994700
## 2                 2.781904
## 3                29.509790
## 4               -74.561673
## 5               -17.985180
## 6                29.449982
## 7                53.501325
## 8                 8.745896
## 9               -49.158287
## 10              -90.278457

summary(linear_mod_RentMort)

## 
## Call:
## lm(formula = Rent.... ~ Mortgage...., data = RentMort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.278 -41.365   5.764  29.495 107.995 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  351.0815   105.3493   3.333  0.01035 * 
## Mortgage....   0.7067     0.1419   4.981  0.00108 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.03 on 8 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7257 
## F-statistic: 24.81 on 1 and 8 DF,  p-value: 0.001079

#sum of squares due to error (SSE) 

SSE_RentMort<-sum(((RentMort$Rent....)-(linear_mod_RentMort$fitted.values))^2) # It measures the total sum of squared error in using our regression equation.
SSE_RentMort

## [1] 32797.25

n<-10 #number of observations
RSE_RentMort<-sqrt((1/(n-2))*SSE_RentMort) #We can find this value on regression output
RSE_RentMort

## [1] 64.02856

#TOTAL SUM OF SQUARES, SST 

SST_RentMort<-sum((RentMort$Rent.... - mean(RentMort$Mortgage....))^2) #It measures the total sum of squared error in using the mean of y to predict. 

SST_RentMort

## [1] 323281.6

#SUM OF SQUARES DUE TO REGRESSION, SSR

SSR_RentMort<-sum((linear_mod_RentMort$fitted.values - mean(RentMort$Mortgage....))^2)

SSR_RentMort

## [1] 290484.3

#sum of squares due to error (SSE): another formula 

SSE_RentMort<-SST_RentMort-SSR_RentMort

SSE_RentMort

## [1] 32797.25

#The COEFFICIENT OF DETERMINATION (R-SQUARED, we also get this value on the linear model regression model output)


RSQ_RentMort<-SSR_RentMort/SST_RentMort #Please see the interpretation of E-squared in the previous section.
RSQ_RentMort

## [1] 0.898549

Construct a residual plot against the predicted y values and a normal probability plot. Perform a statistical test of normality using the residual values.

#Residual Plot Against S.P. 500
plot(linear_mod_RentMort$fitted.values, linear_mod_RentMort$residuals, xlab="Rent", ylab="Residuals", col="red",pch=19, main="Residual Plot Against Mortgage")
abline(h=0, lty=3) #Horizontal line at 0 residual value (y-axis), lyt=3 creates dashed line

#Normal Probability Plot of Residual
std_residual_RentMort<-rstandard(linear_mod_RentMort)
qqnorm(std_residual_RentMort,xlab="Normal Scores", ylab="Residuals", col="magenta3",  pch=19, main="Normal Probability Plot of Residual" )
qqline(std_residual_DJIA_500)

shapiro.test(linear_mod_RentMort$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  linear_mod_RentMort$residuals
## W = 0.97484, p-value = 0.9317

Briefly summarize your interpretation of residual plot, normal probability plot and statistical test of normality.

The distribution of the residuals are considered perfectly normal if the data points fall on the straight line on the Normal Q-Q Plot. On the plot, it can be seen that the data points are close to the straight line (although not all points are exactly on the line). We can assume that the residuals are very close to normal distribution.The p-value of the test is more than the alpha level of 0.05 concluding that the observed distribution fits the normal. distribution

Module 5 Assignment

Alexandra McCabe

February 12 2020

Adding best fit line to the scatter plot

Interpretation of Simple Linear Regression output

Interpretation of Simple Linear Regression output

Interpretation of Simple Linear Regression output