GSM 5103 Final

Instructions

Answers should be written in clear English and complete sentences with reference to appropriate statistical output. Highlighted output is not an answer; taking the output and translating it into a clear sentence that captures the meaning is your task. When intervals are called for, use them: both boundaries.

Finding Larry’s Replacement

The cold has arrived in Chicago and your investment management firm – MoneyMagic – faces a dilemma. After many years of dedicated and profitable service, the firm’s leading fund manager, Larry LeDuc has announced his intention to step aside. The management team must find a replacement for the star manager of their two signature funds: MoneyMagic GRI 1 and MoneyMagic GRI 2 – both growth and income funds.

There are two primary candidates to replace Larry: Hank Hogan is a thirty-five year old graduate of the Ohio State University with five years of experience managing a growth and income fund. He does not hold an MBA and lists as his hobbies games of strategy, military history, and family time. The other primary candidate is Prudence Preudhomme. Prudence Preudhomme is a thirty-two year old graduate of Princeton University with a Harvard MBA that has managed a growth and income fund for two years. She lists her hobbies as Harvard Alumni network events.

The hiring team is divided based on their personal perspectives on what makes a good fund manager. The CEO, Robert Maxwell, holds an MBA from Kellogg and was an undergraduate at Yale and he is impressed by pedigree. He believes that managers with MBAs and top school pedigree are smarter, better educated, better networked, and just better equipped to manage other people’s money. Maxwell believes that Prudence Preudhomme is an ideal hire.

Larry LeDuc leads the other camp. Larry was born in rural Illinois and dropped out of university to trade stocks. His early success trading on his own led to his hire by a fund management company and to his eventual rise to managing Money Magic GRI 1 and 2. He has successfully managed the latter two funds for almost ten years and points to himself as data for hiring. “I never finished college and I haven’t needed an MBA to be a success. Age, MBAs, undergraduate pedigree – all meaningless! People with pedigree are too picky, critical, and political!” Needless to say, Larry believes that Hank Hogan should be hired

Your background

Everyone in the firm is being called upon to contribute to this hire. You are a 27 year old new analyst with MoneyMagic and you worked in investments for 2 years after Yale and before receiving your MBA. You are young and upwardly mobile with large loans to pay off. It is clearly in your personal best interests for the firm to prefer MBAs from prestigious undergraduate institutions because these are parts of your pedigree. But others are aware of your incentives; only claim that which the data justifies.

Relevant to the problem at hand, you recall that your pre-MBA employer was sued for employment discrimination and the decisive evidence had been multiple regression. Controlling for age, education, rank, and specialization, it had been shown that gender (male as 0 and female as 1) had a negative impact on gross wages that, with 95% confidence, ranged from -$9004 to $-6286. The importance of regression in the disposition of the case alongside the growth of human resource analytics led you to focus in decision sciences during your MBA and you studied the case and the data. Here is your chance to shine by bringing your analytics to a key hiring problem for your company. And a fortuitous opportunity has arisen….

An Opportunity Arises….

Mr. Maxwell is having his shoes shined in the break room when you enter for a cup of coffee.  Not five seconds pass before breaking news flashes across the screen that your previous employer has paid a multibillion dollar settlement in the aforementioned discrimination case; the statistical evidence was cited.  You see a light go off in Maxwell’s head and he says to you. “Meet me in my office in five minutes!”  You arrive in Maxwell’s office and he presents you with some regression output.

A Regression

MFP.LM
## 
## Call:
## lm(formula = Returns ~ GRI + SAT + MBA + Age + Tenure, data = MFPerform)
## 
## Coefficients:
## (Intercept)       GRIGRI          SAT       MBAYes          Age       Tenure  
##   -0.297887    -2.788525     0.005379     1.192412    -0.108892     0.010426

Maxwell’s Belief

Maxwell says, “That news report made me think that some statistical analysis might be in order so I looked at some data that I have from Morningstar.  I wanted to know if pedigree and an MBA matter for fund managers.  But the low r-squared made the data useless, or at least that is my vague recollection from analytics in graduate school.  I remember that and that 95% confidence is common, use that if you need.  Perhaps you can have a closer look and convince me that there is something here…..”  Maxwell continues, “These data come from a random sample of 540 Morningstar mutual funds.  For each fund, the following characteristics are measured in MFPerform.RData [an .RData file].

The Data and Definitions

Variable Definition
Returns The percent excess returns of the fund in the year of the observation. Excess returns are the returns over or under the percentage return on a benchmark portfolio consisting of all stocks traded on the two major American stock markets [Nasdaq and NYSE]. For example, if a fund returns 7 percent and the benchmark returned 10 percent, then the Returns are -3.
GRI A qualitative variable that is either GRI [a Growth and Income fund] or Growth [a Growth fund] according to Morningstar classifications.
SAT The average composite SAT scores of enrolling students at the institution where the fund manager received his/her undergraduate degree.
MBA The fund manager either holds an MBA (Yes) or does not hold an MBA (No).
Age The age in calendar years of the fund manager at the end of the previous calendar year, e.g. the 2008 data would contain the age of the fund manager on December 31, 2007.
Tenure The tenure of the fund manager in whole numbers of years managing the fund. Note that this is not how long an individual has been a fund manager, it only measures how long this manager has managed this fund.

The First Question [5 pts.]

Convince Mr. Maxwell why r-squared alone isn’t enough to tell you whether or not the model is useful and come up with an example of high r-squared in a useless regression.

  1. What is r-squared for this regression and what does it mean? Maxwell has forgotten, he just remembers \(r^2\).

For this regression, r-squared would be the value in the coefficient section (-0.297887). r-squared represents the variation, how close the values are from the regression line. So, in this case, the distance between our values and the line would be about -.297

  1. What statistic from the information provided above would you choose instead to evalulate a regression and to make the case for further analysis?

If not going with the r=squared, then we would look at the individual values for each variable. One that would bring the most attention would be GRIGRI. The value is high, showing high variation, and show the distribution of its values for the variable GRI.

Some Preliminaries [35 pts]

The following questions will require that you compute things except 1(c).

    1. Numerically summarise the data.

Summary of returns

result1 <- explore(
  MFPerform, 
  vars = "Returns", 
  fun = c("n_obs", "mean", "min", "max", "sd", "se"), 
  nr = Inf
)
# summary()
dtab(result1) %>% render()

I’m guessing you’re asking for just returns? The mean value of returns is .729, with a pretty substantial range (-14.056 to 16.865) and a standard deviation of 4.895

If you meant summary in general, I’ll provide in just in case

result2 <- explore(
  MFPerform, 
  vars = c("Returns", "GRI", "SAT", "MBA", "Age", "Tenure"), 
  fun = c("n_obs", "mean", "min", "max", "sd", "se"), 
  nr = Inf
)
# summary()
dtab(result2) %>% render()
  1. Provide a graphical description of Returns.

Histogram for Returns

ggplot(MFPerform) +
  aes(x = Returns) +
  geom_histogram(bins = 30L, fill = "#0C778A") +
  theme_minimal()

So, we can use a simple histogram visual to see our data, visually see the spread. We can see the average returns are in the form of smaller percentages on both negative and positive.High count of 0% returns, only a few cases of 10% returns or losses

  1. Just considering the scatterplot below (There is no graph below, I need to produce it), is there a bivariate relationship between Tenure and Returns [with at least 95% confidence]? How do you know?
visualize(
  MFPerform,
  xvar = "Tenure", 
  yvar = "Returns", 
  type = "scatter", 
  nrobs = -1, 
  check = "line", 
  custom = FALSE
)

plot(x = MFPerform$Returns, y = MFPerform$Tenure,)

model1 <- lm(MFPerform$Returns ~ MFPerform$Tenure,)
model1
## 
## Call:
## lm(formula = MFPerform$Returns ~ MFPerform$Tenure)
## 
## Coefficients:
##      (Intercept)  MFPerform$Tenure  
##           1.2266           -0.1338
summary(model1)
## 
## Call:
## lm(formula = MFPerform$Returns ~ MFPerform$Tenure)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8808  -3.0615  -0.3684   3.3597  16.1736 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.22662    0.26129   4.694  3.4e-06 ***
## MFPerform$Tenure -0.13385    0.04225  -3.168  0.00162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.854 on 538 degrees of freedom
## Multiple R-squared:  0.01831,    Adjusted R-squared:  0.01649 
## F-statistic: 10.04 on 1 and 538 DF,  p-value: 0.001622

The scatterplot really threw me off. There isn’t much scattering going on. The data is not bivariate, the line of best fit does not work into the scatterplot and my linear model indicates with 95% confidence that there is no relationship. Our p-value is too small, our coefficient indicates that at most if there was a relationship, then the influence of the points would be of ~ -.13

  1. Are Returns normal?

The data for returns is normal, as previosuly seen in the histogram I created, the data followed a bellcurve shape. Further, I can create a new histogram with the bellcurve included with it, showing that it follows it and is considered normal. Most of our data in returns is also concentrated near the mean, there is only a few points that go our further to the +/- 10% ranges

ggplot(MFPerform, aes(Returns)) +
  geom_histogram(aes(y = ..density..), bins = 7) +
  stat_function(fun = dnorm, args = list(mean = mean(MFPerform$Returns), sd = sd(MFPerform$Returns)))

  1. Assuming that Returns are normal, with a mean exactly equal to the sample mean and standard deviation exactly equal to the sample standard deviation:
    1. with probability 0.9, excess returns should range between XXX and XXX. What is XXX and XXX?
t.test(MFPerform$Returns, conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  MFPerform$Returns
## t = 3.4627, df = 539, p-value = 0.0005773
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  0.3823241 1.0764845
## sample estimates:
## mean of x 
## 0.7294043

With a probability of 0.9, excess returns should range between .382 and 1.07 (percentages)

(b) what is the probability of excess Returns greater than -3?
result3 <- prob_norm(mean = .729, stdev = 4.895, lb = -3)
summary(result1)
## Explore
## Data        : MFPerform 
## Functions   : n_obs, mean, min, max, sd, se 
## Top         : Function 
## 
##  variable n_obs  mean     min    max    sd    se
##   Returns   540 0.729 -14.056 16.865 4.895 0.211
plot(result3)

The probability of x being greater than -3 is .777 or, 77.7%

  1. The Cauchy distribution is defined as the ratio of two normal random variables. The noun, in R, is cauchy so the relevant functions are pcauchy, qcauchy, rcauchy, and dcauchy. The verbs here are 0 and 1 (with location = 0 and scale = 1, the ratio of two \(z\)).
    1. What is the probability that a Cauchy random variable takes values greater than -1?
pcauchy(-1, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE)
## [1] 0.25

The probability of a Cauchy random variable being greater than -1 is .25 or 25%

(b) With probability 0.75, a Cauchy random variable is no greater than ______.
  1. With 95% confidence, average Returns range between XXX and XXX. What are XXX and XXX?
t.test(MFPerform$Returns)
## 
##  One Sample t-test
## 
## data:  MFPerform$Returns
## t = 3.4627, df = 539, p-value = 0.0005773
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.3156144 1.1431942
## sample estimates:
## mean of x 
## 0.7294043

Average returns, with 95% confidence, ranges from .315 to 1.143

  1. With 95% confidence, do MBA’s generate higher returns than non-MBA’s? If so, by how much?

Yes vs No MBA Returns Table

result4 <- explore(
  MFPerform, 
  vars = "Returns", 
  byvar = "MBA", 
  fun = c("n_obs", "mean", "min", "max", "sd", "se", "p95"), 
  nr = Inf
)
# summary()
dtab(result4) %>% render()
result5 <- compare_means(
  MFPerform, 
  var1 = "MBA", 
  var2 = "Returns"
)
summary(result5, show = FALSE)
## Pairwise mean comparisons (t-test)
## Data      : MFPerform 
## Variables : MBA, Returns 
## Samples   : independent 
## Confidence: 0.95 
## Adjustment: None 
## 
##  MBA   mean   n n_missing    sd    se    me
##   No -0.293 205         0 4.692 0.328 0.646
##  Yes  1.355 335         0 4.918 0.269 0.529
## 
##  Null hyp.  Alt. hyp.             diff   p.value    
##  No = Yes   No not equal to Yes   -1.649 < .001  ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result5, plots = "scatter", custom = FALSE)

It seems that, with 95%, there is a slight difference in returns between people who do have MBA vs those who do not. There is a difference in means Those having MBA’s having an average return of +1.355 (SD = 4.918) Those without MBA’s having an average return of -.293 (SD = 4.692) That is a difference of 1.649

  1. Let’s analyse the type of fund: GRI.
    1. Provide a table of Growth vs. GRI types.

Table for Total GRI vs Growth Funds

result6 <- pivotr(MFPerform, cvars = "GRI", nr = Inf)
plot(result6)

# summary()
dtab(result6) %>% render()

There appears to be a total of 540 values, there are more growth funds (327) than GRI funds (213)

  1. With 95% confidence, are growth funds more common than GRI funds?
binom.test(327, 540)
## 
##  Exact binomial test
## 
## data:  327 and 540
## number of successes = 327, number of trials = 540, p-value = 1.061e-06
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5629224 0.6470270
## sample estimates:
## probability of success 
##              0.6055556

With 95% confidence, we can see that growth funds are more common than GRI funds

  1. Provide a 95% confidence interval for the probability that a randomly chosen fund is Growth type?
prop.test(table(MFPerform$GRI))
## 
##  1-sample proportions test with continuity correction
## 
## data:  table(MFPerform$GRI), null probability 0.5
## X-squared = 23.646, df = 1, p-value = 1.158e-06
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5627918 0.6467948
## sample estimates:
##         p 
## 0.6055556

The 95% confidence interval for the probability of growth type is between .5627 and .6467

  1. With 95% confidence, is there a difference [or no difference] in the proportion of MBA’s across the two types of funds?

Table for MBA Holdings between GRI Funds

result7 <- pivotr(
  MFPerform, 
  cvars = c("GRI", "MBA"), 
  nr = Inf
)
# summary()
dtab(result7) %>% render()
result8 <- compare_props(
  MFPerform, 
  var1 = "GRI", 
  var2 = "MBA", 
  levs = "No"
)
summary(result8, show = FALSE)
## Pairwise proportion comparisons
## Data      : MFPerform 
## Variables : GRI, MBA 
## Level     : No in MBA 
## Confidence: 0.95 
## Adjustment: None 
## 
##     GRI  No Yes     p   n n_missing    sd    se    me
##  Growth 126 201 0.385 327         0 0.487 0.027 0.053
##     GRI  79 134 0.371 213         0 0.483 0.033 0.065
## 
##  Null hyp.      Alt. hyp.                 diff  p.value  
##  Growth = GRI   Growth not equal to GRI   0.014 0.736    
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result8, plots = "bar", custom = FALSE)

prop.test(table(MFPerform$GRI, MFPerform$MBA))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(MFPerform$GRI, MFPerform$MBA)
## X-squared = 0.060988, df = 1, p-value = 0.8049
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.07305678  0.10191495
## sample estimates:
##    prop 1    prop 2 
## 0.3853211 0.3708920

There doesn’t seem to be much of a difference in MBA’s across the funds. The interval is wide, (-.073 to .101), the estimates are close to each other with .385 vs .370. Difference is only .014 with Growth having the -.014 with Yes MBA or vice versa when flipping it

  1. With 95% confidence, compare the age of MBAs and non-MBAs. Interpret what you find in light of the fact that the average MBA takes two years. Is either group older with 95% confidence? If so, which?
result9 <- compare_props(
  MFPerform, 
  var1 = "Age", 
  var2 = "MBA", 
  levs = "No", 
  comb = "24:27"
)
summary(result9, show = FALSE)
## Pairwise proportion comparisons
## Data      : MFPerform 
## Variables : Age, MBA 
## Level     : No in MBA 
## Confidence: 0.95 
## Adjustment: None 
## 
##  Age No Yes     p  n n_missing    sd    se    me
##   24  0   1 0.000  1         0 0.000 0.000 0.000
##   27  1   2 0.333  3         0 0.471 0.272 0.533
##   28  1   4 0.200  5         0 0.400 0.179 0.351
##   29  6   7 0.462 13         0 0.499 0.138 0.271
##   30  5  13 0.278 18         0 0.448 0.106 0.207
##   31  5  13 0.278 18         0 0.448 0.106 0.207
##   32  5  13 0.278 18         0 0.448 0.106 0.207
##   33  5  15 0.250 20         0 0.433 0.097 0.190
##   34  8   9 0.471 17         0 0.499 0.121 0.237
##   35  6  19 0.240 25         0 0.427 0.085 0.167
##   36  4  10 0.286 14         0 0.452 0.121 0.237
##   37 11   5 0.688 16         0 0.464 0.116 0.227
##   38  8   4 0.667 12         0 0.471 0.136 0.267
##   39  5   5 0.500 10         0 0.500 0.158 0.310
##   40  2   8 0.200 10         0 0.400 0.126 0.248
##   41  3  12 0.200 15         0 0.400 0.103 0.202
##   42  8   8 0.500 16         0 0.500 0.125 0.245
##   43  7  14 0.333 21         0 0.471 0.103 0.202
##   44 13  15 0.464 28         0 0.499 0.094 0.185
##   45  6  14 0.300 20         0 0.458 0.102 0.201
##   46 12  14 0.462 26         0 0.499 0.098 0.192
##   47  8  18 0.308 26         0 0.462 0.091 0.177
##   48  9  19 0.321 28         0 0.467 0.088 0.173
##   49  8  23 0.258 31         0 0.438 0.079 0.154
##   50  3  11 0.214 14         0 0.410 0.110 0.215
##   51  4  10 0.286 14         0 0.452 0.121 0.237
##   52  6   8 0.429 14         0 0.495 0.132 0.259
##   53  3   3 0.500  6         0 0.500 0.204 0.400
##   54  4   4 0.500  8         0 0.500 0.177 0.346
##   55  3   4 0.429  7         0 0.495 0.187 0.367
##   56  5   5 0.500 10         0 0.500 0.158 0.310
##   57  2   2 0.500  4         0 0.500 0.250 0.490
##   58  2   4 0.333  6         0 0.471 0.192 0.377
##   59  2   2 0.500  4         0 0.500 0.250 0.490
##   60  6   4 0.600 10         0 0.490 0.155 0.304
##   61  2   3 0.400  5         0 0.490 0.219 0.429
##   62  0   5 0.000  5         0 0.000 0.000 0.000
##   63  1   1 0.500  2         0 0.500 0.354 0.693
##   64  2   2 0.500  4         0 0.500 0.250 0.490
##   65  1   1 0.500  2         0 0.500 0.354 0.693
##   66  3   1 0.750  4         0 0.433 0.217 0.424
##   67  3   0 1.000  3         0 0.000 0.000 0.000
##   71  1   0 1.000  1         0 0.000 0.000 0.000
##   73  1   0 1.000  1         0 0.000 0.000 0.000
##   75  2   0 1.000  2         0 0.000 0.000 0.000
##   76  1   0 1.000  1         0 0.000 0.000 0.000
##   77  1   0 1.000  1         0 0.000 0.000 0.000
##   79  1   0 1.000  1         0 0.000 0.000 0.000
## 
##  Null hyp. Alt. hyp.            diff   p.value              
##  24 = 27   24 not equal to 27   -0.333 1 (2000 replicates)  
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result9, plots = "bar", custom = FALSE)

Pivot Table for Average Age Between MBA’s and Non-MBA’s

result10 <- pivotr(
  MFPerform, 
  cvars = "MBA", 
  nvar = "Age", 
  nr = Inf
)
# summary()
dtab(result10) %>% render()

Total Summary Table for MBA’s from Age

result11 <- explore(
  MFPerform, 
  vars = "Age", 
  byvar = "MBA", 
  fun = c("n_obs", "mean", "min", "max", "sd", "se"), 
  nr = Inf
)
# summary()
dtab(result11) %>% render()
result12 <- compare_means(MFPerform, var1 = "MBA", var2 = "Age")
summary(result12, show = FALSE)
## Pairwise mean comparisons (t-test)
## Data      : MFPerform 
## Variables : MBA, Age 
## Samples   : independent 
## Confidence: 0.95 
## Adjustment: None 
## 
##  MBA   mean   n n_missing     sd    se    me
##   No 45.473 205         0 11.035 0.771 1.520
##  Yes 42.982 335         0  9.001 0.492 0.967
## 
##  Null hyp.  Alt. hyp.             diff  p.value   
##  No = Yes   No not equal to Yes   2.491 0.007   **
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result12, plots = "scatter", custom = FALSE)

In terms of general data, there is a difference between ages. For those with MBA’s, the avg age is 42.98 (range between 24 and 66, SD of 9) For those without MBA’s, the avg age is 45.47 (range between 27 and 79, SD of 11.03) With 95% confidence confirms, there is a difference in age (difference of 2.491) with MBA holders being younger. This puts into perspective the demographic of MBA candidates. It appears that people do not go straight to an MBA after college. The demographic seems to be professionals who are in around 40+ years of age.

  1. With 95% confidence, how much time, in years, has the average fund manager spent outside their current job?
result13 <- explore(
  MFPerform, 
  vars = "Tenure", 
  fun = c("n_obs", "mean", "min", "max", "sd"), 
  nr = Inf
)
# summary()
dtab(result13) %>% render()
result14 <- single_mean(MFPerform, var = "Tenure")
summary(result14)
## Single mean test
## Data      : MFPerform 
## Variable  : Tenure 
## Confidence: 0.95 
## Null hyp. : the mean of Tenure = 0 
## Alt. hyp. : the mean of Tenure is not equal to 0 
## 
##   mean   n n_missing    sd    se    me
##  3.715 540         0 4.949 0.213 0.418
## 
##   diff    se t.value p.value  df  2.5% 97.5%    
##  3.715 0.213  17.442  < .001 539 3.296 4.133 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result14, plots = "hist", custom = FALSE)

I am guessing that this has to do with tenure. If we are only going off of their current job at their current fund, then we can see that the total average is 3.71 years (range between 0 and 35) With 95% confidence, the average fund manager spent 3.71 years outside their current position with a standard deviation of 4.94 years.

Multiple Regression

Use the 90% level of confidence where needed. The following questions relate to the regression model presented previously.

Maxwell continues, “Here are the key pieces.  Does having an MBA matter?  Does undergraduate SAT matter?  Does age matter?  Does the type of fund that one is managing matter?  Does tenure as a manager matter?”  Eventually, you will clearly answer these questions, but let's think about this regression.

Further Questions [30 pts]

  1. Are the residuals normal?
source(url("https://raw.githubusercontent.com/robertwwalker/DADMStuff/master/ResidPlotter.R"))
resid.plotter(MFP.LM)

Ir appears that the residuals are normal. Our Shapiro value is strong, with a high p-value, our residuals shape is at a normal shape and our fitted values are clustered through our line of best fit.

  1. Assume that the residuals are normal; what is the probability that the regression predicts returns to within plus or minus 8 percentage points?
library(gvlma)
gvlma(MFP.LM)
## 
## Call:
## lm(formula = Returns ~ GRI + SAT + MBA + Age + Tenure, data = MFPerform)
## 
## Coefficients:
## (Intercept)       GRIGRI          SAT       MBAYes          Age       Tenure  
##   -0.297887    -2.788525     0.005379     1.192412    -0.108892     0.010426  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = MFP.LM) 
## 
##                      Value p-value                Decision
## Global Stat        3.53128  0.4731 Assumptions acceptable.
## Skewness           1.03327  0.3094 Assumptions acceptable.
## Kurtosis           0.03059  0.8612 Assumptions acceptable.
## Link Function      0.04675  0.8288 Assumptions acceptable.
## Heteroscedasticity 2.42067  0.1197 Assumptions acceptable.
car::qqPlot(residuals(MFP.LM))

## [1]   8 482
pnorm(.08)
## [1] 0.5318814

It is a .5318 probability of the percentage points being plus/minus 8%

## Load commands
Ageslope <- get("Ageslope", envir = .GlobalEnv)
register("Ageslope")
  1. Consider the 95% confidence interval for the slope for Age. In a sentence or two, what does it mean and what is the relevant metric?
confint(MFP.LM,'Age',level=0.95)
##          2.5 %      97.5 %
## Age -0.1526457 -0.06513859

The 95% confidence interval for the slope of Age shows that it is between -.1526 and -.0651 This tells us that we are 95% confident that in our population for Age, the slope is between -.1526 and -.0651 So for every point change, the slope will have a change that is between -.1526 and -.0651

  1. Complete the following. With 95% confidence, all other things equal, Growth and Income Funds earn XXX to XXX more/less than Growth funds.
result21 <- explore(
  MFPerform, 
  vars = "Returns", 
  byvar = "GRI", 
  fun = c("n_obs", "mean", "min", "max", "sd", "p025", "p975"), 
  nr = Inf
)
# summary()
dtab(result21) %>% render()

Growth and income funds earn -4.218 to 2.635 less compared to straight growth funds

  1. Let’s think about unexplained excess returns.
    1. Does the worst performing fund also have the most unexplained losses?
library(gvlma)
gvlma(MFP.LM)
## 
## Call:
## lm(formula = Returns ~ GRI + SAT + MBA + Age + Tenure, data = MFPerform)
## 
## Coefficients:
## (Intercept)       GRIGRI          SAT       MBAYes          Age       Tenure  
##   -0.297887    -2.788525     0.005379     1.192412    -0.108892     0.010426  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = MFP.LM) 
## 
##                      Value p-value                Decision
## Global Stat        3.53128  0.4731 Assumptions acceptable.
## Skewness           1.03327  0.3094 Assumptions acceptable.
## Kurtosis           0.03059  0.8612 Assumptions acceptable.
## Link Function      0.04675  0.8288 Assumptions acceptable.
## Heteroscedasticity 2.42067  0.1197 Assumptions acceptable.
car:::qqPlot(residuals(MFP.LM))

## [1]   8 482
  1. Are the most unexplained gains obtained by an MBA?
  2. The fund with the most unexplained losses is a GRI [Growth and Income] fund, True or False?
  3. What is the median value of unexplained returns?
result30 <- explore(
  MFPerform, 
  vars = "Returns", 
  byvar = "GRI", 
  fun = c("n_obs", "median"), 
  nr = Inf
)
# summary()
dtab(result30) %>% render()
result31 <- explore(
  MFPerform, 
  vars = "Returns", 
  fun = c("n_obs", "median"), 
  nr = Inf
)
# summary()
dtab(result31) %>% render()
  1. The middle 50% of unexplained returns range between XXX and XXX.

(I really didn’t know what it what these questions were asking for. Specifically on unexplained vs explained. I could use codes to try and answer them, but I didn’t understand what data it wanted me to use or look at, I have no reference as to what I should be looking at for unexplained variables)

  1. The following statements are either True or False. If true, explain that it is so. If false, change one word to make it true.

    1. The top 5 funds, in terms of expected excess returns are Growth funds managed by young and newly hired MBA’s from above average schools.

This statement is False. Only one manager did not have an MBA and that the ages across the top 5 funds are not young. The ages are fron 32 to 51

## filter and sort the dataset
MFPerform %>%
  arrange(desc(Returns)) %>%
  dtab(dec = 2, nr = 5) %>% render()
(b) The bottom 5 funds, in terms of expected excess returns, are Growth funds managed by older and long-tenured MBA's from below average schools.

This statement is False. These people do not have MBA’s (except for 1) and these people have not been long tenured (except for 2 people)

## filter and sort the dataset
MFPerform %>%
  arrange(Returns) %>%
  dtab(dec = 2, pageLength = 5, nr = 5) %>% render()
  1. What variables explain the most and least variation?

  2. Predict the average and the distribution of excess returns for Hank [Ohio State has average SAT of 1042] and for Prudence [Princeton has average SAT of 1355] in their current funds [all the necessary facts are given in the text]? Who should perform better and is this true on average, overall, both, or neither?

Hank <- MFPerform %>% filter(SAT=="1042", GRI=="Growth and Income Fund", MBA=="2")
predict(MFP.LM, newdata = Hank, interval="confidence")
##      fit lwr upr

Neither of them would get a good predictor value based on their age (because the ages were high both for the top 5 as well as the lower 5), or for being in GRI (Since the top funds were growth, meanwhile the bottom funds were GRI). However, overall, Prudence is predicted more favorably based on his MBA holding and school ranking as shown as variables being placed amongst the top 5 funds

The Following Refers to the Results of the Stepwise Fitting [20 pts]

Everything Prior to this refers to the initial regression.

  1. A stepwise algorithm should be applied to the regression model. What steps are taken and why?
Sliced.regression <- step(MFP.LM, direction = "both")
## Start:  AIC=1618.27
## Returns ~ GRI + SAT + MBA + Age + Tenure
## 
##          Df Sum of Sq   RSS    AIC
## - Tenure  1      1.09 10575 1616.3
## <none>                10574 1618.3
## - MBA     1    173.35 10747 1625.0
## - SAT     1    285.50 10859 1630.7
## - Age     1    473.29 11047 1639.9
## - GRI     1    994.22 11568 1664.8
## 
## Step:  AIC=1616.33
## Returns ~ GRI + SAT + MBA + Age
## 
##          Df Sum of Sq   RSS    AIC
## <none>                10575 1616.3
## + Tenure  1      1.09 10574 1618.3
## - MBA     1    172.78 10748 1623.1
## - SAT     1    287.04 10862 1628.8
## - Age     1    581.56 11157 1643.2
## - GRI     1    994.48 11570 1662.9
summary(Sliced.regression)
## 
## Call:
## lm(formula = Returns ~ GRI + SAT + MBA + Age, data = MFPerform)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.731  -2.986   0.154   2.761  15.167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.306479   1.780482  -0.172 0.863398    
## GRIGRI      -2.788863   0.393181  -7.093 4.18e-12 ***
## SAT          0.005327   0.001398   3.811 0.000155 ***
## MBAYes       1.190063   0.402524   2.957 0.003249 ** 
## Age         -0.106429   0.019621  -5.424 8.83e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.446 on 535 degrees of freedom
## Multiple R-squared:  0.1812, Adjusted R-squared:  0.1751 
## F-statistic:  29.6 on 4 and 535 DF,  p-value: < 2.2e-16

The steps taken on a stepwise are that of adding and removing variables for our predictors based on our F statistic

    1. Are the residuals from this regression normal? How do you know?
lmfit <- lm(Sliced.regression) 
lm_resid <- resid(lmfit) 

hist(resid(lmfit), breaks=100) 

Our residuals from our regression are normal as seen by the distribution of the data and the shape that it has taken

(I tried solving it with qqplots or a jarque.bera.test, however I kept getting errors returned)

  1. Interpret the residual standard error from the table. What is the metric? What does it mean? Is it better that it is smaller or larger?

The residual standard error from our table is 1.78 (Intercept) The individual standard errors are as follows: GRIGRI 0.393181 SAT 0.001398 MBAYes 0.402524 Age 0.019621

Our residual standard errors show us how much deviation there is from our true regression line. It would be better for us to have a smaller value as it would strengthen our data and means our predictions are better and would mean our model fits the data better

  1. Provide a complete interpretation of each of the confidence intervals for slopes of the remaining factors.
wack <- confint(Sliced.regression) %>% data.frame() %>% rename(High=2, Low=1)
wack$Returns <- rownames(confint(Sliced.regression))
Result20 <- wack %>% rowwise() %>% mutate(Interpretation = paste("Percent point per one unit change in ", Returns), to = "to") %>%  select(Low, to, High, Interpretation) 
Result20[1,"Interpretation"] <- "Percent points if all predictors are zero"
Result20 %>% knitr::kable()
Low to High Interpretation
-3.8040718 to 3.1911131 Percent points if all predictors are zero
-3.5612314 to -2.0164956 Percent point per one unit change in GRIGRI
0.0025809 to 0.0080726 Percent point per one unit change in SAT
0.3993415 to 1.9807837 Percent point per one unit change in MBAYes
-0.1449727 to -0.0678849 Percent point per one unit change in Age
  1. List the factors in order of importance.

  2. Plot at least one relevant effect from the regression.

library(jtools)
effect_plot(Sliced.regression, pred=SAT, interval=TRUE, int.type="confidence") + labs(y="Expected Mean Returns", x="SAT", title="The Estimated Effect of Fund Returns", subtitle="From a model simplified by stepwise")

A Final Question and Summary [10 pts]

Provide advice on the hiring decision that summarises what you have learned including the construction of at least one appropriate graphic that clearly summarises the case. This should include a discussion of the candidates and how their attributes fit into the results.

When deciding between the two candidates, I did favor Hank more at the beginning before doing the analysis. It made sense that people who working hard have greater perspectives.

But the data I have analyzed now shows the opposite in performance

Their age and fund that they are coming from will be irrelevant, as we saw that ages were more prevalent with an “older” skew, meanwhile, we also saw that GRI’s were less prevalent and had lower returns compared to Growth funds. Tenure did not have a place either as top and lower rankings showed that recently hired people were more prevalent (also did not find a correlation between them when graphing them), and these two currently have a low tenure in their current positions.

However, what did interest me was the data on the accolades. First, we saw that there was greater mean, max and minimums for returns from those who had MBA’s vs those who did not. We also saw that those managers from the top 5 performing funds came from top schools (such as Prudence) (Also had a very low value for it’s regression, showing how this could be a strong predictor variable).

I tried my best, the 2nd half seemed impossible at my level of understanding.