11.8

a)

$F = \frac{SSM}{DFM} = \frac{16.5}{4} = 4.125$

$MSM = \frac{SSE}{DFE} \frac{100.8}{52} = 1.938$

$F = \frac{4.125}{1.938} = 2.13$

b)

DFs are 4 and 52 = 57 - 4 - 1 = (n-p-1)

c)

The P-value is between 0.050 and 0.100

d)

$R^2 = \frac{SSM}{SST} = \frac{16.5}{16.5 + 100.8} = 0.141$

14% of the response variable is explained by the explantory variables.

11.24

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)

EX11_24TWISTER <- read_csv("EX11-24TWISTER.csv", col_types = cols(...4 = col_skip()))

## New names:
## * `` -> ...4

twister = EX11_24TWISTER
head(twister)

## # A tibble: 6 × 3
##    Year Tornadoes `Census (thousands)`
##   <dbl>     <dbl>                <dbl>
## 1  1953       421               158956
## 2  1954       550               161884
## 3  1955       593               165069
## 4  1956       504               168088
## 5  1957       856               171187
## 6  1958       564               174149

a) graphical and numerical summaries of data

# install.packages("GGally")
# library(GGally)
GGally::ggpairs(twister)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

The output shows a relatively strong linear correlation between the three variables indicated by the correlation coefficients. The data indicates that as years progress, census numbers increase favorably. It also indicates that year and tornadoes have a strong positive relationship.

b) multiple linear regression

twister_fit = lm(twister$Tornadoes ~ twister$Year + twister$`Census (thousands)`)

summary(twister_fit)

## 
## Call:
## lm(formula = twister$Tornadoes ~ twister$Year + twister$`Census (thousands)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.81 -115.82  -10.99  134.76  596.21 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   1.783e+04  4.776e+04   0.373    0.710
## twister$Year                 -9.555e+00  2.525e+01  -0.378    0.706
## twister$`Census (thousands)`  8.672e-03  9.764e-03   0.888    0.378
## 
## Residual standard error: 201.8 on 59 degrees of freedom
## Multiple R-squared:  0.5788, Adjusted R-squared:  0.5645 
## F-statistic: 40.53 on 2 and 59 DF,  p-value: 8.385e-12

Tornadoes_hat = 17,830 − 9.555 ∗ Year + 0.008672 ∗ Census

c) Residuals

Residuals plotted against variables

res_twister = residuals(twister_fit)

plot(twister$`Census (thousands)`, res_twister)

plot(twister$Tornadoes, res_twister)

The above residua polots both show evidance of heterscedasticity given the unequal variance of response variable above the mean 0. It is more pronounced in the Tornado data as the numbers increase.

Normal Quantile

qqnorm(res_twister); qqline(res_twister, col = 2,lwd=2,lty=2)

This Normal quantil plot shows the residuals are approximately normal.

d) Hypothesis testing

H_0: β_year = 0, H_a: β_year > 0

dim(twister)

## [1] 62  3

n = 62 
p = 2

# mean_year = mean(twister$Year)
# 
# twister$error = (twister$Year - mean_year)
# 
# 
# s2_twister = sum(res_twister)^2 / n - p - 1
# 
# s_twister = sqrt(s2_twister)
# 
# SE_b_year = s_twister / sqrt(sum(twister$error))
# 
# t_twister_year =  -9.55 / SE_b_year
# 
# SE_b_year*-0.378 



pt(coef(summary(twister_fit))[, 3], twister_fit$df, lower = FALSE)

##                  (Intercept)                 twister$Year 
##                    0.3551500                    0.6467625 
## twister$`Census (thousands)` 
##                    0.1890349

p_val = 0.706/2

p_val

## [1] 0.353

The P-value of 0.353 for the one-sided alternative of this test is not significant at common alpha levels. There is not sufficient evidence suggesting a linear increase in tornadoes over time as a function of these explanatory variables.

11.33

library(readr) 
EX11_33RANKING <- read_csv("EX11-33RANKING.csv")

## Rows: 55 Columns: 7

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, IndIncome
## dbl (5): Citations, Research, IntOutlook, Teaching, Overall

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

rankings = EX11_33RANKING

rankings2 = select(rankings, -c(Name, IndIncome, IntOutlook))

a)

GGally::ggpairs(rankings2, cardinality_threshold =  80)

All variables are skewed (3 of 4 to the right) and do not look Normally-distributed.

b)

Research and Teaching seem to have a postive linear relatiosnhip. Research and Citations do not appear to have an relationship in any direction. Teaching and Citations also seem to have a very weak relationship (if any).

11.34

GGally::ggpairs(rankings2, cardinality_threshold =  80)

a)

Overall scores seems to have a stornger linear relationship with Research and Teaching, whears with Citations, it looks less linear.

b)

It appaears that all explantory variables are positively correlated with overall score, but citations has the weakest relationship with 0.44. Teaching and Research have an $R^2$ of 0.92 and 0.90, respectiely, and demonstrate a stronger, more linear, relationship with overall score.

11.35

a)

The statistical model for tis analysis is:

Overall Score = $\beta_0 + \beta_1$(Research _score) + $\beta_2$(Citation_score) + $\beta_3$(Teaching_score)

The assumptions of this are that 1) the error deviations are independent and normally distributed; 2) Overall score (response variable) is a linear function fo the explanatory variables (teaching score, research score, citation score). Both of these assumptions are met in this analysis.

b)

rankings_fit = lm(rankings2$Overall ~ rankings2$Citations + rankings2$Research + rankings2$Teaching)

summary(rankings_fit)

## 
## Call:
## lm(formula = rankings2$Overall ~ rankings2$Citations + rankings2$Research + 
##     rankings2$Teaching)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6850 -0.9686 -0.2867  1.0734  3.7384 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.16814    1.50453   5.429 1.58e-06 ***
## rankings2$Citations  0.27513    0.01825  15.078  < 2e-16 ***
## rankings2$Research   0.32800    0.02994  10.956 5.23e-15 ***
## rankings2$Teaching   0.26432    0.03562   7.420 1.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.723 on 51 degrees of freedom
## Multiple R-squared:  0.9774, Adjusted R-squared:  0.9761 
## F-statistic: 736.4 on 3 and 51 DF,  p-value: < 2.2e-16

Overall Score = 8.16814 + $\beta_1$(0.3280)+ $\beta_2$(0.2751) + $\beta_3$(0.2643)

c)

95% confidence interval

Citations

# EXAMPLE MANUAL CALCULATION;
# xbar_citations = mean(rankings2$Citations)
# 
# rankings$x_xbar_c = rankings$Citations - xbar_citations
# 
# rankings$x_xbar_c2 = rankings$x_xbar_c^2
# 
# sum_squares_c = sum(rankings$x_xbar_c2)
# 
# res_rank = resid(rankings_fit)
# res_rank2 = resid(rankings_fit)^2
# sum_res_rank2 = sum(res_rank2)
# 
# # dim(rankings)
# 
# s2_rank = sum_res_rank2 / 55 - 3 - 1
# 
# s_rank = sqrt(pmax(s2_rank))
# 
# SE_b_citation =  / sqrt(sum_squares_c)
# 
# SE_b_citation

# ci(rankings2$Citations, conf)

confint(rankings_fit, level = 0.95)

##                         2.5 %     97.5 %
## (Intercept)         5.1476819 11.1886044
## rankings2$Citations 0.2384990  0.3117622
## rankings2$Research  0.2678983  0.3881030
## rankings2$Teaching  0.1928017  0.3358340

The confidence interval for Teaching is (0.193, 0.336); Research is (0.268, 0.388), Citations is (0.238, 0.311). None contain 0, which recapitulates the significance test

d)

97.74% of the variation observed in overall scores is explained by the linear model. Sigma is the residual standard error and is estimated at 1.723.

11.36

sevengr = read_csv("EX11-36SEVENGR.csv")

## Rows: 78 Columns: 5

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): obs, GPA, IQ, GENDER, CONCEPT

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(sevengr)

## # A tibble: 6 × 5
##     obs   GPA    IQ GENDER CONCEPT
##   <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1     1  7.94   111      2      67
## 2     2  8.29   107      2      43
## 3     3  4.64   100      2      52
## 4     4  7.47   107      2      66
## 5     5  8.88   114      1      58
## 6     6  7.58   115      2      51

GGally::ggpairs(select(sevengr, -obs), cardinality_threshold =  80)

a)

IQ, Gender, and Concept explain 63.4, -0.097, and 54.2 % of the variaiton in GPA.

b)

GPA = $\beta_0 + \beta_1$(IQ) + $\beta_2$(Self-Concept)

$\beta_2$ is the indication of how much self-concept will have an effect on GPA.

c)

sevengr_fit = lm(sevengr$GPA ~ sevengr$IQ + sevengr$CONCEPT )

summary(sevengr_fit)

## 
## Call:
## lm(formula = sevengr$GPA ~ sevengr$IQ + sevengr$CONCEPT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2562 -0.6751  0.2194  1.0829  2.8508 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.88204    1.47223  -2.637  0.01017 *  
## sevengr$IQ       0.07720    0.01539   5.017 3.42e-06 ***
## sevengr$CONCEPT  0.05125    0.01633   3.139  0.00243 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.547 on 75 degrees of freedom
## Multiple R-squared:  0.4711, Adjusted R-squared:  0.457 
## F-statistic:  33.4 on 2 and 75 DF,  p-value: 4.233e-11

$\widehat{GPA}$ = -3.882 + 0.077(IQ) + 0.051(CONCEPT)

47.1% of the variaiton in GPA is splained by the variables in the above model.

d)

$H_0: \beta_2 = 0; H_1: \beta_2 \neq 0$

$t = 3.139$; P-value = 0.00243

Since the p-values is smaller than 0.05 (and 0.01), we reject the null hypothesis and conclude that there is sufficient evidence that self-concept contributes signficantly to GPA and they have a linear relationship.

11.40

biomark = read_csv("EX11-40BIOMARK.csv")

## Rows: 31 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): voplus, vominus, oc, loc, trap, ltrap, lvoplus, lvominus

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

a)

GGally::ggpairs(select(biomark, -c(loc, ltrap, lvoplus, lvominus)), cardinality_threshold =  80)

Each of the variables appears to be skewed significantly to the right.

b)

VO+ and VO- seem to have the strongest linear relationship, while TRAP and VO+ having the second strongest. VO- and OC seem to have the linear relationship. Trap seems to have a moderate relationship qith VO- and OC.

11.41

biomark_fit1 = lm(biomark$voplus ~ biomark$oc)
summary(biomark_fit1)

## 
## Call:
## lm(formula = biomark$voplus ~ biomark$oc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -727.45 -234.43  -85.08   43.66 1500.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  334.034    159.241   2.098   0.0448 *  
## biomark$oc    19.505      4.127   4.726 5.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 443.3 on 29 degrees of freedom
## Multiple R-squared:  0.4351, Adjusted R-squared:  0.4156 
## F-statistic: 22.34 on 1 and 29 DF,  p-value: 5.429e-05

The regression model for predicting VO+ with OC is: $\widehat{VO+} = \beta_0 + \beta_1$(OC) $= 334.034 + 19.505$(OC)

43.51% of the variance in VO+ is explained by OC.

t = 4.726

P-value = 5.34e-5

The test statistics show that OC has a significant linear relationship with VO+

plot(biomark$oc, resid(biomark_fit1))

There is some heteroscedasticity in the the data for the OC variable that dhows inflated variance near the median of the data range.

b)

biomark_fit2 = lm(biomark$voplus ~ biomark$oc + biomark$trap)
summary(biomark_fit2)

## 
## Call:
## lm(formula = biomark$voplus ~ biomark$oc + biomark$trap)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -708.2 -198.6 -100.2  125.8 1224.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    57.704    156.539   0.369  0.71518   
## biomark$oc      6.415      5.125   1.252  0.22102   
## biomark$trap   53.874     15.393   3.500  0.00158 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 376.3 on 28 degrees of freedom
## Multiple R-squared:  0.607,  Adjusted R-squared:  0.5789 
## F-statistic: 21.62 on 2 and 28 DF,  p-value: 2.096e-06

The regression model for predicting VO+ with OC is: $\widehat{VO+} = \beta_0 + \beta_1$(OC) + $\beta_2$(TRAP) $= 334.034 + 19.505$(OC)

It appears that OC is no longer a significant predictor of VO+ and TRAP is much better with a P-value of -.00158. OC is no longer significant at common thresholds.

60.7% of the variance in VO+ is explained by OC and TRAP.

This view is actually consistent with the pattern of relationships described in the previous exercise, since it was stated that TRAP appears to have a stornger linear relationship with VO+ than OC. OC has more outliers. The estimate is sigma decreased as the number of variables and the $R^2$ increased.

11.42

a)

$\widehat{VO+} = \beta_0 + \beta_1$(OC) + $\beta_2$(TRAP) + $\beta_3$(VO-)

b)

biomark_fit3 = lm(biomark$voplus ~ biomark$oc + biomark$trap + biomark$vominus )
summary(biomark_fit3)

## 
## Call:
## lm(formula = biomark$voplus ~ biomark$oc + biomark$trap + biomark$vominus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -364.19 -158.57  -15.13  120.08  441.11 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -243.4877    94.2183  -2.584  0.01549 *  
## biomark$oc         8.2349     2.8397   2.900  0.00733 ** 
## biomark$trap       6.6071    10.3340   0.639  0.52797    
## biomark$vominus    0.9746     0.1211   8.048  1.2e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.8 on 27 degrees of freedom
## Multiple R-squared:  0.8844, Adjusted R-squared:  0.8715 
## F-statistic: 68.84 on 3 and 27 DF,  p-value: 9.031e-13

Once including VO-, TRAP now becomes a worse predictor of VO+ and VO- becomes the most significant. This recapitulates earlier observations that VO+ and VO- are strongly correlated. These three variables explain 88.4% of the variation in VO+ and signma has decrease even more from just including 2 variables.

c)

# install.packages("broom")
broom::tidy(biomark_fit1) %>% 
  mutate_if(is.numeric, funs(as.character(signif(., 4)))) %>% 
  knitr::kable()

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

term	estimate	std.error	statistic	p.value
(Intercept)	334	159.2	2.098	0.04476
biomark$oc	19.5	4.127	4.726	5.429e-05

broom::tidy(biomark_fit2) %>% 
  mutate_if(is.numeric, funs(as.character(signif(., 4)))) %>% 
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	57.7	156.5	0.3686	0.7152
biomark$oc \|6.415 \|5.125 \|1.252 \|0.221 \| \|biomark$trap	53.87	15.39	3.5	0.001577

broom::tidy(biomark_fit3) %>% 
  mutate_if(is.numeric, funs(as.character(signif(., 4)))) %>% 
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	-243.5	94.22	-2.584	0.01549
biomark$oc \|8.235 \|2.84 \|2.9 \|0.007332 \| \|biomark$trap	6.607	10.33	0.6394	0.528
biomark$vominus	0.9746	0.1211	8.048	1.199e-08

When OC is the only predictor and when there are three predictors, OC is significant.TRAP is significant with two predictors but not three predictors.

d)

Number_Predictors = c(1, 2, 3)

R2 = c(0.435, 0.607, 0.884)

sigma = c(443.27, 376.26, 207.84)

data.frame(Number_Predictors, R2, sigma)

##   Number_Predictors    R2  sigma
## 1                 1 0.435 443.27
## 2                 2 0.607 376.26
## 3                 3 0.884 207.84

As the number of predictors increases $R^2$ increases and $\sigma$ decreases.

e)

The results in b) suggest the following model: VO+ = $\beta_0$ + $\beta_1$(OC) + $\beta_2$(VO-)

biomark_fit4 = lm(biomark$voplus ~ biomark$oc + biomark$vominus)
summary(biomark_fit4)

## 
## Call:
## lm(formula = biomark$voplus ~ biomark$oc + biomark$vominus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -350.25 -153.94  -13.22  148.19  428.09 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -234.14400   92.09009  -2.543 0.016818 *  
## biomark$oc         9.40388    2.14964   4.375 0.000153 ***
## biomark$vominus    1.01857    0.09858  10.333 4.65e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 205.6 on 28 degrees of freedom
## Multiple R-squared:  0.8826, Adjusted R-squared:  0.8742 
## F-statistic: 105.3 on 2 and 28 DF,  p-value: 9.418e-14

$R^2$ = 0.8826, p-value = 9.418e-14, $\sigma$ = 205.6

The correlation coefficient is about the same as the previous mdoel with thre predictors, which may indicate that it does not need to be included int he overall model. VO-, however, still shows the highest significance for this model.

11.43

log_biomark = log(biomark)

GGally::ggpairs(select(log_biomark, -c(loc, ltrap, lvoplus, lvominus)), cardinality_threshold =  80)

After the log transformation of the data, all variables appear to be more normal than before. The relationships between variable also appear to have stronger linear trends.

The new statistical model is now: log(VO+) = $\beta_0 + \beta_1$(log(OC)) + $\beta_2$(log(TRAP)) + $\beta_3$(log(VO-))

log_biomark_fit1 = lm(log_biomark$voplus ~ log_biomark$oc + log_biomark$trap + log_biomark$vominus)
summary(log_biomark_fit1)

## 
## Call:
## lm(formula = log_biomark$voplus ~ log_biomark$oc + log_biomark$trap + 
##     log_biomark$vominus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44054 -0.14746 -0.00628  0.16318  0.39937 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.87239    0.64020   1.363  0.18424    
## log_biomark$oc       0.39203    0.11538   3.398  0.00212 ** 
## log_biomark$trap     0.02746    0.15697   0.175  0.86242    
## log_biomark$vominus  0.67248    0.11778   5.709 4.56e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2327 on 27 degrees of freedom
## Multiple R-squared:  0.8421, Adjusted R-squared:  0.8245 
## F-statistic: 47.99 on 3 and 27 DF,  p-value: 5.943e-11

This models shows that 84.2% of the variation in VO+ is accounted for by these variables. The $\sigma$ is 0.2327 and the p-value for this model is 5.943e-11. This model suggests, again, that VO- and OC are the only significant predictors for this response variable, so this new model was proposed:

log(VO+) = $\beta_0 + \beta_1$(log(OC)) + $\beta_2$(log(VO-))

log_biomark_fit2 = lm(log_biomark$voplus ~ log_biomark$oc + log_biomark$vominus)
summary(log_biomark_fit2)

## 
## Call:
## lm(formula = log_biomark$voplus ~ log_biomark$oc + log_biomark$vominus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44153 -0.14522 -0.00953  0.16513  0.40122 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.83298    0.58878   1.415    0.168    
## log_biomark$oc       0.40589    0.08242   4.925 3.41e-05 ***
## log_biomark$vominus  0.68159    0.10378   6.567 4.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2286 on 28 degrees of freedom
## Multiple R-squared:  0.8419, Adjusted R-squared:  0.8306 
## F-statistic: 74.55 on 2 and 28 DF,  p-value: 6.099e-12

This models shows that 84.2% of the variation in VO+ is accounted for by these two variables, which is not different from the previous model with three variables. The $\sigma$ is 0.2286, which is actually less thann the previous model with more variables, and the p-value for this model is 6.099e-12.

It seemse that the coonclusion form this experiment is that working with the log-transformed variables improves normality and possiblty heteroscedasticity in the reisudals, but the inclusion of TRAP as a variable to predict VO+ does not make a significant difference when OC and VO- are used. This does support the higher correlation coefficients that OC and VO- have (0.77, 0.84, respectively) with VO+ compared to TRAP (0.75).

STAT 500 - Adams_HW4

Jenea Adams

November 11, 2021