R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Part 1: Simulation Exercise Instructions

Executive Summary

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.

  1. The sample mean and compare it to the theoretical mean of the distribution.
  2. How variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Proove the distribution is approximately normal.

Preliminary Analysis: Simulations:

Exploring the dataset. Main dataset appearance:

Data Description: Plots

The data is simulated and do not exist in reality. The data plot looks like this:

**From the charts we see that our distribution is close to normal. It is a little bit right-skewed, but still is OK with the normal distribution concept. The mean of the sample and the mean that is expected theoretically are actually extremely close to each other. This corresponds with the results we expected.

# Theoretical mean: 
1/lambda 
## [1] 2
# Smple mean: 
mean(raw_data_frame_1$simulated_1)
## [1] 2.000029
# Absolute difference:
1/lambda  - mean(raw_data_frame_1$simulated_1)
## [1] -2.941894e-05
# Relative Difference:
round(abs(1/lambda  - mean(raw_data_frame_1$simulated_1))/min(1/lambda, mean(raw_data_frame_1$simulated_1)), 10)
## [1] 1.47095e-05
# Is difference significant:
ifelse(round(abs(1/lambda  - mean(raw_data_frame_1$simulated_1))/min(1/lambda, mean(raw_data_frame_1$simulated_1)), 10) < 0.05, "Is difference significant? The difference is insignificant; the results of the theoretical mean and of the actual mean are very close to each other", "Is difference significant? Yes, the difference between the theoretical mean and  the actual mean may be significant")
## [1] "Is difference significant? The difference is insignificant; the results of the theoretical mean and of the actual mean are very close to each other"
Statistical Inference Analysis

Plotting the chart: As we see, we have our distribution close to normal - the hiostogram and the lines distributed almost the same.

Part 2: Basic Inferential Data Analysis Instructions

Load the ToothGrowth data and perform some basic exploratory data analyses; Provide a basic summary of the data.

my_new_data = ToothGrowth
summary(my_new_data)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(my_new_data)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
knitr::kable(head(my_new_data, 10))
len supp dose
4.2 VC 0.5
11.5 VC 0.5
7.3 VC 0.5
5.8 VC 0.5
6.4 VC 0.5
10.0 VC 0.5
11.2 VC 0.5
11.2 VC 0.5
5.2 VC 0.5
7.0 VC 0.5
# head(my_new_data, 10)
a_tooth_for_showing <- my_new_data %>% group_by(supp, dose) %>% summarise(len = mean(len))
## `summarise()` regrouping output by 'supp' (override with `.groups` argument)
knitr::kable(a_tooth_for_showing)
supp dose len
OJ 0.5 13.23
OJ 1.0 22.70
OJ 2.0 26.06
VC 0.5 7.98
VC 1.0 16.77
VC 2.0 26.14
#### Us e confi dence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
OJ = ToothGrowth$len[ToothGrowth$supp == 'OJ']
VC = ToothGrowth$len[ToothGrowth$supp == 'VC']
 
t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
ttt = t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt$p.value
## [1] 0.03031725
ifelse(ttt$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the OJ and VC parameters.", "We cannot reject H0 in our case that there is no dependency between the OJ and VC indicators.")
## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the OJ and VC parameters."
dose_0_5 = ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_1_0 = ToothGrowth$len[ToothGrowth$dose == 1]
dose_2_0 = ToothGrowth$len[ToothGrowth$dose == 2]

ttt_1 = t.test(dose_0_5, dose_1_0, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt_2 = t.test(dose_1_0, dose_2_0, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt_1$p.value
## [1] 6.341504e-08
ifelse(ttt_1$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 1/2 and 1 dose parameters.", "We cannot reject H0 in our case that there is no dependency between the 1/2 and 1 dose indicators.")
## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 1/2 and 1 dose parameters."
ttt_2$p.value
## [1] 9.532148e-06
ifelse(ttt_2$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 2 and 1 dose parameters.", "We cannot reject H0 in our case that there is no dependency between the 2 and 1 dose indicators.")
## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 2 and 1 dose parameters."

Conclusion on the part 2

ifelse(ttt_2$p.value < 0.05 & ttt_1$p.value < 0.05 & ttt$p.value < 0.05, "We couldn't reject H0 for all the 3 requested indicators.", "We found at least one parameter where we used H0 for the t-test.")
## [1] "We couldn't reject H0 for all the 3 requested indicators."

Hene, we found clues for making further analysis which goes beyond this coursework.

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

Appendix: Codes Missing In The Text

Fit the full model of the data

library(GGally)
library(dplyr)
library(ggplot2)
#set seed
set.seed(2020)
#inputs
lambda = .5
n=400
sim = 100000
#simulation
simulated_1 <- NULL

for (i in 1:sim) simulated_1 = c(simulated_1, mean(rexp(n, lambda)))
raw_data_frame_1 <- as.data.frame(simulated_1)
a_statistics_1 <- data.frame(statistics = c("sample mean", "theoretical mean"),
                      values = c(mean(raw_data_frame_1$simulated_1), 1/lambda))
aplot1<- ggplot(raw_data_frame_1, aes(x = simulated_1))
aplot1<- aplot1 + geom_histogram(fill = "green"
                               , color = "blue"
                               , binwidth = .05
                               ) 
aplot1<- aplot1 + geom_vline(data = a_statistics_1
                           , aes(xintercept = values, col = statistics)
                           , size = 1.5
                           ) 
aplot1<- aplot1 + theme_light()
aplot1<- aplot1 + labs(title="Exponential means distribution" 
                     ,subtitle = "Theoretical mean vs. Sample mean"
                     ,x = "exp means"
                     ,y = "frequency"
                    )
# aplot1
aplot2 <- NULL
aplot2 <- ggplot(raw_data_frame_1, aes( x = simulated_1)
               )
aplot2 <- aplot2 + geom_histogram( aes(y=..density..)
                                ,binwidth = .05
                                ,col = "white"
                                ,fill = "grey"
                                ,size = .5
                                ,show.legend = TRUE
                                 )
aplot2 <- aplot2 + stat_function( fun = dnorm
                              , args = list(mean = mean(simulated_1)
                                           ,sd = sqrt(var(simulated_1))
                                           )
                              , col = "red"
                              , size = 1
                              )
aplot2 <- aplot2 + stat_function( fun = dnorm
                              , args = list(mean = 1/lambda 
                                           ,sd = sqrt((1/lambda)^2/n)
                                           )
                              , col = "black"
                              , size = 1
                              )
aplot2 <- aplot2 + theme_light()
aplot2 <- aplot2 + labs(title="Exponential means distribution" 
                     ,subtitle = "Normal distribution vs. Sample distribution"
                     ,x = "exp means"
                     )
# aplot2

We have the only coefficient significant at 10% level (WT) and 0 coefficients from the entire dataset significant at 5% or lower levels in terms of the t-statistics p-values. It means, we need another model. We can use automatic model selection algorithms like stepwise or best subset in order to fulfil this task and have some estimates of the dependencies. Small sample of the original dataset (below 300 observations) prevents us from having clear Gauss-Markov assumptions for the OLS-type models. Increasing the sample size would be of significant help here. At the same time, we work with what we have and suggested methods may cope with out task here.

Backward selection algorithm:

fullModel <- lm(mpg ~ ., data = mtcars)
bwFit <- step(fullModel)
summary(bwFit)       # knitr: results hidden
summary(bwFit)$coeff # knitr: results hidden

We see the next outcome: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01 wt -3.916504 0.7112016 -5.506882 6.952711e-06 qsec 1.225886 0.2886696 4.246676 2.161737e-04 am-manual 2.935837 1.4109045 2.080819 4.671551e-02 Or, in a more appropriate representation:

backfit = as.data.frame(summary(bwFit)$coeff)
knitr::kable(backfit)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.617781 6.9595930 1.381946 0.1779152
wt -3.916504 0.7112016 -5.506882 0.0000070
qsec 1.225886 0.2886696 4.246676 0.0002162
am 2.935837 1.4109045 2.080819 0.0467155

We see that manual transmission (am-manual) increases MPG, on average, by 2.93 units comparing to the automatic transmission [value is in the intercept as the basic value] (p-value < 0.01, hence, our results are statistically significant). The variable named “wt”, or Weight (in 1000 lbs or circa 1/2 tonns) has significant (p-value is below 0.01) negative correlation on MPG; each additional 1000 lbs of weight decreases the MPG, or a distance we can run on the comparable amount of fuel (gallon). The variable named “qsec” and interpreted as “1/4 mile time” in the instructions to the dataset reveals the more time we need to spend to drive a comparable distance (1/4 of the mile), the more MPG will have this car (because sppedy cars usually tend to “eat” more fuel and, hence, would have lower MPG indicator comparing to the slower cars). Or, in other terms, each additional unit of the 1/4 mile time increases the MPG indicator on about 2.93 miles per galon.

Residuals & Diagnostics

Residual Plot

As seen from the plot, the only outlier has some significance at 1 level, but has no significance on 1.96 t-distribution level, which means, the results may be interpreted as OK.

sum((abs(dfbetas(bwFit)))>1.96)
## [1] 0
sum((abs(dfbetas(bwFit)))>1.0)
## [1] 1

Conclusion

The results we have are as expected: manual transmission, non-V-shaped engine and slower 1/4 miles capacity speed may be an evidence of fuel-economy car for the cars produced in 1970-s (as our primary database suggests).

Additional research template for submission

Executive Summary

Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions: “Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

Preliminary Analysis

Exploring the dataset. Main dataset appearance: [1] 32 11

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

[1] “Str() function output:” ‘data.frame’: 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 … $ cyl : num 6 6 4 6 8 6 8 4 4 6 … $ disp: num 160 160 108 258 360 … $ hp : num 110 110 93 110 175 105 245 62 95 123 … $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 … $ wt : num 2.62 2.88 2.32 3.21 3.44 … $ qsec: num 16.5 17 18.6 19.4 17 … $ vs : num 0 0 1 1 0 1 0 1 1 1 … $ am : num 1 1 1 0 0 0 0 0 0 0 … $ gear: num 4 4 4 3 3 3 3 4 4 4 … $ carb: num 4 4 1 1 2 1 4 2 2 4 …

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

[1] “Summary() function output:”

mpg cyl disp hp drat wt qsec vs am gear carb
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
#### Data Description
The d ata was extracte d from the 1974 Motor Trend US m agazine, and com prises fuel cons umption and 10 a spects of automo bile design and p erformance for 32 automobiles (19 73–74 models).
#### Data Format
A dat a frame with 32 observations on 11 (numeric) var iables.

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

Exploratory Analysis

Basic descriptive charts will help us to understand the main data patterns regarding to the mpg: **From the charts we see that cylinders, AM and VS parameters have potential significant impact to the MPG (as seen from the first chart line of the last chart [the boxplot charts]). Let us check from the basic statistical inference method - t-test in the next paragraph
##### Statistical Inference Analysis T-Test for the selected covariates versus MPG

## [1] "This code is generated automatically (including text) by the code compiled by \n       \n       Alexander Shemetev: "
## [1] "Covariate 1 AM: "
## [1] "P-value of the t-test result for the AM covariate is: 0.00137364 or, in percentages it is 0%."
## [1] "We need to reject the H0 that the difference petween the mentioned covariates is 0 (a\n\nsignificant effect may be expected). "
## mean in group automatic    mean in group manual 
##                17.14737                24.39231
## [1] "The absolute difference in the estimates for the MPG is: |24.392 - 17.147|, which, in absolute value is: 7.245"
## [1] "Covariate 2 CYL: "
## [1] "P-value of the t-test result for the VS covariate is: 0.00010984 or, in percentages it is 0%."
## [1] "We need to reject the H0 that the difference petween the mentioned covariates is 0 \n\n(a significant effect may be expected). "
## mean in group V mean in group S 
##        16.61667        24.55714
## [1] "The absolute difference in the estimates for the VS is: |24.557 - 16.617|, which, in absolute value is: 7.94"

As we see, we have tested the 2 binary variables on the t-test statistics. Let’s apply the t-test to all the variables in the dataset. We see that manual AM and S may look better in terms of MPG than automatic AM and V (V-shaped engine).

Regression Analysis

Fit the full model of the data

fullModel <- lm(mpg ~ ., data = mtcars)
summary(fullModel)        # knitr: results hidden
summary(fullModel)$coeff  # knitr: results hidden

We have the only coefficient significant at 10% level (WT) and 0 coefficients from the entire dataset significant at 5% or lower levels in terms of the t-statistics p-values. It means, we need another model. We can use automatic model selection algorithms like stepwise or best subset in order to fulfil this task and have some estimates of the dependencies. Small sample of the original dataset (below 300 observations) prevents us from having clear Gauss-Markov assumptions for the OLS-type models. Increasing the sample size would be of significant help here. At the same time, we work with what we have and suggested methods may cope with out task here.

Backward selection algorithm:

fullModel <- lm(mpg ~ ., data = mtcars)
bwFit <- step(fullModel)
summary(bwFit)       # knitr: results hidden
summary(bwFit)$coeff # knitr: results hidden

We see the next outcome: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01 wt -3.916504 0.7112016 -5.506882 6.952711e-06 qsec 1.225886 0.2886696 4.246676 2.161737e-04 am-manual 2.935837 1.4109045 2.080819 4.671551e-02 Or, in a more appropriate representation:

backfit = as.data.frame(summary(bwFit)$coeff)
knitr::kable(backfit)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.7083239 2.6048862 12.940421 0.0000000
cyl6 -3.0313445 1.4072835 -2.154040 0.0406827
cyl8 -2.1636753 2.2842517 -0.947214 0.3522509
hp -0.0321094 0.0136926 -2.345025 0.0269346
wt -2.4968294 0.8855878 -2.819404 0.0090814
am1 1.8092114 1.3963045 1.295714 0.2064597

We see that manual transmission (am-manual) increases MPG, on average, by 2.93 units comparing to the automatic transmission [value is in the intercept as the basic value] (p-value < 0.01, hence, our results are statistically significant). The variable named “wt”, or Weight (in 1000 lbs or circa 1/2 tonns) has significant (p-value is below 0.01) negative correlation on MPG; each additional 1000 lbs of weight decreases the MPG, or a distance we can run on the comparable amount of fuel (gallon). The variable named “qsec” and interpreted as “1/4 mile time” in the instructions to the dataset reveals the more time we need to spend to drive a comparable distance (1/4 of the mile), the more MPG will have this car (because sppedy cars usually tend to “eat” more fuel and, hence, would have lower MPG indicator comparing to the slower cars). Or, in other terms, each additional unit of the 1/4 mile time increases the MPG indicator on about 2.93 miles per galon.

Residuals & Diagnostics

Residual Plot

As seen from the plot, the only outlier has some significance at 1 level, but has no significance on 1.96 t-distribution level, which means, the results may be interpreted as OK.

sum((abs(dfbetas(bwFit)))>1.96)
## [1] 0
sum((abs(dfbetas(bwFit)))>1.0)
## [1] 0

Conclusion

The results we have are as expected: manual transmission, non-V-shaped engine and slower 1/4 miles capacity speed may be an evidence of fuel-economy car for the cars produced in 1970-s (as our primary database suggests).