R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Part 1: Simulation Exercise Instructions

Executive Summary

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.

The sample mean and compare it to the theoretical mean of the distribution.
How variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Proove the distribution is approximately normal.

Preliminary Analysis: Simulations:

Exploring the dataset. Main dataset appearance:

Data Description: Plots

The data is simulated and do not exist in reality. The data plot looks like this:

**From the charts we see that our distribution is close to normal. It is a little bit right-skewed, but still is OK with the normal distribution concept. The mean of the sample and the mean that is expected theoretically are actually extremely close to each other. This corresponds with the results we expected.

# Theoretical mean: 
1/lambda

## [1] 2

# Smple mean: 
mean(raw_data_frame_1$simulated_1)

## [1] 2.000029

# Absolute difference:
1/lambda  - mean(raw_data_frame_1$simulated_1)

## [1] -2.941894e-05

# Relative Difference:
round(abs(1/lambda  - mean(raw_data_frame_1$simulated_1))/min(1/lambda, mean(raw_data_frame_1$simulated_1)), 10)

## [1] 1.47095e-05

# Is difference significant:
ifelse(round(abs(1/lambda  - mean(raw_data_frame_1$simulated_1))/min(1/lambda, mean(raw_data_frame_1$simulated_1)), 10) < 0.05, "Is difference significant? The difference is insignificant; the results of the theoretical mean and of the actual mean are very close to each other", "Is difference significant? Yes, the difference between the theoretical mean and  the actual mean may be significant")

## [1] "Is difference significant? The difference is insignificant; the results of the theoretical mean and of the actual mean are very close to each other"

Statistical Inference Analysis

Plotting the chart: As we see, we have our distribution close to normal - the hiostogram and the lines distributed almost the same.

Part 2: Basic Inferential Data Analysis Instructions

Load the ToothGrowth data and perform some basic exploratory data analyses; Provide a basic summary of the data.

my_new_data = ToothGrowth
summary(my_new_data)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

str(my_new_data)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

knitr::kable(head(my_new_data, 10))

len	supp	dose
4.2	VC	0.5
11.5	VC	0.5
7.3	VC	0.5
5.8	VC	0.5
6.4	VC	0.5
10.0	VC	0.5
11.2	VC	0.5
11.2	VC	0.5
5.2	VC	0.5
7.0	VC	0.5

# head(my_new_data, 10)
a_tooth_for_showing <- my_new_data %>% group_by(supp, dose) %>% summarise(len = mean(len))

## `summarise()` regrouping output by 'supp' (override with `.groups` argument)

knitr::kable(a_tooth_for_showing)

supp	dose	len
OJ	0.5	13.23
OJ	1.0	22.70
OJ	2.0	26.06
VC	0.5	7.98
VC	1.0	16.77
VC	2.0	26.14
#### Us	e confi	dence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

OJ = ToothGrowth$len[ToothGrowth$supp == 'OJ']
VC = ToothGrowth$len[ToothGrowth$supp == 'VC']
 
t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

ttt = t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt$p.value

## [1] 0.03031725

ifelse(ttt$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the OJ and VC parameters.", "We cannot reject H0 in our case that there is no dependency between the OJ and VC indicators.")

## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the OJ and VC parameters."

dose_0_5 = ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_1_0 = ToothGrowth$len[ToothGrowth$dose == 1]
dose_2_0 = ToothGrowth$len[ToothGrowth$dose == 2]

ttt_1 = t.test(dose_0_5, dose_1_0, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt_2 = t.test(dose_1_0, dose_2_0, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt_1$p.value

## [1] 6.341504e-08

ifelse(ttt_1$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 1/2 and 1 dose parameters.", "We cannot reject H0 in our case that there is no dependency between the 1/2 and 1 dose indicators.")

## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 1/2 and 1 dose parameters."

ttt_2$p.value

## [1] 9.532148e-06

ifelse(ttt_2$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 2 and 1 dose parameters.", "We cannot reject H0 in our case that there is no dependency between the 2 and 1 dose indicators.")

## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 2 and 1 dose parameters."

Conclusion on the part 2

ifelse(ttt_2$p.value < 0.05 & ttt_1$p.value < 0.05 & ttt$p.value < 0.05, "We couldn't reject H0 for all the 3 requested indicators.", "We found at least one parameter where we used H0 for the t-test.")

## [1] "We couldn't reject H0 for all the 3 requested indicators."

Hene, we found clues for making further analysis which goes beyond this coursework.

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

Appendix: Codes Missing In The Text

Fit the full model of the data

library(GGally)
library(dplyr)
library(ggplot2)
#set seed
set.seed(2020)
#inputs
lambda = .5
n=400
sim = 100000
#simulation
simulated_1 <- NULL

for (i in 1:sim) simulated_1 = c(simulated_1, mean(rexp(n, lambda)))
raw_data_frame_1 <- as.data.frame(simulated_1)
a_statistics_1 <- data.frame(statistics = c("sample mean", "theoretical mean"),
                      values = c(mean(raw_data_frame_1$simulated_1), 1/lambda))
aplot1<- ggplot(raw_data_frame_1, aes(x = simulated_1))
aplot1<- aplot1 + geom_histogram(fill = "green"
                               , color = "blue"
                               , binwidth = .05
                               ) 
aplot1<- aplot1 + geom_vline(data = a_statistics_1
                           , aes(xintercept = values, col = statistics)
                           , size = 1.5
                           ) 
aplot1<- aplot1 + theme_light()
aplot1<- aplot1 + labs(title="Exponential means distribution" 
                     ,subtitle = "Theoretical mean vs. Sample mean"
                     ,x = "exp means"
                     ,y = "frequency"
                    )
# aplot1
aplot2 <- NULL
aplot2 <- ggplot(raw_data_frame_1, aes( x = simulated_1)
               )
aplot2 <- aplot2 + geom_histogram( aes(y=..density..)
                                ,binwidth = .05
                                ,col = "white"
                                ,fill = "grey"
                                ,size = .5
                                ,show.legend = TRUE
                                 )
aplot2 <- aplot2 + stat_function( fun = dnorm
                              , args = list(mean = mean(simulated_1)
                                           ,sd = sqrt(var(simulated_1))
                                           )
                              , col = "red"
                              , size = 1
                              )
aplot2 <- aplot2 + stat_function( fun = dnorm
                              , args = list(mean = 1/lambda 
                                           ,sd = sqrt((1/lambda)^2/n)
                                           )
                              , col = "black"
                              , size = 1
                              )
aplot2 <- aplot2 + theme_light()
aplot2 <- aplot2 + labs(title="Exponential means distribution" 
                     ,subtitle = "Normal distribution vs. Sample distribution"
                     ,x = "exp means"
                     )
# aplot2

We have the only coefficient significant at 10% level (WT) and 0 coefficients from the entire dataset significant at 5% or lower levels in terms of the t-statistics p-values. It means, we need another model. We can use automatic model selection algorithms like stepwise or best subset in order to fulfil this task and have some estimates of the dependencies. Small sample of the original dataset (below 300 observations) prevents us from having clear Gauss-Markov assumptions for the OLS-type models. Increasing the sample size would be of significant help here. At the same time, we work with what we have and suggested methods may cope with out task here.

Backward selection algorithm:

fullModel <- lm(mpg ~ ., data = mtcars)
bwFit <- step(fullModel)
summary(bwFit)       # knitr: results hidden
summary(bwFit)$coeff # knitr: results hidden

We see the next outcome: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01 wt -3.916504 0.7112016 -5.506882 6.952711e-06 qsec 1.225886 0.2886696 4.246676 2.161737e-04 am-manual 2.935837 1.4109045 2.080819 4.671551e-02 Or, in a more appropriate representation:

backfit = as.data.frame(summary(bwFit)$coeff)
knitr::kable(backfit)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	9.617781	6.9595930	1.381946	0.1779152
wt	-3.916504	0.7112016	-5.506882	0.0000070
qsec	1.225886	0.2886696	4.246676	0.0002162
am	2.935837	1.4109045	2.080819	0.0467155

We see that manual transmission (am-manual) increases MPG, on average, by 2.93 units comparing to the automatic transmission [value is in the intercept as the basic value] (p-value < 0.01, hence, our results are statistically significant). The variable named “wt”, or Weight (in 1000 lbs or circa 1/2 tonns) has significant (p-value is below 0.01) negative correlation on MPG; each additional 1000 lbs of weight decreases the MPG, or a distance we can run on the comparable amount of fuel (gallon). The variable named “qsec” and interpreted as “1/4 mile time” in the instructions to the dataset reveals the more time we need to spend to drive a comparable distance (1/4 of the mile), the more MPG will have this car (because sppedy cars usually tend to “eat” more fuel and, hence, would have lower MPG indicator comparing to the slower cars). Or, in other terms, each additional unit of the 1/4 mile time increases the MPG indicator on about 2.93 miles per galon.

Residuals & Diagnostics

Residual Plot

As seen from the plot, the only outlier has some significance at 1 level, but has no significance on 1.96 t-distribution level, which means, the results may be interpreted as OK.

sum((abs(dfbetas(bwFit)))>1.96)

## [1] 0

sum((abs(dfbetas(bwFit)))>1.0)

## [1] 1

Conclusion

The results we have are as expected: manual transmission, non-V-shaped engine and slower 1/4 miles capacity speed may be an evidence of fuel-economy car for the cars produced in 1970-s (as our primary database suggests).

Additional research template for submission

Executive Summary

Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions: “Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

Preliminary Analysis

Exploring the dataset. Main dataset appearance: [1] 32 11

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1

[1] “Str() function output:” ‘data.frame’: 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 … $ cyl : num 6 6 4 6 8 6 8 4 4 6 … $ disp: num 160 160 108 258 360 … $ hp : num 110 110 93 110 175 105 245 62 95 123 … $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 … $ wt : num 2.62 2.88 2.32 3.21 3.44 … $ qsec: num 16.5 17 18.6 19.4 17 … $ vs : num 0 0 1 1 0 1 0 1 1 1 … $ am : num 1 1 1 0 0 0 0 0 0 0 … $ gear: num 4 4 4 3 3 3 3 4 4 4 … $ carb: num 4 4 1 1 2 1 4 2 2 4 …

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

[1] “Summary() function output:”

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
	Min. :10.40	Min. :4.000	Min. : 71.1	Min. : 52.0	Min. :2.760	Min. :1.513	Min. :14.50	Min. :0.0000	Min. :0.0000	Min. :3.000	Min. :1.000
	1st Qu.:15.43	1st Qu.:4.000	1st Qu.:120.8	1st Qu.: 96.5	1st Qu.:3.080	1st Qu.:2.581	1st Qu.:16.89	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:3.000	1st Qu.:2.000
	Median :19.20	Median :6.000	Median :196.3	Median :123.0	Median :3.695	Median :3.325	Median :17.71	Median :0.0000	Median :0.0000	Median :4.000	Median :2.000
	Mean :20.09	Mean :6.188	Mean :230.7	Mean :146.7	Mean :3.597	Mean :3.217	Mean :17.85	Mean :0.4375	Mean :0.4062	Mean :3.688	Mean :2.812
	3rd Qu.:22.80	3rd Qu.:8.000	3rd Qu.:326.0	3rd Qu.:180.0	3rd Qu.:3.920	3rd Qu.:3.610	3rd Qu.:18.90	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:4.000	3rd Qu.:4.000
	Max. :33.90	Max. :8.000	Max. :472.0	Max. :335.0	Max. :4.930	Max. :5.424	Max. :22.90	Max. :1.0000	Max. :1.0000	Max. :5.000	Max. :8.000
####	Data Description
The d	ata was extracte	d from the 1974	Motor Trend US m	agazine, and com	prises fuel cons	umption and 10 a	spects of automo	bile design and p	erformance for 32	automobiles (19	73–74 models).
####	Data Format
A dat	a frame with 32	observations on	11 (numeric) var	iables.

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

Exploratory Analysis

Basic descriptive charts will help us to understand the main data patterns regarding to the mpg: **From the charts we see that cylinders, AM and VS parameters have potential significant impact to the MPG (as seen from the first chart line of the last chart [the boxplot charts]). Let us check from the basic statistical inference method - t-test in the next paragraph
##### Statistical Inference Analysis T-Test for the selected covariates versus MPG

## [1] "This code is generated automatically (including text) by the code compiled by \n       \n       Alexander Shemetev: "

## [1] "Covariate 1 AM: "

## [1] "P-value of the t-test result for the AM covariate is: 0.00137364 or, in percentages it is 0%."

## [1] "We need to reject the H0 that the difference petween the mentioned covariates is 0 (a\n\nsignificant effect may be expected). "

## mean in group automatic    mean in group manual 
##                17.14737                24.39231

## [1] "The absolute difference in the estimates for the MPG is: |24.392 - 17.147|, which, in absolute value is: 7.245"

## [1] "Covariate 2 CYL: "

## [1] "P-value of the t-test result for the VS covariate is: 0.00010984 or, in percentages it is 0%."

## [1] "We need to reject the H0 that the difference petween the mentioned covariates is 0 \n\n(a significant effect may be expected). "

## mean in group V mean in group S 
##        16.61667        24.55714

## [1] "The absolute difference in the estimates for the VS is: |24.557 - 16.617|, which, in absolute value is: 7.94"

As we see, we have tested the 2 binary variables on the t-test statistics. Let’s apply the t-test to all the variables in the dataset. We see that manual AM and S may look better in terms of MPG than automatic AM and V (V-shaped engine).

Regression Analysis

Fit the full model of the data

fullModel <- lm(mpg ~ ., data = mtcars)
summary(fullModel)        # knitr: results hidden
summary(fullModel)$coeff  # knitr: results hidden

We have the only coefficient significant at 10% level (WT) and 0 coefficients from the entire dataset significant at 5% or lower levels in terms of the t-statistics p-values. It means, we need another model. We can use automatic model selection algorithms like stepwise or best subset in order to fulfil this task and have some estimates of the dependencies. Small sample of the original dataset (below 300 observations) prevents us from having clear Gauss-Markov assumptions for the OLS-type models. Increasing the sample size would be of significant help here. At the same time, we work with what we have and suggested methods may cope with out task here.

Backward selection algorithm:

fullModel <- lm(mpg ~ ., data = mtcars)
bwFit <- step(fullModel)
summary(bwFit)       # knitr: results hidden
summary(bwFit)$coeff # knitr: results hidden

We see the next outcome: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01 wt -3.916504 0.7112016 -5.506882 6.952711e-06 qsec 1.225886 0.2886696 4.246676 2.161737e-04 am-manual 2.935837 1.4109045 2.080819 4.671551e-02 Or, in a more appropriate representation:

backfit = as.data.frame(summary(bwFit)$coeff)
knitr::kable(backfit)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	33.7083239	2.6048862	12.940421	0.0000000
cyl6	-3.0313445	1.4072835	-2.154040	0.0406827
cyl8	-2.1636753	2.2842517	-0.947214	0.3522509
hp	-0.0321094	0.0136926	-2.345025	0.0269346
wt	-2.4968294	0.8855878	-2.819404	0.0090814
am1	1.8092114	1.3963045	1.295714	0.2064597

We see that manual transmission (am-manual) increases MPG, on average, by 2.93 units comparing to the automatic transmission [value is in the intercept as the basic value] (p-value < 0.01, hence, our results are statistically significant). The variable named “wt”, or Weight (in 1000 lbs or circa 1/2 tonns) has significant (p-value is below 0.01) negative correlation on MPG; each additional 1000 lbs of weight decreases the MPG, or a distance we can run on the comparable amount of fuel (gallon). The variable named “qsec” and interpreted as “1/4 mile time” in the instructions to the dataset reveals the more time we need to spend to drive a comparable distance (1/4 of the mile), the more MPG will have this car (because sppedy cars usually tend to “eat” more fuel and, hence, would have lower MPG indicator comparing to the slower cars). Or, in other terms, each additional unit of the 1/4 mile time increases the MPG indicator on about 2.93 miles per galon.

Residuals & Diagnostics

Residual Plot

As seen from the plot, the only outlier has some significance at 1 level, but has no significance on 1.96 t-distribution level, which means, the results may be interpreted as OK.

sum((abs(dfbetas(bwFit)))>1.96)

## [1] 0

sum((abs(dfbetas(bwFit)))>1.0)

## [1] 0

Conclusion

The results we have are as expected: manual transmission, non-V-shaped engine and slower 1/4 miles capacity speed may be an evidence of fuel-economy car for the cars produced in 1970-s (as our primary database suggests).

Reproducible_Research

Alexander Shemetev

7 8 2020

R Markdown

Part 1: Simulation Exercise Instructions

Executive Summary

Preliminary Analysis: Simulations:

Data Description: Plots

Statistical Inference Analysis

Part 2: Basic Inferential Data Analysis Instructions

Load the ToothGrowth data and perform some basic exploratory data analyses; Provide a basic summary of the data.

Conclusion on the part 2

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

Appendix: Codes Missing In The Text

Residuals & Diagnostics

Conclusion

Additional research template for submission

Executive Summary

Preliminary Analysis

Exploratory Analysis

Regression Analysis

Residuals & Diagnostics

Conclusion