This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.
Exploring the dataset. Main dataset appearance:
The data is simulated and do not exist in reality. The data plot looks like this:
**From the charts we see that our distribution is close to normal. It is a little bit right-skewed, but still is OK with the normal distribution concept. The mean of the sample and the mean that is expected theoretically are actually extremely close to each other. This corresponds with the results we expected.
# Theoretical mean:
1/lambda
## [1] 2
# Smple mean:
mean(raw_data_frame_1$simulated_1)
## [1] 2.000029
# Absolute difference:
1/lambda - mean(raw_data_frame_1$simulated_1)
## [1] -2.941894e-05
# Relative Difference:
round(abs(1/lambda - mean(raw_data_frame_1$simulated_1))/min(1/lambda, mean(raw_data_frame_1$simulated_1)), 10)
## [1] 1.47095e-05
# Is difference significant:
ifelse(round(abs(1/lambda - mean(raw_data_frame_1$simulated_1))/min(1/lambda, mean(raw_data_frame_1$simulated_1)), 10) < 0.05, "Is difference significant? The difference is insignificant; the results of the theoretical mean and of the actual mean are very close to each other", "Is difference significant? Yes, the difference between the theoretical mean and the actual mean may be significant")
## [1] "Is difference significant? The difference is insignificant; the results of the theoretical mean and of the actual mean are very close to each other"
Plotting the chart: As we see, we have our distribution close to normal - the hiostogram and the lines distributed almost the same.
my_new_data = ToothGrowth
summary(my_new_data)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
str(my_new_data)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
knitr::kable(head(my_new_data, 10))
len | supp | dose |
---|---|---|
4.2 | VC | 0.5 |
11.5 | VC | 0.5 |
7.3 | VC | 0.5 |
5.8 | VC | 0.5 |
6.4 | VC | 0.5 |
10.0 | VC | 0.5 |
11.2 | VC | 0.5 |
11.2 | VC | 0.5 |
5.2 | VC | 0.5 |
7.0 | VC | 0.5 |
# head(my_new_data, 10)
a_tooth_for_showing <- my_new_data %>% group_by(supp, dose) %>% summarise(len = mean(len))
## `summarise()` regrouping output by 'supp' (override with `.groups` argument)
knitr::kable(a_tooth_for_showing)
supp | dose | len |
---|---|---|
OJ | 0.5 | 13.23 |
OJ | 1.0 | 22.70 |
OJ | 2.0 | 26.06 |
VC | 0.5 | 7.98 |
VC | 1.0 | 16.77 |
VC | 2.0 | 26.14 |
#### Us | e confi | dence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering) |
OJ = ToothGrowth$len[ToothGrowth$supp == 'OJ']
VC = ToothGrowth$len[ToothGrowth$supp == 'VC']
t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
ttt = t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt$p.value
## [1] 0.03031725
ifelse(ttt$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the OJ and VC parameters.", "We cannot reject H0 in our case that there is no dependency between the OJ and VC indicators.")
## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the OJ and VC parameters."
dose_0_5 = ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_1_0 = ToothGrowth$len[ToothGrowth$dose == 1]
dose_2_0 = ToothGrowth$len[ToothGrowth$dose == 2]
ttt_1 = t.test(dose_0_5, dose_1_0, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt_2 = t.test(dose_1_0, dose_2_0, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
ttt_1$p.value
## [1] 6.341504e-08
ifelse(ttt_1$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 1/2 and 1 dose parameters.", "We cannot reject H0 in our case that there is no dependency between the 1/2 and 1 dose indicators.")
## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 1/2 and 1 dose parameters."
ttt_2$p.value
## [1] 9.532148e-06
ifelse(ttt_2$p.value < 0.05, "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 2 and 1 dose parameters.", "We cannot reject H0 in our case that there is no dependency between the 2 and 1 dose indicators.")
## [1] "There are significant clues to reject the null hypothesis at 5% level which means there are clues about the linear relations between the 2 and 1 dose parameters."
ifelse(ttt_2$p.value < 0.05 & ttt_1$p.value < 0.05 & ttt$p.value < 0.05, "We couldn't reject H0 for all the 3 requested indicators.", "We found at least one parameter where we used H0 for the t-test.")
## [1] "We couldn't reject H0 for all the 3 requested indicators."
Hene, we found clues for making further analysis which goes beyond this coursework.
Fit the full model of the data
library(GGally)
library(dplyr)
library(ggplot2)
#set seed
set.seed(2020)
#inputs
lambda = .5
n=400
sim = 100000
#simulation
simulated_1 <- NULL
for (i in 1:sim) simulated_1 = c(simulated_1, mean(rexp(n, lambda)))
raw_data_frame_1 <- as.data.frame(simulated_1)
a_statistics_1 <- data.frame(statistics = c("sample mean", "theoretical mean"),
values = c(mean(raw_data_frame_1$simulated_1), 1/lambda))
aplot1<- ggplot(raw_data_frame_1, aes(x = simulated_1))
aplot1<- aplot1 + geom_histogram(fill = "green"
, color = "blue"
, binwidth = .05
)
aplot1<- aplot1 + geom_vline(data = a_statistics_1
, aes(xintercept = values, col = statistics)
, size = 1.5
)
aplot1<- aplot1 + theme_light()
aplot1<- aplot1 + labs(title="Exponential means distribution"
,subtitle = "Theoretical mean vs. Sample mean"
,x = "exp means"
,y = "frequency"
)
# aplot1
aplot2 <- NULL
aplot2 <- ggplot(raw_data_frame_1, aes( x = simulated_1)
)
aplot2 <- aplot2 + geom_histogram( aes(y=..density..)
,binwidth = .05
,col = "white"
,fill = "grey"
,size = .5
,show.legend = TRUE
)
aplot2 <- aplot2 + stat_function( fun = dnorm
, args = list(mean = mean(simulated_1)
,sd = sqrt(var(simulated_1))
)
, col = "red"
, size = 1
)
aplot2 <- aplot2 + stat_function( fun = dnorm
, args = list(mean = 1/lambda
,sd = sqrt((1/lambda)^2/n)
)
, col = "black"
, size = 1
)
aplot2 <- aplot2 + theme_light()
aplot2 <- aplot2 + labs(title="Exponential means distribution"
,subtitle = "Normal distribution vs. Sample distribution"
,x = "exp means"
)
# aplot2
We have the only coefficient significant at 10% level (WT) and 0 coefficients from the entire dataset significant at 5% or lower levels in terms of the t-statistics p-values. It means, we need another model. We can use automatic model selection algorithms like stepwise or best subset in order to fulfil this task and have some estimates of the dependencies. Small sample of the original dataset (below 300 observations) prevents us from having clear Gauss-Markov assumptions for the OLS-type models. Increasing the sample size would be of significant help here. At the same time, we work with what we have and suggested methods may cope with out task here.
Backward selection algorithm:
fullModel <- lm(mpg ~ ., data = mtcars)
bwFit <- step(fullModel)
summary(bwFit) # knitr: results hidden
summary(bwFit)$coeff # knitr: results hidden
We see the next outcome: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01 wt -3.916504 0.7112016 -5.506882 6.952711e-06 qsec 1.225886 0.2886696 4.246676 2.161737e-04 am-manual 2.935837 1.4109045 2.080819 4.671551e-02 Or, in a more appropriate representation:
backfit = as.data.frame(summary(bwFit)$coeff)
knitr::kable(backfit)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.617781 | 6.9595930 | 1.381946 | 0.1779152 |
wt | -3.916504 | 0.7112016 | -5.506882 | 0.0000070 |
qsec | 1.225886 | 0.2886696 | 4.246676 | 0.0002162 |
am | 2.935837 | 1.4109045 | 2.080819 | 0.0467155 |
We see that manual transmission (am-manual) increases MPG, on average, by 2.93 units comparing to the automatic transmission [value is in the intercept as the basic value] (p-value < 0.01, hence, our results are statistically significant). The variable named “wt”, or Weight (in 1000 lbs or circa 1/2 tonns) has significant (p-value is below 0.01) negative correlation on MPG; each additional 1000 lbs of weight decreases the MPG, or a distance we can run on the comparable amount of fuel (gallon). The variable named “qsec” and interpreted as “1/4 mile time” in the instructions to the dataset reveals the more time we need to spend to drive a comparable distance (1/4 of the mile), the more MPG will have this car (because sppedy cars usually tend to “eat” more fuel and, hence, would have lower MPG indicator comparing to the slower cars). Or, in other terms, each additional unit of the 1/4 mile time increases the MPG indicator on about 2.93 miles per galon.
Residual Plot
As seen from the plot, the only outlier has some significance at 1 level, but has no significance on 1.96 t-distribution level, which means, the results may be interpreted as OK.
sum((abs(dfbetas(bwFit)))>1.96)
## [1] 0
sum((abs(dfbetas(bwFit)))>1.0)
## [1] 1
The results we have are as expected: manual transmission, non-V-shaped engine and slower 1/4 miles capacity speed may be an evidence of fuel-economy car for the cars produced in 1970-s (as our primary database suggests).
Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions: “Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
Exploring the dataset. Main dataset appearance: [1] 32 11
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
[1] “Str() function output:” ‘data.frame’: 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 … $ cyl : num 6 6 4 6 8 6 8 4 4 6 … $ disp: num 160 160 108 258 360 … $ hp : num 110 110 93 110 175 105 245 62 95 123 … $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 … $ wt : num 2.62 2.88 2.32 3.21 3.44 … $ qsec: num 16.5 17 18.6 19.4 17 … $ vs : num 0 0 1 1 0 1 0 1 1 1 … $ am : num 1 1 1 0 0 0 0 0 0 0 … $ gear: num 4 4 4 3 3 3 3 4 4 4 … $ carb: num 4 4 1 1 2 1 4 2 2 4 …
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)
|| || || ||
[1] “Summary() function output:”
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Min. :10.40 | Min. :4.000 | Min. : 71.1 | Min. : 52.0 | Min. :2.760 | Min. :1.513 | Min. :14.50 | Min. :0.0000 | Min. :0.0000 | Min. :3.000 | Min. :1.000 | |
1st Qu.:15.43 | 1st Qu.:4.000 | 1st Qu.:120.8 | 1st Qu.: 96.5 | 1st Qu.:3.080 | 1st Qu.:2.581 | 1st Qu.:16.89 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:3.000 | 1st Qu.:2.000 | |
Median :19.20 | Median :6.000 | Median :196.3 | Median :123.0 | Median :3.695 | Median :3.325 | Median :17.71 | Median :0.0000 | Median :0.0000 | Median :4.000 | Median :2.000 | |
Mean :20.09 | Mean :6.188 | Mean :230.7 | Mean :146.7 | Mean :3.597 | Mean :3.217 | Mean :17.85 | Mean :0.4375 | Mean :0.4062 | Mean :3.688 | Mean :2.812 | |
3rd Qu.:22.80 | 3rd Qu.:8.000 | 3rd Qu.:326.0 | 3rd Qu.:180.0 | 3rd Qu.:3.920 | 3rd Qu.:3.610 | 3rd Qu.:18.90 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | |
Max. :33.90 | Max. :8.000 | Max. :472.0 | Max. :335.0 | Max. :4.930 | Max. :5.424 | Max. :22.90 | Max. :1.0000 | Max. :1.0000 | Max. :5.000 | Max. :8.000 | |
#### | Data Description | ||||||||||
The d | ata was extracte | d from the 1974 | Motor Trend US m | agazine, and com | prises fuel cons | umption and 10 a | spects of automo | bile design and p | erformance for 32 | automobiles (19 | 73–74 models). |
#### | Data Format | ||||||||||
A dat | a frame with 32 | observations on | 11 (numeric) var | iables. |
[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors
Basic descriptive charts will help us to understand the main data patterns regarding to the mpg: **From the charts we see that cylinders, AM and VS parameters have potential significant impact to the MPG (as seen from the first chart line of the last chart [the boxplot charts]). Let us check from the basic statistical inference method - t-test in the next paragraph
##### Statistical Inference Analysis T-Test for the selected covariates versus MPG
## [1] "This code is generated automatically (including text) by the code compiled by \n \n Alexander Shemetev: "
## [1] "Covariate 1 AM: "
## [1] "P-value of the t-test result for the AM covariate is: 0.00137364 or, in percentages it is 0%."
## [1] "We need to reject the H0 that the difference petween the mentioned covariates is 0 (a\n\nsignificant effect may be expected). "
## mean in group automatic mean in group manual
## 17.14737 24.39231
## [1] "The absolute difference in the estimates for the MPG is: |24.392 - 17.147|, which, in absolute value is: 7.245"
## [1] "Covariate 2 CYL: "
## [1] "P-value of the t-test result for the VS covariate is: 0.00010984 or, in percentages it is 0%."
## [1] "We need to reject the H0 that the difference petween the mentioned covariates is 0 \n\n(a significant effect may be expected). "
## mean in group V mean in group S
## 16.61667 24.55714
## [1] "The absolute difference in the estimates for the VS is: |24.557 - 16.617|, which, in absolute value is: 7.94"
As we see, we have tested the 2 binary variables on the t-test statistics. Let’s apply the t-test to all the variables in the dataset. We see that manual AM and S may look better in terms of MPG than automatic AM and V (V-shaped engine).
Fit the full model of the data
fullModel <- lm(mpg ~ ., data = mtcars)
summary(fullModel) # knitr: results hidden
summary(fullModel)$coeff # knitr: results hidden
We have the only coefficient significant at 10% level (WT) and 0 coefficients from the entire dataset significant at 5% or lower levels in terms of the t-statistics p-values. It means, we need another model. We can use automatic model selection algorithms like stepwise or best subset in order to fulfil this task and have some estimates of the dependencies. Small sample of the original dataset (below 300 observations) prevents us from having clear Gauss-Markov assumptions for the OLS-type models. Increasing the sample size would be of significant help here. At the same time, we work with what we have and suggested methods may cope with out task here.
Backward selection algorithm:
fullModel <- lm(mpg ~ ., data = mtcars)
bwFit <- step(fullModel)
summary(bwFit) # knitr: results hidden
summary(bwFit)$coeff # knitr: results hidden
We see the next outcome: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01 wt -3.916504 0.7112016 -5.506882 6.952711e-06 qsec 1.225886 0.2886696 4.246676 2.161737e-04 am-manual 2.935837 1.4109045 2.080819 4.671551e-02 Or, in a more appropriate representation:
backfit = as.data.frame(summary(bwFit)$coeff)
knitr::kable(backfit)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 33.7083239 | 2.6048862 | 12.940421 | 0.0000000 |
cyl6 | -3.0313445 | 1.4072835 | -2.154040 | 0.0406827 |
cyl8 | -2.1636753 | 2.2842517 | -0.947214 | 0.3522509 |
hp | -0.0321094 | 0.0136926 | -2.345025 | 0.0269346 |
wt | -2.4968294 | 0.8855878 | -2.819404 | 0.0090814 |
am1 | 1.8092114 | 1.3963045 | 1.295714 | 0.2064597 |
We see that manual transmission (am-manual) increases MPG, on average, by 2.93 units comparing to the automatic transmission [value is in the intercept as the basic value] (p-value < 0.01, hence, our results are statistically significant). The variable named “wt”, or Weight (in 1000 lbs or circa 1/2 tonns) has significant (p-value is below 0.01) negative correlation on MPG; each additional 1000 lbs of weight decreases the MPG, or a distance we can run on the comparable amount of fuel (gallon). The variable named “qsec” and interpreted as “1/4 mile time” in the instructions to the dataset reveals the more time we need to spend to drive a comparable distance (1/4 of the mile), the more MPG will have this car (because sppedy cars usually tend to “eat” more fuel and, hence, would have lower MPG indicator comparing to the slower cars). Or, in other terms, each additional unit of the 1/4 mile time increases the MPG indicator on about 2.93 miles per galon.
Residual Plot
As seen from the plot, the only outlier has some significance at 1 level, but has no significance on 1.96 t-distribution level, which means, the results may be interpreted as OK.
sum((abs(dfbetas(bwFit)))>1.96)
## [1] 0
sum((abs(dfbetas(bwFit)))>1.0)
## [1] 0
The results we have are as expected: manual transmission, non-V-shaped engine and slower 1/4 miles capacity speed may be an evidence of fuel-economy car for the cars produced in 1970-s (as our primary database suggests).