[1] 45.14724 70.21451 61.49509 53.18533 53.18820
This distinctions occur due the scale of the random variable we are using.
Recall that the difference between discrete and continous is the following:
You can always represent observations with numbers.
There is no reason why you can not do calculations with these numbers.
The problem is that you are using values that are not defined and therefore depending on the type of discrete random variable could be completely meaningless.
Car Brands are nominal and can be represented by numbers. You might prefer one brand over the other but those preferences are based on other synthesized attributes (safety, power etc…).
Reality (Processes) in many cases are more complex than the distributions we have mentioned. Therefore in these cases we can improve our understanding of the real world phenomena by increasing the complexity of the distribution we end up working.
For instance we can create a mixed distribution of inflated 0s and positive values. Interpretation becomes more challenging.
library(cowplot)
library(ggpubr)
#obtain the first 5 rows from the data dataframe. #concatenate it to the column with numbers from 1 to 5.
Y=round(as.data.frame(cbind(c(data[1,1],data[2,1],data[3,1],data[4,1],data[5,1]),
c(1:5),
c(dnorm(data[1,1],mean=50,sd=10),
dnorm(data[2,1],mean=50,sd=10),
dnorm(data[3,1],mean=50,sd=10),
dnorm(data[4,1],mean=50,sd=10),
dnorm(data[5,1],mean=50,sd=10)))),1)
p2=ggplot(data = data.frame(x = c(20, 80)), aes(x)) +
ggtitle((title=bquote("Densities of X when" ~ mu ~ "=mean(data) and" ~ sigma ~ "sd(data)")))+
theme(plot.title = element_text(hjust = 0.5))+
stat_function(fun = dnorm, n = 101, args = list(mean = mean(data$x), sd = sd(data$x))) + ylab("")+
scale_x_continuous(breaks = Y$V1)+
geom_point(data=Y,aes(size=4,x=V1,y=dnorm(V1,mean=mean(data$x),sd=sd(data$x)),color="red"))+theme(legend.position = "none")+
geom_segment(data=Y,aes(x = V1, y = rep(0,length(V1)),
xend = V1, yend =dnorm(V1,mean=mean(data$x1),sd=sd(data$x1)),color="blue"))
p2
As you know the normal distribution is symmetric. Few things are symmetrically distributed.
Why then do we talk about this distribution so much in intro classes.
Central Limit Theorem
## N random numbers, that have a mixed distribution.
## K numbers greater than 0.
## K+1 to N numbers that are 0
## p is the probability of a number being greater than 0.
N=50000
p=0.8
K=round(N*p,0)
data=as.data.frame(matrix(nrow=N,ncol=1))
data[1:K,1]=(rgamma(n=K,shape=250,rate=5))
data[(K+1):N,1]=0
names(data)='x1'
ggplot(data=data,aes(x=x1))+geom_density()
This idea is used to make inferences on or from \(\mu\)
-If you do not know \(\mu\) and \(\sigma\), but instead know \(\bar{x}\) and \(s\) for a given \(n\), you can create confidence intervals to create an Upper and a Lower bound to give an idea of where \(\mu\) is going to be.
\[ Upper Bound = \bar{x} + t_{(1-\frac{\alpha}{2})}\frac{s}{\sqrt{n}} \] \[ Lower Bound = \bar{x} - t_{(1-\frac{\alpha}{2})}\frac{s}{\sqrt{n}} \]
We will set up a 3 dimensional array.
First dimension will determine distribution type for random variable to be simulated.
Second dimension will assign number of rows (sample size)
Third dimension will be number of samples.
Populate the array with data
Create a matrix of sample means and standard deviations.
Create 95% confidence intervals around each sample mean.
Discuss whether C.L.T. holds using Boolean vectors.
#Setting parameters
sample_size=10
num_samples=500
# Shape and Rate parameters mean is sh/rt
sh=2
rt=1
#Setting up the array described above
demo=array(dim=c(2,sample_size,num_samples))
#The first set of distributions to simulate.
demo[1,,]=rgamma(n=sample_size*num_samples,shape=sh,rate=rt)
# Where summary statistics are going to be
summary_statistics=matrix(nrow=2,ncol=num_samples)
#Sample means
summary_statistics[1,]=colMeans(demo[1,,])
#Sample Standard deviation
summary_statistics[2,]=apply(demo[1,,],2,sd)
[1] 34
[1] 3