Discussion 5

Author

Allison Shrivastava

Part 1. The (student) t distribution converges to normal distribution as the degrees of freedom increase (beyond 120). Please plot a normal distribution, and a few t distributions on the same chart with 2, 5, 15, 30, 120 degrees of freedom.

#set the range
x<-seq(-4,4, length.out=1000)

# create a dataframe with the densities laid out in the problem

dense<-data.frame(
  x=rep(x,6),
 df=c(dnorm(x),
            dt(x, df=2),
            dt(x, df=5),
            dt(x, df=15),
            dt(x, df=30),
            dt(x, df=120)
  ),
  distribution=factor(rep(c("Normal",
                              "t (df=2)",
                              "t (df=5)",
                              "t (df=15)",
                              "t (df=30)",
                              "t (df=120)"), 
                            each=length(x)),
                        levels=c(
                          "Normal",
                            "t (df=2)",
                              "t (df=5)",
                              "t (df=15)",
                              "t (df=30)",
                              "t (df=120)"
                        )))

# now plot
ggplot(dense, aes(x=x, y=df, color=distribution))+
  geom_line(linewidth = 1)+
  labs(title="Normal vs student's t distributions",
       x="",
       y="")+
  theme_minimal()

Lets work with normal data below (1000 observations with mean of 108 and sd of 7.2).

set.seed(123) # Set seed for reproducibility

mu <- 108

sigma <- 7.2

data_values <- rnorm(n = 1000, mean = mu, sd = sigma )

Plot two charts - the normally distributed data (above) and the Z score distribution of the same data. Do they have the same distributional shape ? Why or why not ?

These plots have the same distribution as the z score is a linear transformation that serves to re-scale the data, and preserves the shape of the data

# set values 
set.seed(123)
mu<-108
sigma<-7.2
data_values<-rnorm(n=1000, mean=mu, sd=sigma)

#compute the z scores
z_scores<-(data_values-mean(data_values))/sd(data_values)

## now create a data frame for the plolt
df2<-data.frame(value=data_values)
dfz<-data.frame(value=z_scores)

### now plot the data
p1<-ggplot(df2, aes(x=value))+
  geom_histogram(aes(y=after_stat(density)),
                 bins=30,
                 fill="violet",
                 color="violet")+
  labs(title="distribution of data",
       x="",
       y="")+
  theme_minimal()

# now plopt the z-score distribution
p2<-ggplot(dfz, aes(x=value))+
  geom_histogram(aes(y=after_stat(density)),
  bins=30,
fill="skyblue",
color="skyblue")+
    labs(title="distribution of z-scores",
         x="",
         y="")+
    theme_minimal()

  #pring and compare
p1

p2

Part 3. In your own words, please explain what is p-value?

The p-value gives us the probability that the data supports the null hypothesis. A larger p-value means there’s a greater chance that the data supports the hypothesis while a smaller value indicates a lesser chance.