Inference for Numerical Data

Statistics and Probability for Data Analytics

CUNY MSDS DATA 606

Rose Koh

2018/03/25

Links

Rpub Link

Assignments

Chapter 5 - Inference for Numerical Data

Practice: 5.5, 5.13, 5.19, 5.31, 5.45 Graded: 5.6, 5.14, 5.20, 5.32, 5.48

n <- 25
mean <- (65+77)/2
m.err <- (77-65)/2
# margin of error = t*sd/sqrt(n)
df <- n - 1
t <- qt(.90, df)
# sd <- sqrt(n) * m.err / t
sd <- (m.err / t) * sqrt(n)

Mean 71
Margin of error 6
Sample standard deviation 22.764594

# a. samples = (z*sd / m.err) ^ 2
z <- qnorm(.95, mean = 0, sd = 1) #for 90% confidene interval
sd <- 250
m.err <- 25

n.a <- ((z*sd) / m.err) ^ 2

# b.larger or smaller

# c. 
z <- qnorm(.995, mean = 0, sd = 1) #for 99% confidence interval

n.c <- ((z*sd) / m.err) ^ 2

270.5543454
Larger. Higher degrees of confidence requires larger margins of error.
663.4896601

No. there is no clear difference in the average of the reading and writing schores. In addition to it, the difference in scores centeres around 0
We could assume that the reading and writing scores are independent among students.

H0: diff between reading and writing == 0
Ha: diff between reading and writing != 0

Independency: The samples represent less than 10% of the entire survey, we can safely assume that these are independent.
Size of sample: The amount of samples are > 30 and the distribution seem near-normal (+ even with skewness, with the size of sample, we can ignore and do t test)

n <- 200
mean.diff <- -0.545
df <- n-1
sd.diff <- 8.887
se <- sd.diff/sqrt(n)
t <- ((mean.diff - 0) / se)
p <- pt(t, df)

T score: Assuming that the mean will be xero, given the assumption of H0, Tscore is 0.867274
P value: Look on the T score on a table for df 199 (2-tailed), that falls between .50 ~ .20 range for p value. 0.1934182
We cannot reject the null hypothesis; we do not have clear evidence to convince difference between average reading and writing scores.

This may be type II error ; we may not have enough samples to detect the difference – It appears to show no evidence of difference when there is difference.
Yes. As we failed to reject the null hypothesis that has a null value of 0, that indicates there is no difference between the populations.

H0: Automatic MPG == Manual MPG (diff == 0)
HA: Automatic MPG != Manual MPG (diff != 0)

n <- 26

auto.mean <- 16.12
auto.sd <- 3.58

man.mean <- 19.85
man.sd <- 4.51

mean.diff <- auto.mean - man.mean

se <- sqrt((auto.sd^2 / n) + (man.sd^2/n))
t <- (mean.diff - 0) / se
df <- n - 1
p <- pt(t, df) * 2 #two tailed test

With T score -3.3030197 and degree of freedom 25, we can look up the table for p value
The two tailed p value is between .005 ~ .002 (r p) thus we reject the H0.; The p value 0.0028836 is smaller than .05 TRUE thus we reject the H0.
There is clear evidence that the difference in average city mileage between the auto and manual transmission vehicles.

H0: The average number of hours worked is all equal in each group.
- mean(Less than HS) == mean(HS) == mean( Jr Coll) == mean(Bachelor’s) == mean(Graduate)
HA: The average number of hours worked varies.

We need to consider the independency among observations, distribution and variability across the groups.

We can assume that 1172 US residents did not know each other, thus the samples are independent.
The standard deviation seems fairly similar to each group – the variance per group seems nearly equal to each other.
The observations in each group shows fairly normal distribution with outliers. However given the size of sample being > 30, we can safely perform the ANOVA.

# dfG = k - 1     Degree of freedom
# dfE = n - k     

# SSG             sums of squares groups.      
# MSG = ssg/dfg   mean square between group; measures the variability of the sample averages.

# SSE             sum of squared errors
# MSE = sse/dfe   mean square error; measures the variability within the groups.

# SST             sum of squared total
# F = MSG/MSE     variation between groups / variation within groups

mean <- c(38.67, 39.6, 41.39, 42.55, 40.85)
sd <- c(15.81, 14.97, 18.1, 13.62, 15.51)
n <- c(121, 546, 97, 253, 155)

df <- data.frame(mean, sd, n)

n <- sum(df$n)
k <- 5

# degree of freedom
dof <- k-1
dof.res <- n-k

# f-statistics
prf <- 0.0682
f.stat <- qf(1-prf, dof, dof.res)

# f-statistics = msg / mse
msg <- 501.54
mse <- msg/f.stat

# msg = (1 / dof) * ssg
ssg <- dof * msg
sse <- 267382

# sst = ssg + sse
sst <- ssg + sse

# dof total = dof + dof residual
dft <- dof + dof.res

degree <- c(dof, ssg, msg, f.stat, prf)
residuals <- c(dof.res, sse, mse, NA, NA)
total <- c(dof+dof.res, ssg+sse, NA, NA, NA)
table <- rbind(degree, residuals, total)
colnames(table) <- c("DF", "SUM SQ", "MEAN SQ", "F value", "Pr(>F)")
library(knitr)
kable(table)

	DF	SUM SQ	MEAN SQ	F value	Pr(>F)
degree	4	2006.16	501.5400	2.188931	0.0682
residuals	1167	267382.00	229.1255	NA	NA
total	1171	269388.16	NA	NA	NA

As shown above the p value is > 0.05 – we can state there is no significant difference between the groups and we fail to reject H0.