Answer: b. The null hypothesis is true.
A p-value is the probability of observing a test statistic as extreme as the one observed, assuming the null hypothesis is true.
Answer: b. Small.
We reject the null hypothesis when the p-value is small, meaning the observed result would be unlikely if the null hypothesis were true.
Answer: b. Rejecting \(H_0\) when \(H_0\) is true.
A Type I error occurs when we reject a null hypothesis that is actually true.
These conditions ensure the sampling distribution of the test statistic is approximately normal, allowing us to use the z or t distributions for statistical inference.
Answer: b. Reject \(H_0 : \beta_1 = 0\).
Because the p-value (0.02) is less than \(\alpha = 0.05\), we reject the null hypothesis and conclude the slope is statistically different from zero.
Answer: b. There is evidence of a linear relationship.
Since the 95% confidence interval (1.3, 4.1) does not contain 0, this suggests the slope is significantly different from zero and there is evidence of a linear relationship.
Using the Central Limit Theorem, the sampling distribution of the sample mean is
\[ \bar{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right) \]
Since the population standard deviation \(\sigma\) is unknown, we estimate it with the sample standard deviation \(s\), so in practice we use
\[ \bar{X} \sim t_{n-1}\left(\mu, \frac{s}{\sqrt{n}}\right) \]
The assumptions are that the observations are independent and that the sample size is sufficiently large so that, by the Central Limit Theorem, the sampling distribution of the sample mean is approximately normal.
This sampling distribution is problematic for constructing p-values for \(\mu\) because the population standard deviation \(\sigma\) is unknown. Since we must estimate \(\sigma\) using the sample standard deviation \(s\), there is extra uncertainty, so we use a \(t\) distribution rather than a normal distribution.
## [1] 17.91768 22.26357
The 95 percent confidence interval for the population mean miles per gallon is (17.92, 22.26). This means we are 95 percent confident that the true mean fuel efficiency of the population of cars lies between about 17.9 mpg and 22.3 mpg.
## [1] 0.08506004
## [1] 0.9327606
The test statistic is \(t = 0.085\) and the p-value is 0.933. Since the p-value is much larger than 0.05, we fail to reject the null hypothesis that the population mean mpg is 20. There is not sufficient statistical evidence to conclude that the true mean fuel efficiency differs from 20 mpg.
Yes. The confidence interval (17.92, 22.26) contains the claimed value of 20 mpg. This agrees with the result from the hypothesis test in 2.1e, where we failed to reject the null hypothesis that the population mean mpg is 20.
## [1] 0.05
The value on the boundary of the confidence interval is 22.26. The p-value for testing this value is 0.05, which is exactly the significance level associated with a 95 percent confidence interval.
Confidence intervals and hypothesis tests are dual to each other. A 95 percent confidence interval contains exactly the values of \(\mu_0\) that would not be rejected by a two-sided hypothesis test at the 5 percent significance level. In this problem, 20 lies inside the interval, so we failed to reject \(H_0 : \mu = 20\). The boundary of the interval has p-value 0.05, illustrating the connection between these two methods.
The ncbirths dataset contains information about births in North Carolina. Each observation represents a single birth. The dataset includes variables describing the parents and pregnancy, such as the father’s age (fage), the mother’s age (mage), whether the mother is considered mature (mature), the length of the pregnancy in weeks (weeks), and whether the baby was premature (premie). It also includes health and demographic variables such as the number of prenatal visits (visits), marital status (marital), weight gained during pregnancy (gained), the baby’s birth weight (weight), whether the baby had low birth weight (lowbirthweight), the baby’s gender (gender), whether the mother smoked during pregnancy (habit), and whether the mother is white (whitemom). Each row therefore represents one birth along with characteristics of the parents and the pregnancy.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatterplot shows a positive relationship between the length of pregnancy and birth weight. Babies born after longer pregnancies tend to have higher birth weights, while babies born earlier tend to weigh less. Most births occur between about 36 and 40 weeks, with birth weights generally between about 6 and 9 pounds. Although there is some variation, the overall pattern shows that birth weight increases as the length of pregnancy increases.
##
## Call:
## lm(formula = weight ~ weeks, data = ncbirths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5775 -0.7048 -0.0235 0.7022 4.4165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.09529 0.46464 -13.12 <2e-16 ***
## weeks 0.34433 0.01209 28.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.119 on 996 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.449, Adjusted R-squared: 0.4485
## F-statistic: 811.7 on 1 and 996 DF, p-value: < 2.2e-16
The fitted regression model is
weight = -6.095 + 0.344(weeks)
The slope of 0.344 means that for each additional week of pregnancy, the predicted birth weight increases by about 0.344 pounds on average. This suggests that longer pregnancies tend to result in heavier babies.
The intercept of -6.095 represents the predicted birth weight when the pregnancy length is 0 weeks. This value does not have a meaningful interpretation in this context because a pregnancy length of 0 weeks is not realistic.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The residuals vs fitted plot does not show a clear funnel shape, suggesting that there is no strong evidence of heteroskedasticity. The spread of the residuals appears relatively constant across the fitted values, although there is some natural variation.
The Q-Q plot shows that most of the residuals lie close to the reference line, indicating that the residuals are approximately normally distributed. There are small deviations at the extremes, but overall the normality assumption appears reasonable.
##
## Call:
## lm(formula = weight ~ weeks, data = ncbirths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5775 -0.7048 -0.0235 0.7022 4.4165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.09529 0.46464 -13.12 <2e-16 ***
## weeks 0.34433 0.01209 28.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.119 on 996 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.449, Adjusted R-squared: 0.4485
## F-statistic: 811.7 on 1 and 996 DF, p-value: < 2.2e-16
We test the hypotheses
H0: b1 = 0
HA: b1 > 0
From the regression output, the p-value for b1 (weeks) is less than 2e-16. Since this p-value is much smaller than 0.05, we reject the null hypothesis. There is strong statistical evidence that longer pregnancies are associated with heavier babies.
##
## Call:
## lm(formula = weight ~ weeks + habit, data = ncbirths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3788 -0.6893 -0.0344 0.7157 4.3708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.07220 0.46228 -13.135 < 2e-16 ***
## weeks 0.34491 0.01202 28.685 < 2e-16 ***
## habitsmoker -0.35882 0.10607 -3.383 0.000746 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.113 on 995 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4553, Adjusted R-squared: 0.4542
## F-statistic: 415.8 on 2 and 995 DF, p-value: < 2.2e-16
To improve the model, we add the mother’s smoking status (habit) as an additional predictor because smoking during pregnancy can affect birth weight.
In the new model, the coefficient for weeks is still positive (0.3449) and highly statistically significant with a p-value less than 2e-16. This indicates that even after controlling for smoking behavior, longer pregnancies are still associated with heavier babies.
The coefficient for habitsmoker is -0.3588 and is statistically significant (p = 0.000746), suggesting that babies born to mothers who smoke tend to weigh about 0.36 pounds less on average than babies born to non-smoking mothers.
Therefore, even after adding smoking status to the model, there is still strong statistical evidence that pregnancy length affects birth weight.
In part 2.2g, I selected the variable habit (whether the mother smoked during pregnancy) because smoking is known to affect birth weight and is a relevant factor to include in the model. Adding this variable helps control for another factor that may influence the baby’s weight.
However, the model was not chosen using a systematic model selection procedure, and there may be other variables in the dataset that also affect birth weight. Because the model was selected based on a small number of variables rather than a formal model selection method, the inference should be interpreted with some caution. Nevertheless, the strong statistical significance of the weeks variable suggests that pregnancy length is still an important predictor of birth weight.
## x y
## 1 -1.1317843 -4.9824226
## 2 -0.4404935 5.5259577
## 3 -0.3364002 1.0693278
## 4 -0.8470490 8.3809235
## 5 -0.1460274 3.5940863
## 6 -1.4305493 0.5417081
rnorm(n, mean = 0, sd = 4) generates random error terms from a normal distribution with mean 0 and standard deviation 4. The command set.seed(202) ensures that the same random numbers are generated each time the code is run, making the simulation reproducible.
## (Intercept) x
## 2.084993 2.865824
The true line (red) and the least-squares line (blue) are very close but not exactly the same. The least-squares line is estimated from the simulated data, while the true line represents the actual model used to generate the data. Because the data include random error, the estimated line will differ slightly from the true line. However, since the sample size is fairly large and the model is correct, the least-squares line is close to the true line.
## sse_true sse_fit
## 712.5971 711.4451
The SSE for the least-squares line (711.45) is slightly smaller than the SSE for the true line (712.60). This happens because the least-squares regression line is specifically chosen to minimize the sum of squared errors for the observed data. Even though the true line generated the data, the least-squares line will always produce the smallest SSE for that particular sample.
## (Intercept) x
## 2.242752 2.842498
The least-squares line changes because the new dataset contains different random errors. In this simulation, the estimated intercept is 2.2428 and the estimated slope is 2.8425, which are slightly different from the true values of 2 and 3. This happens because the regression line is estimated from the sample data, which includes random variation.
However, the true line does not change. The true model used to generate the data is still \(y = 2 + 3x + \epsilon\), so the underlying relationship between \(x\) and \(y\) remains the same even though the estimated regression line changes.
## `geom_smooth()` using formula = 'y ~ x'
With the smaller error standard deviation (σ = 0.5), the points lie much closer to the regression line than in the earlier plot. This means there is less random noise in the data. Because the errors are smaller, the simulated points follow the true linear relationship more closely, and the least-squares line is almost identical to the true line.
## `geom_smooth()` using formula = 'y ~ x'
This scatterplot shows more extreme observations compared to the earlier simulation with normally distributed errors. The t-distribution with 3 degrees of freedom has heavier tails than the normal distribution, meaning it produces more large positive or negative errors. As a result, there are more outliers and the points are spread farther from the regression line.
The residuals do not appear to be normally distributed. The histogram shows heavy tails and several extreme values rather than a symmetric bell-shaped distribution. The Q-Q plot also shows strong deviations from the reference line, especially in the tails where the points move far away from the line.
If the residuals were normally distributed, the points in the Q-Q plot would lie close to the straight line. Instead, the large deviations indicate heavy tails caused by the Cauchy errors.