Problem 1

Last homework we examined whether there is a difference between the proportions of enough calcium consumption for children at different ages. The data look like this:

Met requirement 5 to 10 years 11 to 13 years
No 194 557
Yes 861 417
  1. Use two sample proportion test to examine if the proportion of calcium intake met requirement for children from 5 to 10 years old is the same as the proportion for children from 11 to 13 years old. Report the test statistic and p-value.

Our proportion value is .63 (1278/2029) (p1-p2)= .816-/428 = .388 Our standard error is .0215

Therefore our test statistic is : 18.04 With a test statistic this large, our p-value is nearly 0 and we have sufficient evidence to reject the null hypothesis that the two proportions are the same and conclude that there is a difference between the two proportions of children of different ages and if they meet the required calcium intake for their ages.

  1. Use Chi-sq test to examining if the distribution of calcium intake for children from 5 to 10 years old is the same as the distribution for children from 11 to 13 years old. Report your test statistic with degrees of freedom and the P-value. State your conclusion at \(\alpha=0.05\).(You may use the function “chisq.test” or “pchisq” in R to get the exact p-value.)

After finding the expected counts and doing the calculations, I found the chi-sq statistic to be 98.87 +107.094+ 58.1 +62.93 = 326.994 At a degree of freedoms of (2-1)(2-1) = 1, this leads to a p-value of nearly 0. (Could not find it in table F). At the significance level of .05, we reject the null hypothesis and conclude that the distributions are in fact different.

  1. Based on results from (a) and (b), explain the relationship between the chi-sq test and the \(z\) test for comparing two proportions.

Both tests ended up with p-values of nearly 0, they both rejected the null hypothesis. I think that both are effective for doing significance tests on two different populations to see if they have the same distribution. The way the chi-sq test works is by calculating a value that represents how far the values are from their expected counts. This gives a very good indication that a large chi-sq value with a low degree of freedom means it will certainly reject the null hypothesis.

Problem 2

Most errors in billing insurance providers for health services involve honest mistakes by patients, physicians, or others involved in the health care system. However fraud is a serious problem. When fraud is suspected, an audit of randomly selected billings is often conducted. The selected claims are then reviewed by experts, and each claim is classified as allowed or not allowed. There are a large number of small claims and a small number of large claims. Here are data from an audit that used three strata based on the sizes of the claims(small,medium, and large):

Strata Sampled Claims Number not allowed
Small 399 42
Medium 119 35
Large 35 7
  1. Construct the \(3\times 2\) table of counts of these data that includes the marginal totals.
Strata Claims allowed Claims not allowed Sampled total
Small 357 42 399
Medium 84 35 119
Large 28 7 35
Total 469 84 553
  1. Find the percent of claims that were not allowed in each of the three strata.
Strata Percent of Claims not allowed
Small 10.53%
Medium 29.4%
Large 20%
  1. State an appropriate null hypothesis to be tested for these data. Perform a significance test for your hypotheses and report your test statistic with degrees of freedom and the P-value. State your conclusion at \(\alpha=0.05\).

H0 - There is no difference in the distributions of small, medium, and large claims.

After calculating the expected counts, I calculated the chi-sq statistic to be: 1.474 + 5.715+2.516+ 15.86+ 1.08+ .531 = 27.176

The degrees of freedom is (3-1)(2-1)= 2

With a chi-sq test statistic of 27.176 and df = 2, the p-value is significantly less than our alpha level of .05, therefore we reject the null hypothesis and conclude that there is a difference in the distributions.

Problem 3

For each of the following statements, explain why it is wrong and correct the statement:

  1. The slope describes the changes in \(x\) for a change in \(y\).

It is the other way around, the slope describes the change in y for a change in x.

  1. The population regression line is \(y=b_0+b_1x\).

It is missing the mu symbol in front of the y. U\(y=b_0+b_1x\)

  1. A 95% confidence interval for the mean response has the same width regardless of \(x\).

The width changes, does not stay the same.

  1. The parameters of the simple linear regression model are \(b_0,b_1\) and \(s\).

The only parameters for simple linear regression are \(b_0, and b_1\)

  1. To test \(H_0:b_1=0\), we use a \(t\)-test.

You want to use a t-statistic with (b1- hypothesized value)/ SEb1

  1. For a particular value of the explanatory variable \(x\), the confidence interval for the mean response will be wider than the prediction interval for a future observation.

No, the confidence interval will be wider for a future observation rather than the prediction you are doing initially.

Problem 4

Sales price vs. assessed value. Real estate is typically reassessed annually for property tax purposes. This assessed value, however, is not necessarily the same as the fair market value of the property. The following data summarizes an SRS of 29 homes recently sold in a Midwestern city. Both variables are measured in thousands of dollars.

##           [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8] [,9] [,10] [,11]
## Sales    179.9 240.0 113.5 281.5 186.0 275.0 281.5 210.0  210 184.0 239.0
## Assessed 188.7 220.4 118.1 232.4 188.1 240.1 232.4 211.8  168 180.3 209.2
##          [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## Sales    185.0 251.0 180.0 160.0 255.0 220.0 160.0 200.0 265.0 190.0 150.5
## Assessed 162.3 236.8 123.7 191.7 245.6 219.3 181.6 177.4 307.2 229.7 168.9
##          [,23] [,24] [,25] [,26] [,27] [,28] [,29]
## Sales    189.0 157.0 171.5 157.0   175 159.0 229.0
## Assessed 194.4 143.9 201.4 143.9   181 125.1 195.3
  1. Make a scatterplot with assessed value on the horizontal axis. Briefly describe the relationship between assessed value and sales price.
plot(data$Assessed, data$Sales, xlab = "Assessed Value", ylab = "Sales")

There is a positive correlation between assessed value and sales price. There is a medium strong linear association which means that generally as assesed value increases, so does the sales price.

  1. Find the least-square regression line for predicting sales price from assessed value, and report the estimated parameters (\(b_0,b_1,s\)) and other summary statistics. Add the fitted line to your scatterplot.
plot(data$Assessed, data$Sales, xlab = "Assessed Value", ylab = "Sales")
abline(lm(data$Sales ~ data$Assessed))

Linearmodel = lm(data$Sales ~ data$Assessed)
summary(Linearmodel)
## 
## Call:
## lm(formula = data$Sales ~ data$Assessed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.394 -17.691  -2.562  15.500  46.814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    37.4102    23.9272   1.564     0.13    
## data$Assessed   0.8489     0.1208   7.027 1.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.8 on 27 degrees of freedom
## Multiple R-squared:  0.6465, Adjusted R-squared:  0.6334 
## F-statistic: 49.38 on 1 and 27 DF,  p-value: 1.486e-07
  1. Obtain the residuals and plot them vs. assessed value. Do the assumptions for linear regression analysis appear reasonable here? Explain your answer.
resid = residuals(Linearmodel)
plot(data$Assessed, resid)
abline(h=0)

The scatterplot shows a roughly linear relationship. When looking at the residual plot, the data points vary normally around the residual line. Furthermore, all observations are independent of each other. Taking this all into account, it appears reasonable to do linear regression analysis.

  1. Add the line \(y=x\) to your scatterplot and compare it with fitted line. Describe what you observe from the two lines. Is the sales price typically larger or smaller than the assessed value? Explain your answer.
plot(data$Assessed, data$Sales, xlab = "Assessed", ylab = "Sales")
abline(lm(data$Sales ~ data$Assessed))
abline(0, b =1)

Looking at the two lines in relation to my scatterplot, it appears that the sales price is typically larger than the assessed value up until a certain point around an assessed value of 250, at which point the tables turn and sales price is underneath the assessed value.

Problem 5

Refer to the previous problem. Based on your fitted line.

  1. Calculate the predicted sales prices for homes currently assessed at 155K, 220K, and 285K dollars.

Using the model created by the computer(intercept value = 37.4102, slope value = .8489x), We find several estimations: 155k = 168.9897 (thousands of dollars) 220k = 224.1682 (thousands of dollars) 285k = 279.3467 (thousands of dollars)

  1. Construct a 95% confidence interval for the slope. Explain what this model tells you in terms of the relationship between assessed value and sales price.

b1 += t* * SEb1 Using the statistics calculated during the summary, we find a df =27, a SE of .1208, a mean value of .8489, and a t* value of 3.057

Thus, the interval is (.4796, 1.218) This model tells us that the price sold add increases by (slope) whenever the assessed value increases by 1.

  1. Based on your results, What is the sum of squared residuals for your model? What is the fraction of variation in sales price that is explained by the regression line?

The squared residuals value is .6334, which means to say 63.34 % of the variation in sales price can be explained by the linear relationship between sales price and assessed value.

  1. Consider the line \(y=x\). Compare the sum of squared residuals from this model to the model in part (c).

If we evaluate the line y=x, the sum of the squared residuals will be zero because y=x is a linear relationship.