Last homework we examined whether there is a difference between the proportions of enough calcium consumption for children at different ages. The data look like this:
| Met requirement | 5 to 10 years | 11 to 13 years |
|---|---|---|
| No | 194 | 557 |
| Yes | 861 | 417 |
Our proportion value is .63 (1278/2029) (p1-p2)= .816-/428 = .388 Our standard error is .0215
Therefore our test statistic is : 18.04 With a test statistic this large, our p-value is nearly 0 and we have sufficient evidence to reject the null hypothesis that the two proportions are the same and conclude that there is a difference between the two proportions of children of different ages and if they meet the required calcium intake for their ages.
After finding the expected counts and doing the calculations, I found the chi-sq statistic to be 98.87 +107.094+ 58.1 +62.93 = 326.994 At a degree of freedoms of (2-1)(2-1) = 1, this leads to a p-value of nearly 0. (Could not find it in table F). At the significance level of .05, we reject the null hypothesis and conclude that the distributions are in fact different.
Both tests ended up with p-values of nearly 0, they both rejected the null hypothesis. I think that both are effective for doing significance tests on two different populations to see if they have the same distribution. The way the chi-sq test works is by calculating a value that represents how far the values are from their expected counts. This gives a very good indication that a large chi-sq value with a low degree of freedom means it will certainly reject the null hypothesis.
Most errors in billing insurance providers for health services involve honest mistakes by patients, physicians, or others involved in the health care system. However fraud is a serious problem. When fraud is suspected, an audit of randomly selected billings is often conducted. The selected claims are then reviewed by experts, and each claim is classified as allowed or not allowed. There are a large number of small claims and a small number of large claims. Here are data from an audit that used three strata based on the sizes of the claims(small,medium, and large):
| Strata | Sampled Claims | Number not allowed |
|---|---|---|
| Small | 399 | 42 |
| Medium | 119 | 35 |
| Large | 35 | 7 |
| Strata | Claims allowed | Claims not allowed | Sampled total |
|---|---|---|---|
| Small | 357 | 42 | 399 |
| Medium | 84 | 35 | 119 |
| Large | 28 | 7 | 35 |
| Total | 469 | 84 | 553 |
| Strata | Percent of Claims not allowed |
|---|---|
| Small | 10.53% |
| Medium | 29.4% |
| Large | 20% |
H0 - There is no difference in the distributions of small, medium, and large claims.
After calculating the expected counts, I calculated the chi-sq statistic to be: 1.474 + 5.715+2.516+ 15.86+ 1.08+ .531 = 27.176
The degrees of freedom is (3-1)(2-1)= 2
With a chi-sq test statistic of 27.176 and df = 2, the p-value is significantly less than our alpha level of .05, therefore we reject the null hypothesis and conclude that there is a difference in the distributions.
For each of the following statements, explain why it is wrong and correct the statement:
It is the other way around, the slope describes the change in y for a change in x.
It is missing the mu symbol in front of the y. U\(y=b_0+b_1x\)
The width changes, does not stay the same.
The only parameters for simple linear regression are \(b_0, and b_1\)
You want to use a t-statistic with (b1- hypothesized value)/ SEb1
No, the confidence interval will be wider for a future observation rather than the prediction you are doing initially.
Sales price vs. assessed value. Real estate is typically reassessed annually for property tax purposes. This assessed value, however, is not necessarily the same as the fair market value of the property. The following data summarizes an SRS of 29 homes recently sold in a Midwestern city. Both variables are measured in thousands of dollars.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## Sales 179.9 240.0 113.5 281.5 186.0 275.0 281.5 210.0 210 184.0 239.0
## Assessed 188.7 220.4 118.1 232.4 188.1 240.1 232.4 211.8 168 180.3 209.2
## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## Sales 185.0 251.0 180.0 160.0 255.0 220.0 160.0 200.0 265.0 190.0 150.5
## Assessed 162.3 236.8 123.7 191.7 245.6 219.3 181.6 177.4 307.2 229.7 168.9
## [,23] [,24] [,25] [,26] [,27] [,28] [,29]
## Sales 189.0 157.0 171.5 157.0 175 159.0 229.0
## Assessed 194.4 143.9 201.4 143.9 181 125.1 195.3
plot(data$Assessed, data$Sales, xlab = "Assessed Value", ylab = "Sales")
There is a positive correlation between assessed value and sales price. There is a medium strong linear association which means that generally as assesed value increases, so does the sales price.
plot(data$Assessed, data$Sales, xlab = "Assessed Value", ylab = "Sales")
abline(lm(data$Sales ~ data$Assessed))
Linearmodel = lm(data$Sales ~ data$Assessed)
summary(Linearmodel)
##
## Call:
## lm(formula = data$Sales ~ data$Assessed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.394 -17.691 -2.562 15.500 46.814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.4102 23.9272 1.564 0.13
## data$Assessed 0.8489 0.1208 7.027 1.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.8 on 27 degrees of freedom
## Multiple R-squared: 0.6465, Adjusted R-squared: 0.6334
## F-statistic: 49.38 on 1 and 27 DF, p-value: 1.486e-07
resid = residuals(Linearmodel)
plot(data$Assessed, resid)
abline(h=0)
The scatterplot shows a roughly linear relationship. When looking at the residual plot, the data points vary normally around the residual line. Furthermore, all observations are independent of each other. Taking this all into account, it appears reasonable to do linear regression analysis.
plot(data$Assessed, data$Sales, xlab = "Assessed", ylab = "Sales")
abline(lm(data$Sales ~ data$Assessed))
abline(0, b =1)
Looking at the two lines in relation to my scatterplot, it appears that the sales price is typically larger than the assessed value up until a certain point around an assessed value of 250, at which point the tables turn and sales price is underneath the assessed value.
Refer to the previous problem. Based on your fitted line.
Using the model created by the computer(intercept value = 37.4102, slope value = .8489x), We find several estimations: 155k = 168.9897 (thousands of dollars) 220k = 224.1682 (thousands of dollars) 285k = 279.3467 (thousands of dollars)
b1 += t* * SEb1 Using the statistics calculated during the summary, we find a df =27, a SE of .1208, a mean value of .8489, and a t* value of 3.057
Thus, the interval is (.4796, 1.218) This model tells us that the price sold add increases by (slope) whenever the assessed value increases by 1.
The squared residuals value is .6334, which means to say 63.34 % of the variation in sales price can be explained by the linear relationship between sales price and assessed value.
If we evaluate the line y=x, the sum of the squared residuals will be zero because y=x is a linear relationship.