Part I
- A student is gathering data on the driving experiences of other college students. A description of the data car color is presented below. Which of the variables are quantitative and discrete?
car 1=compact,2=standardsize,3=minivan,4=SUV,and 5=truck
color red, blue, green, black, white
daysDrive number of days per week the student drives
gasMonth the amount of money the student spends on gas per month
- car
- daysDrive
- daysDrive, car
- daysDrive, gasMonth
- car, daysDrive, gasMonth
Ans : B
- A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?
Fig - 2
Ans : A Mean = 3.3 and Median = 3.5
- A researcher wants to determine if a new treatment is effective for reducing Ebola related fever. What type of study should be conducted in order to establish that the treatment does indeed cause improvement in Ebola patients?
- Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
- Identify Ebola patients who received the new treatment and those who did not, and then compare the fever of those two groups.
- Identify clusters of villages and then stratify them by gender and compare the fevers of male and female groups.
- Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.
Ans : D
- A study is designed to test whether there is a relationship between natural hair color (brunette, blond, red) and eye color (blue, green, brown). If a large χ2 test statistic is obtained, this suggests that:
- there is a difference between average eye color and average hair color.
- a person’s hair color is determined by his or her eye color.
- there is an association between natural hair color and eye color.
- eye color and natural hair color are independent
Ans : For large values of χ2, we can reject Null Hypothesis : Option C
- A researcher studying how monkeys remember is interested in examining the distribution of the score on a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot?
min Q1 median Q3 max mean sd n 26 37 45 49.3 65 44.4 8.4 50
Ans : A These are the quartile values
- The ————— are resistant to outliers, whereas the —————– are not.
- mean and median; standard deviation and interquartile range
- mean and standard deviation; median and interquartile range
- standard deviation and interquartile range; mean and median
- median and interquartile range; mean and standard deviation
- median and standard deviation; mean and interquartile range
Ans : D
7 . Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.
Fig - 7
a.) Describe the two distributions (2 pts)
Distribution A : Skewed Right. The Spread is Small, The mean is around 5, This is distribution of observed variable. Distribution B : No Skewness. looks normally distributed. The spread is wide. The mean is around 5, This is distribution of mean of the distribution A.
b.) Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
Lesser the sample size, the spread increases and hence Standard Deviation are different. For larger sample size, it will be closer to population mean and smaller spread.
c.) What is the statistical principal that describes this phenomenon (2 pts)?
The principle is CENTRAL LIMIT THEOREM. It States that “The sampling distribution of the mean is nearly normal when the sample observations are independent and come from a nearly normal distribution. This is true for any sample size.”
Part II
- The mean (for x and y separately; 1 pt).
[1] 9
[1] 7.5
[1] 9
[1] 7.5
[1] 9
[1] 7.5
[1] 9
[1] 7.5
- The median (for x and y separately; 1 pt).
[1] 9
[1] 7.6
[1] 9
[1] 8.1
[1] 9
[1] 7.1
[1] 8
[1] 7
- The standard deviation (for x and y separately; 1 pt).
[1] 3.3
[1] 2
[1] 3.3
[1] 2
[1] 3.3
[1] 2
[1] 3.3
[1] 2
For each x and y pair, calculate (also to two decimal places; 1 pt): d. The correlation (1 pt).
[1] 0.82
[1] 0.82
[1] 0.82
[1] 0.82
- Linear regression equation (2 pts).
Call:
lm(formula = data1$y ~ data1$x)
Coefficients:
(Intercept) data1$x
3.0001 0.5001
Call:
lm(formula = data2$y ~ data2$x)
Coefficients:
(Intercept) data2$x
3.001 0.500
Call:
lm(formula = data3$y ~ data3$x)
Coefficients:
(Intercept) data3$x
3.0025 0.4997
Call:
lm(formula = data4$y ~ data4$x)
Coefficients:
(Intercept) data4$x
3.0017 0.4999
\[ \hat{y}(1) = 3 + 0.5 * x \]
\[ \hat{y}(2) = 3 + 0.5 * x \]
\[ \hat{y}(3) = 3 + 0.5 * x \]
\[ \hat{y}(4) = 3 + 0.5 * x \]
- R-Squared (2 pts).
[1] 0.6665425
[1] 0.666242
[1] 0.666324
[1] 0.6667073
For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)
Fig 1: No : The residuals in plot 4 are not normally distrubuted, though scatterplot looks almost linear
Fig 2: No : Scatter plot is like a Parabola. There is no linearity
Fig 3: Yes, This plot fits almost all the criteria, the data follow a linear trend and residuals are normal. looks like a good fit.
Fig 4: No, This looks like a outlier plot. There are too many outliers
Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)
Visualizations give a complete picture of the relationship between independent and dependent variables. For Eg, in the above figures plots of residuals gives a complete picture when just comparing it with scatterplot. So visualization is important