Statistics A-level Questions

## Using poppler version 20.12.1

0.1 Question 1

A company manager is investigating the times taken, t minutes, to complete an aptitude test. The human resources manager produced the table below of coded times, x minutes, for a random sample of 30 applicants.

(you may use \(\sum fy=355\) and \(\sum fy^{2}=5675\))

The median is the midpoint between the 15th and 16th data point. It will lie in the \(5 \leq x < 10\) region. Using linear interpolation it will be 12.5/15 of the distance across the range. This is 5/6 of the distance between 5 and 10 which will be 9.17 minutes.

\[SD=\sqrt{\dfrac{\sum{x^{2}}-\frac{(\sum{x})^{2}}{n}}{n}}\]

\[SD=\sqrt{\dfrac{5675-\frac{(355)^{2}}{30}}{30}}\]

= 7 minutes

The company manager is told by the human resources manager that he subtracted 15 from each of the times and then divided by 2, to calculate the coded times.

The median will be shifted proportionally. The median of the coded times was 9.17. The median of the time t will be (9.17 x 2) + 15 = 33.34 minutes.

The calculation for the coding involves a scale factor of 2 and a coordinate shift of 15. Only the scale factor affects the standard deviation and so the standard deviation will be double that of the coded time data and it will be 14 minutes.

This can be checked by tabulating the data again for t where x is replaced by \(2x+15\) and the midpoint of each range of t, z.

(\(\sum fz=1160\) and \(\sum fz^{2}=50750\))

\[SD=\sqrt{\dfrac{50750-\frac{(1160)^{2}}{30}}{30}} = 14 mins\]

The following year the company has 25 positions available. The company manager decides not to offer a position to any applicant who takes 35 minutes or more to complete the aptitude test.

The company has 60 applicants.

The median time is less than 35 minutes and so you would expect 50% of applicants to complete the test in less than this time. As there are 60 applicants. This will mean at least 30 of them should meet the time limit and so the positions will be filled as there are only 25 available.

0.2 Question 2

Using the calculator you need to calculate the cumulative distribution function for the range from 5 to 11 in the binomial distribution. It does not include 12 as it states less than 12. This is the area shaded in tan.

The answer is:

0.8443677

Past records from a large supermarket show that 25% of people who buy eggs, but organic eggs. On one particular day, a random sample of 40 people is taken from those that had bought eggs and 16 people are found to have bought organic eggs.

A hypothesis is always written as a hypothesis pair.

There is the null hypothesis \(H_{0}\) and the alternative hypothesis \(H_{1}\)

In this case the null hypothesis is that the proportion of shoppers that buy organic eggs is 25%. The alternative hypothesis is that the proportion of shoppers that buy eggs is greater than 25%.

To test the hypothesis at the 1% level you assume that the null hypothesis is true and you calculate the probability of observing 16 or more shoppers buying organic eggs if the proportion of shoppers who buy organic eggs is 25%.

This is the area under the curve for 16 or more people buying organic eggs coloured in tan.

Using the calculator you need to calculate the cumulative distribution function for the range from 16 to 40 in the binomial distribution.

This is 2.62 % which is above the threshold of the hypothesis test which was 1%. So there is not enough evidence at the 1% level to reject the null hypothesis and so the proportion of customers who buy organic eggs has not increased.

If the threshold for the hypothesis test was 5% then 2.62 % IS less than this threshold and so we would reject the null hypothesis and there would be evidence that the proportion of people buying organic eggs has increased from 25%.

0.3 Question 3

Pete is investigating the relationship between daily rainfall w(mm) and daily mean pressure p (hPA) in Perth during 2015. He used the large data set to take a sample of size 12. He obtained the following results.

The data
p	w
1007	102.0
1012	63.0
1013	63.0
1009	38.4
1019	38.0
1010	35.0
1010	34.2
1010	32.0
1013	30.4
1011	28.0
1014	28.0
1022	15.0

The Calculated Quartiles
	p	w
Q1	1010.0	29.2
Q2	1011.5	34.6
Q3	1013.5	50.7

The Real Summary Statistics for the Data
	p	w
	Min. :1007	Min. : 15.00
	1st Qu.:1010	1st Qu.: 29.80
	Median :1012	Median : 34.60
	Mean :1012	Mean : 42.25
	3rd Qu.:1013	3rd Qu.: 44.55
	Max. :1022	Max. :102.00

An outlier is a value which is more than 1.5 times the interquartile range above Q3 or more that 1.5 times the interquartile range below Q1.

The first point has a value of w over 100 mm and so this is an outlier based on rainfall. The other two plots are outliers in pressure as they have values above 1018.75 hPa.

The correlation will become a more accurate measure of the relationship between the two variables after the outliers have been removed. In this case there will be correlation once the outliers are removed.

The actual Pearson’s correlation for the complete data set is -0.4985878 and the correlation once the outliers have been removed is 0.1990396.

John has also been studying the large data set and believes that the sample Pete has taken is not random.

The values chosen by Pete are unusual because all of them have large amounts of rainfall whereas in the complete dataset at least half of the measured days have no rainfall at all.

John finds that the equation of the regression line of w on p using all of the data in the large data set is:

\(w = 1023 - 0.223p\)

-0.223 is the gradient of the regression line. It is the amount that the value of the dependent variable (w) will increase when the independent variable (p) changes by one unit.

John decided to use the regression line to estimate the daily rainfall for a day in December when the daily mean pressure is 1011 hPa.

John’s estimate is an interpolation at close to the centre of the set of data. This reduces the likelihood of having a biased estimate and as he is using all of the data it is the most reliable estimate that you can make.

0.4 Question 4

Alyona, Dawn and Sergei are sometimes late for school. The events A, D and S are as follows

The Venn diagram below shows the three events A,D and S and the probabilities associated with each region of D. The constants p,q and r each represent probabilities associated with the three separate regions outside D.

Venn Diagram

A and S are mutually exclusive as there is no overlap between the two sets and the probability that both of them occurs is zero. Alyona and Sergei are never both late to school.

The probability that Sergei is late for school is 0.2. The events A and D are independent.

Dawn and Sergei’s teacher believes that when Sergei is late for school, Dawn tends to be late for school.

The teacher is wrong to think that there is any influence of Sergei’s being late for school on Dawn’s being late as the probability of the two events is independent and one does not affect the other.

Statistics A-level Questions

Andrew Dalby

2021/09/27

0.1 Question 1

0.2 Question 2

0.3 Question 3

0.4 Question 4