A lecturer is interested in experiments where the “Wisdom of the Crowd” is put to the test. In these cases, combined answers from a large group usually proves better than the individuals. A simple example is to get people to guess the number of jelly beans in a jar. Individual guesses vary widely and are usually all wrong, but the average can be quite close to the actual answer.
An internet experiment was carried out where people where asked to guess the weight of a cow (in pounds) that was pictured. The actual weight of the cow in the picture was 1355 pounds. A large number of people responded with their estimated weight. For this question, we will use a random subset of the respondents.
We will treat the random sample of guesses as representative of group wisdom. What we are interested in are is was the group wisdom consistent with the actual weight of the cow or was there evidence that it differed? If there was evidence that the group wisdom estimate differed, what was the group wisdom estimate? (Remember, we want an interval for the estimate.)
The data on the 500 guesses of the cows weight are in the file Cow.csv, which contains the variable:
| Variable | Description |
|---|---|
| Weight | the guess of the weight of the cow (in pounds) |
Instructions:
We are interested in seeing if group wisdom can guess the average weight of a cow that is shown in a picture on the internet.
Cow.df=read.csv("Cow.csv", header=T)
hist(Cow.df$Weight,breaks=20)
summary(Cow.df$Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 300 950 1276 1286 1532 4450
Yes I can see that there is a sign from the plot that the weight guess low, this is because the plot is right skewed where its bimodial. This is indictaed by the cluster less than 1000 pounds, which indicates that there are groups of people who are weighting it down in kilograms.
Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)
NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where
\(s\) is the standard deviation of
\(y\) and is calculated by
sd(y), and \(n\) is the
number of data-points calculated by length(y). The degrees
of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R
code qt(0.975, df).
ybar = mean(Cow.df$Weight)
n = length(Cow.df$Weight)
se.ybar = sd(Cow.df$Weight)/sqrt(n)
# t-statistic for H0: mu=1355 :
(ybar - 1355) / se.ybar
## [1] -3.026792
# 95% confidence interval for the mean:
ybar - qt(0.975, n-1) * se.ybar
## [1] 1241.15
ybar + qt(0.975, n-1) * se.ybar
## [1] 1330.776
t.test(Cow.df$Weight, mu=1355)
##
## One Sample t-test
##
## data: Cow.df$Weight
## t = -3.0268, df = 499, p-value = 0.002599
## alternative hypothesis: true mean is not equal to 1355
## 95 percent confidence interval:
## 1241.150 1330.776
## sample estimates:
## mean of x
## 1285.963
Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.
Cow.fit=lm(Weight~1,data=Cow.df)
normcheck(Cow.fit)
cooks20x(Cow.fit)
summary(Cow.fit);
##
## Call:
## lm(formula = Weight ~ 1, data = Cow.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -986.0 -336.0 -9.5 246.0 3164.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1285.96 22.81 56.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 510 on 499 degrees of freedom
confint(Cow.fit)
## 2.5 % 97.5 %
## (Intercept) 1241.15 1330.776
As this data consists of 500 guesses (of a cows weight). We have applied a one sample t-test to it, equivalent to an intercept only linear model (null model).
We have a random sample of 500 guesses, and we wished to see if their average guess is consistent with the actual cows weight of 1355 pounds. The guesses should be independent of each other. Though the data is skewed, we are happy with the normality assumption (see answer to previous question). There were no unduly influential points.
Our model is: \(Weight_i = \mu + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)
We are interested in seeing if group wisdom can guess the average weight of a cow that is shown in a picture on the internet. We aim to predict the average weight of a cow shown a picture.
We have evidence that the underlying mean of gueses is not the same as the actal weight, this can be seen through the p-value(<2e-16).
It is estimated that those who are shown the picture will guess an average wight of the cow to be between 1241.15 and 1330.78 pounds.
A manufacturer of electric bikes wants to investigate how power consumption of their model of bike increases with speed. 100 independent measurements of speed vs power consumption were recorded for their bikes.
The data is in the file CyclePower.csv, which contains the variables:
| Variable | Description |
|---|---|
| kph | Speed (kilometres per hour) |
| watts | Power consumption (watts) |
Instructions:
We are interested in investigating how the power consumption of an electric bike changes with the bikes speed.
Ebike.df=read.csv("CyclePower.csv", header=T)
plot(watts~kph,data=Ebike.df)
The scatter plot of watts verses kph shows a strong, increasing linear relationship. The larger the speed (kph), the greater the mean power consumption (watts) of an electric bike.
The scatter of the graph is constant.
## Fitting the simple linear model first.
Ebike.fit1 = lm(watts~kph, data=Ebike.df)
modelcheck(Ebike.fit1)
summary(Ebike.fit1)
##
## Call:
## lm(formula = watts ~ kph, data = Ebike.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.744 -9.365 0.080 9.919 31.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -77.7181 4.3908 -17.70 <2e-16 ***
## kph 18.2442 0.2079 87.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.26 on 103 degrees of freedom
## Multiple R-squared: 0.9868, Adjusted R-squared: 0.9867
## F-statistic: 7698 on 1 and 103 DF, p-value: < 2.2e-16
#watts.fit1=lm(watts~temp, data = watts.df)
#modelcheck(watts.fit1)
## Fit a quadratic relationship.
Ebike.fit2 = lm(watts~kph + I(kph^2), data=Ebike.df)
plot(Ebike.fit2,which=1)
modcheck(Ebike.fit2)
summary(Ebike.fit2)
##
## Call:
## lm(formula = watts ~ kph + I(kph^2), data = Ebike.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.182 -5.054 1.478 6.219 24.628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.80175 11.23558 -0.160 0.873
## kph 9.92299 1.17888 8.417 2.52e-13 ***
## I(kph^2) 0.20630 0.02892 7.134 1.45e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.89 on 102 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.991
## F-statistic: 5739 on 2 and 102 DF, p-value: < 2.2e-16
Pred.df=data.frame(kph=9:32)
pred.watts = predict(Ebike.fit2, Pred.df)
#pred.watts
pred.watts.linear = predict(Ebike.fit1, Pred.df)
#pred.watts.linear
Diff= pred.watts-pred.watts.linear
Diff
## 1 2 3 4 5 6 7
## 17.7360920 13.3346268 9.3457635 5.7695020 2.6058424 -0.1452153 -2.4836712
## 8 9 10 11 12 13 14
## -4.4095252 -5.9227773 -7.0234276 -7.7114760 -7.9869226 -7.8497673 -7.3000101
## 15 16 17 18 19 20 21
## -6.3376511 -4.9626902 -3.1751275 -0.9749628 1.6378036 4.6631720 8.1011421
## 22 23 24
## 11.9517142 16.2148881 20.8906639
# Have already predicted values over a range for the model above so can use the lines command to add these as the appropriate line/curve to the plot.
plot(watts ~ kph, data = Ebike.df)
lines(9:32, pred.watts, col="purple")
lines(9:32, pred.watts.linear, col="pink")
Both models share similar assumption and each observation is independen, with equal variance (EOV) maintained. The R-squared value for the quadratic model is 0.9912, while for the linear model it is 0.9868, showing comparable levels of variability. However, when examining the residual plots, the linear model displays more curvature, whereas the quadratic model shows a straighter pattern, indicating more uniform data and consistent variance. Thus, the quadratic model is considered more accurate.
\(Ebike_i = \beta_0 + \beta_1 \times kph_i + \beta_2\times kph_i^2 + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)
There is a positive linear relationship between the speed and power of electric bikes, this indicates that as the speed (kph) increases so does the power conumption (Watts).
The linear model has an easy to read relationship between the power conumption and the speed, therefore the model I would use would be the linear model.
1.3 Comment on the plot/exploratory data analysis
The data is not normally distributed, as shown on the histogram. The shape of the distribution shows a cluster of data points with more peaks towards the lower weight values, particularly between 800 and 1600 pounds. There is a peak around 1300 to 1400 pounds, indicating that most of the data points lie within this range.
From the summary statistics, we can infer that the distribution is not symmetrical. This is evident because the mean weight is 1286 pounds, while the median weight is 1276 pounds. Since the mean is greater than the median, we can confirm that the data is right skewed, meaning there is a longer tail on the right side of the distribution. The lowest weight recorded is 300 pounds, and the maximum weight recorded is 4450 pounds, which indicates large variability.