1 Question 1 [10 Marks]

A lecturer is interested in experiments where the “Wisdom of the Crowd” is put to the test. In these cases, combined answers from a large group usually proves better than the individuals. A simple example is to get people to guess the number of jelly beans in a jar. Individual guesses vary widely and are usually all wrong, but the average can be quite close to the actual answer.

An internet experiment was carried out where people where asked to guess the weight of a cow (in pounds) that was pictured. The actual weight of the cow in the picture was 1355 pounds. A large number of people responded with their estimated weight. For this question, we will use a random subset of the respondents.

We will treat the random sample of guesses as representative of group wisdom. What we are interested in are is was the group wisdom consistent with the actual weight of the cow or was there evidence that it differed? If there was evidence that the group wisdom estimate differed, what was the group wisdom estimate? (Remember, we want an interval for the estimate.)

The data on the 500 guesses of the cows weight are in the file Cow.csv, which contains the variable:

Variable	Description
Weight	the guess of the weight of the cow (in pounds)

Instructions:

Make sure you change your name and UPI/ID number at the top of the assignment.
Comment on the plot of the data.
One concern that the lecturer has is that, since the respondents were self-selected from an internet website with readers from all around the world, some would be more used to estimating weights in kilograms rather than pounds and tend to guess low. Looking at the plot, comment if there seem to be any sign of this?
Manually calculate the t-statistic for comparing the mean height to 1355 pounds and the corresponding 95% confidence interval.
Write an appropriate Executive Summary.

1.1 Question of interest/goal of the study

We are interested in seeing if group wisdom can guess the average weight of a cow that is shown in a picture on the internet.

1.2 Read in and inspect the data:

Cow.df=read.csv("Cow.csv", header=T)
hist(Cow.df$Weight,breaks=20)

summary(Cow.df$Weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     300     950    1276    1286    1532    4450

1.3 Comment on the plot/exploratory data analysis

The data is not normally distributed, as shown on the histogram. The shape of the distribution shows a cluster of data points with more peaks towards the lower weight values, particularly between 800 and 1600 pounds. There is a peak around 1300 to 1400 pounds, indicating that most of the data points lie within this range.

From the summary statistics, we can infer that the distribution is not symmetrical. This is evident because the mean weight is 1286 pounds, while the median weight is 1276 pounds. Since the mean is greater than the median, we can confirm that the data is right skewed, meaning there is a longer tail on the right side of the distribution. The lowest weight recorded is 300 pounds, and the maximum weight recorded is 4450 pounds, which indicates large variability.

1.4 One concern that the lecturer has is that, since the respondents were self-selected from an internet website with readers from all around the world, some would be more used to estimating weights in kilograms rather than pounds and tend to guess low. Looking at the plot, comment if there seem to be any sign of this?

Yes I can see that there is a sign from the plot that the weight guess low, this is because the plot is right skewed where its bimodial. This is indictaed by the cluster less than 1000 pounds, which indicates that there are groups of people who are weighting it down in kilograms.

Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)

NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where \(s\) is the standard deviation of \(y\) and is calculated by sd(y), and \(n\) is the number of data-points calculated by length(y). The degrees of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R code qt(0.975, df).

ybar = mean(Cow.df$Weight)
n = length(Cow.df$Weight)
se.ybar = sd(Cow.df$Weight)/sqrt(n)

# t-statistic for H0: mu=1355 :
(ybar - 1355) / se.ybar

## [1] -3.026792

# 95% confidence interval for the mean:
ybar - qt(0.975, n-1) * se.ybar

## [1] 1241.15

ybar + qt(0.975, n-1) * se.ybar

## [1] 1330.776

1.5 Repeat the same calculation using the t.test function (done for you):

t.test(Cow.df$Weight, mu=1355)

## 
##  One Sample t-test
## 
## data:  Cow.df$Weight
## t = -3.0268, df = 499, p-value = 0.002599
## alternative hypothesis: true mean is not equal to 1355
## 95 percent confidence interval:
##  1241.150 1330.776
## sample estimates:
## mean of x 
##  1285.963

Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.

1.6 Fit and check the null model (done for you):

Cow.fit=lm(Weight~1,data=Cow.df)
normcheck(Cow.fit)

cooks20x(Cow.fit)

summary(Cow.fit);

## 
## Call:
## lm(formula = Weight ~ 1, data = Cow.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -986.0 -336.0   -9.5  246.0 3164.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1285.96      22.81   56.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 510 on 499 degrees of freedom

confint(Cow.fit)

##               2.5 %   97.5 %
## (Intercept) 1241.15 1330.776

1.7 Method and Assumption Checks

As this data consists of 500 guesses (of a cows weight). We have applied a one sample t-test to it, equivalent to an intercept only linear model (null model).

We have a random sample of 500 guesses, and we wished to see if their average guess is consistent with the actual cows weight of 1355 pounds. The guesses should be independent of each other. Though the data is skewed, we are happy with the normality assumption (see answer to previous question). There were no unduly influential points.

Our model is: \(Weight_i = \mu + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)

1.8 Executive Summary

We are interested in seeing if group wisdom can guess the average weight of a cow that is shown in a picture on the internet. We aim to predict the average weight of a cow shown a picture.

We have evidence that the underlying mean of gueses is not the same as the actal weight, this can be seen through the p-value(<2e-16).

It is estimated that those who are shown the picture will guess an average wight of the cow to be between 1241.15 and 1330.78 pounds.

2 Question 2 [16 Marks]

A manufacturer of electric bikes wants to investigate how power consumption of their model of bike increases with speed. 100 independent measurements of speed vs power consumption were recorded for their bikes.

The data is in the file CyclePower.csv, which contains the variables:

Variable	Description
kph	Speed (kilometres per hour)
watts	Power consumption (watts)

Instructions:

Comment on the initial plot of the data.
Fit BOTH a simple linear model and a second linear model with an appropriate quadratic term, including model checks.
Create predicted power consumptions for BOTH models for kph ranging from 9 to 32 kph (changing in steps of 1 kph.) You do not need to list these.
Create and list the differences between the two sets of predictions.
Plot the data with BOTH models superimposed over it.
Which of the two models is the most correct according to the assumptions and would be the most accurate for prediction? Justify your answer.
Write the equation of the model you chose above as if for Methods and Assumption Checks.
In one sentence, explain the relationship between speed and power consumption from the simple linear model.
Describe how the quadratic model further adjusts this relationship from the simple linear relationship. (You will find looking at your plot and the list of differences helpful here.)
If you wanted to just have a simple rule of thumb description for the relationship, which model would you use? Justify your answer.

2.1 Question of interest/goal of the study

We are interested in investigating how the power consumption of an electric bike changes with the bikes speed.

2.2 Read in and inspect the data:

Ebike.df=read.csv("CyclePower.csv", header=T)
plot(watts~kph,data=Ebike.df)

2.3 Comment on the plot

The scatter plot of watts verses kph shows a strong, increasing linear relationship. The larger the speed (kph), the greater the mean power consumption (watts) of an electric bike.

The scatter of the graph is constant.

2.4 Fit a linear model with an appropriate quadratic term, including model checks.

## Fitting the simple linear model first.

Ebike.fit1 = lm(watts~kph, data=Ebike.df)
modelcheck(Ebike.fit1)

summary(Ebike.fit1)

## 
## Call:
## lm(formula = watts ~ kph, data = Ebike.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.744  -9.365   0.080   9.919  31.595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -77.7181     4.3908  -17.70   <2e-16 ***
## kph          18.2442     0.2079   87.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.26 on 103 degrees of freedom
## Multiple R-squared:  0.9868, Adjusted R-squared:  0.9867 
## F-statistic:  7698 on 1 and 103 DF,  p-value: < 2.2e-16

#watts.fit1=lm(watts~temp, data = watts.df)
#modelcheck(watts.fit1)

## Fit a quadratic relationship.

Ebike.fit2 = lm(watts~kph + I(kph^2), data=Ebike.df)
plot(Ebike.fit2,which=1)

modcheck(Ebike.fit2)

summary(Ebike.fit2)

## 
## Call:
## lm(formula = watts ~ kph + I(kph^2), data = Ebike.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.182  -5.054   1.478   6.219  24.628 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.80175   11.23558  -0.160    0.873    
## kph          9.92299    1.17888   8.417 2.52e-13 ***
## I(kph^2)     0.20630    0.02892   7.134 1.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.89 on 102 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.991 
## F-statistic:  5739 on 2 and 102 DF,  p-value: < 2.2e-16

3 Prediction of power consumption ranging from 9 to 32 kph for BOTH models and find differences.

Pred.df=data.frame(kph=9:32)
pred.watts = predict(Ebike.fit2, Pred.df)
#pred.watts

pred.watts.linear = predict(Ebike.fit1, Pred.df)
#pred.watts.linear


Diff= pred.watts-pred.watts.linear
Diff

##          1          2          3          4          5          6          7 
## 17.7360920 13.3346268  9.3457635  5.7695020  2.6058424 -0.1452153 -2.4836712 
##          8          9         10         11         12         13         14 
## -4.4095252 -5.9227773 -7.0234276 -7.7114760 -7.9869226 -7.8497673 -7.3000101 
##         15         16         17         18         19         20         21 
## -6.3376511 -4.9626902 -3.1751275 -0.9749628  1.6378036  4.6631720  8.1011421 
##         22         23         24 
## 11.9517142 16.2148881 20.8906639

3.1 Plot the data with BOTH model superimposed over it.

# Have already predicted values over a range for the model above so can use the lines command to add these as the appropriate line/curve to the plot.

plot(watts ~ kph, data = Ebike.df)
lines(9:32, pred.watts, col="purple")
lines(9:32, pred.watts.linear, col="pink")

3.2 Which of the two models is the most correct according to the assumptions and would be the most accurate for prediction? Justify your answer.

Both models share similar assumption and each observation is independen, with equal variance (EOV) maintained. The R-squared value for the quadratic model is 0.9912, while for the linear model it is 0.9868, showing comparable levels of variability. However, when examining the residual plots, the linear model displays more curvature, whereas the quadratic model shows a straighter pattern, indicating more uniform data and consistent variance. Thus, the quadratic model is considered more accurate.

3.3 Write the equation of the model you chose above as if for Methods and Assumption Checks.

\(Ebike_i = \beta_0 + \beta_1 \times kph_i + \beta_2\times kph_i^2 + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)

3.4 In one sentence, explain the relationship between speed and power consumption from the simple linear model.

There is a positive linear relationship between the speed and power of electric bikes, this indicates that as the speed (kph) increases so does the power conumption (Watts).

3.5 Describe how the quadratic model further adjusts this relationship from the simple linear relationship. (You will find looking at your plot and the list of differences helpful here.)

3.6 If you wanted to just have a simple rule of thumb description for the relationship, which model would you use? Justify your answer.

The linear model has an easy to read relationship between the power conumption and the speed, therefore the model I would use would be the linear model.

STATS 201/8 Assignment 2

Marva MFAQ168

Due Date: 3pm, Tuesday 6th August 2024