2.28
What’s in the beer? The website beer100.com advertises itself as “Your Place for All Things Beer.” One of their “things” is a list of 159 domestic beer brands with the percent alcohol, calories per 12 ounces, and carbohydrates per 12 ounces (in grams).
Figure 2.12 gives a scatterplot of calories versus percent alcohol. Give a short summary of what can be learned from the plot.
#get a feel of the data
beerGraph <- beer %>%
ggplot(aes(y = Calories, x = Alcohol)) +
geom_point(alpha = .7, colour = "darkgreen") +
labs(title = "Scatterplot of calories versus percent alcohol",
y = "Calories per 12 ounces") +
geom_smooth(method = "lm", se = FALSE)
beerGraph
This plot above shows a positive linear correlation between Calories and Alcohol. There is just one outlier on the plot at the far left of the graph, where the percent alcohol is low compared to normal. Additionally, most of the points lie between 3 to 6 percents alcohol and 100 to 200 Calories per 12 ounces. Outside of this range, the points are more scattered.
- One of the points is an outlier. Find the brand of the outlier. How is this brand of beer different from the other brands?
From the graph above, we know that the outlier lies at the minimum point of the alcohol range. We will filter out this outlier and identify its brewery brand:
# filter outlier
outlier <- beer %>%
filter(Alcohol == min(Alcohol))
outlier
## # A tibble: 1 x 6
## BEER Brewery Calories Carbohydrates Alcohol Type
## <chr> <chr> <int> <dbl> <dbl> <chr>
## 1 O'Doul's Anheuser Busch 70 13.3 0.4 Domestic
knitr::kable(outlier)
BEER | Brewery | Calories | Carbohydrates | Alcohol | Type |
---|---|---|---|---|---|
O’Doul’s | Anheuser Busch | 70 | 13.3 | 0.4 | Domestic |
We found out that the outlier has the brewery brand “Anheuser Busch” with the beer name O’Doul’s. Particularly, it has the percent alcohol equals 0.4.
brand <- beer %>%
ggplot(aes(x = Brewery, y = Alcohol, colour = Brewery)) +
geom_jitter(stat = "identity") +
labs(title = "The percents of Alcohol based on Brewery Brands") +
coord_flip() +
theme(legend.position = 'none')
brand
The graph above shows the percent alcohol for each beer name based on Brewery. While other brewery brands have no beer below 3 percents alcohol, Anheuser Busch is the only that has that low percent of alcohol (0.4). Other beers of this brand fall under the average percent alcohol. Therefore, we can say that this brand is different from other brands because of one particular beer that has 0.4 percent alcohol.
- Remove the outlier from the data set and generate a scatterplot of the remaining data.
We can remove the outlier from the dataset by filtering out the lowest percent of alcohol:
# remove the outlier
beerRmOutlier <- beer %>%
filter(Alcohol > min(Alcohol))
# plot the graph
beerRmGraph <- beerRmOutlier %>%
ggplot(aes(y = Calories, x = Alcohol)) +
geom_point(alpha = .7, colour = "darkblue") +
geom_smooth(method = "lm", se = FALSE)
grid.arrange(beerGraph, beerRmGraph, ncol=2)
- Describe the relationship between calories and percent alcohol based on what you see in your scatterplot.
I combine these two graphs above. The left one is the scatterplot with the outlier, and the right graph is without the outlier. The graph with the outlier removed similarly shows a positive linear correlation between alcohol and calories. However, the outlier removal effect is not clear in this context. We do not see much differences between these two graphs.
2.54
Alcohol and calories in beer revisited. Refer to the previous exercise. The data that you used to compute the correlation includes an outlier.
- Remove the outlier and recompute the correlation.
I will compare the correlation before and after the outlier is being removed:
cor(beer$Calories, beer$Alcohol)
## [1] 0.9051336
cor(beerRmOutlier$Calories,beerRmOutlier$Alcohol)
## [1] 0.9081986
The correlation of the original dataset is 0.905 and the correlation after removing the outlier is 0.908. This shows that the difference is miniscule.
- Write a short paragraph about the possible effects of outliers on a correlation using this example to illustrate your ideas.
In this example, the effect of outliers is very small. Even if we remove the outlier, the correlation is not much different from that of the original. We see that for the dataset in which outliers are few and not very far from the common pattern, the outlier removal effect is small. If there are more outliers that are far away from the common data in this example, then there can be a more influential effect. Therefore, it should be noted that the outlier removal effect is only worthwhile if we foresee that removing outliers is better for the prediction of our linear model.
2.83
Alcohol and calories in beer. Figure 2.12 (page 98) gives a scatterplot of calories versus percent alcohol in 159 brands of domestic beer.
- Find the equation of the least-squares regression line for these data.
We will use the lm function to find the least-squares regression line for the data.
modelBeer <- lm(beer$Calories~beer$Alcohol)
modelBeer
##
## Call:
## lm(formula = beer$Calories ~ beer$Alcohol)
##
## Coefficients:
## (Intercept) beer$Alcohol
## 6.419 28.347
The equation of the regression line is:
\[y = 28.347*x + 6.419\]
- Find the value of r2 and interpret it in the regression context.
The use the summary function to find the R-squared value:
summary(modelBeer)
##
## Call:
## lm(formula = beer$Calories ~ beer$Alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.500 -15.476 2.516 12.268 55.634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.419 5.758 1.115 0.267
## beer$Alcohol 28.347 1.063 26.677 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.78 on 157 degrees of freedom
## Multiple R-squared: 0.8193, Adjusted R-squared: 0.8181
## F-statistic: 711.7 on 1 and 157 DF, p-value: < 2.2e-16
This summary shows that the R-squared value is 0.82. This mean that approximately 82% of the variation is explained by the regression line.
- Write a short report on the relationship between calories and percent alcohol in beer. Include graphical and numerical summaries for each variable separately as well as graphical and numerical summaries for the relationship in your report.
We will draw a scatterplot to see more clearly the relationship between calories and percent alcohol:
beerGraph
This scatterplot shows a positive linear relationship, with the slope of 28. This means that as percents of Alcohol increase, Calories per 12 ounces also increase at a high corresponding rate. Also, R-squared value is 82%, which means that the majority of variation is explained by this linear model we have calculated above.
2.88
Metabolic rate and lean body mass. Compute the mean and the standard deviation of the metabolic rates and lean body masses in Exercise 2.37 (page 100) and the correlation between these two variables. Use these values to find the slope of the regression line of metabolic rate on lean body mass. Also find the slope of the regression line of lean body mass on metabolic rate. What are the units for each of the two slopes?
Answer
The summary for the mean, the SD and the correlation is found in this below table:
bodySum <- body %>%
summarise( sum = n(),
meanMass = mean(Mass),
meanRate = mean(Rate),
sdMass = sd(Mass),
sdRate = sd(Rate),
cor = cor(Rate,Mass))
knitr::kable(bodySum)
sum | meanMass | meanRate | sdMass | sdRate | cor |
---|---|---|---|---|---|
19 | 46.74211 | 1369.526 | 8.284411 | 257.5041 | 0.8647361 |
We found that the mean of the lean body mass is 47kg. The mean metabolic rate is 1370. The SD of the lean body mass is 8.3 and the SD of the metabolic rate is 257. The correlation between body mass and metabolic rate is 0.86, which is reasonably high.
Use these values to find the slope of the regression line of metabolic rate on lean body mass.
The slope of the regression line of metabolic rate on lean body mass is:
modelBody <- lm(body$Rate~body$Mass)
modelBody
##
## Call:
## lm(formula = body$Rate ~ body$Mass)
##
## Coefficients:
## (Intercept) body$Mass
## 113.17 26.88
The table shows that the slope is 27
Also find the slope of the regression line of lean body mass on metabolic rate. What are the units for each of the two slopes?
The slope of the regression line of metabolic rate on lean body mass is:
lm(body$Mass~body$Rate)
##
## Call:
## lm(formula = body$Mass ~ body$Rate)
##
## Coefficients:
## (Intercept) body$Rate
## 8.64154 0.02782
The table shows that the slope is 0.03
2.144
Dwelling permits and sales for 23 countries. The Organisation for Economic Co-operation and Development collects data on main economic indicators (MEIs) for many countries. Each variable is recorded as an index with the year 2000 serving as a base year. This means that the variable for each year is reported as a ratio of the value for the year divided by the value for 2000. Use of indices in this way makes it easier to compare values for different countries. Table 2.3 gives the values of three MEIs for 23 countries.
- Make a scatterplot with sales as the response variable and permits issued for new dwellings as the explanatory variable. Describe the relationship.
We draw the scatterplot displaying the relationship between Dwelling Permits and Sales:
meis %>%
ggplot(aes(x = DwellPermit, y = Sales)) +
geom_point(alpha = 0.7, colour = "darkblue") +
labs(title = "Scatterplot of Dwelling Permits versus Sales") +
geom_smooth(method = "lm", se = FALSE)
Are there any outliers or influential observations?
There are quite a few outliers, especially those lie at 100-110 and 160 at Sales.
- Find the least-squares regression line and add it to your plot.
We use the lm function to find the least-squares regression line:
modelMeis <- lm(meis$Sales~meis$DwellPermit)
modelMeis
##
## Call:
## lm(formula = meis$Sales ~ meis$DwellPermit)
##
## Coefficients:
## (Intercept) meis$DwellPermit
## 109.8204 0.1263
The least-squares regression line between Sales and Dwell Permits is: \[y = 0.1263*x + 109.8204\]
- Interpret the slope of the line in the context of this exercise.
The slope indicates how much the response variable (Sales) is expected to change as the explainatory variable (Dwell Permits) increases. In this case, 0.12 is a positive, yet pretty low. Therefore, we can say that as Dwell Permits increase, Sales will increase but in a very low rate.
- Interpret the intercept of the line in the context of this exercise. Explain whether or not this interpretation is useful in explaining the relationship between these two variables.
In this context, I don’t see any meaningful intepretation of the intercept of the line. Literally, the intercept indicates that if dwelling permits equal zero, then sales start at 109. In this context, it makes no sense. Therefore, for the regression line above, the intercept is not useful to explain the relationship between these two variables.
- What is the predicted value of sales for a country that has an index of 224 for dwelling permits?
- Canada has an index of 224 for dwelling permits. Find the residual for this country.
To answer these above questions, we are going to filter out Canada with an index of 224, and the augment function will help finding the predictions and residuals:
# filter the country that has index 224
meis224 <- meisTidy %>%
filter(meis.DwellPermit == 224)
# map the table
knitr::kable(meis224[ , 1:5])
meis.Sales | meis.DwellPermit | .fitted | .se.fit | .resid |
---|---|---|---|---|
122 | 224 | 138.1224 | 8.847853 | -16.12235 |
The table above shows that Canada, with an index of 224, has the predicted value of 138.12 and the residual of -16.12
- What percent of the variation in sales is explained by dwelling permits?
To answer this question, we need to find the R squared value of the model:
summary(modelMeis)
##
## Call:
## lm(formula = meis$Sales ~ meis$DwellPermit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.036 -16.122 1.556 11.397 32.859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.8204 11.5582 9.501 1.19e-08 ***
## meis$DwellPermit 0.1263 0.0857 1.474 0.157
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.7 on 19 degrees of freedom
## Multiple R-squared: 0.1027, Adjusted R-squared: 0.05543
## F-statistic: 2.174 on 1 and 19 DF, p-value: 0.1568
As shown above, the R squared of the model is 0.1, which is very low. Therefore, we can say that only 10% of the variation is explained by the regression line. And the huge 90% is due to random and unexplained factors.
[F] Prof. Tucker’s BIG DATA Challenge questions… In Canada, are women better drivers than men? Are older drivers safer than younger ones? What weather conditions are the most dangerous for driving?
- We’re still at the descriptive statistics stage of the course so a table of data, some basic descriptive statistics, maybe a histogram will be used to answer the three questions posed above.
Are women better drivers than men?
We draw the graph to see the number of accidents based on sex:
ncdb %>%
ggplot(aes(P_SEX)) +
labs(title = "Number of Accidents based on Sex",
y = "Number of Accidents") +
geom_bar()
In this case, we ignore U and N in the x axis because there is no satistical explaination we can draw on it. This above graph shows that males have more accidents than females. Given that there are more males having accidents than females, it might not be strong enough to conclude that females are safer drivers than males. We also have to account for the fact that there can be more male drivers than female drivers. In this case more male drivers would have a higher tendency to have accidents compared to females. However, due to the large historical data collected through many years, we can note that females tend to have less accidents than males in general.
Are older drivers safer than younger ones?
ncdbAge <- ncdb %>%
as.tibble() %>%
group_by(P_AGE) %>%
ggplot(aes(x = P_AGE)) +
labs(title = "Histogram of Accidents based on Ages",
x = "Ages",
y = "Number of Accidents") +
geom_histogram(stat = "count", na.rm = TRUE)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ncdbAge
We will ignore the very high mode at the rightest of the graph, because it does not account for our satistical analysis. It’s hard to see the exact ages in this graph, but we can see the pattern where the higher curve lies at the left of the graph. This shows that there are more accidents of people at younger ages compared to older ones. Similarly, the fact that there are more accidents of people at the young ages is not enough to conclude that young people are safer. There are more factors than just merely relying on ages, such as there can be more young drivers than older ones and it leds to higher frequency of accidents amoung young drivers. Nevertheless, we can be pretty confident to say that ages can have a predictive power on the number of accidents.
What weather conditions are the most dangerous for driving?
Now we graph the number of accidents based on Weather:
ncdb %>%
ggplot(aes(x = C_WTHR)) +
labs(title = "Number of Accidents based on Weather",
x = "Number of Accidents",
y = "Weather conditions") +
geom_bar(na.rm = TRUE)
Interestingly, most accidents fall under “1”, which indicates that the weather is sunny and clear!? The other more dangerous weather conditions such as cloudy, raining or snowing have very few accidents compared to sunny days. This might due to the fact that in the sunny day, there are more cars driving on the street and it leads to higher frequency of accidents. For other possible reasons, we might have to avoid extrapolations due to the lack of evidence.
There are three statistical results I have drawn from the data. The first one is that females have less accidents than males driving. The second is that more accidents occured when the weather is perfectly fine compared to the worse weather conditions. The final analysis is that more young people died from accidents than older ones. However, these evidences are not enough to show that women or older people are safer drivers than males and younger people. To be able to conclude that women or older people are safe drivers, we have to compare the number of accidents with the demographical statistics of drivers. In this case we can calculate the ratio, for example, between the number of young drivers’ accidents and the total amount of young people driving. However, this dataset does not include this demographical statistics, so we cannot conclude any causal relationship between ages and safety.