606 Homework 7

7.23 Tourism spending.

The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year.20 Three plots are provided: scatterplot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.

Describe the relationship between number of tourists and spending.

Positive and nearly perfectly linear.
What are the explanatory and response variables?

Explanatory variable is number of tourists.
Response variable is spending.
Why might we want to fit a regression line to these data?

To predict the amount of spending Turkey could expect given an increase or decrease in tourism.
Do the data meet the conditions required for fitting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question.

Not really–the variability of constants in the residual plot are all over the place.

7.24 Nutrition at Starbucks, Part I.

The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. 21 Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

Loose, at best. I would guess R to be around 0.5.
In this scenario, what are the explanatory and response variables?

Explanatory variable is calories.
Response variable is carbs.
Why might we want to fit a regression line to these data?

To understand how many carbs might be in a product based on the number of calories.
Do these data meet the conditions required for fitting a least squares line?

No–there’s no apparent consistency in the constants in the residual plot.

7.25 The Coast Starlight, Part II.

Exercise 7.13 introduces data on the Coast Starlight Amtrak train that runs from Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. The correlation between travel time and distance is 0.636.

Write the equation of the regression line for predicting travel time.

y = time (minutes)
x = distance (miles)

Equation for slope using SD
b₁ = $\frac{S~y~}{S~x~} \cdot$ R

b = $\frac{113}{99} \cdot$ 0.636

b~ = 0.726

Slope-point equation for regression line
y - y₀ = slope $\cdot$ (x - x₀)

129 - y~ = 0.726 $\cdot$ (108 - x)

129 - y~ = 78.41 - 0.726x

Slope-interept equation for regression line
129 - y = 78.41 - 0.726x
- y = -50.59 - 0.726x

y = 0.726x + 50.59
Interpret the slope and the intercept in this context.

The line has a slope of 0.726 and will cross the y axis when y = 50.59a. This means that the train travels at approximately 0.726 minutes per mile.

This also suggests that it will take almost an hour for the train to not move (time = 50.59 when distance = 0), illustrating how the y can simply be indicative of how figures in a model cannot be taken literally.

## Warning: package 'bindrcpp' was built under R version 3.5.1

Calculate R2 of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret R2 in the context of the application.

Correlation (R) is 0.6362 so R-Squared is .405, meaning that about 41% of travel time is explained by distance.
The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.

y = 0.726x + 50.59
y = 0.726 $\cdot$ 103 + 50.59

y = 125.4 minutes
It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.

The residutal for this point is 42.6 meaning that the predicted value was 42.6 minutes less than the actual observation.
Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?

The question does not indicate what direction the train will run and it is difficult to predict how similar the travel conditions would be based on that. 500 miles to the south you would go through immigration. Any direction to the west goes through mountains. We also don’t know what route is being taken to Seattle so we don’t know what the travel conditions are there.

In general we can see that the question is asking us to predict a value that lies outside the population from which samples are available and thus, ultimate, the answer is no, it would need to be extrapolated.

7.26 Body measurements, Part III.

Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

Write the equation of the regression line for predicting height.

y = height (cm)
x = girth (cm)

Equation for slope using SD
b₁ = $\frac{S~y~}{S~x~} \cdot$ R

b = $ $\cdot$ 0.67

b = 0.608

Slope-point equation for regression line
y - y₀ = slope $\cdot$ (x - x₀)

171.14 - y = 0.608 $\cdot$ (107.2 - x)

171.14 - y = 65.18 - 0.608x

Slope-interept equation for regression line
- y = -105.96 - 0.608x
y = 0.608x + 105.96
Interpret the slope and the intercept in this context.

The line has a slope of 0.608 and will cross the y axis when y = -16.16

Calculate R2 of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.

Correlation (R) is 0.67 so R-Squared is .449, meaning that about 45% of a subject’s height is explained by girth.
A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.

y = 0.608x + 105.96 y = 0.608 $\cdot$ 100 + 105.96

Predicted height based on this model is 167 cm
The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

The residual for this observation is 7, meaning that the model overestimated the observed height by 7 cm.
A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

The age of participants is not specified so we can’t necessarily rule out the propriety of using the model on this basis, however it is reasonable to assume that given that the shortest person in the sample is over 4 1/2 feet tall that there were no 1 year olds included.

That being the case, we cannot use the model to predict the height of the child as we cannot know the impact of age on either height or girth with the data given, it must be extrapolated.

7.29 Murders and poverty, Part I.

The following regression output is for predicting annual murders per million from percentage living in poverty in a random sample of 20 metropolitan areas.

Write out the linear model.

$\hat{murders}$ = $\beta$₀ + $\beta$₁ $\cdot$ %poverty

$\hat{murders}$ = - 29.901 + 2.559 $\cdot$ %poverty

y = 2.559x - 29.901
Interpret the intercept.

The line has a slope of 2.559 and will cross the y axis when y = - 29.901 (if that were a possible value)

Interpret the slope.

For every 1% increase in % poverty there is an increase of about a 2.6 murders per million people.
Interpret R2.

R² is 70.52%, which means that using the model explains about 71% of murder rates.
Calculate the correlation coefficient.

a <- sqrt(70.52)

paste("The correlation coefficient is the square root of R^2. In this example the correlation coeffecient is", a)

## [1] "The correlation coefficient is the square root of R^2. In this example the correlation coeffecient is 8.39761871008681"

7.30 Cats, Part I.

The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

Write out the linear model.

$\hat{heart weight}$ = $\beta$₀ + $\beta$₁ $\cdot$ body Weight

$\hat{heart weight}$ = - 0.357 + 4.034 $\cdot$ body weight

y = 4.034x - 0.357
Interpret the intercept.

The line has a slope of 4.034 and would cross the y axis when y = - 0.357 if that were a possible value.

For every 1kg increase in body weight there is an increase of about a 4.034 in heart weight.

Interpret R2.

R² is 64.66%, which means that using the model improves predictability of murder rate by 64.66%
Calculate the correlation coefficient.

a <- sqrt(64.66)

paste("The correlation coefficient is the square root of R^2. In this example the correlation coeffecient is", a)

## [1] "The correlation coefficient is the square root of R^2. In this example the correlation coeffecient is 8.04114419718985"

7.39 Urban homeowners, Part II.

Exercise 7.33 gives a scatterplot displaying the relationship between the percent of families that own their home and the percent of the population living in urban areas. Below is a similar scatterplot, excluding District of Columbia, as well as the residuals plot. There were 51 cases.

For these data, R2 = 0.28. What is the correlation?

a <- sqrt(0.28)

paste("The correlation coefficient is the square root of R^2. In this example the correlation coeffecient is", a)

## [1] "The correlation coefficient is the square root of R^2. In this example the correlation coeffecient is 0.529150262212918"

How can you tell if it is positive or negative?

Correlation is negative as the trend line moves from the upper left to lower right.

Examine the residual plot. What do you observe?

Residuals increase as the independent variable increases.

Is a simple least squares fit appropriate for these data?

As the variability is not constant (it increases) it is not appropriate to use least squares.

7.40 Rate my professor.

Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.

Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

value of x at y intercept (4.010) is 0 so the first point is (0, 4.010).
second point is at means of both data sets (-0.0883, 3.9983).

m = $\frac{y~1~ - y~0~}{x~1~ - x~0~}$
m = $
m = $

m = 0.133

Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

T = $\frac{point estimate - null value}{SE}$

T = $

T = 4.13

p > T = 1

While my eyes say otherwise, the math suggests that there is convincing evidence that the slope is positive.
List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.
1. Linearity: Yes. Data follow a fairly linear trend.
2. Nearly normal residuals: Yes. Residuals are fairly normally distributed.
3. Constant variability: No. Variability is all over the place.