#install.packages("tidyverse")
#install.packages("Stat2Data")
#install.packages("skimr")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
library(Stat2Data)
library(skimr)
## Warning: package 'skimr' was built under R version 3.4.4
1.8 Breakfast cereal. The number of calories and number of grams of sugar per serving were measured for 36 breakfast cereals. The data are in the file Cereal. We are interested in trying to predict the calories using the sugar content.
data(Cereal)
head(Cereal)
## Cereal Calories Sugar Fiber
## 1 Common Sense Oat Bran 100 6 3
## 2 Product 19 100 3 1
## 3 All Bran Xtra Fiber 50 0 14
## 4 Just Right 140 9 2
## 5 Original Oat Bran 70 5 10
## 6 Heartwise 90 5 6
attach(Cereal)#attaches the data Cereal so we can use its categories without having to call the data every time.
## The following object is masked _by_ .GlobalEnv:
##
## Cereal
A. Make a scatterplot and comment on what you see.
plot(Sugar, Calories, main="Scatterplot of Calories and Sugar of Cereals",
xlab="Sugar (grams per serving)", ylab="Calories (per serving)", pch=19)
abline(lm(Calories~Sugar))#adds a line of the simple linear regression of Calories and Sugar to the graph
Comments: The scatterplot shows that Sugar and Calories have a small positive relationship. This means that as the amount of grams of sugar per serving of cereal increases the amount of calories in the cereal also increases. There are no very obvious outliners in the scatterplot either. I added the least squares regression line to the graph to make it easier to see the relationship.
B. Find the least squares regression line for predicting calories based on sugar content.
LeastSquareLine<-(lm(Calories~Sugar))
LeastSquareLine
##
## Call:
## lm(formula = Calories ~ Sugar)
##
## Coefficients:
## (Intercept) Sugar
## 87.428 2.481
summary(LeastSquareLine)
##
## Call:
## lm(formula = Calories ~ Sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.428 -9.832 0.245 8.909 40.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.4277 5.1627 16.935 <2e-16 ***
## Sugar 2.4808 0.7074 3.507 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.27 on 34 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.244
## F-statistic: 12.3 on 1 and 34 DF, p-value: 0.001296
Comments: The least square regression line is the simple linear regression line which is found in R by taking the equation (lm(response variable~predictor variable)). From this code, we were able to find that the line is hat(Calories) = 87,4277 + 2.4808*(grams of sugar). This means that the number of calories in a serving of cereal will increase by 2.4808 calories per increase of 1 gram of sugar per serving. Using the summary() command on this equation, will clearly show the variables and how well they fit into the response, such as showing their p values and r^2 values.
C. Interpret the value (not just the sign) of the slope of the fitted model in the context of this setting.
Comments: As stated in the comments for part B, the line is hat(Calories) = 87,4277 + 2.4808*(grams of sugar). This means that the number of calories in a serving of cereal will increase by 2.4808 calories per increase of 1 gram of sugar per serving. This is a positive relationship between the number of calories in a serving of cereal and the number of grams of sugar in that same serving. The p values of sugar (.0013) shows that number of grams sugar in a serving of cereal is significant to the number of Calories in the cereal, while the R^2 is only .2656, which might indicate that we are missing other significant variables.
More breakfast cereal. Refer to the data on breakfast cereals in Cereal that is described in Exercise 1.8. The number of calories and number of grams of sugar per serving were measured for 36 breakfast cereals. The data are in the file Cereal. We are interested in trying to predict the calories using the sugar content.
A.How many calories would the fitted model predict for a cereal that has 10 grams of sugar?
LeastSquareLine<-(lm(Calories~Sugar))
LeastSquareLine
##
## Call:
## lm(formula = Calories ~ Sugar)
##
## Coefficients:
## (Intercept) Sugar
## 87.428 2.481
CaloriesPredicted<-87.428 +2.481*(10)
CaloriesPredicted
## [1] 112.238
Comments: We can see from the equation for the least square line is hat(Calories)=87.428+2.481*(grams of sugar per serving). To get the calories prediction with a cereal that has 10 grams of sugar, we should put 10 in place of (grams of sugar per serving). As you can see above the fitted model would predict that a cereal that has 10 grams of sugar has 112.238 calories per serving of cereal.
B.Cheerios has 110 calories but just 1 gram of sugar. Find the residual for this data point.
ExpectedCheerios<-87.428 +2.481*(1)
ExpectedCheerios
## [1] 89.909
ObservedCheerios<-110
ResidualsCheerios<-(ObservedCheerios-ExpectedCheerios)
ResidualsCheerios
## [1] 20.091
Comments: Residuals are the (observed values - the expected values). For Cheerios this means that the observed number of calories is 110 calories per serving and as shown above in the code “ExpectedCheerios<-87.428 +2.481*(1)" the expected amount of calories in Cheerios according to our fitted model was 89.909 calories per serving. As such the residual for Cheerios is 20.091, meaning that there is a 20.091 calories deference between observed and expected values. Another way to get this would be to call the dataset Cereal and see which number Cheerios is in the dataset aka 14 and then take the residuals of the fitted model which is LeastSquareLine<-(lm(Calories~Sugar)) and find the 14th residual. The code for this would be residuals(LeastSquareLine).
C. Does the linear regression model appear to be a good summary of the relationship between calories and sugar content of breakfast cereals?
summary(LeastSquareLine)
##
## Call:
## lm(formula = Calories ~ Sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.428 -9.832 0.245 8.909 40.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.4277 5.1627 16.935 <2e-16 ***
## Sugar 2.4808 0.7074 3.507 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.27 on 34 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.244
## F-statistic: 12.3 on 1 and 34 DF, p-value: 0.001296
Comments: Looking at the summary again, we can see that the p value for Sugar is .0013. Since it is less than an Alpha of value .05, we can conclude that Sugar is a significant variable to determine the calories in a serving of cereal. The only problem with this is that the Multiple R-squared is only .2656. This means that Sugar only counts for 26.56% of calories variation that is explained by the linear regression model. This is pretty okay for a single variable, real world linear regression model, so I will say that it is a good summary.
1.13 Caterpillar nitrogen assimilation versus mass. The Nassim variable in the Caterpillars dataset measures nitrogen assimilation, which might be associated with the size of the caterpillars as measured with Mass. Use the data to examine this relationship as outlined below.
data(Caterpillars)
head(Caterpillars)
## Instar ActiveFeeding Fgp Mgp Mass LogMass Intake LogIntake
## 1 1 Y Y Y 0.002064 -2.685290 0.165118 -0.7822056
## 2 1 Y N N 0.005191 -2.284749 0.201008 -0.6967867
## 3 2 N Y N 0.005603 -2.251579 0.189125 -0.7232511
## 4 2 Y N N 0.019300 -1.714443 0.283280 -0.5477841
## 5 2 N Y Y 0.029300 -1.533132 0.259569 -0.5857472
## 6 3 Y Y N 0.062600 -1.203426 0.327864 -0.4843063
## WetFrass LogWetFrass DryFrass LogDryFrass Cassim LogCassim Nfrass
## 1 0.000241 -3.617983 0.000208 -3.681937 0.01422378 -1.846985 6.61e-06
## 2 0.000063 -4.200659 0.000061 -4.214670 0.01739189 -1.759653 1.03e-06
## 3 0.001401 -2.853562 0.000969 -3.013676 0.01639923 -1.785177 2.78e-05
## 4 0.002045 -2.689307 0.001834 -2.736601 0.02392468 -1.621154 4.64e-05
## 5 0.005377 -2.269460 0.003523 -2.453087 0.02122857 -1.673079 9.97e-05
## 6 0.029500 -1.530178 0.000789 -3.102923 0.02836365 -1.547238 1.84e-05
## LogNfrass Nassim LogNassim
## 1 -5.179510 0.001858999 -2.730721
## 2 -5.986783 0.002270091 -2.643957
## 3 -4.555794 0.002302210 -2.637855
## 4 -4.333480 0.003041352 -2.516933
## 5 -4.001301 0.002791898 -2.554100
## 6 -4.735567 0.003627464 -2.440397
A. Produce a scatterplot for predicting nitrogen assimilation (Nassim) based on Mass. Comment on any patterns.
attach (Caterpillars)
plot(Mass, Nassim, main="Scatterplot of Nitrogen Assimilation and Mass of Caterpillars",
xlab="Mass (grams)", ylab="Nitrogen Assimilation (ingestion - excretion)", pch=19)
abline(lm(Nassim~Mass))#adds a line of the simple linear regression of Mass and Nassim to the graph
Comments: The scatterplot shows that the Nitrogen Assimilation increases with the mass of the caterpillar until the caterpillar hits a mass of around 6 grams at which point the Nitrogen Assimilation decreases while the mass still increases. This makes the scatterplot into a frown or a upside down u. As a side note: the largest cluster of caterpillars is at around 0-2 grams in mass and 0-.02 Nitrogen Assimilation.
B.Produce a similar plot using the log (base 10) transformed variables, LogNassim versus Log Mass. Again, comment on any patterns.
plot(log(Mass), log(Nassim), main="Scatterplot of Nitrogen Assimilation and Mass of Caterpillars",
xlab="log(Mass) (grams)", ylab="log(Nitrogen Assimilation) (ingestion - excretion)", pch=19)
## Warning in log(Nassim): NaNs produced
abline(predict<-lm(log(Nassim)~log(Mass)))#adds a line of the simple linear regression of log(Mass) and log(Nassim) to the graph
## Warning in log(Nassim): NaNs produced
Comments: The scatterplot shows that the Nitrogen Assimilation increases with the mass of the caterpillar with a fairly strong positive relationship. The problem with this scatterplot is that there are outliers aka the values at log(Mass)=0 and log(Nassim)=-7. No matter what it is still better than the scatterplot in part A even if it is a bit harder to explain to people who don’t know much about statistics.
[Instead of the LogWetFrass and LogMass variables, make your own logged variables using natural log]
C. Would you prefer the plot in part (a) or part (b) to predict the nitrogen assimilation of caterpillars with a linear model? Fit a linear regression model for the plot you chose and write down the prediction equation.
Comments: The scatterplot I prefer is the one from part b since it shows that the Nitrogen Assimilation increases with the mass of the caterpillar and doesn’t form the shape of a upside down u. The linear regression model for this plot was used on the plot and was predict<-lm(log(Nassim)~log(Mass)). This means that the response variable is the log of Nitrogen Assimilation and the predictor variable is the log of the Mass in grams of the caterpillars.
predict<-lm(log(Nassim)~log(Mass))
## Warning in log(Nassim): NaNs produced
predict
##
## Call:
## lm(formula = log(Nassim) ~ log(Mass))
##
## Coefficients:
## (Intercept) log(Mass)
## -4.346 0.371
As shown above the prediction equation for the log(Nassim) is hat(log(Nassim))=-4.346 +.371 (log(Mass in grams)). This means that for each increase in 1 gram of mass of the caterpillar the amount of nitrogen assimilation will increase by .371.
D. Add a plotting symbol for the grouping variable Instar to the scatterplot that you chose in (c). Does the linear trend appear consistent for all five stages of a caterpillar’s life? (Note: We are not ready to fit more complicated models yet, but we will return to this experiment in Chapter 3.)
Changing colors is fine, you don’t need to change the symbols.
ggplot(Caterpillars,aes(x=log(Mass),y=log(Nassim),color=Instar)) +
geom_point(size=5)
## Warning in log(Nassim): NaNs produced
## Warning in log(Nassim): NaNs produced
## Warning: Removed 14 rows containing missing values (geom_point).
Comments: Yes, the linear trend appears consistent for all five stages of a caterpillar’s life. The youngest part is the one with the smaller log(Mass)and log(Nassim) and then ass they get older the log(mass) and the log(Nassim) increase.
E.Repeat part (d) using plotting symbols (or colors) for the groups defined by the free growth period variable Fgp. Does the linear trend appear to be better when the caterpillars are in a free growth period? (Again, we are not ready to fit more complicated models, but we are looking at the plot for linear trend in the two groups.)
ggplot(Caterpillars,aes(x=log(Mass),y=log(Nassim),color=Fgp)) +
geom_point(size=5)
## Warning in log(Nassim): NaNs produced
## Warning in log(Nassim): NaNs produced
## Warning: Removed 14 rows containing missing values (geom_point).
Comments: Yes, the linear trend appear to be better when the caterpillars are in a free growth period. As shown in the graph above the blue markers or the ones in a free growth period are far more linear than those that are not in a free growth period.
data(Pines)
head(Pines)
## Row Col Hgt90 Hgt96 Diam96 Grow96 Hgt97 Diam97 Spread.97 Needles97
## 1 1 1 NA NA NA NA NA NA NA NA
## 2 1 2 14 284 4.2 96 362 6.6 162 66
## 3 1 3 17 387 7.4 110 442 9.3 250 77
## 4 1 4 NA NA NA NA NA NA NA NA
## 5 1 5 24 294 3.9 70 369 7.0 176 72
## 6 1 6 22 310 5.6 84 365 6.9 215 76
## Deer95 Deer97 Cover95 Fert Spacing
## 1 NA NA 0 0 15
## 2 0 1 2 0 15
## 3 0 0 1 0 15
## 4 NA NA 0 0 15
## 5 0 0 2 0 15
## 6 0 0 1 0 15
attach(Pines)
1.18 Pines. The dataset Pines contains data from an experiment conducted by the Department of Biology at Kenyon College at a site near the campus in Gambier, Ohio.13 In April 1990, student and faculty volunteers planted 1000 white pine (Pinus strobes) seedlings at the Brown Family Environmental Center. These seedlings were planted in two grids, distinguished by 10-and 15-foot spacings between the seedlings. Several variables, described below, were measured and recorded for each seedling over time.
A.Construct a scatterplot to examine the relationship between the initial height in 1990 and the height in 1996. Comment on any relationship seen.
plot(Hgt90, Hgt96, main="Scatterplot of initial height in 1990 and the height in 1996",
xlab="Initial Height in 1990 (cm)", ylab="Height in 1996 (cm)", pch=19)
abline(height<-lm(Hgt96~Hgt90))#adds a line of the simple linear regression of initial height in 1990 and the height in 1996 to the graph
Comments: As shown in the graph above with the least square line that I added to the graph we can see that there is a weak positive relationship between the height in 1990 and the height in 1996.
B. Fit a least squares line for predicting the height in 1996 from the initial height in 1990.
Comments: I already fit a least square line to the graph in part A. This least square line was height<-lm(Hgt96~Hgt90).
height<-lm(Hgt96~Hgt90)
height
##
## Call:
## lm(formula = Hgt96 ~ Hgt90)
##
## Coefficients:
## (Intercept) Hgt90
## 241.28 2.25
The line means that hat(height in 1996 in cm)=241.28 +2.25(height in 1990 in cm). That means that for each increase in cm in 1990 the tree in 1996 will be 2.25 cm higher in the prediction.
C. Are you satisfied with the fit of this simple linear model? Explain.
summary(height)
##
## Call:
## lm(formula = Hgt96 ~ Hgt90)
##
## Residuals:
## Min 1Q Median 3Q Max
## -275.293 -42.798 7.208 46.332 181.457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 241.2846 8.6209 27.99 < 2e-16 ***
## Hgt90 2.2504 0.4311 5.22 2.28e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.02 on 805 degrees of freedom
## (193 observations deleted due to missingness)
## Multiple R-squared: 0.03274, Adjusted R-squared: 0.03154
## F-statistic: 27.25 on 1 and 805 DF, p-value: 2.276e-07
Comments: Looking at the summary of the simple linear model, I am not satisfied with it. It’s not that the pvalue isn’t small enough, because 2.276e-07 is pretty small. The problem I have with it is that the Multiple R-squared is 0.03274 this means that the height in 1990 in cm only counts for 3.274% of the tree’s height in 1996 in cm variation that is explained by the linear regression model. This is not enough for me to feel satisfied with the linear model. There is too much unexplained.
1.26 Textbook prices. Two undergraduate students at Cal Poly took a random sample16 of 30 textbooks from the campus bookstore in the fall of 2006. They recorded the price and number of pages in each book, in order to investigate the question of whether the number of pages can be used to predict price. Their data are stored in the file TextPrices and appear in Table 1.5.
data(TextPrices)
head(TextPrices)
## Pages Price
## 1 600 95.00
## 2 91 19.95
## 3 200 51.50
## 4 400 128.50
## 5 521 96.00
## 6 315 48.50
attach(TextPrices)
A. Produce the relevant scatterplot to investigate the students’ question. Comment on what the scatterplot reveals about the question.
plot(Pages, Price, main="Scatterplot of Number of Pages and Price of Textbooks",
xlab="Number of Pages", ylab="Price of Textbooks(in dollars)", pch=19)
abline(PagePrice<-lm(Price~Pages))#adds a line of the simple linear regression of the number of pages and the price of the textbook
Comments: The scatterplot reveals that there is a strong positive relationship between the number of pages and the prices of textbooks. This means that as the number of pages in the textbook increases so does the price of the textbook.
B. Determine the equation of the regression line for predicting price from number of pages. Comments: Once again I used this line in the plot the code for this is PagePrice<-lm(Price~Pages).
PagePrice<-lm(Price~Pages)
PagePrice
##
## Call:
## lm(formula = Price ~ Pages)
##
## Coefficients:
## (Intercept) Pages
## -3.4223 0.1473
The formal regression equation is hat(Price)= -3.4223 +.1473(number of Pages). This means that for each page the price of the textbook goes up by .1473 dollars.
C. Produce and examine relevant residual plots, and comment on what they reveal about whether the conditions for inference are met with these data.
plot(PagePrice)
Comments: The Residuals vs Fitted plot of the residuals show that the average of the residuals is around 0 which is what we want in the dataset, since this plot is checking linearity. If there were patterns in this plot that would be a sign for concern since that would mean that the linear model is not good for these variables. This plot also checks for constant variance, by seeing if there are the same number of dots below and above the 0 mark and spread evenly throughout the entire plot. Since the plot does this we can say its residuals do have constant variance.
The Normal Q-Q plot shows that the majority of the points around the line, with a few outliers on the upper and lower ends which is not too bad. Since this plot checks for normality we can say that the residuals are normal, which is also what we want.
For the last condition, Independence, or are the variables independent. We can’t tell this from the plots but whether by thinking are the number of pages in books independent of their cost. There is some cost in printing pages but not much to make a difference after all Les Miserables (Paperback) by (Victor Hugo) costs $7 and has around 600 pages. So I am going to say that yes they are independent variables.
So yes the conditions for inference are met with the data and model.