library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
# 7.6 Husbands and wives, Part I. The Great Britain Offce of Population
#Census and Surveys once collected data on a random sample of 170
#married couples in Britain, recording the age (in years) and
#heights (converted here to inches) of the husbands and wives.
#16 The scatterplot on the left shows the wife's age plotted against
#her husband's age, and the plot on the right shows wife's height
#plotted against husband's height.
#(a) Describe the relationship between husbands' and wives' ages.
hw.dt <-husbands.wives
age.lm <- lm(Age_Husband ~ Age_Wife, data = hw.dt)
plot(Age_Husband ~ Age_Wife, data = hw.dt)
abline(coef(age.lm))

#The Husband's age and wive's age appears to have a positive linear relationship
#(b) Describe the relationship between husbands' and wives' heights.
ht.lm <- lm(data = hw.dt, Ht_Husband ~ Ht_Wife)
plot(data = hw.dt, Ht_Husband ~ Ht_Wife)
abline(coef(ht.lm))

#There appears to be a slight positive linear relationship, but the data does not appear to be linear
#(c) Which plot shows a stronger correlation? Explain your reasoning.
#the Age plot appears to have a stronger correlation as the data is much more centralized
#(d) Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion a???ec
#No this is will not make a difference as the relationship will remain the same despite the units that it is measured
# 7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.17
trees.dt <- trees
#(a) Describe the relationship between volume and height of these trees.
ht.lm <- lm(data = trees.dt, Height~Volume)
plot(data = trees.dt, Height~Volume)
abline(coef(ht.lm))

#There appears to be a positive linear relationship with a high variability. There are several outliers that could be influencing variables
#(b) Describe the relationship between volume and diameter of these trees.
vd.lm <- lm(data = trees.dt, Girth~Volume)
plot(data = trees.dt, Girth~Volume)
abline(coef(vd.lm))

#there is a far clearer positive linear relationship between the two variables
#(c) Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.
#Diameter is clearly a better predicter for the height of the tree, this woudl be the prefered variable.
# 7.18 Correlation, Part II.
#What would be the correlation between the annual salaries of
#males and females at a company if for a certain type of position men always made
#(a) $5,000 more than women?
#salMen = salWomen + 5000
#(b) 25% more than women?
#salMen = 1.25 * salWomen
#(c) 15% less than women?
#salMen = .85 * salWomen
#7.24 Nutrition at Starbucks, Part I.
#The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain.21 Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
#(a) Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.
#Calories has a positive linear relationship with Carbs, the residuals appear to fan out at higher calories, but the appear to be almost normal
#(b) In this scenario, what are the explanatory and response variables?
#The explanatory variable is the Calorie intake and the response variable is the number of carbs
#(c) Why might we want to fit a regression line to these data?
#We could predict the amount of carbs in a starbucks menu food item based on the number of calories
#(d) Do these data meet the conditions required for fitting a least squares line?
#Linearity -> The data should show a linear trend. From the scatterplot as noticed in a), we can see that the data surely follows a weak yet linear relationship.
#Nearly normal residual -> From the histogram for residual, we can see that the residuals in this case have a slightly skewed distribution to the left.
#Constant variability -> Based on the residual plot, we can see that we don't have constant variability for residuals. The data fit the linear model much better for lower number of calories than for higher as shown by much larger residuals value.
#Independent observation -> Each menu item is presumably independent of the next, but they are all Starbucks menu items.
# 7.30 Cats, Part I. The following regression output is for predicting
#the heart weight (in g) of cats from their body weight (in kg).
#The coeffcients are estimated using a dataset of 144 domestic cats.
#(a) Write out the linear model.
#heartWeight=???0.357+4.034???bodyWeight
#(b) Interpret the intercept.
#This says that at a 0 body weight a heart should still have a weight which helps adjust the accuracy of the model
#(c) Interpret the slope.
#The slope describes the relationship between the two variables
#(d) Interpret R2.
#Body weight variable is able to explain nearly 65% of the variablitiy in the heart weight of cats
#(e) Calculate the correlation coeffcient.
R2 <- .6466
round(sqrt(R2),3)
## [1] 0.804
# 7.36 Beer and blood alcohol content. Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they di???ered in weight and drinking habits. Thirty minutes later, a police o"cer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.
#(a) Describe the relationship between the number of cans of beer and BAC.
bac.dt <- bac
bac.lm <- lm(data = bac.dt, BAC ~ Beers)
plot(data = bac.dt, BAC ~ Beers)
abline(coef(bac.lm))

#The number of cans a student drinks has a strong positive linear relationship with BAC
#(b) Write the equation of the regression line. Interpret the slope and intercept in context.
coef(bac.lm)
## (Intercept) Beers
## -0.01270060 0.01796376
#BAC = -0.0127 + 0.01796BEERS
#(c) Do the data provide strong evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.
#p-value of the reg coeff for 'beers' = 0.0000 , since p-value<0.05 it is a statistically significant variable. Null Hypothesis, Ho: There is no significant association or b1=0.
#Alternate Hypothesis, Ha: There is some significant association, b1!=0.
#p-value of b1 = 0.0000, which makes us reject Ho or indicating that there is strong relationship between the response & explanatory variables. This is a strong evidence that drinking more cans of beer is associated with an increase in blood alcohol.
#(d) The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate R2 and interpret it in context.
r=0.89
paste("R2: ", round(r^2,3))
## [1] "R2: 0.792"
#(e) Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC. Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?
#Yes there will be as strong of a relationship as the sample should be representative of the population
# 7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks).
#The estimated regression line is headCircumference=3.91+0.78???gestationalAge.
#(a) What is the predicted head circumference for a baby whose gestational age is 28 weeks?
headCirc.fct <- function(gestAge) {
return(3.91 + 0.78*gestAge)
}
headCirc.fct(28)
## [1] 25.75
#(b) The standard error for the coefficient of gestational age is 0.35, which is associated with df = 23. Does the model provide strong evidence that gestational age is significantly associated with head circumference?
se <- .35
df <- 23
#As shown in the regression equation of the model the regression coefficient = 0.78 i.e. it is positive indicating there is a positive correlation between the 2 variables.
#For lack of more information related to the p-values of the explanatory variables , R2 etc. We can perform a hypothesis test and check the evidence of significance association.
#Null Hypothesis, Ho: There is no significant association.
#Alternate Hypothesis, Ha: There is some significant association.
#n is small (n=23 or n<30), t=(0.78-0)/0.35 = 2.229. for df=23 p-value from the t-table = 0.0178, reject Ho or indicating that there is strong relationship between the 2 variables in question.