** Please submit both .Rmd file and the compiled .html file.

Check your work directory.

getwd()
## [1] "/Users/krishapatel/Downloads"
library(foreign)

kidiq = read.dta('regression_methods/kidiq.dta')
head(kidiq)
##   kid_score mom_hs    mom_iq mom_work mom_age
## 1        65      1 121.11753        4      27
## 2        98      1  89.36188        4      25
## 3        85      1 115.44316        4      27
## 4        83      1  99.44964        3      25
## 5       115      1  92.74571        4      27
## 6        98      0 107.90184        1      18

Problem 1

(a) Make boxplots of kid.score grouped by mom.hs

boxplot(kid_score~mom_hs,data=kidiq, main="Kid IQ Score grouped by Mom",
        xlab="mom", ylab="kid score")

(b) Use lm()

Fit a simple linear regression model with kid score as the response and mom hs as the predictor using the R function lm() for each subgroup. Report the estimated coefficients.

lm(kid_score ~ mom_hs, data = kidiq)
## 
## Call:
## lm(formula = kid_score ~ mom_hs, data = kidiq)
## 
## Coefficients:
## (Intercept)       mom_hs  
##       77.55        11.77
#the response is the y variable
#the predictor is the x variable

#the coefficients are 77.55 (intercept), beta 0 
#and 11.77 (mom_hs), beta 1 

(c) Interpret the slope and the intercept of the regression line.

–> The regression line has a slope of 77.55 and an intercept of 11.77

Problem 2

Split the data into two subgroups, corresponding to mom hs=0 and mom hs=1 respectively. For each subgroup, use the Kid’s Score (kid score) and Mom’s Score (mom iq) as the response and predictor respectively, and do the following:

mom_in_hs <- kidiq[kidiq$mom_hs == 0,]
#View(mom_in_hs)

mom_not_in_hs <- kidiq[kidiq$mom_hs == 1,]

(a) Scatter plot

Give a scatter plot of the pair of variables. Use lightblue and lightgreen for mom hs=0 and mom hs=1 respectively

plot(mom_in_hs$mom_iq, mom_in_hs$kid_score, col = "lightblue")

plot(mom_not_in_hs$mom_iq, mom_not_in_hs$kid_score, col = "lightgreen")

(b) Fit a regression model using lm(). Then add the regression line to the scatter plot.

Fit a simple linear regression model using R function lm(). Report the estimated coefficients. Add the regression line with black color to the scatter plot in blue color.

lm_model_not_in_hs <- lm(kid_score~mom_iq, data = mom_not_in_hs)
#line = y = 39.7862 + 0.4846x
#coefficients:
# beta 0 = 39.7862
# beta 1 = 0.4846

lm_model_in_hs <- lm(kid_score~mom_iq, data = mom_in_hs)
#line = y = -11.4820 + 0.9689x
#coefficients:
# beta 0 = -11.4820
# beta 1 = 0.9689

print(lm_model_not_in_hs)
## 
## Call:
## lm(formula = kid_score ~ mom_iq, data = mom_not_in_hs)
## 
## Coefficients:
## (Intercept)       mom_iq  
##     39.7862       0.4846
print(lm_model_in_hs)
## 
## Call:
## lm(formula = kid_score ~ mom_iq, data = mom_in_hs)
## 
## Coefficients:
## (Intercept)       mom_iq  
##    -11.4820       0.9689
plot(mom_in_hs$mom_iq, mom_in_hs$kid_score,
     main = "Predicting kid_score based on IQ of mom who went to HS",
     xlab = "kid_score",
     ylab = "mom in HS's IQ Score",
     col = "lightblue",
     abline(lm_model_in_hs, 
      col = "black")
     )

plot(mom_not_in_hs$mom_iq, mom_not_in_hs$kid_score,
     main = "Predicting kid's score based on IQ of mom who did not go to HS",
     xlab = "kid_score",
     ylab = "mom not in HS score",
     col = "lightgreen",
     abline(lm_model_not_in_hs) #black is default
     )

Problem 3

hw1 = read.table("regression_methods/hw1.txt",header=T)
hw1
##         Country       GDP Satisfaction
## 1     Australia 27.055725     7.894780
## 2       Finland 25.860430     7.905812
## 3         Japan 25.592535     6.579316
## 4         Korea  7.351448     5.334750
## 5        Mexico 13.613936     7.964578
## 6        Sweden 29.394784     8.010560
## 7 United States 33.824743     7.658895

The data object hw1 has 7 rows and 3 columns: Country, GDP and Satisfaction. The GDP is the GDP per capita (unit: thousand dollars) in 1984, and the Satisfaction is a measure of how one is satisfied with the life as a whole, calculated for each country using the World Value Survey (https://www.worldvaluessurvey.org/) Wave 1 data.

##(a) 1. Give the scatter plot with black color of the Satisfaction against the GDP. Please label the x and y axes and give a main title of the plot. 2. Use R function lm() to fit a linear regression model with satisfaction as the response and GDP as the predictor. 3. Add the line with blue color to the scatter plot. (2 pts)

plot(hw1$GDP, hw1$Satisfaction,
     main = "Satisfaction vs. GDP from World Value Survey",
     xlab = "GDP (unit: thousand dollars) Satisfaction", 
     ylab = "Satisfaction"
     )

model <- lm(Satisfaction ~ GDP, data = hw1)
print(model)
## 
## Call:
## lm(formula = Satisfaction ~ GDP, data = hw1)
## 
## Coefficients:
## (Intercept)          GDP  
##     5.76995      0.06736
#the line is y = 5.76995 + 0.06736x


plot(hw1$GDP, hw1$Satisfaction, 
     main = "Satisfaction vs. GDP from World Value Survey with LM()",
     xlab = "GDP (unit: thousand dollars)", 
     ylab = "Satisfaction",
     abline( model,
       col = "blue", 
       lwd = 2),
     asp = 1 #setting aspect ratio to 1 so the intervals are consistent 
     )

(b) Plot Satisfaction against GDP and then use lm()

Generate the same scatter plot of the Satisfaction against the GDP as previous question and use R function lm() to fit a linear regression model with satisfaction as the response and GDP as the predictor without intercept. Add the line with red color to the scatter plot.

model2 <- lm(Satisfaction ~ 0 + GDP, data = hw1)
print(model2)
## 
## Call:
## lm(formula = Satisfaction ~ 0 + GDP, data = hw1)
## 
## Coefficients:
##    GDP  
## 0.2855
#only has beta 1 value of 0.2855 and no intercept

plot(hw1$GDP, hw1$Satisfaction,
     main = "Satisfaction against GDP from World Value Survey",
     xlab = "GDP (unit: thousand dollars)", 
     ylab = "Satisfaction",
     abline(model2, 
       col = "red",
       lwd = 2),
     asp = 1 #setting aspect ratio to 1 so the intervals are consistent 
     )  

(c) Compare these two models and explain the differences along with your observations

The main difference between Model 1 and Model 2 is that Model 1 fits a regression line using both an intercept and a slope, while Model 2 does not include an intercept (i.e., intercept = 0). As a result, Model 2 requires a higher slope (0.2855 compared to 0.06736 in Model 1).

By plotting the regression lines on a scatter plot, I can observe the impact of including an intercept versus excluding it. Model 1 appears to fit the data much better because its line starts at a Satisfaction value of around 6, which is between the first two data points. Additionally, the regression line in Model 1 is positioned such that the distance from the line to the point above it is approximately equal to the distance to the point below it. This balance causes the residuals to cancel out, resulting in a residual value close to 0.

In contrast, Model 2 has two data points that are significantly far from the regression line, both positioned above it. This increases the residual value, which is undesirable when selecting a regression model. This theme of increased deviation from the regression line is continued towards the end of the graph for Model 2, especially when compared to the Model 1 graph. Thus, it is evident that Model 1 fits the data better than Model 2.