Problem 1. Faraway Chapter 1, Exercise 1.

The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.

data("teengamb", package = "faraway")
teengamb$sex <- factor(teengamb$sex)
levels(teengamb$sex) <- c("male","female")
# distribution of gambling
ggplot(teengamb, aes(x = gamble)) + 
  geom_histogram(binwidth = 10, fill = "blue", color = "black") + 
  labs(x = "Gamble Amount", y = "Frequency")

# sex
table(teengamb$sex)
## 
##   male female 
##     28     19
ggplot(teengamb, aes(x = sex, y = gamble, fill = sex)) +
  geom_boxplot() +
  labs(title = "Gambling Amount by Sex", x = "Sex", y = "Gamble Amount")

# parental status
ggplot(teengamb, aes(x = status, y = gamble)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Gambling Amount by Parental Status", x = "Status", y = "Gamble Amount")
## `geom_smooth()` using formula = 'y ~ x'

# individual income
ggplot(teengamb, aes(x = income, y = gamble)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Gambling Amount by Income", x = "Income", y = "Gamble Amount")
## `geom_smooth()` using formula = 'y ~ x'

# verbal ability
ggplot(teengamb, aes(x = verbal, y = gamble)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Gambling Amount by Verbal Ability", x = "Verbal Score", y = "Gamble Amount")
## `geom_smooth()` using formula = 'y ~ x'

Based on the analysis of the teengamb dataset, the following key insights were observed:

  1. Gambling Amount Distribution: Most teens gamble small amounts, but some gamble a lot more.
  2. Gambling Amount by Sex: Boys usually gamble more than girls.
  3. Gambling Amount by Income: Teens with more money tend to gamble more.
  4. Gambling Amount by Verbal Ability: There’s not much connection between verbal skills and gambling.

In conclusion, these findings suggest that gender and income are significant predictors of gambling behavior, while verbal ability does not have a notable impact.

ggplot(teengamb, aes(x = income, y = gamble)) +
  geom_point(aes(color = sex), size = 3, alpha = 0.6, position = position_jitter(width = 0.3)) +
  geom_smooth(aes(group = 1), method = "lm", col = "blue", se = FALSE, linetype ="solid") +
  labs(
    title = "Gambling Amount by Income and Gender",
    x = "Income",
    y = "Gambling Amount",
    color = "Gender"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'




Problem 2. Faraway Chapter 1, Exercise 3

The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question. ote: similarly, in the future we will perform regression by treating lpsa as the response, and various subsets of other variables as predictors.

data("prostate", package = "faraway")
# log(cancer volume)
ggplot(prostate, aes(x = lcavol, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "lcavol", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# log(prostate weight)
ggplot(prostate, aes(x = lweight, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "lweight", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# age
ggplot(prostate, aes(x = age, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "Age", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# log(benign prostatic hyperplasia amount)
ggplot(prostate, aes(x = lbph, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "lbph", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# seminal vesicle invasion
ggplot(prostate, aes(x = svi, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "svi", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# log(capsular penetration)
ggplot(prostate, aes(x = lcp, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "lcp", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# Gleason score
ggplot(prostate, aes(x = gleason, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "Gleason Score", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

# percentage Gleason scores 4 or 5
ggplot(prostate, aes(x = pgg45, y = lpsa)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(x = "pgg45", y = "lpsa")
## `geom_smooth()` using formula = 'y ~ x'

Based on the analysis of the prostate dataset, the following key insights were observed:

  1. Cancer Volume (lcavol) and lpsa: When cancer volume goes up, so does lpsa.
  2. Prostate Weight (lweight) and lpsa: Heavier prostates usually have higher lpsa.
  3. Seminal Vesicle Invasion (svi) and lpsa: If svi is present, lpsa is a lot higher.
  4. Gleason Score (gleason) and lpsa: Higher Gleason scores often mean higher lpsa.

In conclusion, things like cancer volume, weight, svi, and Gleason score are important to look at when figuring out lpsa.**




Problem 3.

(Proposition 6.2 of Review of Matrix Algebra) Let \[A \in \mathbb{R}^{n \times n}\] be a real symmetric and idempotent matrix, and {λ1, …, λn} its eigenvalues. Prove that: (a) λi iseither0or1forall1≤i≤n; (b) tr(A) = rank(A); (c) rank(A) + rank(I − A) = n.

knitr::include_graphics("HW1_3.jpeg")




Problem 4.

  1. Prove that P = In − n1 1n1′n is a projection matrix.
  2. Identify the vector Py, the projection of an arbitrary y by P.
  3. Based on the answer from the previous part, what does the projection matrix P do to y? Explain in your own words.
knitr::include_graphics("HW1_4.jpeg")