HW-Q 1: For answering the following questions, we will use the diabetes dataset from the faraway package.

  1. Install and load the faraway package. Do not include the installation command in your .Rmd file. Do include the command to load the package into your environment.
#install.packages(faraway)
library(faraway)
  1. What is the mean HDL level (High Density Lipoprotein) of individuals in this sample?
x = diabetes$hdl
mean(x, na.rm = TRUE)
## [1] 50.44527
  1. What is the mean HDL of females in this sample?
mean(subset(diabetes, gender == "female")$hdl)
## [1] 52.11111

Create a scatter plot of total cholesterol (y − axis) vs weight (x − axis). Use a non-default color for the points.

plot(chol ~ weight, data = diabetes,
     xlab = "Weight",
     ylab = "Total Cholesterol",
     main = "Total Cholesterol vs Weight",
     pch  = 20,
     cex  = 2,
     col  = "lightpink")

  1. Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
  1. Create side-by-side boxplots for HDL by gender. Use non-default colors for the plot.
boxplot(hdl ~ gender, data = diabetes,
     xlab = "Gender",
     ylab = "High-Density Lipoprotein",
     main = "HDL vs Gender",
     pch  = 20,
     cex  = 2,
     col    = "lightpink",
     border = "darkgreen")

  1. Based on the boxplot, does there seem to be a difference in HDL level between the genders.? Briefly explain

HW-Q 2: For this exercise we will use the data stored in nutrition.csv provided with the homework handle.

It contains the nutritional values per serving size for a large variety of foods as calculated by the USDA.

  1. How many variables do this data set has in it?
nutrition <- read.csv("~/Downloads/nutrition.csv")
summary(nutrition)
##        ID            Desc               Water           Calories    
##  Min.   : 1001   Length:5138        Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 7925   Class :character   1st Qu.: 20.59   1st Qu.: 75.0  
##  Median :11800   Mode  :character   Median : 64.58   Median :177.0  
##  Mean   :14271                      Mean   : 54.20   Mean   :223.1  
##  3rd Qu.:18968                      3rd Qu.: 81.30   3rd Qu.:347.0  
##  Max.   :93600                      Max.   :100.00   Max.   :902.0  
##     Protein            Fat              Carbs            Fiber       
##  Min.   : 0.000   Min.   :  0.000   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 1.710   1st Qu.:  0.590   1st Qu.:  1.55   1st Qu.: 0.000  
##  Median : 6.700   Median :  3.955   Median : 11.15   Median : 0.800  
##  Mean   : 9.961   Mean   : 10.313   Mean   : 23.59   Mean   : 2.342  
##  3rd Qu.:16.500   3rd Qu.: 12.280   3rd Qu.: 38.35   3rd Qu.: 2.775  
##  Max.   :88.320   Max.   :100.000   Max.   :100.00   Max.   :79.000  
##      Sugar           Calcium          Potassium           Sodium       
##  Min.   : 0.000   Min.   :   0.00   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.: 0.000   1st Qu.:  10.00   1st Qu.:  118.0   1st Qu.:   27.0  
##  Median : 2.330   Median :  24.00   Median :  210.0   Median :  113.0  
##  Mean   : 9.000   Mean   :  85.41   Mean   :  275.6   Mean   :  343.9  
##  3rd Qu.: 9.795   3rd Qu.:  75.00   3rd Qu.:  328.0   3rd Qu.:  440.8  
##  Max.   :99.800   Max.   :7364.00   Max.   :16500.0   Max.   :38758.0  
##     VitaminC            Chol           Portion         
##  Min.   :   0.00   Min.   :   0.00   Length:5138       
##  1st Qu.:   0.00   1st Qu.:   0.00   Class :character  
##  Median :   0.10   Median :   0.00   Mode  :character  
##  Mean   :  10.38   Mean   :  32.44                     
##  3rd Qu.:   4.20   3rd Qu.:  53.00                     
##  Max.   :2732.00   Max.   :3100.00
  1. How many observations do the data set has for each of the variables in this data set?
NROW(nutrition)
## [1] 5138
  1. Create a histogram of the variable Calories. Do not modify R’s default bin selection. Make the plot presentable.
library(readr)
nutrition = read_csv("nutrition.csv")
## Rows: 5138 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Desc, Portion
## dbl (13): ID, Water, Calories, Protein, Fat, Carbs, Fiber, Sugar, Calcium, P...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hist(nutrition$Calories,
     xlab = "Calories",
     main = "Histogram of Calories for Various Foods",
     border = "darkgreen",
     col  = "lightpink")

  1. Describe the shape of the histogram.
  1. Do you notice anything unusual?
  1. Create a scatter plot of Calories \((y − axis)\) vs 4 ∗ Protein + 4 ∗ Carbs + 9 ∗ Fat + 2 ∗ Fiber \((x − axis)\). Make the plot presentable.
plot(Calories ~ I(4 * Protein + 4 * Carbs + 9 * Fat + 2 * Fiber), data = nutrition,
     xlab = "Protein",
     ylab = "Calories",
     main = "Calories vs Protein",
     pch  = 20,
     cex  = 1,
     col  = "lightpink")

  1. From the graph above, what could be possible fitted function to the data. Briefly explain your reasoning.

##HW-Q 3: For the following questions, we will use the data Advertising from the ISLR2 package using sales as the response \((y_i)\) and TV as the predictor \((x_i)\).

  1. RSS was defined in Section 3.1.1, and is given by the formula

RSS \(=\)\(\sum^n_{i=1}\)\((y_i-\hat{y_i})^2\)

For the given data, please calculate the above formula.

Advertising <- read.csv("~/Downloads/Advertising.csv")
yi = Advertising$sales
xi = Advertising$TV

model1 = lm(yi ~ xi)
summary(model1)
## 
## Call:
## lm(formula = yi ~ xi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## xi          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
yi.hat = model1$fitted.values
head(yi.hat)
##         1         2         3         4         5         6 
## 17.970775  9.147974  7.850224 14.234395 15.627218  7.446162
RSS = sum((yi - yi.hat)^2)
RSS
## [1] 2102.531
  1. TSS measures the total variance in the response Y . This TSS can be calculated using the formula below:

TSS \(=\)\(\sum^n_{i=1}\)\((y_i-\bar{y_i})^2\)

is also known as the total sum of squares and it is used to calculated \(R^2\). For the given data, calculated the TSS.

y.bar = mean(yi)
head(y.bar)
## [1] 14.0225
TSS = sum((yi - y.bar)^2)
TSS
## [1] 5417.149
  1. \(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\). This \(R2\) statistic can be calculated using the following formula as well;

\(r^2 =\) \((Cor(x,y))^2 =\) \((\frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{{\sqrt{\sum^n_{i=1}(x_i-\bar{x})^2}}\sqrt{\sum^n_{i=1}(y_i-\bar{y})^2}})^2\)

Using the data given in this exercise, please calculate the above quantity

R_sqr = summary(model1)$r.squared
R_sqr
## [1] 0.6118751