HW2 MATH-420

HW-Q 1: For answering the following questions, we will use the `diabetes` dataset from the faraway package.

Install and load the faraway package. Do not include the installation command in your .Rmd file. Do include the command to load the package into your environment.

#install.packages(faraway)
library(faraway)

What is the mean HDL level (High Density Lipoprotein) of individuals in this sample?

x = diabetes$hdl
mean(x, na.rm = TRUE)

## [1] 50.44527

What is the mean HDL of females in this sample?

mean(subset(diabetes, gender == "female")$hdl)

## [1] 52.11111

Create a scatter plot of total cholesterol (y − axis) vs weight (x − axis). Use a non-default color for the points.

plot(chol ~ weight, data = diabetes,
     xlab = "Weight",
     ylab = "Total Cholesterol",
     main = "Total Cholesterol vs Weight",
     pch  = 20,
     cex  = 2,
     col  = "lightpink")

Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.

There doesn’t seem a clear trend between the two variables. Average total cholesterol seems almost constant for different weights.

Create side-by-side boxplots for HDL by gender. Use non-default colors for the plot.

boxplot(hdl ~ gender, data = diabetes,
     xlab = "Gender",
     ylab = "High-Density Lipoprotein",
     main = "HDL vs Gender",
     pch  = 20,
     cex  = 2,
     col    = "lightpink",
     border = "darkgreen")

Based on the boxplot, does there seem to be a difference in HDL level between the genders.? Briefly explain

HDL levels between generally similar, besides from females having a smaller range of HDL.

HW-Q 2: For this exercise we will use the data stored in `nutrition.csv` provided with the homework handle.

It contains the nutritional values per serving size for a large variety of foods as calculated by the USDA.

How many variables do this data set has in it?

nutrition <- read.csv("~/Downloads/nutrition.csv")
summary(nutrition)

##        ID            Desc               Water           Calories    
##  Min.   : 1001   Length:5138        Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 7925   Class :character   1st Qu.: 20.59   1st Qu.: 75.0  
##  Median :11800   Mode  :character   Median : 64.58   Median :177.0  
##  Mean   :14271                      Mean   : 54.20   Mean   :223.1  
##  3rd Qu.:18968                      3rd Qu.: 81.30   3rd Qu.:347.0  
##  Max.   :93600                      Max.   :100.00   Max.   :902.0  
##     Protein            Fat              Carbs            Fiber       
##  Min.   : 0.000   Min.   :  0.000   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 1.710   1st Qu.:  0.590   1st Qu.:  1.55   1st Qu.: 0.000  
##  Median : 6.700   Median :  3.955   Median : 11.15   Median : 0.800  
##  Mean   : 9.961   Mean   : 10.313   Mean   : 23.59   Mean   : 2.342  
##  3rd Qu.:16.500   3rd Qu.: 12.280   3rd Qu.: 38.35   3rd Qu.: 2.775  
##  Max.   :88.320   Max.   :100.000   Max.   :100.00   Max.   :79.000  
##      Sugar           Calcium          Potassium           Sodium       
##  Min.   : 0.000   Min.   :   0.00   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.: 0.000   1st Qu.:  10.00   1st Qu.:  118.0   1st Qu.:   27.0  
##  Median : 2.330   Median :  24.00   Median :  210.0   Median :  113.0  
##  Mean   : 9.000   Mean   :  85.41   Mean   :  275.6   Mean   :  343.9  
##  3rd Qu.: 9.795   3rd Qu.:  75.00   3rd Qu.:  328.0   3rd Qu.:  440.8  
##  Max.   :99.800   Max.   :7364.00   Max.   :16500.0   Max.   :38758.0  
##     VitaminC            Chol           Portion         
##  Min.   :   0.00   Min.   :   0.00   Length:5138       
##  1st Qu.:   0.00   1st Qu.:   0.00   Class :character  
##  Median :   0.10   Median :   0.00   Mode  :character  
##  Mean   :  10.38   Mean   :  32.44                     
##  3rd Qu.:   4.20   3rd Qu.:  53.00                     
##  Max.   :2732.00   Max.   :3100.00

There are 15 variables.

How many observations do the data set has for each of the variables in this data set?

NROW(nutrition)

## [1] 5138

There are 5138 observations.

Create a histogram of the variable Calories. Do not modify R’s default bin selection. Make the plot presentable.

library(readr)
nutrition = read_csv("nutrition.csv")

## Rows: 5138 Columns: 15

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Desc, Portion
## dbl (13): ID, Water, Calories, Protein, Fat, Carbs, Fiber, Sugar, Calcium, P...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hist(nutrition$Calories,
     xlab = "Calories",
     main = "Histogram of Calories for Various Foods",
     border = "darkgreen",
     col  = "lightpink")

Describe the shape of the histogram.

The distribution of Calories’ histogram is right-skewed, and approximately unimodal.

Do you notice anything unusual?

There are two unusual peaks, one at 400 kcal and one above 800 kcal.

Create a scatter plot of Calories \((y − axis)\) vs 4 ∗ Protein + 4 ∗ Carbs + 9 ∗ Fat + 2 ∗ Fiber \((x − axis)\). Make the plot presentable.

plot(Calories ~ I(4 * Protein + 4 * Carbs + 9 * Fat + 2 * Fiber), data = nutrition,
     xlab = "Protein",
     ylab = "Calories",
     main = "Calories vs Protein",
     pch  = 20,
     cex  = 1,
     col  = "lightpink")

From the graph above, what could be possible fitted function to the data. Briefly explain your reasoning.

A possible fitted function could be a straight line along the dots observed on the graph. This is because of the general trend of the graph’s data, as well as

##HW-Q 3: For the following questions, we will use the data Advertising from the ISLR2 package using sales as the response \((y_i)\) and TV as the predictor \((x_i)\).

RSS was defined in Section 3.1.1, and is given by the formula

RSS \(=\)\(\sum^n_{i=1}\)\((y_i-\hat{y_i})^2\)

For the given data, please calculate the above formula.

Advertising <- read.csv("~/Downloads/Advertising.csv")
yi = Advertising$sales
xi = Advertising$TV

model1 = lm(yi ~ xi)
summary(model1)

## 
## Call:
## lm(formula = yi ~ xi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## xi          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

yi.hat = model1$fitted.values
head(yi.hat)

##         1         2         3         4         5         6 
## 17.970775  9.147974  7.850224 14.234395 15.627218  7.446162

RSS = sum((yi - yi.hat)^2)
RSS

## [1] 2102.531

TSS measures the total variance in the response Y . This TSS can be calculated using the formula below:

TSS \(=\)\(\sum^n_{i=1}\)\((y_i-\bar{y_i})^2\)

is also known as the total sum of squares and it is used to calculated \(R^2\). For the given data, calculated the TSS.

y.bar = mean(yi)
head(y.bar)

## [1] 14.0225

TSS = sum((yi - y.bar)^2)
TSS

## [1] 5417.149

\(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\). This \(R2\) statistic can be calculated using the following formula as well;

\(r^2 =\) \((Cor(x,y))^2 =\) \((\frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{{\sqrt{\sum^n_{i=1}(x_i-\bar{x})^2}}\sqrt{\sum^n_{i=1}(y_i-\bar{y})^2}})^2\)

Using the data given in this exercise, please calculate the above quantity

R_sqr = summary(model1)$r.squared
R_sqr

## [1] 0.6118751

HW2 MATH-420

Ghita Belaid

2/27/2022

HW-Q 1: For answering the following questions, we will use the `diabetes` dataset from the faraway package.

HW-Q 2: For this exercise we will use the data stored in `nutrition.csv` provided with the homework handle.

HW2 MATH-420

Ghita Belaid

2/27/2022

HW-Q 1: For answering the following questions, we will use the diabetes dataset from the faraway package.

HW-Q 2: For this exercise we will use the data stored in nutrition.csv provided with the homework handle.

HW-Q 1: For answering the following questions, we will use the `diabetes` dataset from the faraway package.

HW-Q 2: For this exercise we will use the data stored in `nutrition.csv` provided with the homework handle.