1. Exploratory Data Analysis

1.1 Load the mtcars dataset, review its codebook, and report summary statistics for each column. Is there anything you find abnormal?(10pts)

# Load required libraries
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Load the mtcars dataset
data(mtcars)

# Review the codebook by ?
?mtcars

# Display the first few rows and summary statistics
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Answer: I see lots of columns some are boolean and some should be counted as factors, such as vs and am.

1.2 Provide any ONE visualization you used to understand the data (10pts)

# Example: A scatter plot of mpg vs. weight
plot(mtcars$wt, mtcars$mpg)

Answer: I am presenting miles per gallon vs the cars weight. As seen here, usually the heivier the car, the less MPG that you would get.

1.4 What variables are most strongly correlated with mpg? Answer this with code ALONG WITH markdown narrative.

# Correlation between variables and mpg
cor_mpg <- cor(mtcars) [, "mpg"]
sort(abs(cor_mpg), decreasing = TRUE)
##       mpg        wt       cyl      disp        hp      drat        vs        am 
## 1.0000000 0.8676594 0.8521620 0.8475514 0.7761684 0.6811719 0.6640389 0.5998324 
##      carb      gear      qsec 
## 0.5509251 0.4802848 0.4186840

Answer: The wt, cly, disp, hp, and drat, are the most influential on mpg of a car in the dataset.

1.5 Check whether there are missing data using R codes. If not, show evidence. If yes, which column and how many. 10pts

# Check for missing values in each column
missing_val <- colSums(is.na(mtcars))
print(missing_val)
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0
# Total missing across entire dataset
cat("Total missing values:", sum(is.na(mtcars)), "\n")
## Total missing values: 0

Answer: There are no missing data in the data set and is shown in the code.

1.6 Use Box plot to decide if any variable has significant outliers. 10pts

# Box plots for all variables
par(mfrow = c(3, 4))
for (var in names(mtcars)) {
  boxplot(mtcars[[var]], main = var, col = "lightblue")
}
par(mfrow = c(1, 1))

Answer: There are a couple of outliers. There are significant outliers in qsec, the quarter mile time, horsepower, and carb There are some outliers in weight as well, but its less significant.

1.7 Apply range standardization on variable hp, and save it with the new name hp_rs in data.frame mtcars. Report the max and min of the standardized hp_rs to validate. 20pts

# Range standardization formula: (x - min) / (max - min)
mtcars$hp_rs <- (mtcars$hp - min(mtcars$hp)) / (max(mtcars$hp) - min(mtcars$hp))

# Validate — min should be 0, max should be 1
cat("Min of hp_rs:", min(mtcars$hp_rs), "\n")
## Min of hp_rs: 0
cat("Max of hp_rs:", max(mtcars$hp_rs), "\n")
## Max of hp_rs: 1

1.8 Winsorize the variable wt with 5%/95%, save it with the new name wt_win, and report the new max and min. 20pts

# 5th and 95th percentile cutoffs
lower <- quantile(mtcars$wt, 0.05)
upper <- quantile(mtcars$wt, 0.95)

# Winsorizing only keeps within boundaries and changes outliers
mtcars$wt_win <- mtcars$wt
mtcars$wt_win[mtcars$wt_win < lower] <- lower
mtcars$wt_win[mtcars$wt_win > upper] <- upper

# New min and max
cat("Min of wt_win:", min(mtcars$wt_win), "\n")
## Min of wt_win: 1.736
cat("Max of wt_win:", max(mtcars$wt_win), "\n")
## Max of wt_win: 5.29275