1. Exploratory Data Analysis

1.1 Load the mtcars dataset, review its codebook, and report summary statistics for each column. Is there anything you find abnormal?(10pts)

# Load required libraries
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Load the mtcars dataset
data(mtcars)

# Review the codebook by ?
?mtcars
## starting httpd help server ...
##  done
# Display the first few rows and summary statistics
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Answer: Overall, I do not see anything too abnormal in this data. Though, im sure more will be discovered as I dig more into the data.

1.2 Provide any ONE visualization you used to understand the data (10pts)

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 3, fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of Miles Per Gallon (mpg)",
    x = "Miles Per Gallon",
    y = "Frequency"
  ) +
  theme_minimal()

Answer: Here is a histogram of mpg used to understand the distribution fuel efficiency across vehicles. This plot shows that most cars fall between 15 and 25 mpg. This histogram seems slightly right skewed.

1.4 What variables are most strongly correlated with mpg? Answer this with code ALONG WITH markdown narrative. 10 pts

# Computing correlation matrix
cor_matrix <- cor(mtcars)

# Extracting correlations with mpg
mpg_correlations <- cor_matrix["mpg", ]

# Sorting by strength (absolute value)
mpg_correlations_sorted <- sort(abs(mpg_correlations), decreasing = TRUE)

mpg_correlations_sorted
##       mpg        wt       cyl      disp        hp      drat        vs        am 
## 1.0000000 0.8676594 0.8521620 0.8475514 0.7761684 0.6811719 0.6640389 0.5998324 
##      carb      gear      qsec 
## 0.5509251 0.4802848 0.4186840

Answer: The variables most strongly correlated with mpg are wt (weight), cyl (number of cylinders), disp (Displacement), and hp (Gross Horsepower). The variable that is most strongly correlated with mpg is weight with approximately -0.87, which shows a strong negative relationship. This means that as weight increases, miles per gallon decrease.

1.5 Check whether there are missing data using R codes. If not, show evidence. If yes, which column and how many. 10 pts

# Checking if there are any missing values
any(is.na(mtcars))
## [1] FALSE
# Counting missing values per column
colSums(is.na(mtcars))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

Answer: In the first r chunk, I used a code to check if there is any missing values. FALSE was shown, therefore meaning that there are no missing values. To show evidence, I used a code to count the missing values per column, if any.

1.6 Use Box plot to decide if any variable has significant outliers. 10pts

boxplot(mtcars,
        main = "Boxplots of All Variables in mtcars",
        col = "blue",
        las = 2)   # rotates labels for readability

Answer: This box plot shows the variables that may have significant outliers. The main variables that has an outlier is hp. Variables wt (weight), qsec (1/4 mile time), and carb (Number of carburetors) appear to have possible outliers.

1.7 Apply range standardization on variable hp, and save it with the new name hp_rs in data.frame mtcars. Report the max and min of the standardized hp_rs to validate. 20 pts

# Applying min-max (range) standardization
mtcars$hp_rs <- (mtcars$hp - min(mtcars$hp)) / 
                (max(mtcars$hp) - min(mtcars$hp))
# Check min and max of standardized variable
min(mtcars$hp_rs)
## [1] 0
max(mtcars$hp_rs)
## [1] 1

Answer: The range standardization on variable hp was calculated, and after calculations, the minimum value of hp_rs is 0 and the maximum is 1. The next r chunk confirms that the variable hp has been rescaled to a fixed interval between 0 and 1.

1.8 Winsorize the variable wt with 5%/95%, save it with the new name wt_win, and report the new max and min. 20 pts

# Step 1
# Calculating 5th and 95th percentiles
lower <- quantile(mtcars$wt, 0.05)
upper <- quantile(mtcars$wt, 0.95)

lower
##    5% 
## 1.736
upper
##     95% 
## 5.29275
# Step 2
# Creating winsorized variable
mtcars$wt_win <- ifelse(mtcars$wt < lower, lower,
                        ifelse(mtcars$wt > upper, upper,
                               mtcars$wt))
# Step 3
min(mtcars$wt_win)
## [1] 1.736
max(mtcars$wt_win)
## [1] 5.29275

Shown is the variable wt winsorized at the 5th and 95th percentiles. After winsorization, the minimum values are 1.736 and the maximum values are 5.29275.