In Part II of the project, you identified the dataset, read the data into R, and performed initial exploration of the data. In this part of the project, you will focus on the following tasks:

1. (4 points) Perform any necessary data cleaning and/or wrangling (reshape data, define and add new relevant variables computed from the variables that are already in the dataset).

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
car_data <- read_csv("car_data.csv") #import dataset into R
## Rows: 550 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): class, drive, fuel_type, make, model, transmission
## dbl (6): city_mpg, combination_mpg, cylinders, displacement, highway_mpg, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(car_data) #summary of the car_data
##     city_mpg         class           combination_mpg    cylinders     
##  Min.   : 11.00   Length:550         Min.   : 14.00   Min.   : 3.000  
##  1st Qu.: 17.00   Class :character   1st Qu.: 20.00   1st Qu.: 4.000  
##  Median : 20.00   Mode  :character   Median : 23.00   Median : 4.000  
##  Mean   : 21.46                      Mean   : 24.07   Mean   : 5.316  
##  3rd Qu.: 24.00                      3rd Qu.: 27.00   3rd Qu.: 6.000  
##  Max.   :126.00                      Max.   :112.00   Max.   :12.000  
##                                                       NA's   :2       
##   displacement      drive            fuel_type          highway_mpg    
##  Min.   :1.200   Length:550         Length:550         Min.   : 18.00  
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.: 24.00  
##  Median :2.500   Mode  :character   Mode  :character   Median : 28.00  
##  Mean   :2.932                                         Mean   : 28.61  
##  3rd Qu.:3.500                                         3rd Qu.: 32.00  
##  Max.   :6.800                                         Max.   :102.00  
##  NA's   :2                                                             
##      make              model           transmission            year     
##  Length:550         Length:550         Length:550         Min.   :2014  
##  Class :character   Class :character   Class :character   1st Qu.:2016  
##  Mode  :character   Mode  :character   Mode  :character   Median :2019  
##                                                           Mean   :2019  
##                                                           3rd Qu.:2022  
##                                                           Max.   :2024  
## 

Checking for missing values:

sum(is.na(car_data)) #Check for missing values
## [1] 4
#[1] 4

colSums(is.na(car_data)) #Check missing values per column
##        city_mpg           class combination_mpg       cylinders    displacement 
##               0               0               0               2               2 
##           drive       fuel_type     highway_mpg            make           model 
##               0               0               0               0               0 
##    transmission            year 
##               0               0
#Data before being cleaned 
#       city_mpg           class combination_mpg       cylinders 
#              0               0               0               2 
#   displacement           drive       fuel_type     highway_mpg 
#              2               0               0               0 
#           make           model    transmission            year 
#              0               0               0               0 

Remove rows with missing values:

data_clean <- na.omit(car_data)

car_data$cylinders[is.na(car_data$cylinders)] <- mean(car_data$cylinders, na.rm = TRUE)
car_data$displacement[is.na(car_data$displacement)] <- mean(car_data$displacement, na.rm = TRUE)

colSums(is.na(car_data))
##        city_mpg           class combination_mpg       cylinders    displacement 
##               0               0               0               0               0 
##           drive       fuel_type     highway_mpg            make           model 
##               0               0               0               0               0 
##    transmission            year 
##               0               0

Check for Duplicates:

sum(duplicated(car_data)) #Looking for duplicate rows 
## [1] 2
#[1] 2

print(duplicated(car_data)) 
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Remove duplicates:

data_clean <- car_data[!duplicated(car_data[, c("fuel_type", "make")]), ]

sum(duplicated(data_clean))
## [1] 0
#[1] 0

Save the now cleaned data:

library(readr)
cleaned_car_data <- read_csv("cleaned_car_data.csv")
## Rows: 550 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): class, drive, fuel_type, make, model, transmission
## dbl (6): city_mpg, combination_mpg, cylinders, displacement, highway_mpg, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(cleaned_car_data)

#write.csv(car_data, "cleaned_car_data.csv", row.names = FALSE)
#Save cleaned data 
summary(cleaned_car_data)
##     city_mpg         class           combination_mpg    cylinders     
##  Min.   : 11.00   Length:550         Min.   : 14.00   Min.   : 3.000  
##  1st Qu.: 17.00   Class :character   1st Qu.: 20.00   1st Qu.: 4.000  
##  Median : 20.00   Mode  :character   Median : 23.00   Median : 4.000  
##  Mean   : 21.46                      Mean   : 24.07   Mean   : 5.316  
##  3rd Qu.: 24.00                      3rd Qu.: 27.00   3rd Qu.: 6.000  
##  Max.   :126.00                      Max.   :112.00   Max.   :12.000  
##                                                       NA's   :2       
##   displacement      drive            fuel_type          highway_mpg    
##  Min.   :1.200   Length:550         Length:550         Min.   : 18.00  
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.: 24.00  
##  Median :2.500   Mode  :character   Mode  :character   Median : 28.00  
##  Mean   :2.932                                         Mean   : 28.61  
##  3rd Qu.:3.500                                         3rd Qu.: 32.00  
##  Max.   :6.800                                         Max.   :102.00  
##  NA's   :2                                                             
##      make              model           transmission            year     
##  Length:550         Length:550         Length:550         Min.   :2014  
##  Class :character   Class :character   Class :character   1st Qu.:2016  
##  Mode  :character   Mode  :character   Mode  :character   Median :2019  
##                                                           Mean   :2019  
##                                                           3rd Qu.:2022  
##                                                           Max.   :2024  
## 

2. (6 points) Use summary statistics (mean, standard deviation, and five number summary) to summarizes the quantitative variables in your data paying special attention to the outcome variables that you identified in Part I of the project. To look for patterns and evaluate potential correlations, compute those summaries by groups defined using categorical variables. Comment on the patterns/correlations seen from those summaries.

List of quantitative columns:

quantitative_columns <- c("city_mpg", "combination_mpg", "cylinders", 
                          "displacement", "highway_mpg", "year")

summary_stats <- summary(cleaned_car_data[quantitative_columns])

print(summary_stats)
##     city_mpg      combination_mpg    cylinders       displacement  
##  Min.   : 11.00   Min.   : 14.00   Min.   : 3.000   Min.   :1.200  
##  1st Qu.: 17.00   1st Qu.: 20.00   1st Qu.: 4.000   1st Qu.:2.000  
##  Median : 20.00   Median : 23.00   Median : 4.000   Median :2.500  
##  Mean   : 21.46   Mean   : 24.07   Mean   : 5.316   Mean   :2.932  
##  3rd Qu.: 24.00   3rd Qu.: 27.00   3rd Qu.: 6.000   3rd Qu.:3.500  
##  Max.   :126.00   Max.   :112.00   Max.   :12.000   Max.   :6.800  
##                                    NA's   :2        NA's   :2      
##   highway_mpg          year     
##  Min.   : 18.00   Min.   :2014  
##  1st Qu.: 24.00   1st Qu.:2016  
##  Median : 28.00   Median :2019  
##  Mean   : 28.61   Mean   :2019  
##  3rd Qu.: 32.00   3rd Qu.:2022  
##  Max.   :102.00   Max.   :2024  
## 
means <- colMeans(cleaned_car_data[quantitative_columns], na.rm = TRUE)
print("Means:")
## [1] "Means:"
print(means)
##        city_mpg combination_mpg       cylinders    displacement     highway_mpg 
##       21.460000       24.069091        5.315693        2.931752       28.609091 
##            year 
##     2019.000000
sds <- sapply(cleaned_car_data[quantitative_columns], sd, na.rm = TRUE)
print("Standard Deviations:")
## [1] "Standard Deviations:"
print(sds)
##        city_mpg combination_mpg       cylinders    displacement     highway_mpg 
##        8.147392        7.478369        1.759999        1.248419        6.832228 
##            year 
##        3.165156
five_num_summary <- sapply(cleaned_car_data[quantitative_columns], function(x) quantile(x, na.rm = TRUE))
print("Five Number Summary:")
## [1] "Five Number Summary:"
print(five_num_summary)
##      city_mpg combination_mpg cylinders displacement highway_mpg year
## 0%         11              14         3          1.2          18 2014
## 25%        17              20         4          2.0          24 2016
## 50%        20              23         4          2.5          28 2019
## 75%        24              27         6          3.5          32 2022
## 100%      126             112        12          6.8         102 2024

Summarizing the quantitative variables by fuel_type:

quantitative_columns <- c("city_mpg", "combination_mpg", "cylinders", 
                          "displacement", "highway_mpg", "year")

grouped_summary <- cleaned_car_data %>%
  group_by(fuel_type) %>%
  summarize(
    city_mpg_mean = mean(city_mpg, na.rm = TRUE),
    city_mpg_sd = sd(city_mpg, na.rm = TRUE),
    city_mpg_min = min(city_mpg, na.rm = TRUE),
    city_mpg_q1 = quantile(city_mpg, 0.25, na.rm = TRUE),
    city_mpg_median = median(city_mpg, na.rm = TRUE),
    city_mpg_q3 = quantile(city_mpg, 0.75, na.rm = TRUE),
    city_mpg_max = max(city_mpg, na.rm = TRUE),

    combination_mpg_mean = mean(combination_mpg, na.rm = TRUE),
    combination_mpg_sd = sd(combination_mpg, na.rm = TRUE),
    combination_mpg_min = min(combination_mpg, na.rm = TRUE),
    combination_mpg_q1 = quantile(combination_mpg, 0.25, na.rm = TRUE),
    combination_mpg_median = median(combination_mpg, na.rm = TRUE),
    combination_mpg_q3 = quantile(combination_mpg, 0.75, na.rm = TRUE),
    combination_mpg_max = max(combination_mpg, na.rm = TRUE),

    cylinders_mean = mean(cylinders, na.rm = TRUE),
    cylinders_sd = sd(cylinders, na.rm = TRUE),
    cylinders_min = min(cylinders, na.rm = TRUE),
    cylinders_q1 = quantile(cylinders, 0.25, na.rm = TRUE),
    cylinders_median = median(cylinders, na.rm = TRUE),
    cylinders_q3 = quantile(cylinders, 0.75, na.rm = TRUE),
    cylinders_max = max(cylinders, na.rm = TRUE),

    displacement_mean = mean(displacement, na.rm = TRUE),
    displacement_sd = sd(displacement, na.rm = TRUE),
    displacement_min = min(displacement, na.rm = TRUE),
    displacement_q1 = quantile(displacement, 0.25, na.rm = TRUE),
    displacement_median = median(displacement, na.rm = TRUE),
    displacement_q3 = quantile(displacement, 0.75, na.rm = TRUE),
    displacement_max = max(displacement, na.rm = TRUE),

    highway_mpg_mean = mean(highway_mpg, na.rm = TRUE),
    highway_mpg_sd = sd(highway_mpg, na.rm = TRUE),
    highway_mpg_min = min(highway_mpg, na.rm = TRUE),
    highway_mpg_q1 = quantile(highway_mpg, 0.25, na.rm = TRUE),
    highway_mpg_median = median(highway_mpg, na.rm = TRUE),
    highway_mpg_q3 = quantile(highway_mpg, 0.75, na.rm = TRUE),
    highway_mpg_max = max(highway_mpg, na.rm = TRUE),

    year_mean = mean(year, na.rm = TRUE),
    year_sd = sd(year, na.rm = TRUE),
    year_min = min(year, na.rm = TRUE),
    year_q1 = quantile(year, 0.25, na.rm = TRUE),
    year_median = median(year, na.rm = TRUE),
    year_q3 = quantile(year, 0.75, na.rm = TRUE),
    year_max = max(year, na.rm = TRUE)
  )
## Warning: There were 4 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `cylinders_min = min(cylinders, na.rm = TRUE)`.
## ℹ In group 2: `fuel_type = "electricity"`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
print(grouped_summary)
## # A tibble: 3 × 43
##   fuel_type   city_mpg_mean city_mpg_sd city_mpg_min city_mpg_q1 city_mpg_median
##   <chr>               <dbl>       <dbl>        <dbl>       <dbl>           <dbl>
## 1 diesel               24.5       0.707           24        24.2            24.5
## 2 electricity         124.        3.54           121       122.            124. 
## 3 gas                  21.1       5.33            11        17              20  
## # ℹ 37 more variables: city_mpg_q3 <dbl>, city_mpg_max <dbl>,
## #   combination_mpg_mean <dbl>, combination_mpg_sd <dbl>,
## #   combination_mpg_min <dbl>, combination_mpg_q1 <dbl>,
## #   combination_mpg_median <dbl>, combination_mpg_q3 <dbl>,
## #   combination_mpg_max <dbl>, cylinders_mean <dbl>, cylinders_sd <dbl>,
## #   cylinders_min <dbl>, cylinders_q1 <dbl>, cylinders_median <dbl>,
## #   cylinders_q3 <dbl>, cylinders_max <dbl>, displacement_mean <dbl>, …

3. (10 points) Use different visual displays to identify patterns and evaluate potential correlations between the outcome variables and the remaining variables in the dataset. Comment on the patterns/correlations seen from those visualizations.

Boxplot for city_mpg grouped by fuel_type

ggplot(cleaned_car_data, aes(x = fuel_type, y = city_mpg, fill = fuel_type)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "City MPG by Fuel Type", x = "Fuel Type", y = "City MPG")

Patterns: If one category (e.g., “gas” fuel type) has a higher median city MPG compared to another category (e.g., “diesel”), this could suggest a relationship between fuel type and fuel efficiency.