1. (4 points) Perform any necessary data cleaning and/or wrangling (reshape data, define and add new relevant variables computed from the variables that are already in the dataset).
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
car_data <- read_csv("car_data.csv") #import dataset into R
## Rows: 550 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): class, drive, fuel_type, make, model, transmission
## dbl (6): city_mpg, combination_mpg, cylinders, displacement, highway_mpg, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(car_data) #summary of the car_data
## city_mpg class combination_mpg cylinders
## Min. : 11.00 Length:550 Min. : 14.00 Min. : 3.000
## 1st Qu.: 17.00 Class :character 1st Qu.: 20.00 1st Qu.: 4.000
## Median : 20.00 Mode :character Median : 23.00 Median : 4.000
## Mean : 21.46 Mean : 24.07 Mean : 5.316
## 3rd Qu.: 24.00 3rd Qu.: 27.00 3rd Qu.: 6.000
## Max. :126.00 Max. :112.00 Max. :12.000
## NA's :2
## displacement drive fuel_type highway_mpg
## Min. :1.200 Length:550 Length:550 Min. : 18.00
## 1st Qu.:2.000 Class :character Class :character 1st Qu.: 24.00
## Median :2.500 Mode :character Mode :character Median : 28.00
## Mean :2.932 Mean : 28.61
## 3rd Qu.:3.500 3rd Qu.: 32.00
## Max. :6.800 Max. :102.00
## NA's :2
## make model transmission year
## Length:550 Length:550 Length:550 Min. :2014
## Class :character Class :character Class :character 1st Qu.:2016
## Mode :character Mode :character Mode :character Median :2019
## Mean :2019
## 3rd Qu.:2022
## Max. :2024
##
Checking for missing values:
sum(is.na(car_data)) #Check for missing values
## [1] 4
#[1] 4
colSums(is.na(car_data)) #Check missing values per column
## city_mpg class combination_mpg cylinders displacement
## 0 0 0 2 2
## drive fuel_type highway_mpg make model
## 0 0 0 0 0
## transmission year
## 0 0
#Data before being cleaned
# city_mpg class combination_mpg cylinders
# 0 0 0 2
# displacement drive fuel_type highway_mpg
# 2 0 0 0
# make model transmission year
# 0 0 0 0
Remove rows with missing values:
data_clean <- na.omit(car_data)
car_data$cylinders[is.na(car_data$cylinders)] <- mean(car_data$cylinders, na.rm = TRUE)
car_data$displacement[is.na(car_data$displacement)] <- mean(car_data$displacement, na.rm = TRUE)
colSums(is.na(car_data))
## city_mpg class combination_mpg cylinders displacement
## 0 0 0 0 0
## drive fuel_type highway_mpg make model
## 0 0 0 0 0
## transmission year
## 0 0
Check for Duplicates:
sum(duplicated(car_data)) #Looking for duplicate rows
## [1] 2
#[1] 2
print(duplicated(car_data))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Remove duplicates:
data_clean <- car_data[!duplicated(car_data[, c("fuel_type", "make")]), ]
sum(duplicated(data_clean))
## [1] 0
#[1] 0
Save the now cleaned data:
library(readr)
cleaned_car_data <- read_csv("cleaned_car_data.csv")
## Rows: 550 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): class, drive, fuel_type, make, model, transmission
## dbl (6): city_mpg, combination_mpg, cylinders, displacement, highway_mpg, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(cleaned_car_data)
#write.csv(car_data, "cleaned_car_data.csv", row.names = FALSE)
#Save cleaned data
summary(cleaned_car_data)
## city_mpg class combination_mpg cylinders
## Min. : 11.00 Length:550 Min. : 14.00 Min. : 3.000
## 1st Qu.: 17.00 Class :character 1st Qu.: 20.00 1st Qu.: 4.000
## Median : 20.00 Mode :character Median : 23.00 Median : 4.000
## Mean : 21.46 Mean : 24.07 Mean : 5.316
## 3rd Qu.: 24.00 3rd Qu.: 27.00 3rd Qu.: 6.000
## Max. :126.00 Max. :112.00 Max. :12.000
## NA's :2
## displacement drive fuel_type highway_mpg
## Min. :1.200 Length:550 Length:550 Min. : 18.00
## 1st Qu.:2.000 Class :character Class :character 1st Qu.: 24.00
## Median :2.500 Mode :character Mode :character Median : 28.00
## Mean :2.932 Mean : 28.61
## 3rd Qu.:3.500 3rd Qu.: 32.00
## Max. :6.800 Max. :102.00
## NA's :2
## make model transmission year
## Length:550 Length:550 Length:550 Min. :2014
## Class :character Class :character Class :character 1st Qu.:2016
## Mode :character Mode :character Mode :character Median :2019
## Mean :2019
## 3rd Qu.:2022
## Max. :2024
##
2. (6 points) Use summary statistics (mean, standard deviation, and five number summary) to summarizes the quantitative variables in your data paying special attention to the outcome variables that you identified in Part I of the project. To look for patterns and evaluate potential correlations, compute those summaries by groups defined using categorical variables. Comment on the patterns/correlations seen from those summaries.
List of quantitative columns:
quantitative_columns <- c("city_mpg", "combination_mpg", "cylinders",
"displacement", "highway_mpg", "year")
summary_stats <- summary(cleaned_car_data[quantitative_columns])
print(summary_stats)
## city_mpg combination_mpg cylinders displacement
## Min. : 11.00 Min. : 14.00 Min. : 3.000 Min. :1.200
## 1st Qu.: 17.00 1st Qu.: 20.00 1st Qu.: 4.000 1st Qu.:2.000
## Median : 20.00 Median : 23.00 Median : 4.000 Median :2.500
## Mean : 21.46 Mean : 24.07 Mean : 5.316 Mean :2.932
## 3rd Qu.: 24.00 3rd Qu.: 27.00 3rd Qu.: 6.000 3rd Qu.:3.500
## Max. :126.00 Max. :112.00 Max. :12.000 Max. :6.800
## NA's :2 NA's :2
## highway_mpg year
## Min. : 18.00 Min. :2014
## 1st Qu.: 24.00 1st Qu.:2016
## Median : 28.00 Median :2019
## Mean : 28.61 Mean :2019
## 3rd Qu.: 32.00 3rd Qu.:2022
## Max. :102.00 Max. :2024
##
means <- colMeans(cleaned_car_data[quantitative_columns], na.rm = TRUE)
print("Means:")
## [1] "Means:"
print(means)
## city_mpg combination_mpg cylinders displacement highway_mpg
## 21.460000 24.069091 5.315693 2.931752 28.609091
## year
## 2019.000000
sds <- sapply(cleaned_car_data[quantitative_columns], sd, na.rm = TRUE)
print("Standard Deviations:")
## [1] "Standard Deviations:"
print(sds)
## city_mpg combination_mpg cylinders displacement highway_mpg
## 8.147392 7.478369 1.759999 1.248419 6.832228
## year
## 3.165156
five_num_summary <- sapply(cleaned_car_data[quantitative_columns], function(x) quantile(x, na.rm = TRUE))
print("Five Number Summary:")
## [1] "Five Number Summary:"
print(five_num_summary)
## city_mpg combination_mpg cylinders displacement highway_mpg year
## 0% 11 14 3 1.2 18 2014
## 25% 17 20 4 2.0 24 2016
## 50% 20 23 4 2.5 28 2019
## 75% 24 27 6 3.5 32 2022
## 100% 126 112 12 6.8 102 2024
Summarizing the quantitative variables by fuel_type:
quantitative_columns <- c("city_mpg", "combination_mpg", "cylinders",
"displacement", "highway_mpg", "year")
grouped_summary <- cleaned_car_data %>%
group_by(fuel_type) %>%
summarize(
city_mpg_mean = mean(city_mpg, na.rm = TRUE),
city_mpg_sd = sd(city_mpg, na.rm = TRUE),
city_mpg_min = min(city_mpg, na.rm = TRUE),
city_mpg_q1 = quantile(city_mpg, 0.25, na.rm = TRUE),
city_mpg_median = median(city_mpg, na.rm = TRUE),
city_mpg_q3 = quantile(city_mpg, 0.75, na.rm = TRUE),
city_mpg_max = max(city_mpg, na.rm = TRUE),
combination_mpg_mean = mean(combination_mpg, na.rm = TRUE),
combination_mpg_sd = sd(combination_mpg, na.rm = TRUE),
combination_mpg_min = min(combination_mpg, na.rm = TRUE),
combination_mpg_q1 = quantile(combination_mpg, 0.25, na.rm = TRUE),
combination_mpg_median = median(combination_mpg, na.rm = TRUE),
combination_mpg_q3 = quantile(combination_mpg, 0.75, na.rm = TRUE),
combination_mpg_max = max(combination_mpg, na.rm = TRUE),
cylinders_mean = mean(cylinders, na.rm = TRUE),
cylinders_sd = sd(cylinders, na.rm = TRUE),
cylinders_min = min(cylinders, na.rm = TRUE),
cylinders_q1 = quantile(cylinders, 0.25, na.rm = TRUE),
cylinders_median = median(cylinders, na.rm = TRUE),
cylinders_q3 = quantile(cylinders, 0.75, na.rm = TRUE),
cylinders_max = max(cylinders, na.rm = TRUE),
displacement_mean = mean(displacement, na.rm = TRUE),
displacement_sd = sd(displacement, na.rm = TRUE),
displacement_min = min(displacement, na.rm = TRUE),
displacement_q1 = quantile(displacement, 0.25, na.rm = TRUE),
displacement_median = median(displacement, na.rm = TRUE),
displacement_q3 = quantile(displacement, 0.75, na.rm = TRUE),
displacement_max = max(displacement, na.rm = TRUE),
highway_mpg_mean = mean(highway_mpg, na.rm = TRUE),
highway_mpg_sd = sd(highway_mpg, na.rm = TRUE),
highway_mpg_min = min(highway_mpg, na.rm = TRUE),
highway_mpg_q1 = quantile(highway_mpg, 0.25, na.rm = TRUE),
highway_mpg_median = median(highway_mpg, na.rm = TRUE),
highway_mpg_q3 = quantile(highway_mpg, 0.75, na.rm = TRUE),
highway_mpg_max = max(highway_mpg, na.rm = TRUE),
year_mean = mean(year, na.rm = TRUE),
year_sd = sd(year, na.rm = TRUE),
year_min = min(year, na.rm = TRUE),
year_q1 = quantile(year, 0.25, na.rm = TRUE),
year_median = median(year, na.rm = TRUE),
year_q3 = quantile(year, 0.75, na.rm = TRUE),
year_max = max(year, na.rm = TRUE)
)
## Warning: There were 4 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `cylinders_min = min(cylinders, na.rm = TRUE)`.
## ℹ In group 2: `fuel_type = "electricity"`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
print(grouped_summary)
## # A tibble: 3 × 43
## fuel_type city_mpg_mean city_mpg_sd city_mpg_min city_mpg_q1 city_mpg_median
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 diesel 24.5 0.707 24 24.2 24.5
## 2 electricity 124. 3.54 121 122. 124.
## 3 gas 21.1 5.33 11 17 20
## # ℹ 37 more variables: city_mpg_q3 <dbl>, city_mpg_max <dbl>,
## # combination_mpg_mean <dbl>, combination_mpg_sd <dbl>,
## # combination_mpg_min <dbl>, combination_mpg_q1 <dbl>,
## # combination_mpg_median <dbl>, combination_mpg_q3 <dbl>,
## # combination_mpg_max <dbl>, cylinders_mean <dbl>, cylinders_sd <dbl>,
## # cylinders_min <dbl>, cylinders_q1 <dbl>, cylinders_median <dbl>,
## # cylinders_q3 <dbl>, cylinders_max <dbl>, displacement_mean <dbl>, …
3. (10 points) Use different visual displays to identify patterns and evaluate potential correlations between the outcome variables and the remaining variables in the dataset. Comment on the patterns/correlations seen from those visualizations.
Boxplot for city_mpg grouped by fuel_type
ggplot(cleaned_car_data, aes(x = fuel_type, y = city_mpg, fill = fuel_type)) +
geom_boxplot() +
theme_minimal() +
labs(title = "City MPG by Fuel Type", x = "Fuel Type", y = "City MPG")
Patterns: If one category (e.g., “gas” fuel type) has a higher median city MPG compared to another category (e.g., “diesel”), this could suggest a relationship between fuel type and fuel efficiency.