Final Project Part 1 & 2: Introductory Analysis

Author

Luis Tapia

CIS Large Car Dataset: First Contact with Your Dataset Using Arrow

Assignment Overview

This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your dataset systematically.

Learning Objectives

By completing this assignment, you will: - Apply the READY framework to plan your data investigation - Use the SCAN framework to systematically explore your dataset - Practice using Arrow for memory-efficient data loading - Document your initial findings and develop investigation questions

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

# performed an installation of 'zip' package first for troubleshooting purposes
# help with initial troubleshooting
# install.packages("zip") --> commented out
# install.packages("writexl")
# Load required libraries
# install.packages("openxlsx")
# Load the library
library(knitr)
Warning: package 'knitr' was built under R version 4.5.1
library(openxlsx)
Warning: package 'openxlsx' was built under R version 4.5.1
library(writexl) # for outputting/printing excel files for analysis
Warning: package 'writexl' was built under R version 4.5.1
library(arrow)
Warning: package 'arrow' was built under R version 4.5.1

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(glue)
Warning: package 'glue' was built under R version 4.5.1
library(zip)
Warning: package 'zip' was built under R version 4.5.1

Attaching package: 'zip'
The following objects are masked from 'package:utils':

    unzip, zip
library(dplyr)
Warning: package 'dplyr' was built under R version 4.5.1

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
Warning: package 'readr' was built under R version 4.5.1
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'tidyr' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::duration() masks arrow::duration()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

For ZIP files containing CSV(s):

Note on Loading in Dataset

Since the large dataset I was working on only consisted of a single compressed file for unzipping, I still encountered issues in using the “open_dataset()” function directly to load in the data after the unzipping process. Specifically, when opening the dataset, there are various datatypes for each field or variable (such as categorical or numeric) that created errors in reading in the variables correctly.

Error Example:

Error: Invalid: In CSV column #9: Row #1204: CSV conversion error to null: invalid value 'Ebony'

As such, I have utilized research gathered from StackOverflow and the assistance of Claude as an AI tool to help guide the best approach to load in the data that has multiple different types of fields/variables. This also involved converting the file early on as a Parquet file type for efficiency, based on what was noted during in-class (10/8), to best optimize performance in working with the dataset during the cleaning process.

Note: Uncomment code and change path for the unzipping, as well as extraction procedure to take place. This needs to be done for the first 3 code chunks.

# path to zip file 

# CHANGE PATH WHEN RUNNING ON DIFFERENT SYSTEM 
#zip_path <- "C:/Univeristy_Assignments/Fall 2025/DSA 406/406_Final/Final_Project/archive.zip"

# create folder for holding extracted dataset 
#outdir <- file.path(dirname(zip_path), "Extracted Data Folder")

# extract files if needed and use if statements to ensure 
# extracted file is not already present
# this was implemented due to multiple rounds of testing with the dataset

#if (!dir.exists(outdir)) {
 # dir.create(outdir)
 # unzip(zip_path, exdir = outdir) # unzips the dataset
 # message("Files extracted")
 # } else{
 # message("Files already extracted")
 # }

# get list of CSV files present
#csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)

# use read_csv_arrow() to read in dataset rather than open_dataset()
# as a result of columm reading issues

#baseData <- read_csv_arrow(
  #csv_files[1],
 # col_types = schema(.default = string()) 
#) %>%
  #collect()

# Check memory usage of dataset load
  
#glue("Memory used by Arrow object: {format(object.size(baseData), units = 'MB')}")

Initial View of Data (without conversion ~ only for testing)

# reading in the first 5 rows of the data
#baseData %>% 
 # head(5)

Converting CSV to Parquet

# convert and save csv file as parquet for efficiency
#parquet_path <- file.path(outdir, "baseDataAauto.parquet")
#write_parquet(baseData, parquet_path)
#message("Parquet file created")

# Check memory usage after collecting
#glue("Memory used after collect(): {format(object.size(baseData), units = 'MB')}")

Parquet File Validation & Viewing (Run this to bring file into system)

# verification of file presence
file.exists("Extracted Data Folder/baseDataAauto.parquet")
[1] TRUE
file.size("Extracted Data Folder/baseDataAauto.parquet") / 1024^3  # Size in GB
[1] 0.9410913
#loading in the dataset
autoInfoData <- open_dataset("Extracted Data Folder/baseDataAauto.parquet", format = "parquet")

Using glimpse() on Parquet File

autoInfoData %>%
  glimpse()
FileSystemDataset with 1 Parquet file
5,695,015 rows x 156 columns
$ vin                                    <string> "abc5f0360059cf7b6fa8368db57f2…
$ stockNum                               <string> "11701A", "9055B", "11816A", "…
$ firstSeen                         <date32[day]> 2019-05-06, 2019-05-06, 2017-0…
$ lastSeen                          <date32[day]> 2019-05-06, 2019-05-06, 2019-0…
$ msrp                                    <int32> 1498, 10589, 11992, 12387, 416…
$ askPrice                                <int32> 1498, 10589, 9940, 12387, 4165…
$ mileage                                 <int32> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ isNew                                  <string> "False", "False", "False", "Fa…
$ color                                  <string> "Gray", "Super Black", "White"…
$ interiorColor                          <string> "N/A", "N/A", "N/A", "N/A", "N…
$ brandName                              <string> "MITSUBISHI", "NISSAN", "FORD"…
$ modelName                              <string> "Eclipse Spyder", "Altima", "E…
$ dealerID                                <int32> 7514, 7514, 7514, 7514, 7514, …
$ vf_ABS                                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_ActiveSafetySysNote                 <string> NA, NA, NA, NA, NA, NA, "My Ke…
$ vf_AdaptiveCruiseControl               <string> NA, NA, NA, NA, NA, "Optional"…
$ vf_AdaptiveDrivingBeam                 <string> NA, NA, NA, "Optional", "Stand…
$ vf_AdaptiveHeadlights                  <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_AdditionalErrorText                 <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_AirBagLocCurtain                    <string> NA, "1st & 2nd Rows", NA, "All…
$ vf_AirBagLocFront                      <string> "1st Row (Driver & Passenger)"…
$ vf_AirBagLocKnee                       <string> NA, NA, "Driver Seat Only", "1…
$ vf_AirBagLocSeatCushion                <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_AirBagLocSide                       <string> NA, "1st Row (Driver & Passeng…
$ vf_AutoReverseSystem                   <string> NA, NA, NA, "Standard", "Stand…
$ vf_AutomaticPedestrianAlertingSound    <string> NA, NA, NA, NA, NA, "Standard"…
$ vf_AxleConfiguration                   <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Axles                                <int32> NA, NA, NA, 2, 2, 2, 2, 2, 2, …
$ vf_BasePrice                           <double> NA, NA, NA, 23475, NA, 26500, …
$ vf_BatteryA                             <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryA_to                           <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryCells                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryInfo                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryKWh                          <double> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryKWh_to                        <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryModules                        <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryPacks                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryType                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryV                             <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryV_to                           <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BedLengthIN                          <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BedType                             <string> "Not Applicable", "Not Applica…
$ vf_BlindSpotMon                        <string> NA, NA, NA, "Optional", NA, "O…
$ vf_BodyCabType                         <string> "Not Applicable", "Not Applica…
$ vf_BodyClass                           <string> "Convertible/Cabriolet", "Seda…
$ vf_BrakeSystemDesc                     <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BrakeSystemType                     <string> NA, NA, "Hydraulic", NA, "Hydr…
$ vf_BusFloorConfigType                  <string> "Not Applicable", "Not Applica…
$ vf_BusLength                             <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BusType                             <string> "Not Applicable", "Not Applica…
$ vf_CAN_AACN                            <string> NA, NA, NA, "Standard", "Stand…
$ vf_CIB                                 <string> NA, NA, NA, NA, "Standard", "O…
$ vf_CashForClunkers                       <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ChargerLevel                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ChargerPowerKW                       <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_CoolingType                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_CurbWeightLB                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_CustomMotorcycleType                <string> "Not Applicable", "Not Applica…
$ vf_DaytimeRunningLight                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_DestinationMarket                   <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_DisplacementCC                      <double> 3000, 2500, 1600, 1400, 5000, …
$ vf_DisplacementCI                      <double> 183.07123, 152.55936, 97.63799…
$ vf_DisplacementL                       <double> 3.0, 2.5, 1.6, 1.4, 5.0, 2.0, …
$ vf_Doors                                <int32> 2, 4, 4, 4, NA, 4, 4, 4, 4, 4,…
$ vf_DriveType                           <string> NA, "4x2", "4x2", NA, "4WD/4-W…
$ vf_DriverAssist                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_DynamicBrakeSupport                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_EDR                                 <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ESC                                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_EVDriveUnit                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ElectrificationLevel                <string> NA, NA, NA, NA, NA, "Strong HE…
$ vf_EngineConfiguration                 <string> NA, "In-Line", "In-Line", "In-…
$ vf_EngineCycles                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_EngineCylinders                      <int32> NA, 4, 4, 4, 8, 4, 4, 6, 6, 4,…
$ vf_EngineHP                            <double> NA, NA, 178, NA, 395, 188, 171…
$ vf_EngineHP_to                         <double> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_EngineKW                            <double> NA, NA, 132.7346, NA, 294.5515…
$ vf_EngineManufacturer                  <string> NA, NA, "Ford", "GMNA", "Ford"…
$ vf_EngineModel                         <string> NA, NA, NA, "LE2 -DI: Direct I…
$ vf_EntertainmentSystem                 <string> NA, NA, NA, NA, "CD + stereo",…
$ vf_ForwardCollisionWarning             <string> NA, NA, NA, "Optional", "Stand…
$ vf_FuelInjectionType                   <string> "Multipoint Fuel Injection (MP…
$ vf_FuelTypePrimary                     <string> NA, "Gasoline", "Gasoline", "G…
$ vf_FuelTypeSecondary                   <string> NA, NA, NA, NA, NA, "Electric"…
$ vf_GCWR                                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_GCWR_to                               <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_GVWR                                <string> NA, NA, "Class 1C: 4001 - 5000…
$ vf_GVWR_to                               <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_KeylessIgnition                     <string> NA, NA, NA, "Standard", NA, "O…
$ vf_LaneDepartureWarning                <string> NA, NA, NA, "Optional", "Optio…
$ vf_LaneKeepSystem                      <string> NA, NA, NA, "Optional", "Optio…
$ vf_LowerBeamHeadlampLightSource        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Make                                <string> "MITSUBISHI", "NISSAN", "FORD"…
$ vf_MakeID                              <double> 481, 478, 460, 467, 460, 460, …
$ vf_Manufacturer                        <string> "MITSUBISHI MOTORS NORTH AMERI…
$ vf_ManufacturerId                       <int32> 1054, 997, 976, 984, 976, 979,…
$ vf_Model                               <string> "Eclipse Spyder", "Altima", "E…
$ vf_ModelID                              <int32> 2321, 1904, 1798, 1832, 1801, …
$ vf_ModelYear                            <int32> 2002, 2016, 2014, 2017, 2019, …
$ vf_MotorcycleChassisType               <string> "Not Applicable", "Not Applica…
$ vf_MotorcycleSuspensionType            <string> "Not Applicable", "Not Applica…
$ vf_NCSABodyType                          <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMake                              <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMapExcApprovedBy                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMapExcApprovedOn                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMappingException                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAModel                             <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSANote                            <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Note                                <string> NA, "position 6:Model change n…
$ vf_OtherBusInfo                          <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_OtherEngineInfo                     <string> "MPI", NA, "Ti-VCT ", NA, NA, …
$ vf_OtherMotorcycleInfo                 <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_OtherRestraintSystemInfo            <string> NA, "2nd row outboard and cent…
$ vf_OtherTrailerInfo                    <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ParkAssist                          <string> NA, NA, NA, NA, NA, NA, NA, "O…
$ vf_PedestrianAutomaticEmergencyBraking <string> NA, NA, NA, NA, "Standard", "O…
$ vf_PlantCity                           <string> "BLOOMINGTON-NORMAL", "CANTON"…
$ vf_PlantCompanyName                    <string> NA, "Nissan North America Inc.…
$ vf_PlantCountry                        <string> "UNITED STATES (USA)", "UNITED…
$ vf_PlantState                          <string> "ILLINOIS", "MISSISSIPPI", "KE…
$ vf_PossibleValues                      <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Pretensioner                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_RearCrossTrafficAlert               <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_RearVisibilitySystem                <string> NA, NA, NA, "Standard", "Stand…
$ vf_SAEAutomationLevel                   <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_SAEAutomationLevel_to                 <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_SeatBeltsAll                        <string> "Manual", "Manual", "Manual", …
$ vf_SeatRows                             <int32> NA, NA, NA, 2, 2, 2, 2, NA, 3,…
$ vf_Seats                                <int32> NA, NA, NA, 5, 6, 5, 5, 7, 7, …
$ vf_SemiautomaticHeadlampBeamSwitching  <string> NA, NA, NA, "Standard", "Stand…
$ vf_Series                              <string> "SPORTS", NA, "SE", "Premier",…
$ vf_Series2                             <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_SteeringLocation                    <string> NA, NA, NA, "Left Hand Drive (…
$ vf_SuggestedVIN                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_TPMS                                <string> NA, "Direct", "Direct", "Direc…
$ vf_TopSpeedMPH                          <int32> NA, NA, NA, 130, NA, 105, 114,…
$ vf_TrackWidth                          <double> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_TractionControl                     <string> NA, NA, NA, "Standard", "Stand…
$ vf_TrailerBodyType                     <string> "Not Applicable", "Not Applica…
$ vf_TrailerLength                        <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_TrailerType                         <string> "Not Applicable", "Not Applica…
$ vf_TransmissionSpeeds                   <int32> NA, NA, NA, NA, NA, NA, 6, 6, …
$ vf_TransmissionStyle                   <string> NA, NA, NA, "Automatic", "Auto…
$ vf_Trim                                <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Trim2                               <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Turbo                               <string> NA, NA, NA, "Yes", NA, NA, NA,…
$ vf_VIN                                 <string> "abc5f0360059cf7b6fa8368db57f2…
$ vf_ValveTrainDesign                    <string> NA, NA, NA, "Dual Overhead Cam…
$ vf_VehicleType                         <string> "PASSENGER CAR", "PASSENGER CA…
$ vf_WheelBaseLong                       <double> NA, NA, NA, NA, 156.8, NA, NA,…
$ vf_WheelBaseShort                      <double> NA, NA, NA, 106.3, 145.0, 112.…
$ vf_WheelBaseType                       <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_WheelSizeFront                       <int32> NA, NA, NA, 17, 17, 17, 17, 19…
$ vf_WheelSizeRear                        <int32> NA, NA, NA, 17, 17, 17, 17, 19…
$ vf_Wheels                               <int32> NA, NA, NA, 4, 4, 4, 4, 4, 4, …
$ vf_Windows                              <int32> NA, NA, NA, 4, NA, NA, NA, NA,…
Call `print()` for full schema details

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

R - Representative Data

Document your thoughts as comments:

E - Executive Driven Questions

Who would care about insights from your data?

Primary stakeholders: Key business/research questions they might ask: What decisions could this data inform?

Examples: - If this is sales data: “How can we optimize our sales strategy?” - If this is health data: “What patterns affect patient outcomes?” - If this is social media data: “How can we improve engagement?”

Your stakeholder/research questions:

  1. How important are specific technical and other physical attributes of a car that can impact the asking price of a vehicle from a dealership?
  2. Which other aspects of a car, not specifically related to the physical attributes, are correlated with pricing, such as the location where a car is assembled or manufactured
  3. What car brands, along with models, are most prominent in dealership locations, and impact how frequently they are sold?
  4. What car features are also most prominent in vehicles in dealerships to focus primarily, on the car manufacturer’s side, what to prioritize most that best sell?
  5. How much can MSRP and asking price can differentiate on vehicles based on factors such as mileage?
    • This also pertains to how much less or more a dealership prices a vehicle when compared to the manufacturer’s recommended pricing.

A - Analytical Framework

Your exploration strategy:

  • Phase 1: Data Quality Assessment - Check for missing values - Identify data types and consistency - Look for outliers or anomalies

  • Phase 2: Descriptive Analysis - What are the key variables? - What’s the distribution of important metrics? - What time patterns exist?

  • Phase 3: Pattern Investigation - What relationships might exist between variables? - Are there seasonal or temporal patterns? - What groupings or segments emerge?

Your specific analytical approach:

  1. Begin checking for any potential relationships within the data through the use of measures such as a correlation matrix to begin identifying areas to explore.

  2. In determining key variables or fields I want to further focus on as part of the data, I begin developing visualizations or informational tables that depict anything noteworthy at first glance.

  3. Consider utilizing strategies such as “feature engineering” to develop new variables from existing fields, such as any month-day-year fields, to make a separate year field that can be used to develop a year-by-year analysis on key variables.

D - Data Best Practices

Quality checks to perform:

Missing data assessment:

Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?

Your quality concerns:

  1. Based on a first glance of the dataset, there appear to be various variables/fields that have numerous amounts of NA values present, which impacts the usefulness of some fields. Some fields are completely read in as ‘NULL’; however, they only appear as variables that are not relevant to the scope of the analysis.

  2. Inconsistent formatting between categorical variables, such as ‘vehicleType’ (all uppercase) and ‘Model’ (titlecase), would require some further cleaning to ensure consistent reporting.

  3. There appear to be variations of the same measure, such as price (‘askPrice’, ‘msrp’, and ‘BasePrice’), that could make it difficult to perform certain analyses related to pricing/value, not recognizing which is the most optimal to use for specific cases.

Y - Your Insights

Initial hypotheses about what you might find:

Based on your domain knowledge, what patterns do you expect? What would surprise you? What would be most valuable to discover?

Your predictions:

  1. There might be certain fields (related to areas such as vehicle type) that show distinct patterns based on year that could be indicative of trends or preferences demonstrated based on dealership location.

  2. Specific periods of time based on the year might demonstrate a general increase in ‘msrp’ for new car offerings due to increased implementation of newer technologies on vehicles.

  3. A prominence of larger vehicle types such as SUVs and or trucks when compared to sedans in the dataset due to noted popularity of larger car types for practicality, higher driving position, as well as comfort.

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

Standardizing Naming Scheme & Obtain Column Names

# before further analysis, ensure column names/variables
# maintain a consistent naming scheme, removes ""
autoInfoData <- autoInfoData %>%
  rename_with(~ str_replace(., "^vf_", "")) %>% 
  collect()

# prints out column names of the dataset
colnames(autoInfoData)
  [1] "vin"                                
  [2] "stockNum"                           
  [3] "firstSeen"                          
  [4] "lastSeen"                           
  [5] "msrp"                               
  [6] "askPrice"                           
  [7] "mileage"                            
  [8] "isNew"                              
  [9] "color"                              
 [10] "interiorColor"                      
 [11] "brandName"                          
 [12] "modelName"                          
 [13] "dealerID"                           
 [14] "ABS"                                
 [15] "ActiveSafetySysNote"                
 [16] "AdaptiveCruiseControl"              
 [17] "AdaptiveDrivingBeam"                
 [18] "AdaptiveHeadlights"                 
 [19] "AdditionalErrorText"                
 [20] "AirBagLocCurtain"                   
 [21] "AirBagLocFront"                     
 [22] "AirBagLocKnee"                      
 [23] "AirBagLocSeatCushion"               
 [24] "AirBagLocSide"                      
 [25] "AutoReverseSystem"                  
 [26] "AutomaticPedestrianAlertingSound"   
 [27] "AxleConfiguration"                  
 [28] "Axles"                              
 [29] "BasePrice"                          
 [30] "BatteryA"                           
 [31] "BatteryA_to"                        
 [32] "BatteryCells"                       
 [33] "BatteryInfo"                        
 [34] "BatteryKWh"                         
 [35] "BatteryKWh_to"                      
 [36] "BatteryModules"                     
 [37] "BatteryPacks"                       
 [38] "BatteryType"                        
 [39] "BatteryV"                           
 [40] "BatteryV_to"                        
 [41] "BedLengthIN"                        
 [42] "BedType"                            
 [43] "BlindSpotMon"                       
 [44] "BodyCabType"                        
 [45] "BodyClass"                          
 [46] "BrakeSystemDesc"                    
 [47] "BrakeSystemType"                    
 [48] "BusFloorConfigType"                 
 [49] "BusLength"                          
 [50] "BusType"                            
 [51] "CAN_AACN"                           
 [52] "CIB"                                
 [53] "CashForClunkers"                    
 [54] "ChargerLevel"                       
 [55] "ChargerPowerKW"                     
 [56] "CoolingType"                        
 [57] "CurbWeightLB"                       
 [58] "CustomMotorcycleType"               
 [59] "DaytimeRunningLight"                
 [60] "DestinationMarket"                  
 [61] "DisplacementCC"                     
 [62] "DisplacementCI"                     
 [63] "DisplacementL"                      
 [64] "Doors"                              
 [65] "DriveType"                          
 [66] "DriverAssist"                       
 [67] "DynamicBrakeSupport"                
 [68] "EDR"                                
 [69] "ESC"                                
 [70] "EVDriveUnit"                        
 [71] "ElectrificationLevel"               
 [72] "EngineConfiguration"                
 [73] "EngineCycles"                       
 [74] "EngineCylinders"                    
 [75] "EngineHP"                           
 [76] "EngineHP_to"                        
 [77] "EngineKW"                           
 [78] "EngineManufacturer"                 
 [79] "EngineModel"                        
 [80] "EntertainmentSystem"                
 [81] "ForwardCollisionWarning"            
 [82] "FuelInjectionType"                  
 [83] "FuelTypePrimary"                    
 [84] "FuelTypeSecondary"                  
 [85] "GCWR"                               
 [86] "GCWR_to"                            
 [87] "GVWR"                               
 [88] "GVWR_to"                            
 [89] "KeylessIgnition"                    
 [90] "LaneDepartureWarning"               
 [91] "LaneKeepSystem"                     
 [92] "LowerBeamHeadlampLightSource"       
 [93] "Make"                               
 [94] "MakeID"                             
 [95] "Manufacturer"                       
 [96] "ManufacturerId"                     
 [97] "Model"                              
 [98] "ModelID"                            
 [99] "ModelYear"                          
[100] "MotorcycleChassisType"              
[101] "MotorcycleSuspensionType"           
[102] "NCSABodyType"                       
[103] "NCSAMake"                           
[104] "NCSAMapExcApprovedBy"               
[105] "NCSAMapExcApprovedOn"               
[106] "NCSAMappingException"               
[107] "NCSAModel"                          
[108] "NCSANote"                           
[109] "Note"                               
[110] "OtherBusInfo"                       
[111] "OtherEngineInfo"                    
[112] "OtherMotorcycleInfo"                
[113] "OtherRestraintSystemInfo"           
[114] "OtherTrailerInfo"                   
[115] "ParkAssist"                         
[116] "PedestrianAutomaticEmergencyBraking"
[117] "PlantCity"                          
[118] "PlantCompanyName"                   
[119] "PlantCountry"                       
[120] "PlantState"                         
[121] "PossibleValues"                     
[122] "Pretensioner"                       
[123] "RearCrossTrafficAlert"              
[124] "RearVisibilitySystem"               
[125] "SAEAutomationLevel"                 
[126] "SAEAutomationLevel_to"              
[127] "SeatBeltsAll"                       
[128] "SeatRows"                           
[129] "Seats"                              
[130] "SemiautomaticHeadlampBeamSwitching" 
[131] "Series"                             
[132] "Series2"                            
[133] "SteeringLocation"                   
[134] "SuggestedVIN"                       
[135] "TPMS"                               
[136] "TopSpeedMPH"                        
[137] "TrackWidth"                         
[138] "TractionControl"                    
[139] "TrailerBodyType"                    
[140] "TrailerLength"                      
[141] "TrailerType"                        
[142] "TransmissionSpeeds"                 
[143] "TransmissionStyle"                  
[144] "Trim"                               
[145] "Trim2"                              
[146] "Turbo"                              
[147] "VIN"                                
[148] "ValveTrainDesign"                   
[149] "VehicleType"                        
[150] "WheelBaseLong"                      
[151] "WheelBaseShort"                     
[152] "WheelBaseType"                      
[153] "WheelSizeFront"                     
[154] "WheelSizeRear"                      
[155] "Wheels"                             
[156] "Windows"                            

Other Dataset Structure Analysis

# this section analyzes the overall structure of the data
# take a sample size from data 
# to minimize instance of loading the entire data repeatedly

# OTHER DETAILS
# this version of the data filters for specifc variables that stakeholders
# would be most interested in, that avoids heavily technical details
autoInfoSample_1 <- autoInfoData %>% 
  slice_sample(n = 100) %>% 
    select(
      ModelYear,
      brandName,
      modelName,
      BodyClass,
      VehicleType,
      msrp,
      askPrice,
      mileage,
      isNew,
      color,
      interiorColor,
      Doors,
      FuelTypePrimary,
      SeatBeltsAll,
  ) %>% 
  collect()  

# prints out structure of data
str(autoInfoSample_1)
tibble [100 × 14] (S3: tbl_df/tbl/data.frame)
 $ ModelYear      : int [1:100] 2019 2019 2019 2019 2018 2015 2018 2014 2016 2019 ...
 $ brandName      : chr [1:100] "DODGE" "HONDA" "CHEVROLET" "JEEP" ...
 $ modelName      : chr [1:100] "Durango" "CR-V" "Colorado" "Cherokee" ...
 $ BodyClass      : chr [1:100] "Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)" "Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)" "Pickup" "Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)" ...
 $ VehicleType    : chr [1:100] "MULTIPURPOSE PASSENGER VEHICLE (MPV)" "MULTIPURPOSE PASSENGER VEHICLE (MPV)" "TRUCK " "MULTIPURPOSE PASSENGER VEHICLE (MPV)" ...
 $ msrp           : int [1:100] 36990 28228 33660 27135 18479 10852 14994 8995 24785 22980 ...
 $ askPrice       : int [1:100] 31990 28228 28631 21015 18254 10852 14994 8995 22000 22980 ...
 $ mileage        : int [1:100] 19341 0 0 0 35901 88252 49048 40278 96732 0 ...
 $ isNew          : chr [1:100] "False" "True" "True" "True" ...
 $ color          : chr [1:100] "DB Black" "N/A" "Steel Metallic" "N/A" ...
 $ interiorColor  : chr [1:100] "Black" "N/A" "Jet Black/Dark Ash" "N/A" ...
 $ Doors          : int [1:100] 4 5 4 4 4 4 4 4 4 4 ...
 $ FuelTypePrimary: chr [1:100] "Gasoline" "Gasoline" "Gasoline" "Gasoline" ...
 $ SeatBeltsAll   : chr [1:100] "Manual" "Manual" "Manual" "Manual" ...
# display dataset dimensions
cat("Dimension Information:\n") # print statement without "" showing
Dimension Information:
dim(autoInfoSample_1)
[1] 100  14
# data timeframe
autotime <- autoInfoData %>%
  summarise(
    earliest_firstSeen = min(as.Date(firstSeen), na.rm = TRUE),
    recent_lastSeen = max(as.Date(lastSeen), na.rm = TRUE)
  )

# print readable text output
glue(
  "Date Information:\n", 
  "The earliest first seen date is: {autotime$earliest_firstSeen}\n",
  "The latest last seen date is: {autotime$recent_lastSeen}"
)
Date Information:
The earliest first seen date is: 2016-04-05
The latest last seen date is: 2020-05-31
# what model year is the most recurring in the dataset
cat("Most Recurring 'ModelYear':\n")
Most Recurring 'ModelYear':
modelYearMode <- autoInfoData %>%
  count(ModelYear) %>% # count of 'ModelYear'
  arrange(desc(n)) %>%
  slice(1) %>% # selects first row of arranged table
  pull(ModelYear) # value extraction is performed

modelYearMode
[1] 2019

After examining the data structure, who else might be interested?

Since the main audience of the dataset is from dealerships, and technical analysts, this data might also appeal to consumers, as well as within automotive industry (manufacturers).

What specific questions would they have?

  1. How does the presence of specific features and different configurations affect the overall price?

  2. How much does brand potentially impact the overall pricing of cars when compared to the manufacturer’s price and the dealer’s price?

  3. What vehicles (such as SUVs or Trucks) are most commonly found at dealerships?

  4. Are there any trends over the time span of the data that seem to indicate a shift in what types of cars are acquired by dealers (such as a shift from primarily gas-powered cars to electric or hybrid vehicles)?

What concerns might they have about data quality?

  • The relevancy of some fields that are a part of the dataset, which would require adjustment/further cleaning to narrow the variable size of the dataset to only core fields.

  • Duplicated data, such as the model name of a vehicle that requires further cleaning/handling, should be interpreted to best interpret the data.

  • Some stakeholder-relevant variables/fields have some portion of missing/NA variables related; however, it is sizable enough to hinder any reporting. As such, only acknowledgement is warranted for this matter.

C - Columns and Coverage

Create a summary table of your variables:

# enhanced summary table
variableSummary <- data.frame(
  Variable = colnames(autoInfoData),
  Type = sapply(autoInfoData, function(x) class(x)[1]),
  MissingQuantity = colSums(is.na(autoInfoData)),
  MissingPercentage = round(colSums(is.na(autoInfoData)) / nrow(autoInfoData) * 100, 2)) 

# print out summary
print(variableSummary)
                                                               Variable
vin                                                                 vin
stockNum                                                       stockNum
firstSeen                                                     firstSeen
lastSeen                                                       lastSeen
msrp                                                               msrp
askPrice                                                       askPrice
mileage                                                         mileage
isNew                                                             isNew
color                                                             color
interiorColor                                             interiorColor
brandName                                                     brandName
modelName                                                     modelName
dealerID                                                       dealerID
ABS                                                                 ABS
ActiveSafetySysNote                                 ActiveSafetySysNote
AdaptiveCruiseControl                             AdaptiveCruiseControl
AdaptiveDrivingBeam                                 AdaptiveDrivingBeam
AdaptiveHeadlights                                   AdaptiveHeadlights
AdditionalErrorText                                 AdditionalErrorText
AirBagLocCurtain                                       AirBagLocCurtain
AirBagLocFront                                           AirBagLocFront
AirBagLocKnee                                             AirBagLocKnee
AirBagLocSeatCushion                               AirBagLocSeatCushion
AirBagLocSide                                             AirBagLocSide
AutoReverseSystem                                     AutoReverseSystem
AutomaticPedestrianAlertingSound       AutomaticPedestrianAlertingSound
AxleConfiguration                                     AxleConfiguration
Axles                                                             Axles
BasePrice                                                     BasePrice
BatteryA                                                       BatteryA
BatteryA_to                                                 BatteryA_to
BatteryCells                                               BatteryCells
BatteryInfo                                                 BatteryInfo
BatteryKWh                                                   BatteryKWh
BatteryKWh_to                                             BatteryKWh_to
BatteryModules                                           BatteryModules
BatteryPacks                                               BatteryPacks
BatteryType                                                 BatteryType
BatteryV                                                       BatteryV
BatteryV_to                                                 BatteryV_to
BedLengthIN                                                 BedLengthIN
BedType                                                         BedType
BlindSpotMon                                               BlindSpotMon
BodyCabType                                                 BodyCabType
BodyClass                                                     BodyClass
BrakeSystemDesc                                         BrakeSystemDesc
BrakeSystemType                                         BrakeSystemType
BusFloorConfigType                                   BusFloorConfigType
BusLength                                                     BusLength
BusType                                                         BusType
CAN_AACN                                                       CAN_AACN
CIB                                                                 CIB
CashForClunkers                                         CashForClunkers
ChargerLevel                                               ChargerLevel
ChargerPowerKW                                           ChargerPowerKW
CoolingType                                                 CoolingType
CurbWeightLB                                               CurbWeightLB
CustomMotorcycleType                               CustomMotorcycleType
DaytimeRunningLight                                 DaytimeRunningLight
DestinationMarket                                     DestinationMarket
DisplacementCC                                           DisplacementCC
DisplacementCI                                           DisplacementCI
DisplacementL                                             DisplacementL
Doors                                                             Doors
DriveType                                                     DriveType
DriverAssist                                               DriverAssist
DynamicBrakeSupport                                 DynamicBrakeSupport
EDR                                                                 EDR
ESC                                                                 ESC
EVDriveUnit                                                 EVDriveUnit
ElectrificationLevel                               ElectrificationLevel
EngineConfiguration                                 EngineConfiguration
EngineCycles                                               EngineCycles
EngineCylinders                                         EngineCylinders
EngineHP                                                       EngineHP
EngineHP_to                                                 EngineHP_to
EngineKW                                                       EngineKW
EngineManufacturer                                   EngineManufacturer
EngineModel                                                 EngineModel
EntertainmentSystem                                 EntertainmentSystem
ForwardCollisionWarning                         ForwardCollisionWarning
FuelInjectionType                                     FuelInjectionType
FuelTypePrimary                                         FuelTypePrimary
FuelTypeSecondary                                     FuelTypeSecondary
GCWR                                                               GCWR
GCWR_to                                                         GCWR_to
GVWR                                                               GVWR
GVWR_to                                                         GVWR_to
KeylessIgnition                                         KeylessIgnition
LaneDepartureWarning                               LaneDepartureWarning
LaneKeepSystem                                           LaneKeepSystem
LowerBeamHeadlampLightSource               LowerBeamHeadlampLightSource
Make                                                               Make
MakeID                                                           MakeID
Manufacturer                                               Manufacturer
ManufacturerId                                           ManufacturerId
Model                                                             Model
ModelID                                                         ModelID
ModelYear                                                     ModelYear
MotorcycleChassisType                             MotorcycleChassisType
MotorcycleSuspensionType                       MotorcycleSuspensionType
NCSABodyType                                               NCSABodyType
NCSAMake                                                       NCSAMake
NCSAMapExcApprovedBy                               NCSAMapExcApprovedBy
NCSAMapExcApprovedOn                               NCSAMapExcApprovedOn
NCSAMappingException                               NCSAMappingException
NCSAModel                                                     NCSAModel
NCSANote                                                       NCSANote
Note                                                               Note
OtherBusInfo                                               OtherBusInfo
OtherEngineInfo                                         OtherEngineInfo
OtherMotorcycleInfo                                 OtherMotorcycleInfo
OtherRestraintSystemInfo                       OtherRestraintSystemInfo
OtherTrailerInfo                                       OtherTrailerInfo
ParkAssist                                                   ParkAssist
PedestrianAutomaticEmergencyBraking PedestrianAutomaticEmergencyBraking
PlantCity                                                     PlantCity
PlantCompanyName                                       PlantCompanyName
PlantCountry                                               PlantCountry
PlantState                                                   PlantState
PossibleValues                                           PossibleValues
Pretensioner                                               Pretensioner
RearCrossTrafficAlert                             RearCrossTrafficAlert
RearVisibilitySystem                               RearVisibilitySystem
SAEAutomationLevel                                   SAEAutomationLevel
SAEAutomationLevel_to                             SAEAutomationLevel_to
SeatBeltsAll                                               SeatBeltsAll
SeatRows                                                       SeatRows
Seats                                                             Seats
SemiautomaticHeadlampBeamSwitching   SemiautomaticHeadlampBeamSwitching
Series                                                           Series
Series2                                                         Series2
SteeringLocation                                       SteeringLocation
SuggestedVIN                                               SuggestedVIN
TPMS                                                               TPMS
TopSpeedMPH                                                 TopSpeedMPH
TrackWidth                                                   TrackWidth
TractionControl                                         TractionControl
TrailerBodyType                                         TrailerBodyType
TrailerLength                                             TrailerLength
TrailerType                                                 TrailerType
TransmissionSpeeds                                   TransmissionSpeeds
TransmissionStyle                                     TransmissionStyle
Trim                                                               Trim
Trim2                                                             Trim2
Turbo                                                             Turbo
VIN                                                                 VIN
ValveTrainDesign                                       ValveTrainDesign
VehicleType                                                 VehicleType
WheelBaseLong                                             WheelBaseLong
WheelBaseShort                                           WheelBaseShort
WheelBaseType                                             WheelBaseType
WheelSizeFront                                           WheelSizeFront
WheelSizeRear                                             WheelSizeRear
Wheels                                                           Wheels
Windows                                                         Windows
                                                 Type MissingQuantity
vin                                         character               0
stockNum                                    character            6806
firstSeen                                        Date               0
lastSeen                                         Date               0
msrp                                          integer               0
askPrice                                      integer               0
mileage                                       integer               0
isNew                                       character               0
color                                       character               4
interiorColor                               character               1
brandName                                   character            1260
modelName                                   character            5843
dealerID                                      integer               0
ABS                                         character         3364367
ActiveSafetySysNote                         character         4858219
AdaptiveCruiseControl                       character         4732470
AdaptiveDrivingBeam                         character         4859853
AdaptiveHeadlights                          character         5692273
AdditionalErrorText                         character         5663198
AirBagLocCurtain                            character         2346025
AirBagLocFront                              character          302911
AirBagLocKnee                               character         4038037
AirBagLocSeatCushion                        character         5364405
AirBagLocSide                               character          612020
AutoReverseSystem                           character         3724453
AutomaticPedestrianAlertingSound            character         5611953
AxleConfiguration                           character         5694492
Axles                                         integer         3595438
BasePrice                                     numeric         3838272
BatteryA                                      integer         5695013
BatteryA_to                         vctrs_unspecified         5695015
BatteryCells                                  integer         5695013
BatteryInfo                                 character         5681514
BatteryKWh                                    numeric         5693362
BatteryKWh_to                                 integer         5694777
BatteryModules                      vctrs_unspecified         5695015
BatteryPacks                                  integer         5634621
BatteryType                                 character         5676852
BatteryV                                      integer         5685448
BatteryV_to                         vctrs_unspecified         5695015
BedLengthIN                                   integer         5689399
BedType                                     character         3628238
BlindSpotMon                                character         4297756
BodyCabType                                 character         3011749
BodyClass                                   character           11559
BrakeSystemDesc                             character         5590227
BrakeSystemType                             character         3409195
BusFloorConfigType                          character           29202
BusLength                           vctrs_unspecified         5695015
BusType                                     character           29202
CAN_AACN                                    character         4537162
CIB                                         character         4522241
CashForClunkers                     vctrs_unspecified         5695015
ChargerLevel                                character         5694818
ChargerPowerKW                                integer         5693367
CoolingType                                 character         5193992
CurbWeightLB                                  integer         5577825
CustomMotorcycleType                        character            3852
DaytimeRunningLight                         character         3719303
DestinationMarket                           character         5551219
DisplacementCC                                numeric           46149
DisplacementCI                                numeric           46149
DisplacementL                                 numeric           46149
Doors                                         integer          780801
DriveType                                   character         1411574
DriverAssist                                character         5692273
DynamicBrakeSupport                         character         3657055
EDR                                         character         5632193
ESC                                         character         3507431
EVDriveUnit                                 character         5694160
ElectrificationLevel                        character         5615959
EngineConfiguration                         character         2198693
EngineCycles                                  integer         5299045
EngineCylinders                               integer          416098
EngineHP                                      numeric         2517224
EngineHP_to                                   numeric         5500857
EngineKW                                      numeric         2518459
EngineManufacturer                          character         2561190
EngineModel                                 character         2416151
EntertainmentSystem                         character         5482863
ForwardCollisionWarning                     character         4423758
FuelInjectionType                           character         4293878
FuelTypePrimary                             character          166433
FuelTypeSecondary                           character         5157971
GCWR                                vctrs_unspecified         5695015
GCWR_to                             vctrs_unspecified         5695015
GVWR                                        character         1785733
GVWR_to                             vctrs_unspecified         5695015
KeylessIgnition                             character         4147281
LaneDepartureWarning                        character         4508384
LaneKeepSystem                              character         4629374
LowerBeamHeadlampLightSource                character         5682593
Make                                        character            1260
MakeID                                        numeric            1260
Manufacturer                                character            1260
ManufacturerId                                integer            1260
Model                                       character            5843
ModelID                                       integer            5843
ModelYear                                     integer            1282
MotorcycleChassisType                       character            3919
MotorcycleSuspensionType                    character            3916
NCSABodyType                        vctrs_unspecified         5695015
NCSAMake                            vctrs_unspecified         5695015
NCSAMapExcApprovedBy                vctrs_unspecified         5695015
NCSAMapExcApprovedOn                vctrs_unspecified         5695015
NCSAMappingException                vctrs_unspecified         5695015
NCSAModel                           vctrs_unspecified         5695015
NCSANote                                    character         5680634
Note                                        character         5043647
OtherBusInfo                        vctrs_unspecified         5695015
OtherEngineInfo                             character         3805614
OtherMotorcycleInfo                         character         5693679
OtherRestraintSystemInfo                    character         3794935
OtherTrailerInfo                            character         5695005
ParkAssist                                  character         5288671
PedestrianAutomaticEmergencyBraking         character         5136236
PlantCity                                   character          614448
PlantCompanyName                            character         1568615
PlantCountry                                character          178494
PlantState                                  character         1522665
PossibleValues                              character         5687691
Pretensioner                                character         5276379
RearCrossTrafficAlert                       character         5682328
RearVisibilitySystem                        character         3534758
SAEAutomationLevel                            integer         5694761
SAEAutomationLevel_to               vctrs_unspecified         5695015
SeatBeltsAll                                character          358162
SeatRows                                      integer         3769624
Seats                                         integer         3590951
SemiautomaticHeadlampBeamSwitching          character         3705261
Series                                      character         1404119
Series2                                     character         5429563
SteeringLocation                            character         3129603
SuggestedVIN                                character         5663196
TPMS                                        character         1002845
TopSpeedMPH                                   integer         4797588
TrackWidth                                    numeric         5640611
TractionControl                             character         3600224
TrailerBodyType                             character            1299
TrailerLength                                 integer         5694994
TrailerType                                 character            1322
TransmissionSpeeds                            integer         4401493
TransmissionStyle                           character         3770877
Trim                                        character         3445574
Trim2                                       character         5575717
Turbo                                       character         4488758
VIN                                         character               0
ValveTrainDesign                            character         3494358
VehicleType                                 character            1260
WheelBaseLong                                 numeric         5607936
WheelBaseShort                                numeric         3632320
WheelBaseType                               character         5515108
WheelSizeFront                                integer         3964805
WheelSizeRear                                 integer         3965238
Wheels                                        integer         3506892
Windows                                       integer         5357885
                                    MissingPercentage
vin                                              0.00
stockNum                                         0.12
firstSeen                                        0.00
lastSeen                                         0.00
msrp                                             0.00
askPrice                                         0.00
mileage                                          0.00
isNew                                            0.00
color                                            0.00
interiorColor                                    0.00
brandName                                        0.02
modelName                                        0.10
dealerID                                         0.00
ABS                                             59.08
ActiveSafetySysNote                             85.31
AdaptiveCruiseControl                           83.10
AdaptiveDrivingBeam                             85.34
AdaptiveHeadlights                              99.95
AdditionalErrorText                             99.44
AirBagLocCurtain                                41.19
AirBagLocFront                                   5.32
AirBagLocKnee                                   70.90
AirBagLocSeatCushion                            94.19
AirBagLocSide                                   10.75
AutoReverseSystem                               65.40
AutomaticPedestrianAlertingSound                98.54
AxleConfiguration                               99.99
Axles                                           63.13
BasePrice                                       67.40
BatteryA                                       100.00
BatteryA_to                                    100.00
BatteryCells                                   100.00
BatteryInfo                                     99.76
BatteryKWh                                      99.97
BatteryKWh_to                                  100.00
BatteryModules                                 100.00
BatteryPacks                                    98.94
BatteryType                                     99.68
BatteryV                                        99.83
BatteryV_to                                    100.00
BedLengthIN                                     99.90
BedType                                         63.71
BlindSpotMon                                    75.47
BodyCabType                                     52.88
BodyClass                                        0.20
BrakeSystemDesc                                 98.16
BrakeSystemType                                 59.86
BusFloorConfigType                               0.51
BusLength                                      100.00
BusType                                          0.51
CAN_AACN                                        79.67
CIB                                             79.41
CashForClunkers                                100.00
ChargerLevel                                   100.00
ChargerPowerKW                                  99.97
CoolingType                                     91.20
CurbWeightLB                                    97.94
CustomMotorcycleType                             0.07
DaytimeRunningLight                             65.31
DestinationMarket                               97.48
DisplacementCC                                   0.81
DisplacementCI                                   0.81
DisplacementL                                    0.81
Doors                                           13.71
DriveType                                       24.79
DriverAssist                                    99.95
DynamicBrakeSupport                             64.22
EDR                                             98.90
ESC                                             61.59
EVDriveUnit                                     99.98
ElectrificationLevel                            98.61
EngineConfiguration                             38.61
EngineCycles                                    93.05
EngineCylinders                                  7.31
EngineHP                                        44.20
EngineHP_to                                     96.59
EngineKW                                        44.22
EngineManufacturer                              44.97
EngineModel                                     42.43
EntertainmentSystem                             96.27
ForwardCollisionWarning                         77.68
FuelInjectionType                               75.40
FuelTypePrimary                                  2.92
FuelTypeSecondary                               90.57
GCWR                                           100.00
GCWR_to                                        100.00
GVWR                                            31.36
GVWR_to                                        100.00
KeylessIgnition                                 72.82
LaneDepartureWarning                            79.16
LaneKeepSystem                                  81.29
LowerBeamHeadlampLightSource                    99.78
Make                                             0.02
MakeID                                           0.02
Manufacturer                                     0.02
ManufacturerId                                   0.02
Model                                            0.10
ModelID                                          0.10
ModelYear                                        0.02
MotorcycleChassisType                            0.07
MotorcycleSuspensionType                         0.07
NCSABodyType                                   100.00
NCSAMake                                       100.00
NCSAMapExcApprovedBy                           100.00
NCSAMapExcApprovedOn                           100.00
NCSAMappingException                           100.00
NCSAModel                                      100.00
NCSANote                                        99.75
Note                                            88.56
OtherBusInfo                                   100.00
OtherEngineInfo                                 66.82
OtherMotorcycleInfo                             99.98
OtherRestraintSystemInfo                        66.64
OtherTrailerInfo                               100.00
ParkAssist                                      92.86
PedestrianAutomaticEmergencyBraking             90.19
PlantCity                                       10.79
PlantCompanyName                                27.54
PlantCountry                                     3.13
PlantState                                      26.74
PossibleValues                                  99.87
Pretensioner                                    92.65
RearCrossTrafficAlert                           99.78
RearVisibilitySystem                            62.07
SAEAutomationLevel                             100.00
SAEAutomationLevel_to                          100.00
SeatBeltsAll                                     6.29
SeatRows                                        66.19
Seats                                           63.05
SemiautomaticHeadlampBeamSwitching              65.06
Series                                          24.66
Series2                                         95.34
SteeringLocation                                54.95
SuggestedVIN                                    99.44
TPMS                                            17.61
TopSpeedMPH                                     84.24
TrackWidth                                      99.04
TractionControl                                 63.22
TrailerBodyType                                  0.02
TrailerLength                                  100.00
TrailerType                                      0.02
TransmissionSpeeds                              77.29
TransmissionStyle                               66.21
Trim                                            60.50
Trim2                                           97.91
Turbo                                           78.82
VIN                                              0.00
ValveTrainDesign                                61.36
VehicleType                                      0.02
WheelBaseLong                                   98.47
WheelBaseShort                                  63.78
WheelBaseType                                   96.84
WheelSizeFront                                  69.62
WheelSizeRear                                   69.63
Wheels                                          61.58
Windows                                         94.08
#output summary as excel file for further analysis
#write.xlsx(variableSummary, "auto_variable_summary.xlsx")

A - Aggregates: Overall Picture

# get comprehensive dataset statistics
autoNumStats <- autoInfoData %>% 
# does not include NA values
  select(where(~ !all(is.na(.x)))) %>%
  select(-dealerID, -ModelYear) %>% #removes dealerID and ModelYear
# summarize numeric columns
  summarise(
    total_rows = n(),
    across(where(is.numeric), list(
      average = ~mean(.x, na.rm = TRUE)
    ))
  ) %>%
  collect()

# display as a formatted table
print(autoNumStats)
# A tibble: 1 × 41
  total_rows msrp_average askPrice_average mileage_average Axles_average
       <int>        <dbl>            <dbl>           <dbl>         <dbl>
1    5695015      744570.          186921.          22415.          2.00
# ℹ 36 more variables: BasePrice_average <dbl>, BatteryA_average <dbl>,
#   BatteryCells_average <dbl>, BatteryKWh_average <dbl>,
#   BatteryKWh_to_average <dbl>, BatteryPacks_average <dbl>,
#   BatteryV_average <dbl>, BedLengthIN_average <dbl>,
#   ChargerPowerKW_average <dbl>, CurbWeightLB_average <dbl>,
#   DisplacementCC_average <dbl>, DisplacementCI_average <dbl>,
#   DisplacementL_average <dbl>, Doors_average <dbl>, …

N - Notable Segments

  • Analyze key categorical variables

  • Modify based on your specific data

# must be run prior to conducting calculations for formatting
options(scipen = 999) # used to avoid scientific notation
totalRecords <- nrow(autoInfoData) 
# OVERALL BRAND COUNT
# selecting car 'brandName' to observe frequency values
brandCount <- autoInfoData %>%
  # group by 'brandName' Variable
  group_by(brandName) %>% 
  # conducts count
  summarise(Frequency = n()) %>%   
  # calculate percentage in respect to the dataset
  mutate(Percentage = (Frequency / sum(Frequency)) * 100) %>% 
  # sort from highest to lowest frequency
  arrange(desc(Frequency))           

# print out table
print(brandCount)
# A tibble: 111 × 3
   brandName  Frequency Percentage
   <chr>          <int>      <dbl>
 1 CHEVROLET     890213      15.6 
 2 FORD          782063      13.7 
 3 TOYOTA        398976       7.01
 4 JEEP          369067       6.48
 5 NISSAN        312876       5.49
 6 HONDA         278725       4.89
 7 HYUNDAI       264381       4.64
 8 GMC           232111       4.08
 9 DODGE         218588       3.84
10 VOLKSWAGEN    210886       3.70
# ℹ 101 more rows

*Note: For conducting pricing based calculations, the median measure was used to avoid any skewness from outlier pricing from specific vehicle listings.

Pricing Differences:

  • MSRP - Manufacturer’s Suggested Retail Price

  • askPrice - Last price seen before vehicle was sold

# MODEL NAME & VEHICLE TYPE COUNT 
modelCount <- autoInfoData %>%
  # group by 'brandName','modelName', 'VechicleType' Variable
  group_by(brandName, modelName, VehicleType) %>% 
  # conducts count
  summarise(
        MedianMSRP =(median(msrp)), 
        MedianAskPrice =(median(askPrice)), 
        Frequency = n()) %>%  
  # calculate percentage in respect to the dataset
  mutate(Percentage = (Frequency /totalRecords) * 100) %>% 
  # calculates average 'askPrice'  # sort from highest to lowest frequency
  arrange(desc(Frequency))           
`summarise()` has grouped output by 'brandName', 'modelName'. You can override
using the `.groups` argument.
# print out table
print(modelCount)
# A tibble: 1,683 × 7
# Groups:   brandName, modelName [1,548]
   brandName modelName      VehicleType      MedianMSRP MedianAskPrice Frequency
   <chr>     <chr>          <chr>                 <dbl>          <dbl>     <int>
 1 FORD      F-150          "TRUCK "              30981         28995     175036
 2 CHEVROLET Silverado      "TRUCK "              31959         29973     161960
 3 CHEVROLET Equinox        "MULTIPURPOSE P…      21990         20129     157648
 4 FORD      Escape         "MULTIPURPOSE P…      17950         16601     117659
 5 JEEP      Grand Cherokee "MULTIPURPOSE P…      29994         28777     104822
 6 CHEVROLET Malibu         "PASSENGER CAR"       16660         15691      96262
 7 RAM       1500           "TRUCK "              32622         30250      86582
 8 FORD      Explorer       "MULTIPURPOSE P…      27990         26738.     83510
 9 GMC       Sierra         "TRUCK "              40267         37745      81360
10 FORD      Fusion         "PASSENGER CAR"       15908         14777      80275
# ℹ 1,673 more rows
# ℹ 1 more variable: Percentage <dbl>
# sum(modelCount$Percentage) ~ this is a check to make sure values
# do equal 100
#MOST COMMON VEHICLE TYPE
vehicleTypeMode <- autoInfoData %>%
  filter(!is.na(VehicleType)) %>%   
  count(VehicleType) %>%        
  arrange(desc(n)) %>%             
  slice(1) %>%                   
  pull(VehicleType)                

cat("Mode of 'VehicleType':", vehicleTypeMode, "\n")
Mode of 'VehicleType': MULTIPURPOSE PASSENGER VEHICLE (MPV) 
#USA MANUFACTURING TABLE
plantTable <- autoInfoData %>%
  # only focus on data from the US
  filter(PlantCountry %in% c("UNITED STATES (USA)")) %>% 
  # removes NA values
  filter(!is.na(PlantState)) %>% 
  # group by Plant State
  group_by(PlantState) %>%                           
  # counts values       
  summarise(Frequency = n(), .groups = "drop") %>%   
  # arrange count from high to low            
  arrange(desc(Frequency))                        

#print out table
print(plantTable)
# A tibble: 32 × 2
   PlantState Frequency
   <chr>          <int>
 1 MICHIGAN      677742
 2 OHIO          322011
 3 INDIANA       311711
 4 KENTUCKY      288568
 5 TENNESSEE     261068
 6 ALABAMA       220984
 7 ILLINOIS      176490
 8 MISSOURI      172641
 9 KANSAS        110277
10 TEXAS         107274
# ℹ 22 more rows
# MANUFACTURING BY COUNTRY TABLE
countryPlantTable <- autoInfoData %>%
  filter(!is.na(PlantCountry)) %>%                   # group by 'PlantCountry'         
  group_by(PlantCountry) %>%                         # counts values       
  summarise(Frequency = n(), .groups = "drop") %>%   # arrange count from high to low         
  arrange(desc(Frequency))

#print out table
print(countryPlantTable)
# A tibble: 32 × 2
   PlantCountry        Frequency
   <chr>                   <int>
 1 UNITED STATES (USA)   2951430
 2 MEXICO                 711108
 3 CANADA                 601480
 4 JAPAN                  458808
 5 SOUTH KOREA            285261
 6 GERMANY                225355
 7 UNITED KINGDOM (UK)     47228
 8 ITALY                   40050
 9 ENGLAND                 29034
10 SWEDEN                  22859
# ℹ 22 more rows

Complete this comprehensive assessment:

DATASET OVERVIEW:

Records: There is a total of 5,695,015 records representing different car models present at dealerships across the state of Illinois.

Time span: The time-frame of the data is roughly from 9/29/2017 to 5/30/2020.

Key metrics:

  • Most recurring model year for cars is ‘2019’.

  • Most frequent car model across Illinois dealerships is ’Ford - F150”, specifically making up 3.07% of the data set.

    • Median asking price for a “Ford F-150” is $28,995, when compared to an MSRP at the of $30,981
  • However, Chevrolet is the most recurring brand (present frequently) across Illinois dealerships , resulting in a frequency percentage of 15.63%

  • The most prominent vehicle type across dealerships is Multipurpose Passenger Vehicle (MPV).

  • The US is the most common manufacturer of vehicles in the data set, with the specific state being Michigan.

DATA COMPLETENESS:

Potential Core fields:

Variable Name Completeness
firstSeen 100.00%
lastSeen 100.00%
msrp 100.00%
askPrice 100.00%
mileage 100.00%
isNew 100.00%
color 100.00%
interiorColor 100.00%
brandName 99.98%
modelName 99.90%
BodyClass 99.80%
Doors 86.29%
Engine Cylinders 92.69%
FuelTypePrimary 97.08%
ModelYear 99.98%
PlantCity 89.21%
PlantCountry 96.87%
PlantState 73.26%
VehicleType 99.98%
KeylessIgnition 27.18%
LaneDepartureWarning 20.84%
LaneKeepSystem 18.71%
BlindSpotMon 24.53%
BodyCabType 47.12%
BodyClass 99.80%

Overall “Variable Completeness” Table

Variable Name Completeness
vin 100.00%
stockNum 99.88%
firstSeen 100.00%
lastSeen 100.00%
msrp 100.00%
askPrice 100.00%
mileage 100.00%
isNew 100.00%
color 100.00%
interiorColor 100.00%
brandName 99.98%
modelName 99.90%
dealerID 100.00%
ABS 40.92%
ActiveSafetySysNote 14.69%
AdaptiveCruiseControl 16.90%
AdaptiveDrivingBeam 14.66%
AdaptiveHeadlights 0.05%
AdditionalErrorText 0.56%
AirBagLocCurtain 58.81%
AirBagLocFront 94.68%
AirBagLocKnee 29.10%
AirBagLocSeatCushion 5.81%
AirBagLocSide 89.25%
AutoReverseSystem 34.60%
AutomaticPedestrianAlertingSound 1.46%
AxleConfiguration 0.01%
Axles 36.87%
BasePrice 32.60%
BatteryA 0.00%
BatteryA_to 0.00%
BatteryCells 0.00%
BatteryInfo 0.24%
BatteryKWh 0.03%
BatteryKWh_to 0.00%
BatteryModules 0.00%
BatteryPacks 1.06%
BatteryType 0.32%
BatteryV 0.17%
BatteryV_to 0.00%
BedLengthIN 0.10%
BedType 36.29%
BlindSpotMon 24.53%
BodyCabType 47.12%
BodyClass 99.80%
BrakeSystemDesc 1.84%
BrakeSystemType 40.14%
BusFloorConfigType 99.49%
BusLength 0.00%
BusType 99.49%
CAN_AACN 20.33%
CIB 20.59%
CashForClunkers 0.00%
ChargerLevel 0.00%
ChargerPowerKW 0.03%
CoolingType 8.80%
CurbWeightLB 2.06%
CustomMotorcycleType 99.93%
DaytimeRunningLight 34.69%
DestinationMarket 2.52%
DisplacementCC 99.19%
DisplacementCI 99.19%
DisplacementL 99.19%
Doors 86.29%
DriveType 75.21%
DriverAssist 0.05%
DynamicBrakeSupport 35.78%
EDR 1.10%
ESC 38.41%
EVDriveUnit 0.02%
ElectrificationLevel 1.39%
EngineConfiguration 61.39%
EngineCycles 6.95%
EngineCylinders 92.69%
EngineHP 55.80%
EngineHP_to 3.41%
EngineKW 55.78%
EngineManufacturer 55.03%
EngineModel 57.57%
EntertainmentSystem 3.73%
ForwardCollisionWarning 22.32%
FuelInjectionType 24.60%
FuelTypePrimary 97.08%
FuelTypeSecondary 9.43%
GCWR 0.00%
GCWR_to 0.00%
GVWR 68.64%
GVWR_to 0.00%
KeylessIgnition 27.18%
LaneDepartureWarning 20.84%
LaneKeepSystem 18.71%
LowerBeamHeadlampLightSource 0.22%
Make 99.98%
MakeID 99.98%
Manufacturer 99.98%
ManufacturerId 99.98%
Model 99.90%
ModelID 99.90%
ModelYear 99.98%
MotorcycleChassisType 99.93%
MotorcycleSuspensionType 99.93%
NCSABodyType 0.00%
NCSAMake 0.00%
NCSAMapExcApprovedBy 0.00%
NCSAMapExcApprovedOn 0.00%
NCSAMappingException 0.00%
NCSAModel 0.00%
NCSANote 0.25%
Note 11.44%
OtherBusInfo 0.00%
OtherEngineInfo 33.18%
OtherMotorcycleInfo 0.02%
OtherRestraintSystemInfo 33.36%
OtherTrailerInfo 0.00%
ParkAssist 7.14%
PedestrianAutomaticEmergencyBraking 9.81%
PlantCity 89.21%
PlantCompanyName 72.46%
PlantCountry 96.87%
PlantState 73.26%
PossibleValues 0.13%
Pretensioner 7.35%
RearCrossTrafficAlert 0.22%
RearVisibilitySystem 37.93%
SAEAutomationLevel 0.00%
SAEAutomationLevel_to 0.00%
SeatBeltsAll 93.71%
SeatRows 33.81%
Seats 36.95%
SemiautomaticHeadlampBeamSwitching 34.94%
Series 75.34%
Series2 4.66%
SteeringLocation 45.05%
SuggestedVIN 0.56%
TPMS 82.39%
TopSpeedMPH 15.76%
TrackWidth 0.96%
TractionControl 36.78%
TrailerBodyType 99.98%
TrailerLength 0.00%
TrailerType 99.98%
TransmissionSpeeds 22.71%
TransmissionStyle 33.79%
Trim 39.50%
Trim2 2.09%
Turbo 21.18%
VIN 100.00%
ValveTrainDesign 38.64%
VehicleType 99.98%
WheelBaseLong 1.53%
WheelBaseShort 36.22%
WheelBaseType 3.16%
WheelSizeFront 30.38%
WheelSizeRear 30.37%
Wheels 38.42%
Windows 5.92%

DATA QUALITY STRENGTHS:

  1. What aspects are high quality?

    The amount of data present in the dataset makes it overall high quality, in that there are various measures taken on the aspects or components related to a vehicle. Along with other attributes that I did not initially consider but are still highly relevant to the overall analysis geared towards price/value impact.

  2. What makes this reliable?

    What makes it overall reliable is that most of the crucial/core fields that are a part of my analysis are, for the most part, complete, with some exceptions that do not have a completeness rate of at least 85%. However, those specific fields that are not at that preferred completeness rate can still be used for analysis in other areas, serving as secondary support for other aspects of the exploratory analysis.

  3. What coverage is excellent?

    The coverage that is excellent coverage is the information reporting on the exterior physical aspects of vehicles, along with any geographical-related information (such as fields related to manufacturer location) related to each vehicle, which demonstrates sufficient coverage/completeness for analysis.

DATA QUALITY CONCERNS:

  1. What are the main issues?

    The main issues with the data are that various technical fields have N/A or missing values that may limit performing certain observations, or not at all possible due to the amount of coverage missing within these specific fields.

    Secondly, handling large amounts of information and precisely filtering information from fields that have an inconsistent method of formatting values. For instance, some categorical fields use all uppercase values, while other categorical fields do not use this specific scheme for values.

    Thirdly, the overall naming scheme of each variable being non-standardized would require changing the overall format to fit an appropriate “camelCase’ for accessibility in data cleaning and manipulation.

  2. What might limit analysis?

    There is a large absence of information on some variables that may impact the practicality of using other specific variables for analysis. However, most of the target variables, as previously elaborated, are incomplete, with some variables that do not meet the preferred threshold to be completely utilized, which may require consideration of other variables/fields to use.

  3. What needs careful handling?

    Specifically, what would require the most careful handling is in performing further data cleaning and manipulation of the variables to ensure dataset only contains information relevant to the analysis. Additionally, ensuring that the other variables I focus on are still relevant to the overall objective I set with the analysis and matches the scope of what information I want to display from the analysis.

MISSING DATA IMPACT:

  • Most missing:

    Only focusing on core fields, the ‘LaneKeepSystem’ field has the lowest completeness rate of 18.71% among the core fields selected initially.

  • Impact on analysis:

    Since the majority of the specific field is missing, this may hinder overall reporting when performing analysis (such as skewness of values).

  • Handling strategy:

    The strategy for handling this specific instance is by not including fields that have most of their values missing or NA. My overall approach in selecting which variables to focus on the most is based on overall completion. If certain variables have a completeness rate of 50% or less, I will take careful consideration of whether or not to include them in some form. This is due to their inclusion in negatively impacting the reliability of the data reporting, as the majority of the field values are invalid, which validates the consideration of other variables. It is only if further investigation does not warrant the use of ‘LaneKeepSystem’ and other similar variables.

  • Most reliable variables:

    • firstSeen

    • lastSeen

    • msrp

    • askPrice

    • mileage

    • isNew

    • color

    • interiorColor

    • brandName

    • modelName

    • BodyClass

    • Engine Cylinders

    • FuelTypePrimary

    • ModelYear

    • PlantCity

    • PlantCountry

    • VehicleType

  • Variables needing caution:

    • KeylessIgnition

    • LaneDepartureWarning

    • LaneKeepSystem

    • BlindSpotMon

    • BodyCabType

  • Overall confidence level:

    My overall confidence level remains high for the data analysis and reporting.

JUSTIFICATION:

My justification for why I still feel highly confident about this for the analysis (despite having some fields with a completeness rate of less than 50%) is based on the fact that most of the selected target variables for analysis are, for the most part, complete ( at least 86% ad or higher) and can be relied upon during analysis. In addition, since the scope of the analysis is based on the value of vehicles and key features or aspects that may impact overall pricing, I have a clear and specific objective I can achieve through continued analysis of the dataset. I will still continue to adjust other aspects to further present findings during analysis to potentially support or reinforce findings. However, I will still maintain caution on what other fields to consider or to incorporate to best support points from the general analysis.

Professional Summary & Next Steps:

After successfully loading the dataset and performing basic analysis on certain aspects of the data, which helps to address the key objective of the analysis related to vehicle pricing/value. That involved performing a data validity assessment (checking the structure of data) and using various features to gather insights that helped to bring out key metrics about aspects of vehicles beyond just value. This included which vehicle brands are most frequently present across dealerships in Illinois, the most common vehicle model year, and vehicle type. Along with measuring the frequency of categorical variables, it also details how prevalent certain feature sets of vehicles are in the data.

Moving forward, the next step to take involving the data is performing further data cleaning to better standardize the formatting of the dataset for continued analysis without further issues. After this crucial step, exploration and investigation will continue to address the key objectives. This procedure would lead towards building further data tables and visualizations to effectively present key assertions to help further develop the story presented by the data through a focus on vehicle value/pricing.

Deliverable Checklist

Ensure your submission includes:

  • Complete READY framework analysis with thoughtful responses

  • Systematic SCAN framework exploration with specific findings

  • Successful data loading with Arrow

  • Professional data description and summary statistics

  • Comprehensive missing value analysis with percentages

  • Variable summary table documenting key fields

  • Memory efficiency demonstration - (Parquet Conversion)

  • 3-5 well-defined, specific exporatory research questions

    • How important are specific technical and other physical attributes of a car that can impact the asking price of a vehicle from a dealership?

    • Which other aspects of a car, not specifically related to the physical attributes, are correlated with pricing, such as the location where a car is assembled or manufactured

    • What car brands, along with models, are most prominent in dealership locations, and impact how frequently they are sold?

    • What car features are also most prominent in vehicles in dealerships to focus primarily, on the car manufacturer’s side, what to prioritize most that best sell?

    • How much can MSRP and asking price can differentiate on vehicles based on factors such as mileage?

  • Data quality assessment with honest evaluation

  • Professional summary with clear next steps

Grading Criteria

  • READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach

  • Data Loading (15%): Successful Arrow implementation with proper documentation

  • SCAN Framework (25%): Systematic exploration with specific, meaningful findings

  • Data Quality Assessment (20%): Comprehensive evaluation with specific evidence

  • Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities

  • Professional Communication (5%): Clear, honest, well-organized presentation throughout

Tips for Success

  • Be specific in your observations - avoid vague statements

  • Think like a stakeholder - what would decision-makers actually want to know?

  • Document your reasoning for all assessment decisions

  • Be honest about limitations - this builds credibility

  • Focus on actionable insights - what can actually be learned from this data?

  • Ask for help if your data format doesn’t match the provided templates

Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.