Part 1 Abstract

Predicting Car Fuel Efficiency with Engine Features, This project investigates the impact of engine characteristics on car fuel efficiency, measured in miles per gallon (MPG). Given the significant role of the automobile industry in petroleum consumption and greenhouse gas emissions, improving fuel efficiency is crucial.

We hypothesize that key engine features, namely the number of cylinders (cyl) and engine displacement (displ) in liters, can be valuable predictors of a car’s fuel efficiency.

To test this hypothesis, we will perform a multiple regression analysis on a dataset containing 33,442 vehicle testing records with 12 quantitative and qualitative variables. From this data, we will focus on cyl, displ (explanatory variables), and combined MPG (response variable).

The results of this analysis will help determine the significance of these engine features in predicting fuel efficiency. This information can be valuable for car manufacturers, policymakers, and consumers seeking to improve fuel economy and reduce environmental impact.

Keywords:

Fuel Efficiency, Car, Engine, Cylinders, Displacement, MPG, Multiple Regression, Prediction, Environment, Policymakers, Consumers

Introduction

Addressing environmental concerns and reducing reliance on fossil fuels have made the pursuit of sustainable transportation solutions essential. Central to this effort is understanding and improving vehicle fuel economy. The Environmental Protection Agency (EPA) in the United States plays a pivotal role in evaluating vehicle fuel efficiency through rigorous testing at its National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan.

The fuel economy data collected through these tests provide valuable insights into vehicle performance and environmental impact. These assessments are crucial benchmarks for manufacturers, helping them develop vehicles that meet stringent emissions standards and fuel efficiency requirements.

Under the EPA’s supervision, manufacturers conduct comprehensive testing to ensure regulatory compliance and enhance the sustainability of their vehicle fleets. This collaborative effort between regulatory agencies and manufacturers highlights a shared commitment to advancing environmentally responsible transportation solutions.

In this project, titled “Predicting Car Fuel Efficiency with Engine Features,” we aim to leverage the fuel economy data from the EPA’s testing initiatives to analyze trends, identify patterns, and gain a deeper understanding of vehicle performance characteristics. By utilizing data analytics, we seek to extract actionable insights to inform decisions that promote fuel efficiency, reduce emissions, and drive innovation in the automotive industry.

Research question:

Multiple Linear Regression Model Analysis Hypothesis: The car features (Cylinders, Displacement) affect its fuel efficiency (MPG).

Null Hypothesis (H0): There is no significant relationship between the features of a car and its fuel efficiency. In other words, features such as Cylinders and Displacement do not have a significant impact on determining fuel efficiency.

Alternative Hypothesis (H1): There is a significant relationship between the features of a car and its fuel efficiency. Features such as Cylinders and Displacement do have a significant impact on determining fuel efficiency.

By conducting a multiple linear regression analysis, we can evaluate the significance of these features and determine whether we can reject the null hypothesis (H0) in favor of the alternative hypothesis (H1). The regression model will estimate the coefficients for each feature, indicating the direction and magnitude of their influence on fuel efficiency.

If the p-values associated with the coefficients are statistically significant (typically with a significance level of 0.05 or lower), it would provide evidence to reject the null hypothesis (H0) and support the presence of a significant relationship between the features and the fuel efficiency of a car.

Ultimately, this hypothesis testing will help determine whether the selected features are valuable predictors for classifying the fuel efficiency of a car.

Part 2 Data

Load all necessary library & Plot Theme

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## 
## Attaching package: 'Hmisc'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Data Source

Data is collected by HADLEY WICKHAM and is available online here: https://github.com/hadley/fueleconomy

##  [1] "id"    "make"  "model" "year"  "class" "trans" "drive" "cyl"   "displ"
## [10] "fuel"  "hwy"   "cty"

Part 3 - Data Preparation

## # A tibble: 6 × 12
##      id make  model        year class  trans drive   cyl displ fuel    hwy   cty
##   <dbl> <chr> <chr>       <dbl> <chr>  <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13309 Acura 2.2CL/3.0CL  1997 Subco… Auto… Fron…     4   2.2 Regu…    26    20
## 2 13310 Acura 2.2CL/3.0CL  1997 Subco… Manu… Fron…     4   2.2 Regu…    28    22
## 3 13311 Acura 2.2CL/3.0CL  1997 Subco… Auto… Fron…     6   3   Regu…    26    18
## 4 14038 Acura 2.3CL/3.0CL  1998 Subco… Auto… Fron…     4   2.3 Regu…    27    19
## 5 14039 Acura 2.3CL/3.0CL  1998 Subco… Manu… Fron…     4   2.3 Regu…    29    21
## 6 14040 Acura 2.3CL/3.0CL  1998 Subco… Auto… Fron…     6   3   Regu…    26    17
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
## tibble [33,442 × 12] (S3: tbl_df/tbl/data.frame)
##  $ id   : num [1:33442] 13309 13310 13311 14038 14039 ...
##  $ make : chr [1:33442] "Acura" "Acura" "Acura" "Acura" ...
##  $ model: chr [1:33442] "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : num [1:33442] 1997 1997 1997 1998 1998 ...
##  $ class: chr [1:33442] "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans: chr [1:33442] "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr [1:33442] "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : num [1:33442] 4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num [1:33442] 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : chr [1:33442] "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : num [1:33442] 26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : num [1:33442] 20 22 18 19 21 17 20 21 17 18 ...

Variables

This dataset comprises approximately 33,442 vehicle testing records. The fuel economy data originate from tests conducted at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan. Additionally, manufacturers conduct testing under the supervision of the U.S. Environmental Protection Agency (EPA). The dataset includes detailed information on various vehicle features, including Cylinders and Displacement, which are crucial for analyzing and predicting fuel efficiency. This comprehensive dataset serves as a robust foundation for evaluating the impact of specific car features on fuel economy through rigorous statistical analysis.

##There are 11 variables.**

##class- Vehicle Class

-   



    +--------------------------+-----------------------------------+
    | Class                    | Passenger & Cargo Volume(cu. ft.) |
    +==========================+:=================================:+
    | TWO-SEATER CARS          | any                               |
    +--------------------------+-----------------------------------+
    | SEDANS Minicompact       |                                   |
    +--------------------------+-----------------------------------+
    | -   Minicompact          | Under 85                          |
    +--------------------------+-----------------------------------+
    | -   Subcompact           | 85 to 99                          |
    +--------------------------+-----------------------------------+
    | -   Compact              | 100 to 109                        |
    +--------------------------+-----------------------------------+
    | -   Midsize              | 110 to 119                        |
    +--------------------------+-----------------------------------+
    | -   Large                | 120 or more                       |
    +--------------------------+-----------------------------------+
    | STATION WAGONS           |                                   |
    +--------------------------+-----------------------------------+
    | -   Small                | Under 130                         |
    +--------------------------+-----------------------------------+
    | -   Midsize              | 130 to 159                        |
    +--------------------------+-----------------------------------+
    | -   Large                | 160 or more                       |
    +--------------------------+-----------------------------------+
    | PICKUP TRUCKS            |                                   |
    +--------------------------+-----------------------------------+
    | -   Small                | Under 6,000                       |
    +--------------------------+-----------------------------------+
    | -   Standard             | 6,000 to 8,500                    |
    +--------------------------+-----------------------------------+
    | VANS                     |                                   |
    +--------------------------+-----------------------------------+
    | -   Passenger            | Under 10,000                      |
    +--------------------------+-----------------------------------+
    | -   Cargo                | Under 8,500                       |
    +--------------------------+-----------------------------------+
    | MINIVANS                 | Under 8,500                       |
    +--------------------------+-----------------------------------+
    | SPORT UTILITY VEHICLES   |                                   |
    +--------------------------+-----------------------------------+
    | -   Small                | Under 6,000                       |
    +--------------------------+-----------------------------------+
    | -   Standard             | 6,000 to 9,999                    |
    +--------------------------+-----------------------------------+
    | SPECIAL PURPOSE VEHICLES | Under 8,500                       |
    +--------------------------+-----------------------------------+

##- trans - transmission

##Among this eleven variables bellow 5 variables are Quantitative variable:

year, cyl , displ ,hwy , cty

and bellow 6 variables are Qualitative variable :

make, model, class, trans ,drive, fuel

Check the missing value and drop all null value raw

## [1] TRUE
##    id  make model  year class trans drive   cyl displ  fuel   hwy   cty 
##     0     0     0     0     0     8     0    58    57     0     0     0
##    id  make model  year class trans drive   cyl displ  fuel   hwy   cty 
##     0     0     0     0     0     0     0     0     0     0     0     0

Fuel efficiency and Combined MPG

https://www.fueleconomy.gov/feg/pdfs/guides/FEG2023.pdf

A “city” estimate that represents urban driving, in which a vehicle is started in the morning (after being parked all night) and driven in stop-and-go traffic • A “highway” estimate that represents a mixture of rural and interstate highway driving in a warmed-up vehicle, typical of longer trips in free-flowing traffic • A “combined” estimate that represents a combination of city driving (55%) and highway driving (45%).

This combined variable represent the fuel efficiency. MPG -“miles per gallon”, it is a common unit of measurement used to express the fuel efficiency or fuel economy of vehicles.

## # A tibble: 6 × 13
##      id make  model        year class  trans drive   cyl displ fuel    hwy   cty
##   <dbl> <chr> <chr>       <dbl> <chr>  <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13309 Acura 2.2CL/3.0CL  1997 Subco… Auto… Fron…     4   2.2 Regu…    26    20
## 2 13310 Acura 2.2CL/3.0CL  1997 Subco… Manu… Fron…     4   2.2 Regu…    28    22
## 3 13311 Acura 2.2CL/3.0CL  1997 Subco… Auto… Fron…     6   3   Regu…    26    18
## 4 14038 Acura 2.3CL/3.0CL  1998 Subco… Auto… Fron…     4   2.3 Regu…    27    19
## 5 14039 Acura 2.3CL/3.0CL  1998 Subco… Manu… Fron…     4   2.3 Regu…    29    21
## 6 14040 Acura 2.3CL/3.0CL  1998 Subco… Auto… Fron…     6   3   Regu…    26    17
## # ℹ 1 more variable: combined <dbl>

Summary statistics

##        id            make              model                year     
##  Min.   :    1   Length:33382       Length:33382       Min.   :1984  
##  1st Qu.: 8346   Class :character   Class :character   1st Qu.:1991  
##  Median :16694   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17013                                         Mean   :1999  
##  3rd Qu.:25226                                         3rd Qu.:2007  
##  Max.   :34932                                         Max.   :2015  
##     class              trans              drive                cyl        
##  Length:33382       Length:33382       Length:33382       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##      displ           fuel                hwy             cty       
##  Min.   :0.000   Length:33382       Min.   : 9.00   Min.   : 6.00  
##  1st Qu.:2.300   Class :character   1st Qu.:19.00   1st Qu.:15.00  
##  Median :3.000   Mode  :character   Median :23.00   Median :17.00  
##  Mean   :3.353                      Mean   :23.46   Mean   :17.37  
##  3rd Qu.:4.300                      3rd Qu.:27.00   3rd Qu.:20.00  
##  Max.   :8.400                      Max.   :61.00   Max.   :53.00  
##     combined    
##  Min.   : 7.80  
##  1st Qu.:16.70  
##  Median :19.70  
##  Mean   :20.11  
##  3rd Qu.:22.60  
##  Max.   :54.40

Check the structure of the dataset

## tibble [33,382 × 13] (S3: tbl_df/tbl/data.frame)
##  $ id      : num [1:33382] 13309 13310 13311 14038 14039 ...
##  $ make    : chr [1:33382] "Acura" "Acura" "Acura" "Acura" ...
##  $ model   : chr [1:33382] "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year    : num [1:33382] 1997 1997 1997 1998 1998 ...
##  $ class   : chr [1:33382] "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans   : chr [1:33382] "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive   : chr [1:33382] "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl     : num [1:33382] 4 4 6 4 4 6 4 4 6 5 ...
##  $ displ   : num [1:33382] 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel    : chr [1:33382] "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy     : num [1:33382] 26 28 26 27 29 26 27 29 26 23 ...
##  $ cty     : num [1:33382] 20 22 18 19 21 17 20 21 17 18 ...
##  $ combined: num [1:33382] 22.7 24.7 21.6 22.6 24.6 ...
##  - attr(*, "na.action")= 'omit' Named int [1:60] 1232 1233 2347 3246 3247 3248 6115 6116 6533 7783 ...
##   ..- attr(*, "names")= chr [1:60] "1232" "1233" "2347" "3246" ...

Check The transmission name and find the same name

##  [1] "Automatic 4-spd"                  "Manual 5-spd"                    
##  [3] "Automatic (S5)"                   "Manual 6-spd"                    
##  [5] "Automatic 5-spd"                  "Auto(AV-S7)"                     
##  [7] "Automatic (S6)"                   "Automatic (S4)"                  
##  [9] "Automatic (S7)"                   "Automatic 6-spd"                 
## [11] "Automatic 3-spd"                  "Manual 4-spd"                    
## [13] "Auto(AM6)"                        "Auto(AM7)"                       
## [15] "Auto(AM-S6)"                      "Automatic (variable gear ratios)"
## [17] "Automatic (AV)"                   "Auto(AV-S8)"                     
## [19] "Automatic (S8)"                   "Automatic (AM6)"                 
## [21] "Auto(AM-S7)"                      "Automatic (A6)"                  
## [23] "Automatic 8-spd"                  "Automatic (AM-S7)"               
## [25] "Auto(AV-S6)"                      "Manual 3-spd"                    
## [27] "Manual 7-spd"                     "Auto (AV)"                       
## [29] "Automatic 9-spd"                  "Automatic 8spd"                  
## [31] "Auto(A1)"                         "Automatic 6spd"                  
## [33] "Auto(L4)"                         "Auto(L3)"                        
## [35] "Automatic (S9)"                   "Auto (AV-S6)"                    
## [37] "Auto (AV-S8)"                     "Automatic (AV-S6)"               
## [39] "Automatic 7-spd"                  "Manual(M7)"                      
## [41] "Auto(A8)"                         "Auto(AM-S8)"                     
## [43] "Auto(AM5)"                        "Automatic (AM5)"                 
## [45] "Automatic (AM-S6)"                "Manual 5 spd"
##  [1] "Auto4-spd"                  "Manual 5-spd"              
##  [3] "Auto(S5)"                   "Manual 6-spd"              
##  [5] "Auto5-spd"                  "Auto(AV-S7)"               
##  [7] "Auto(S6)"                   "Auto(S4)"                  
##  [9] "Auto(S7)"                   "Auto6-spd"                 
## [11] "Auto3-spd"                  "Manual 4-spd"              
## [13] "Auto(AM6)"                  "Auto(AM7)"                 
## [15] "Auto(AM-S6)"                "Auto(variable gear ratios)"
## [17] "Auto(AV)"                   "Auto(AV-S8)"               
## [19] "Auto(S8)"                   "Auto(AM-S7)"               
## [21] "Auto(A6)"                   "Auto8-spd"                 
## [23] "Auto(AV-S6)"                "Manual 3-spd"              
## [25] "Manual 7-spd"               "Auto9-spd"                 
## [27] "Auto8spd"                   "Auto(A1)"                  
## [29] "Auto6spd"                   "Auto(L4)"                  
## [31] "Auto(L3)"                   "Auto(S9)"                  
## [33] "Auto7-spd"                  "Manual(M7)"                
## [35] "Auto(A8)"                   "Auto(AM-S8)"               
## [37] "Auto(AM5)"                  "Manual 5 spd"
##  [1] "Auto 4-spd"                  "Manual 5-spd"               
##  [3] "Auto (S5)"                   "Manual 6-spd"               
##  [5] "Auto 5-spd"                  "Auto (AV-S7)"               
##  [7] "Auto (S6)"                   "Auto (S4)"                  
##  [9] "Auto (S7)"                   "Auto 6-spd"                 
## [11] "Auto 3-spd"                  "Manual 4-spd"               
## [13] "Auto (AM6)"                  "Auto (AM7)"                 
## [15] "Auto (AM-S6)"                "Auto (variable gear ratios)"
## [17] "Auto (AV)"                   "Auto (AV-S8)"               
## [19] "Auto (S8)"                   "Auto (AM-S7)"               
## [21] "Auto (A6)"                   "Auto 8-spd"                 
## [23] "Auto (AV-S6)"                "Manual 3-spd"               
## [25] "Manual 7-spd"                "Auto 9-spd"                 
## [27] "Auto 8spd"                   "Auto (A1)"                  
## [29] "Auto 6spd"                   "Auto (L4)"                  
## [31] "Auto (L3)"                   "Auto (S9)"                  
## [33] "Auto 7-spd"                  "Manual(M7)"                 
## [35] "Auto (A8)"                   "Auto (AM-S8)"               
## [37] "Auto (AM5)"                  "Manual 5 spd"
##  [1] "Auto 4-spd"                  "Manual 5-spd"               
##  [3] "Auto (S5)"                   "Manual 6-spd"               
##  [5] "Auto (AV-S7)"                "Auto (S6)"                  
##  [7] "Auto (S4)"                   "Auto (S7)"                  
##  [9] "Auto (S3)"                   "Manual 4-spd"               
## [11] "Auto (AM6)"                  "Auto (AM7)"                 
## [13] "Auto (AM-S6)"                "Auto (variable gear ratios)"
## [15] "Auto (AV)"                   "Auto (AV-S8)"               
## [17] "Auto (S8)"                   "Auto (AM-S7)"               
## [19] "Auto (A6)"                   "Auto (AV-S6)"               
## [21] "Manual 3-spd"                "Manual 7-spd"               
## [23] "Auto (S9)"                   "Auto (A1)"                  
## [25] "Auto (L4)"                   "Auto (L3)"                  
## [27] "Manual(M7)"                  "Auto (A8)"                  
## [29] "Auto (AM-S8)"                "Auto (AM5)"
##   [1] "Acura"                              "Alfa Romeo"                        
##   [3] "AM General"                         "American Motors Corporation"       
##   [5] "ASC Incorporated"                   "Aston Martin"                      
##   [7] "Audi"                               "Aurora Cars Ltd"                   
##   [9] "Autokraft Limited"                  "Bentley"                           
##  [11] "Bertone"                            "Bill Dovell Motor Car Company"     
##  [13] "Bitter Gmbh and Co. Kg"             "BMW"                               
##  [15] "BMW Alpina"                         "Bugatti"                           
##  [17] "Buick"                              "Cadillac"                          
##  [19] "CCC Engineering"                    "Chevrolet"                         
##  [21] "Chrysler"                           "Consulier Industries Inc"          
##  [23] "CX Automotive"                      "Dabryan Coach Builders Inc"        
##  [25] "Dacia"                              "Daewoo"                            
##  [27] "Daihatsu"                           "Dodge"                             
##  [29] "E. P. Dutton, Inc."                 "Eagle"                             
##  [31] "Environmental Rsch and Devp Corp"   "Evans Automobiles"                 
##  [33] "Excalibur Autos"                    "Federal Coach"                     
##  [35] "Ferrari"                            "Fiat"                              
##  [37] "Fisker"                             "Ford"                              
##  [39] "General Motors"                     "Geo"                               
##  [41] "GMC"                                "Goldacre"                          
##  [43] "Grumman Allied Industries"          "Grumman Olson"                     
##  [45] "Honda"                              "Hummer"                            
##  [47] "Hyundai"                            "Import Foreign Auto Sales Inc"     
##  [49] "Import Trade Services"              "Infiniti"                          
##  [51] "Isis Imports Ltd"                   "Isuzu"                             
##  [53] "J.K. Motors"                        "Jaguar"                            
##  [55] "JBA Motorcars, Inc."                "Jeep"                              
##  [57] "Kia"                                "Laforza Automobile Inc"            
##  [59] "Lambda Control Systems"             "Lamborghini"                       
##  [61] "Land Rover"                         "Lexus"                             
##  [63] "Lincoln"                            "London Coach Co Inc"               
##  [65] "London Taxi"                        "Lotus"                             
##  [67] "Mahindra"                           "Maserati"                          
##  [69] "Maybach"                            "Mazda"                             
##  [71] "Mcevoy Motors"                      "McLaren Automotive"                
##  [73] "Mercedes-Benz"                      "Mercury"                           
##  [75] "Merkur"                             "MINI"                              
##  [77] "Mitsubishi"                         "Morgan"                            
##  [79] "Nissan"                             "Oldsmobile"                        
##  [81] "Panos"                              "Panoz Auto-Development"            
##  [83] "Panther Car Company Limited"        "PAS Inc - GMC"                     
##  [85] "PAS, Inc"                           "Peugeot"                           
##  [87] "Pininfarina"                        "Plymouth"                          
##  [89] "Pontiac"                            "Porsche"                           
##  [91] "Quantum Technologies"               "Qvale"                             
##  [93] "Ram"                                "Red Shift Ltd."                    
##  [95] "Renault"                            "Rolls-Royce"                       
##  [97] "Roush Performance"                  "Ruf Automobile Gmbh"               
##  [99] "S and S Coach Company  E.p. Dutton" "Saab"                              
## [101] "Saleen"                             "Saleen Performance"                
## [103] "Saturn"                             "Scion"                             
## [105] "Shelby"                             "smart"                             
## [107] "Spyker"                             "SRT"                               
## [109] "Sterling"                           "Subaru"                            
## [111] "Superior Coaches Div E.p. Dutton"   "Suzuki"                            
## [113] "Tecstar, LP"                        "Texas Coach Company"               
## [115] "Toyota"                             "TVR Engineering Ltd"               
## [117] "Vector"                             "Vixen Motor Company"               
## [119] "Volga Associated Automobile"        "Volkswagen"                        
## [121] "Volvo"                              "VPG"                               
## [123] "Wallace Environmental"              "Yugo"

Manufacturer with counts greater than 500 vehicle

##  [1] "Audi"          "BMW"           "Chevrolet"     "Chrysler"     
##  [5] "Dodge"         "Ford"          "GMC"           "Honda"        
##  [9] "Hyundai"       "Jeep"          "Mazda"         "Mercedes-Benz"
## [13] "Mercury"       "Mitsubishi"    "Nissan"        "Pontiac"      
## [17] "Porsche"       "Subaru"        "Suzuki"        "Toyota"       
## [21] "Volkswagen"    "Volvo"
##  [1] "Subcompact Cars"                    "Compact Cars"                      
##  [3] "Midsize Cars"                       "Sport Utility Vehicle - 4WD"       
##  [5] "Small Sport Utility Vehicle 2WD"    "Small Sport Utility Vehicle 4WD"   
##  [7] "Two Seaters"                        "Sport Utility Vehicle - 2WD"       
##  [9] "Special Purpose Vehicles"           "Special Purpose Vehicle 4WD"       
## [11] "Small Station Wagons"               "Minicompact Cars"                  
## [13] "Special Purpose Vehicle 2WD"        "Midsize-Large Station Wagons"      
## [15] "Midsize Station Wagons"             "Large Cars"                        
## [17] "Standard Sport Utility Vehicle 4WD" "Standard Sport Utility Vehicle 2WD"
## [19] "Minivan - 4WD"                      "Minivan - 2WD"                     
## [21] "Vans"                               "Vans, Cargo Type"                  
## [23] "Vans, Passenger Type"               "Standard Pickup Trucks 2WD"        
## [25] "Standard Pickup Trucks"             "Standard Pickup Trucks/2wd"        
## [27] "Small Pickup Trucks 2WD"            "Standard Pickup Trucks 4WD"        
## [29] "Small Pickup Trucks 4WD"            "Small Pickup Trucks"               
## [31] "Vans Passenger"                     "Special Purpose Vehicle"           
## [33] "Special Purpose Vehicles/2wd"       "Special Purpose Vehicles/4wd"
##  [1] "Subcompact Cars"              "Compact Cars"                
##  [3] "Midsize Cars"                 "SUV - 4WD"                   
##  [5] "Small SUV 2WD"                "Small SUV 4WD"               
##  [7] "Two Seaters"                  "SUV - 2WD"                   
##  [9] "SPVs"                         "SPV 4WD"                     
## [11] "Small Station Wagons"         "Minicompact Cars"            
## [13] "SPV 2WD"                      "Midsize-Large Station Wagons"
## [15] "Midsize Station Wagons"       "Large Cars"                  
## [17] "Standard SUV 4WD"             "Standard SUV 2WD"            
## [19] "Minivan - 4WD"                "Minivan - 2WD"               
## [21] "Vans"                         "Vans, Cargo Type"            
## [23] "Vans Passenger"               "Standard Pickup Trucks 2WD"  
## [25] "Standard Pickup Trucks"       "Small Pickup Trucks 2WD"     
## [27] "Standard Pickup Trucks 4WD"   "Small Pickup Trucks 4WD"     
## [29] "Small Pickup Trucks"          "SPV"                         
## [31] "SPVs/2wd"                     "SPVs/4wd"

Replace values in ‘Drive’ column

## [1] "Front-Wheel Drive"          "4-Wheel or All-Wheel Drive"
## [3] "All-Wheel Drive"            "Rear-Wheel Drive"          
## [5] "2-Wheel Drive"              "4-Wheel Drive"             
## [7] "Part-time 4-Wheel Drive"
## [1] "Front-Wheel Drive" "4-Wheel Drive"     "2-Wheel Drive"

Part 4 - Explanatory data analysis

Distribution of Manufacturer by Class

## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Warning: The `scale_name` argument of `discrete_scale()` is deprecated as of ggplot2
## 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Distribution of Manufacturer by Drive

Plot the cty and highyway mpg with drive and manufacturer

Count the car quantity per maker

Histogram of Fuel efficiency

Bar chart of Cylinder number

Relationship Between mpg and Cylinders & Displacement

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: `legend.margin` must be specified using `margin()`
## ℹ For the old behavior use `legend.spacing`

Variables:

We are interested to use explanatory variable(s) cyl as engine cylinders and displ as engine displacement in liters.

and Response variable is combined- fuel efficiency (MPG)

Check the correlation between explanatory variable(s) and Response variable

## [1] "Correlation between cyl and fuel efficiency: -0.695079260610909"
## [1] "Correlation between displ and fuel efficiency: -0.746847994847246"
## [1] "Correlation between explanatory variable(s): 0.898882035529359"

Part 5

Type of Study

Study Type: This is an observational study. Response Variable Response Variable: Fuel efficiency

Part 6: Inference and analysis

Fit linear regression model for each variable

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = formula(paste(y, "~", x)), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.648  -2.406  -0.406   1.944  28.756 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.63233    0.06812   464.3   <2e-16 ***
## cyl         -1.99606    0.01130  -176.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.594 on 33380 degrees of freedom
## Multiple R-squared:  0.4831, Adjusted R-squared:  0.4831 
## F-statistic: 3.12e+04 on 1 and 33380 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = formula(paste(y, "~", x)), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5172  -2.0286  -0.3803   1.5976  27.8287 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.31715    0.04841   605.6   <2e-16 ***
## displ       -2.74589    0.01338  -205.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.325 on 33380 degrees of freedom
## Multiple R-squared:  0.5578, Adjusted R-squared:  0.5578 
## F-statistic: 4.21e+04 on 1 and 33380 DF,  p-value: < 2.2e-16

Linear Regression Model fit

## 
## Call:
## lm(formula = combined ~ cyl + displ, data = fuel_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9670  -2.0638  -0.4054   1.6154  27.8059 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.99689    0.06632  452.32   <2e-16 ***
## cyl         -0.35522    0.02377  -14.94   <2e-16 ***
## displ       -2.33710    0.03044  -76.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.314 on 33379 degrees of freedom
## Multiple R-squared:  0.5607, Adjusted R-squared:  0.5607 
## F-statistic: 2.13e+04 on 2 and 33379 DF,  p-value: < 2.2e-16

## `geom_smooth()` using formula = 'y ~ x'

## (Intercept)         cyl       displ 
##  29.9968902  -0.3552182  -2.3370956

Interpretetion:

The p-value, the adjusted R-squared, and the coefficients are some of the values that summarize the results of a linear regression model. Here is how to interpret them,

P value: This is the probability of obtaining a test statistic that is at least as extreme as the one observed in the sample, assuming that the null hypothesis is true. The null hypothesis for each coefficient in a regression model is that it is equal to zero, which means that there is no relationship between the predictor variable and the response variable. A low p value (usually less than 0.05) indicates that we can reject the null hypothesis and conclude that there is a significant relationship between the predictor and the response.In our output, P-value all predictor variables are bellow .05 i.e. all variables are statistically significant.

Adjusted R-squared: This is a modified version of R-squared that adjusts for the number of predictors in a regression model. Adjusted R-squared tells you how well your model explains the variation in the response variable, taking into account the number of predictors in the model. A higher adjusted R-squared indicates a better fit. In our output, the adjusted R-squared is 0.5607, which means that our model explains about 50% of the variation in exam score, after accounting for the number of predictors.

Coefficient: This is the estimate of the effect of each predictor variable on the response variable. It tells how much the response variable changes on average when the predictor variable increases by one unit, holding all other predictors constant. In our output, the coefficient for Cylinder is -0.3552182, which means that for every one unit increase in weight, exam score decreases by 0.3552182 units on average, assuming that all other variables are held constant and the coefficient for displacement is -2.3370956, which means that for every one unit increase in displacement, exam score decreases by 2.3370956 units on average, assuming that all other variables are held constant. And intercept 29.9968902

F-statistic: 2.13e+04 i.e. the large F-statistics suggests that at at least one of the predictor variable to be related to response varible.

Analysis of Variance Table

## Analysis of Variance Table
## 
## Response: combined
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## cyl           1 403101  403101 36711.4 < 2.2e-16 ***
## displ         1  64732   64732  5895.3 < 2.2e-16 ***
## Residuals 33379 366511      11                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis of variance table (ANOVA) is a way of summarizing the results of a multiple linear regression model. It shows how the total variation in the response variable (mpg) is partitioned into the variation explained by the model (regression) and the variation that is not explained by the model (residuals). The table also provides some statistics to test the significance of each predictor variable and the overall model fit.

Here is a brief explanation of each column and row in the table,

##Df: This stands for degrees of freedom, which is the number of independent pieces of information that are used to estimate a parameter or a statistic. For example, the Df for residuals is 33379, which means that there are 33379 independent pieces of information that are used to estimate the error variance. The Df for each predictor variable is 1, which means that each variable uses one piece of information to estimate its effect on the response variable.

##Sum Sq: This stands for sum of squares, which is a measure of variation or deviation from the mean. For example, the Sum Sq for residuals is 366511, which means that the sum of the squared differences between the observed mpg values and the predicted mpg values is 366511. The Sum Sq for each predictor variable is the amount of variation in mpg that is explained by that variable alone, after adjusting for the effects of other variables in the model.

##Mean Sq: This stands for mean square, which is obtained by dividing the sum of squares by the degrees of freedom. For example, the Mean Sq for residuals is 11, which means that the average squared difference between the observed mpg values and the predicted mpg values is 11. The Mean Sq for each predictor variable is an estimate of how much variation in mpg is explained by that variable per unit of information.

##F value: This is a statistic that compares the mean square for each predictor variable to the mean square for residuals. It measures how much more variation in mpg is explained by that variable than by random error. For example, the F value for cylinders is 36711.4, which means that the mean square for cylinders is 36711.4 times larger than the mean square for residuals. A large F value indicates a strong relationship between the predictor variable and the response variable.

##Pr(>F): This is the p-value associated with the F value, which is the probability of obtaining an F value as large or larger than the observed one, assuming that there is no relationship between the predictor variable and the response variable. A small p-value indicates that there is strong evidence to reject this assumption and conclude that there is a significant relationship between the predictor variable and the response variable. For example, the Pr(>F) for cylinders is less than 2.2e-16, which means that there is almost zero chance of getting an F value as large as 36711.4 if cylinders had no effect on mpg.

Part 7 - Conclusion

We see that both regression coefficients had P<0.05 and are statistically significant, the conditions for inference were also met with linearity, normal (after outline removal), and scattered residuals. A low p value (usually less than 0.05) indicates that we can reject the null hypothesis(H0) and accept the alternative hypothesis(H1).In our output, P-value for all explanatory variables are bellow .05 i.e. all variables are statistically significant.Which concludes that there is a significant relationship between the explanatory variable and the response variable.The above project shows that there is a significant relationship between the car features(Cylinders, Displacement) and the fuel efficiency(combined MPG) of a car.

Part 8 - Principal component analysis

Encode Categorical Variable: Convert the categorical variable drive into numerical values. One common method is to use one-hot encoding or dummy encoding.

Merge with Numerical Data:

Standardize the Data

##          cyl      displ drive2-Wheel Drive drive4-Wheel Drive
## 1 -1.0177222 -0.8475746         -0.7720502         -0.5940174
## 2 -1.0177222 -0.8475746         -0.7720502         -0.5940174
## 3  0.1310825 -0.2592479         -0.7720502         -0.5940174
## 4 -1.0177222 -0.7740338         -0.7720502         -0.5940174
## 5 -1.0177222 -0.7740338         -0.7720502         -0.5940174
## 6  0.1310825 -0.2592479         -0.7720502         -0.5940174
##   driveFront-Wheel Drive
## 1                1.31696
## 2                1.31696
## 3                1.31696
## 4                1.31696
## 5                1.31696
## 6                1.31696

Perform PCA

## Importance of components:
##                           PC1    PC2    PC3     PC4       PC5
## Standard deviation     1.6409 1.1960 0.8821 0.31433 1.734e-13
## Proportion of Variance 0.5385 0.2861 0.1556 0.01976 0.000e+00
## Cumulative Proportion  0.5385 0.8246 0.9802 1.00000 1.000e+00

Component Analysis (PCA) result interpretation:

The table provided summarizes the results of a Principal Component Analysis (PCA). Here’s how to interpret each component:

Standard Deviation: This row shows the standard deviation (or square root of the eigenvalues) of each principal component (PC). It indicates the spread or variability of the data along each principal component axis.

PC1: Standard deviation is 1.6409 PC2: Standard deviation is 1.1960 PC3: Standard deviation is 0.8821 PC4: Standard deviation is 0.31433 PC5: Standard deviation is approximately 0 (1.734e-13) The larger the standard deviation, the more variability the principal component captures in the original data.

Proportion of Variance: This row shows the proportion of total variance explained by each principal component. It indicates the relative importance of each PC in explaining the variability of the data.

PC1: Explains 53.85% of the total variance PC2: Explains 28.61% of the total variance PC3: Explains 15.56% of the total variance PC4: Explains 1.976% of the total variance PC5: Explains 0% of the total variance (very small, practically zero) Principal components are ordered by the amount of variance they explain, with PC1 explaining the most variance and subsequent components explaining decreasing amounts of variance.

Cumulative Proportion: This row shows the cumulative proportion of variance explained up to each principal component. It helps assess how much of the total variance is explained as more components are included.

PC1: Cumulative proportion is 53.85% PC2: Cumulative proportion is 82.46% PC3: Cumulative proportion is 98.02% PC4: Cumulative proportion is 100% (explains the remaining variance) PC5: Cumulative proportion is 100% PC1 to PC3 together explain 98.02% of the total variance, suggesting that these components capture the majority of the variability in the dataset. PC4 and PC5 explain very little additional variance.

Interpretation: PC1: This component (with the highest standard deviation and explaining 53.85% of the variance) captures the most significant patterns or relationships in the data. It represents the direction in the data that has the highest variability.

PC2 and PC3: These components also capture substantial variance and provide additional insights into the data structure beyond PC1. Together with PC1, they explain 98.02% of the variance, indicating they collectively provide a comprehensive view of the dataset’s variability.

PC4 and PC5: These components contribute very little to explaining the variance (less than 2% cumulatively), suggesting they may not be as informative or significant in describing the dataset compared to the first three components.

In summary, PCA has reduced the original dataset into a smaller set of uncorrelated components (PCs) that explain the variability in the data. The results show that the first few components (PC1 to PC3) are the most important for understanding the data structure and patterns, while the remaining components contribute minimally.

Part 9 Reference

https://www.fueleconomy.gov/feg/pdfs/guides/FEG2023.pdf