Loading packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading the dataset

Cars_2 = read_csv("Cars_02.csv")

## Rows: 203 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): car_name, fuel_type, transmission_type, body_type
## dbl (11): reviews_count, engine_displacement, no_cylinder, seating_capacity,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

The dataset CARS_2 includes information about cars that are available for sale in India. It has details such as price, fuel type, engine size, seating capacity, and horsepower. The goal of this project is to explore the data and understand what factors affect the price of a car. By using graphs and basic data analysis, this project will look for patterns and relationships between different car features and their prices.

Load and explore the dataset assigned to you for your project to answer the following questions. You can use the glimpse function from the tidyverse package to peek into the dataset and find answers to these questions.

glimpse(Cars_2)

## Rows: 203
## Columns: 15
## $ car_name            <chr> "Maruti Alto K10", "Maruti Brezza", "Mahindra Thar…
## $ reviews_count       <dbl> 51, 86, 242, 313, 107, 99, 731, 381, 107, 205, 568…
## $ fuel_type           <chr> "Petrol", "Petrol", "Diesel", "Diesel", "Diesel", …
## $ engine_displacement <dbl> 998, 1462, 2184, 2198, 2198, 2755, 1493, 1199, 149…
## $ no_cylinder         <dbl> 3, 4, 4, 4, 4, 4, 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4,…
## $ seating_capacity    <dbl> 5, 5, 4, 7, 7, 7, 5, 5, 7, 5, 5, 5, 5, 5, 5, 5, 8,…
## $ transmission_type   <chr> "Automatic", "Automatic", "Automatic", "Automatic"…
## $ fuel_tank_capacity  <dbl> 27, 48, 57, 60, 57, 80, 50, 37, 60, 37, 44, 45, 50…
## $ body_type           <chr> "Hatchback", "SUV", "SUV", "SUV", "SUV", "SUV", "S…
## $ rating              <dbl> 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.0, …
## $ max_torque_nm       <dbl> 89.00, 136.80, 300.00, 450.00, 400.00, 500.00, 250…
## $ max_torque_rpm      <dbl> 3500, 4400, 2800, 2800, 2750, 2800, 2750, 3400, 22…
## $ max_power_bhp       <dbl> 65.71, 101.65, 130.00, 182.38, 172.45, 201.15, 113…
## $ max_power_rp        <dbl> 5500, 6000, 3750, 3500, 3500, 3400, 4000, 6000, 36…
## $ price               <dbl> 6383.0, 14267.5, 19214.0, 24544.0, 23328.5, 53280.…

How many cases (instances/rows) are in your dataset?

There are 203 rows or instances in my dataset

How many variables (attributes/columns) are in your dataset?

There are 15 columns or attributes in my dataset

Identify the main response (also known as dependent) variable in the data. Price of a car is the main response or dependent variable

d.Does your dataset contain missing values? Which variables contain missing values?

anyNA(Cars_2)

## [1] TRUE

table(is.na(Cars_2)) ##display missing values

## 
## FALSE  TRUE 
##  3044     1

colSums(is.na(Cars_2)) ## shows variable that contains missing values

##            car_name       reviews_count           fuel_type engine_displacement 
##                   0                   0                   0                   0 
##         no_cylinder    seating_capacity   transmission_type  fuel_tank_capacity 
##                   0                   1                   0                   0 
##           body_type              rating       max_torque_nm      max_torque_rpm 
##                   0                   0                   0                   0 
##       max_power_bhp        max_power_rp               price 
##                   0                   0                   0

Yes, my dataset contains a missing value. There is 1 missing value for seating_capacity.

Research Question and Hypotheses

Research Question

Identify one research question that you plan to answer using your dataset. Do automatic transmission cars have higher prices than manual transmission cars

Write your research question and the corresponding hypothesis to be tested.

Do automatic transmission cars have higher prices than manual transmission cars ## Hypotheses

H0: Mean Price (Automatic) = Mean Price (Manual) HA: Mean Price (Automatic) > Mean Price (Manual)

Identify and list the relevant variables that will be used in the analysis. -Transmission type and price and engine displacement and maximunm horsepower

Price (in $1000) transmission_type → Automatic transmission available (categorical: Yes or No) engine_displacement max_power_bhp

Hypotheses

H0:Mean Price (Transmission = Automatic) = Mean Price (Transmission = Manual)

HA: Mean Price (Transmission = Automatic) > Mean Price (Transmission = Manual)

Exploratory Data Analysis

What type of variable is your main response variable? (Categorical/Numerical)

The main response variable price is numerical.

Create 2 different histograms of your response variable with 2 different bin widths.

Cars_2 %>%
  ggplot(aes(x = price))+
  geom_histogram(color= "pink",fill="pink", bins=25)

Cars_2 %>%
  ggplot(aes(x = price))+
  geom_histogram(color= "pink",fill="pink", bins=30)

Create a density plot of your response variable.

Cars_2 %>%
  ggplot(aes(x = price))+
  geom_density(color= "pink",fill="pink",)

Comment on the shape, modality and potential outliers. The distribution is strongly right-skewed and unimodal, so there may be some outliers.
What measures should you use to describe the center and the spread of your response variable? Due to the strongly right-skewed distribution with outliers, I should use the median to describe the center and the interquartile range (IQR) to describe the spread of the car prices.

Graphing some variables in your dataset.

Create at least 3 graphs displaying the distributions of at least three different variables in your data set NOT including your response variable that relate to research question(s) you might have for your dataset. For example, if the variable is categorical, report a bar chart, while for quantitative variables, report histogram, dotplot or boxplot. Note: You need to make a graph for each variable. You shouldn’t use the same variable for different graphs.

Cars_2 %>%
  ggplot(aes(x = transmission_type))+
  geom_bar(color= "pink",fill="pink",)

Cars_2 %>%
  ggplot(aes(x = engine_displacement))+
  geom_histogram(color= "pink",fill="pink",)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Cars_2 %>%
  ggplot(aes(x = max_power_bhp))+
  geom_boxplot(color= "pink",fill="pink",)

Write a few sentences about what you have learned from your graphs. From the plots above, the distribution of engine displacement appears bimodal, while maximum horsepower is strongly right-skewed. Automatic transmission is the most common type in the dataset, significantly outnumbering manual transmission vehicles.

Create at least 2 graphs that include at least 2 variables in each, one of which is the response variable identified in 1c.

Cars_2 %>%
  ggplot(aes(x = engine_displacement, y = price))+
  geom_point(color= "pink",fill="pink",)

Cars_2 %>%
  ggplot(aes(x = transmission_type, y = price,fill = transmission_type))+
  geom_boxplot()+labs(title = "Car Prices by Transmission Type")

Write a few sentences about what you have learned from your graphs.

From the graphs, I can see that automatic transmission cars have a significantly higher median price than manual transmission cars, which directly supports my research hypothesis. Additionally, there appears to be a positive relationship between engine size and price, where cars with larger engine displacements tend to be more expensive.

#DAP Part 2

Summary of the Response Variable Compute and report descriptive statistics (mean, standard deviation, and five-number summary) for your response variable identified in Part I. Create appropriate visualizations (histogram or boxplot) and describe the shape of the distribution. Discuss what these summaries reveal about your response variable.

Cars_2 %>%
  summarise(Min = min(price),
            Q1= quantile(price, 0.25),
            Mean = mean(price),
            Q3 = quantile(price, 0.75),
            Max = max(price),
            SD = sd(price))

Cars_2 %>%
  ggplot(aes(x = price))+
  geom_histogram(col = "pink",fill = "black",bins = 30)

Looking at the mean and the histogram, the variable “price” is right-skewed. The mean is $133,664, but since it is much higher than what most cars cost, the long tail on the right side of the histogram pulls the average upward. This happens because a few very expensive cars make the mean larger, while most cars are priced lower.

Confidence Interval for the Population Mean Construct and interpret a 95% confidence interval for the population mean of your response variable. Assume your sample size is large enough for the Central Limit Theorem to apply, even if the data are not perfectly normal.

Report the sample mean, standard deviation, and sample size (n) for your response variable.

x_bar=mean(Cars_2$price)
S= sd(Cars_2$price)
n= 203

The sample mean is 133664.014, the standard deviation is 187915.353 annd the sample size is 203

Compute the standard error (SE) and the margin of error (ME) manually using R arithmetic.

Recall: $SE =s\sqrt{n}, \quad ME = z^*x SE \quad \text{where}\quad z^*=1.96$ for a 95% confidence level.

s <- sd(Cars_2$price, na.rm = TRUE)
x_bar <- mean(Cars_2$price, na.rm = TRUE)
n <- nrow(Cars_2)

z = 1.96

se = s / sqrt(n)

me = z * se

x_bar - me

## [1] 107813.4

x_bar + me

## [1] 159514.6

Report and interpret your confidence interval in the context of your dataset (e.g., “We are 95% confident that the true mean house price is captured between . . . ’ ’).

We are 95% confident that the true mean of car prices is captured between 107813.4 and 159514.6

Comparing Groups Select one categorical variable (e.g., gender, region, smoker/nonsmoker, etc.) and summarize your response variable by the groups defined by this variable.

Report and compare the group means and standard deviations.
Create side-by-side boxplots or bar charts to visualize group differences.
Comment on any patterns or differences between the groups.

Cars_2 %>%
  group_by(transmission_type) %>%
  summarise(Mean = mean(price),
            SD= sd(price)) %>% arrange(desc(Mean), desc(SD))

Cars_2 %>%
  ggplot(aes(x=transmission_type,y=price, fill = transmission_type))+
  geom_boxplot(show.legend = F)

From the summary and boxplot, automatic transmission cars have a higher average price compared to manual transmission cars. The spread of prices for automatic cars is also larger, with some very expensive models acting as outliers. Manual cars tend to be less expensive and have a smaller range of prices. This pattern supports the hypothesis that automatic cars are generally more expensive than manual cars.

Exploring Relationships Using Correlation and Regression Use correlation and regression to describe and model the association between your response variable and two numerical explanatory variables.

Create two scatterplots displaying the relationship between the response variable and two different numerical explanatory variables that you think might be related.

Cars_2 %>%
ggplot(aes(x = engine_displacement, y = price)) +
geom_point(color = "pink") +
  geom_smooth(method = lm, se= F)

## `geom_smooth()` using formula = 'y ~ x'

Cars_2 %>%
ggplot(aes(x = max_power_bhp, y = price)) +
geom_point(color = "pink") +
  geom_smooth(method = lm, se= F)

## `geom_smooth()` using formula = 'y ~ x'

Describe the direction, form, and strength of the relationship in each plot.

The scatterplot of price versus engine displacement shows a moderate positive association, indicating that cars with larger engines tend to cost more. The scatterplot of price versus horsepower shows a weaker positive association, suggesting that while more powerful cars tend to be more expensive, horsepower alone does not predict price as consistently as engine displacement.

Fit two simple linear regression models (one for each explanatory variable) using the lm() function.

model1 = lm(price ~ engine_displacement, data= Cars_2)
summary(model1)

## 
## Call:
## lm(formula = price ~ engine_displacement, data = Cars_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -312331  -58937  -36368   23097 1005372 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -49871.523  18864.050  -2.644  0.00885 ** 
## engine_displacement     79.593      6.871  11.584  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145900 on 201 degrees of freedom
## Multiple R-squared:  0.4003, Adjusted R-squared:  0.3974 
## F-statistic: 134.2 on 1 and 201 DF,  p-value: < 2.2e-16

model2 = lm(price ~ max_power_bhp, data= Cars_2)
summary(model2 )

## 
## Call:
## lm(formula = price ~ max_power_bhp, data = Cars_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -457511  -44815  -17083    6396  738100 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -41159.56   16630.77  -2.475   0.0142 *  
## max_power_bhp    655.81      50.53  12.978   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 139000 on 201 degrees of freedom
## Multiple R-squared:  0.4559, Adjusted R-squared:  0.4532 
## F-statistic: 168.4 on 1 and 201 DF,  p-value: < 2.2e-16

Report the correlation coefficient (r) and coefficient of determination (R2) for each model.

r1 =cor(Cars_2$price, Cars_2$engine_displacement)
R_square1= r1^2 ## coefficient of determination
r2= cor(Cars_2$price, Cars_2$max_power_bhp)
R_square2= r2^2 ## coefficient of determination 
c(r1,r2)

## [1] 0.6327278 0.6752110

c(R_square1, R_square2)

## [1] 0.4003444 0.4559099

About 40.0% of the variation in car prices (the response variable) can be explained by engine displacement (the explanatory variable). This indicates a moderately strong relationship, meaning engine size is an important factor in predicting car prices.

About 45.6% of the variation in car prices can be explained by maximum horsepower. This indicates a moderately strong relationship, slightly stronger than engine displacement, suggesting that horsepower also plays an important role in predicting car prices.

Write out each model equation and explain which variable produces the better model and why. \[ Price_i =\beta_0 + \beta_1 \times X_i =−49871.523+79.593 \ engine_{displacement}\ \\ Price_i =\beta_0 + \beta_1 \times X_i =−41159.56+655.81 \ max_{power bhp} \\ \] Comparing the two models, the one with max horsepower as the explanatory variable is slightly better. The $R^2$ for horsepower is 0.456, which means it explains about 45.6% of the variation in car prices, while engine displacement only explains about 40.0%. This shows that both engine size and horsepower affect car prices, but horsepower does a little better at predicting price in this dataset.

For every unit increase in maximum horsepower, price increases by about 655.81 on average

Using the better model, make a prediction for your response variable based on a specific explanatory value.

Using the better model based on maximum horsepower, we can make predictions for car prices. The regression equation is price = -41,159.56 + 655.81 × max_power_bhp.

\[ Price_i =\beta_0 + \beta_1 \times X_i =−41159.56+655.81 \ max_{power bhp} \\ \] X=105bhp

x = 105
intercept = -41159.56
slope = 655.81
y_hat = intercept + slope * x
y_hat

## [1] 27700.49

This means that a car with 105 bhp is predicted to cost approximately $27,700 according to our model.

Combine your results from Part I and Part II into a single, cohesive story-telling report that summarizes your findings. Your final report should: • Restate your research questions. • Summarize your descriptive, inferential, and regression findings. • Include figures and tables as supporting evidence. • Provide a clear and meaningful conclusion connecting your analysis to your research questions.

Research Question: Do automatic transmission cars have higher prices than manual transmission cars?

Summary of Findings:

Descriptive: Car prices are strongly right-skewed. Automatic cars are more expensive on average than manual cars.

Inferential: Confidence intervals confirm that the mean price of automatic cars is higher than that of manual cars.

Regression: Both engine displacement and maximum horsepower positively predict price. Maximum horsepower provides the better model, explaining about 45.6% of the variation in prices. For every 1 bhp increase, price increases by roughly $655.81. Using this model, a car with 105 bhp is predicted to cost approximately $27,700.

Conclusion: Automatic cars are generally more expensive than manual cars, and performance features like engine size and horsepower significantly affect car prices. Maximum horsepower is the strongest predictor of price in this dataset, confirming that both transmission type and car performance are key factors in determining car value.

CARS_2