library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Cars_2 = read_csv("Cars_02.csv")
## Rows: 203 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): car_name, fuel_type, transmission_type, body_type
## dbl (11): reviews_count, engine_displacement, no_cylinder, seating_capacity,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset CARS_2 includes information about cars that are available for sale in India. It has details such as price, fuel type, engine size, seating capacity, and horsepower. The goal of this project is to explore the data and understand what factors affect the price of a car. By using graphs and basic data analysis, this project will look for patterns and relationships between different car features and their prices.
glimpse(Cars_2)
## Rows: 203
## Columns: 15
## $ car_name <chr> "Maruti Alto K10", "Maruti Brezza", "Mahindra Thar…
## $ reviews_count <dbl> 51, 86, 242, 313, 107, 99, 731, 381, 107, 205, 568…
## $ fuel_type <chr> "Petrol", "Petrol", "Diesel", "Diesel", "Diesel", …
## $ engine_displacement <dbl> 998, 1462, 2184, 2198, 2198, 2755, 1493, 1199, 149…
## $ no_cylinder <dbl> 3, 4, 4, 4, 4, 4, 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4,…
## $ seating_capacity <dbl> 5, 5, 4, 7, 7, 7, 5, 5, 7, 5, 5, 5, 5, 5, 5, 5, 8,…
## $ transmission_type <chr> "Automatic", "Automatic", "Automatic", "Automatic"…
## $ fuel_tank_capacity <dbl> 27, 48, 57, 60, 57, 80, 50, 37, 60, 37, 44, 45, 50…
## $ body_type <chr> "Hatchback", "SUV", "SUV", "SUV", "SUV", "SUV", "S…
## $ rating <dbl> 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.0, …
## $ max_torque_nm <dbl> 89.00, 136.80, 300.00, 450.00, 400.00, 500.00, 250…
## $ max_torque_rpm <dbl> 3500, 4400, 2800, 2800, 2750, 2800, 2750, 3400, 22…
## $ max_power_bhp <dbl> 65.71, 101.65, 130.00, 182.38, 172.45, 201.15, 113…
## $ max_power_rp <dbl> 5500, 6000, 3750, 3500, 3500, 3400, 4000, 6000, 36…
## $ price <dbl> 6383.0, 14267.5, 19214.0, 24544.0, 23328.5, 53280.…
There are 203 rows or instances in my dataset
There are 15 columns or attributes in my dataset
d.Does your dataset contain missing values? Which variables contain missing values?
anyNA(Cars_2)
## [1] TRUE
table(is.na(Cars_2)) ##display missing values
##
## FALSE TRUE
## 3044 1
colSums(is.na(Cars_2)) ## shows variable that contains missing values
## car_name reviews_count fuel_type engine_displacement
## 0 0 0 0
## no_cylinder seating_capacity transmission_type fuel_tank_capacity
## 0 1 0 0
## body_type rating max_torque_nm max_torque_rpm
## 0 0 0 0
## max_power_bhp max_power_rp price
## 0 0 0
Yes, my dataset contains a missing value. There is 1 missing value for seating_capacity.
Do automatic transmission cars have higher prices than manual transmission cars ## Hypotheses
H0: Mean Price (Automatic) = Mean Price (Manual) HA: Mean Price (Automatic) > Mean Price (Manual)
Price (in $1000) transmission_type → Automatic transmission available (categorical: Yes or No) engine_displacement max_power_bhp
H0:Mean Price (Transmission = Automatic) = Mean Price (Transmission = Manual)
HA: Mean Price (Transmission = Automatic) > Mean Price (Transmission = Manual)
Cars_2 %>%
ggplot(aes(x = price))+
geom_histogram(color= "pink",fill="pink", bins=25)
Cars_2 %>%
ggplot(aes(x = price))+
geom_histogram(color= "pink",fill="pink", bins=30)
Cars_2 %>%
ggplot(aes(x = price))+
geom_density(color= "pink",fill="pink",)
Comment on the shape, modality and potential outliers. The distribution is strongly right-skewed and unimodal, so there may be some outliers.
What measures should you use to describe the center and the spread of your response variable? Due to the strongly right-skewed distribution with outliers, I should use the median to describe the center and the interquartile range (IQR) to describe the spread of the car prices.
Cars_2 %>%
ggplot(aes(x = transmission_type))+
geom_bar(color= "pink",fill="pink",)
Cars_2 %>%
ggplot(aes(x = engine_displacement))+
geom_histogram(color= "pink",fill="pink",)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Cars_2 %>%
ggplot(aes(x = max_power_bhp))+
geom_boxplot(color= "pink",fill="pink",)
Create at least 2 graphs that include at least 2 variables in each, one of which is the response variable identified in 1c.
Cars_2 %>%
ggplot(aes(x = engine_displacement, y = price))+
geom_point(color= "pink",fill="pink",)
Cars_2 %>%
ggplot(aes(x = transmission_type, y = price,fill = transmission_type))+
geom_boxplot()+labs(title = "Car Prices by Transmission Type")
From the graphs, I can see that automatic transmission cars have a significantly higher median price than manual transmission cars, which directly supports my research hypothesis. Additionally, there appears to be a positive relationship between engine size and price, where cars with larger engine displacements tend to be more expensive.
#DAP Part 2
Cars_2 %>%
summarise(Min = min(price),
Q1= quantile(price, 0.25),
Mean = mean(price),
Q3 = quantile(price, 0.75),
Max = max(price),
SD = sd(price))
Cars_2 %>%
ggplot(aes(x = price))+
geom_histogram(col = "pink",fill = "black",bins = 30)
Looking at the mean and the histogram, the variable “price” is
right-skewed. The mean is $133,664, but since it is much higher than
what most cars cost, the long tail on the right side of the histogram
pulls the average upward. This happens because a few very expensive cars
make the mean larger, while most cars are priced lower.
x_bar=mean(Cars_2$price)
S= sd(Cars_2$price)
n= 203
The sample mean is 133664.014, the standard deviation is 187915.353 annd the sample size is 203
Recall: \(SE =s\sqrt{n}, \quad ME = z^*x SE \quad \text{where}\quad z^*=1.96\) for a 95% confidence level.
s <- sd(Cars_2$price, na.rm = TRUE)
x_bar <- mean(Cars_2$price, na.rm = TRUE)
n <- nrow(Cars_2)
z = 1.96
se = s / sqrt(n)
me = z * se
x_bar - me
## [1] 107813.4
x_bar + me
## [1] 159514.6
Report and interpret your confidence interval in the context of your dataset (e.g., “We are 95% confident that the true mean house price is captured between . . . ’ ’).
We are 95% confident that the true mean of car prices is captured between 107813.4 and 159514.6
Cars_2 %>%
group_by(transmission_type) %>%
summarise(Mean = mean(price),
SD= sd(price)) %>% arrange(desc(Mean), desc(SD))
Cars_2 %>%
ggplot(aes(x=transmission_type,y=price, fill = transmission_type))+
geom_boxplot(show.legend = F)
From the summary and boxplot, automatic transmission cars have a higher
average price compared to manual transmission cars. The spread of prices
for automatic cars is also larger, with some very expensive models
acting as outliers. Manual cars tend to be less expensive and have a
smaller range of prices. This pattern supports the hypothesis that
automatic cars are generally more expensive than manual cars.
Cars_2 %>%
ggplot(aes(x = engine_displacement, y = price)) +
geom_point(color = "pink") +
geom_smooth(method = lm, se= F)
## `geom_smooth()` using formula = 'y ~ x'
Cars_2 %>%
ggplot(aes(x = max_power_bhp, y = price)) +
geom_point(color = "pink") +
geom_smooth(method = lm, se= F)
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot of price versus engine displacement shows a moderate positive association, indicating that cars with larger engines tend to cost more. The scatterplot of price versus horsepower shows a weaker positive association, suggesting that while more powerful cars tend to be more expensive, horsepower alone does not predict price as consistently as engine displacement.
model1 = lm(price ~ engine_displacement, data= Cars_2)
summary(model1)
##
## Call:
## lm(formula = price ~ engine_displacement, data = Cars_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -312331 -58937 -36368 23097 1005372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49871.523 18864.050 -2.644 0.00885 **
## engine_displacement 79.593 6.871 11.584 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145900 on 201 degrees of freedom
## Multiple R-squared: 0.4003, Adjusted R-squared: 0.3974
## F-statistic: 134.2 on 1 and 201 DF, p-value: < 2.2e-16
model2 = lm(price ~ max_power_bhp, data= Cars_2)
summary(model2 )
##
## Call:
## lm(formula = price ~ max_power_bhp, data = Cars_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -457511 -44815 -17083 6396 738100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41159.56 16630.77 -2.475 0.0142 *
## max_power_bhp 655.81 50.53 12.978 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 139000 on 201 degrees of freedom
## Multiple R-squared: 0.4559, Adjusted R-squared: 0.4532
## F-statistic: 168.4 on 1 and 201 DF, p-value: < 2.2e-16
r1 =cor(Cars_2$price, Cars_2$engine_displacement)
R_square1= r1^2 ## coefficient of determination
r2= cor(Cars_2$price, Cars_2$max_power_bhp)
R_square2= r2^2 ## coefficient of determination
c(r1,r2)
## [1] 0.6327278 0.6752110
c(R_square1, R_square2)
## [1] 0.4003444 0.4559099
About 40.0% of the variation in car prices (the response variable) can be explained by engine displacement (the explanatory variable). This indicates a moderately strong relationship, meaning engine size is an important factor in predicting car prices.
About 45.6% of the variation in car prices can be explained by maximum horsepower. This indicates a moderately strong relationship, slightly stronger than engine displacement, suggesting that horsepower also plays an important role in predicting car prices.
For every unit increase in maximum horsepower, price increases by about 655.81 on average
Using the better model based on maximum horsepower, we can make predictions for car prices. The regression equation is price = -41,159.56 + 655.81 × max_power_bhp.
\[ Price_i =\beta_0 + \beta_1 \times X_i =−41159.56+655.81 \ max_{power bhp} \\ \] X=105bhp
x = 105
intercept = -41159.56
slope = 655.81
y_hat = intercept + slope * x
y_hat
## [1] 27700.49
This means that a car with 105 bhp is predicted to cost approximately $27,700 according to our model.
Combine your results from Part I and Part II into a single, cohesive story-telling report that summarizes your findings. Your final report should: • Restate your research questions. • Summarize your descriptive, inferential, and regression findings. • Include figures and tables as supporting evidence. • Provide a clear and meaningful conclusion connecting your analysis to your research questions.
Research Question: Do automatic transmission cars have higher prices than manual transmission cars?
Summary of Findings:
Descriptive: Car prices are strongly right-skewed. Automatic cars are more expensive on average than manual cars.
Inferential: Confidence intervals confirm that the mean price of automatic cars is higher than that of manual cars.
Regression: Both engine displacement and maximum horsepower positively predict price. Maximum horsepower provides the better model, explaining about 45.6% of the variation in prices. For every 1 bhp increase, price increases by roughly $655.81. Using this model, a car with 105 bhp is predicted to cost approximately $27,700.
Conclusion: Automatic cars are generally more expensive than manual cars, and performance features like engine size and horsepower significantly affect car prices. Maximum horsepower is the strongest predictor of price in this dataset, confirming that both transmission type and car performance are key factors in determining car value.