Indian Car Sales Report
ALY6010: Probability Theory and
Introductory Statistics
John Wood-Downey
Dr. Dee Chiluiza, PhD
03 March, 2023
Introduction
The global automotive industry is a substantial market. According
to Carlier (2023), the auto manufacturing industry generates 2.86
trillion USD annually and produces nearly 57 million cars yearly. In
2022, global car sales were 66.1 million; in 2021, China and the United
States were two of the biggest markets for car sales (Carlier, 2023). In
2021, Toyota sold the most cars worldwide, but the Volkswagen group had
the highest revenue, with $295.73 billion (Carlier, 2023). Some of the
most significant issues in the automotive industry are supply-chain
issues due to continued chip shortages from the pandemic, increased
demand for electric vehicles, and companies transitioning manufacturing
to EVs (Carlier, 2023). In addition, in 2020, passenger vehicles
accounted for 41% of global carbon dioxide emissions, and many countries
want to stop their global footprint with new regulations (Carlier,
2023). Growing electric vehicle companies are disrupting the automotive
industry, and sales are shifting toward the EV market. In a recent
article, Elon Musk, the CEO of Tesla, said that Tesla plans to cut
production costs by 50% and offer more affordable cars to increase the
output of electric vehicles (Reed, 2023).
More specifically, the Indian car sales market is substantial
globally. In 2021, the Indian automotive market value was $32.7 billion
and should grow at a compounded annual growth rate of 9% from 2022 to
2027 to reach $54.84 billion in 2027 (IBEF, 2022). In addition, the EV
market should reach $7.09 by 2025 (IBEF, 2022). Some of the automobile
clusters in India are Mumbai, Chennai, Kolkata, and Delhi (IBEF, 2022).
The data set evaluates these locations. Some key trends in the Indian
automobile industry are the rising demand due to a growing young and
wealthy middle-class population, increased incorporation of electric
standards, India receiving much foreign investment in the automotive
sector, and strong policies supporting the automotive industry (IBEF,
2022). In 2022, Suzuki was the most popular car brand in India, with a
market share of 46%, and the other big brands in the Indian car market
are KIA motors, Toyota, Honda, and Volkswagen (Statista Research
Department, 2023). In another article, Reuters (2023) claims India is
the fourth largest car market, and vehicle sales should grow nearly 10%
in 2024, more than 20% more than pre-pandemic car sales. According to
Reuters (2023), the SUV category is in the highest demand in India, and
it will help the total number of vehicles sold reach nearly 5 million
units in 2024. The health of the Indian car market is in solid standing.
Discrete and continuous probability distributions are important in
statistics and data visualization. According to Bluman (2018), discrete
variables are anything countable. For example, there are nine cars in
the parking lot. Discrete probability is usually presented in graphs
like bar plots because the data is discrete or contains absolute values
(Young, 2023). The main types of discrete probability distributions are
binomial, a probability of two outcomes; Bernoulli, a probability of two
outcomes where one is success and zero failure; Multinomial, a
probability of more than two outcomes; and Poisson, a probability of
events will take place over a fixed period (Young, 2023). A continuous
probability distribution evaluates numbers with decimals and fractions
(Bluman, 2018)—for example, measuring the temperature in classrooms at
Northeastern. A temperature unit contains fractions and is not a whole
number. According to Albright (2023), an X value can assume infinite
values and be displayed under a standard curve in a continuous
probability distribution. There are numerous types of continuous
distributions like uniform, flipping a die; normal distribution,
evaluating data with a mean of zero and a standard deviation of one;
log-normal distribution, using the logarithm of x; T distribution, which
has thicker tails than a normal distribution; Chi-square, evaluating
degrees of freedom in data; and exponential distribution (Mandiraju,
2021).
The data used in this report is about car sales in India. The
data evaluates car models, location, year, fuel type, transmission,
owner, efficiency, engine, power, seats, kilometers, and price. The data
is presented in tables, bar charts, pie charts, density graphs, box
plots, histograms, and scatterplots to give an overview of car sales in
India and how variables affect each other. The goal of the report is to
find the price of cars in India, the kilometers, and the location of
sales to give a presentation of the market and explain how the Indian
market is one of the biggest car sales markets in the world.
Analysis
Task 1
This task presents a table with descriptive statistics of the car
sales data set
| Efficiency | Power_bhp | Seats | Km | Price | |
|---|---|---|---|---|---|
| vars | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 |
| n | 4949.00 | 4949.00 | 4949.00 | 4949.00 | 4949.00 |
| mean | 18.47 | 123.73 | 5.16 | 55809.14 | 8383.22 |
| sd | 4.17 | 41.46 | 0.56 | 28764.20 | 5158.19 |
| median | 18.00 | 100.00 | 5.00 | 54000.00 | 7146.00 |
| trimmed | 18.32 | 117.48 | 5.03 | 54060.90 | 7668.28 |
| mad | 2.97 | 0.00 | 0.00 | 28169.40 | 4121.63 |
| min | 8.00 | 50.00 | 4.00 | 171.00 | 618.00 |
| max | 30.00 | 600.00 | 10.00 | 149000.00 | 25270.00 |
| range | 22.00 | 550.00 | 6.00 | 148829.00 | 24652.00 |
| skew | 0.41 | 1.96 | 3.85 | 0.58 | 1.23 |
| kurtosis | -0.17 | 8.61 | 18.71 | 0.23 | 1.16 |
| se | 0.06 | 0.59 | 0.01 | 408.88 | 73.32 |
The table above describes the car sales data set with
descriptive statistics like the mean, median, range, skew, number of
variables for each category, and other information. According to Bluman
(2018), descriptive statistics help describe a situation. From the
table, we understand that some of the main variables in the car sales
data set are car efficiency, power/horsepower output, seats per vehicle,
kilometers, and the price of the car. These variables are important to
highlight because they influence car sales significantly. Two of the
most impactful factors when buying used cars are mileage and condition
because the higher the mileage, usually the less good the shape of the
vehicle and the less a customer wants to pay for it because there is a
big difference between 10,000km and 200,000km (D’Allegro, 2021). From
the table, we learn there are 4949 data points in the car sales data
set. The mean for efficiency is 18.47, the mean for horsepower is
123.73, the mean for seats is 5.16, the mean for kilometers is 55,809.14
km, and the average price per car in the data set is $8383.22. The
vehicle with the lowest price is $618, and the highest is $25,270. The
code used to create the table was the dplyr::select() code which allowed
only select the variables seen in the table. The code is good for data
analysis to only present the variables of interest. Secondly, the code
psych::describe() gave the descriptive stats in the table. The code t()
transposed the table to have the variables in columns and the
information in rows. Finally, the code round() eliminated some decimals
to clean the data. The table presentation used the code knitr::kable to
create a better HTML table (Zhu, 2021).
Task 2
This task presents a bar chart for each location of car sales and a
pie chart for each type of fuel in the car sales data set
Here we have two graphs using the code par(mfcol) to display two
graphs side by side. The first graph is a bar chart showing the
frequency of car sales in different locations in India, and the second
is a pie chart showcasing the types of fuel for the cars in the data set
“car sales.” From the bar chart, the area with the most car sales was
Mumbai, with 677 cars, and the lowest was Ahmedabad, with 196 cars sold.
The data is essential because we can see what area sells the most cars
and which sells less. Based on historical sales data, consumers can use
this information when looking for a reasonable sales price or where
sellers can sell their cars the fastest. Bluman (2018) says bar graphs
accurately show the frequencies of categories, so a bar graph is
appropriate in this example to showcase the sales location. The second
graph is a pie chart for the type of fuel Indian cars use. From the pie
chart, we see that most vehicles are diesel or gasoline vehicles. Very
few cars are CNG or LPG-powered vehicles. The chart is exciting because
India has many diesel vehicles available compared to North America.
According to IEA, in 2017, diesel engines accounted for 47% of vehicle
sales, and in 2019 40% of vehicle sales in India (IEA, 2021). The pie
chart confirms the data and explains that many diesel-powered vehicles
are available in these locations in India.
Task 3
This task shows a table for the frequency of owner type (first,
second, third, or fourth) with the percentage of each
| . | Frequency | Cum Frequency | Percentage | Cum Percentage |
|---|---|---|---|---|
| First | 2921 | 2921 | 0.59 | 0.59 |
| Fourth | 782 | 3703 | 0.16 | 0.75 |
| Second | 916 | 4619 | 0.19 | 0.93 |
| Third | 330 | 4949 | 0.07 | 1.00 |
The following table shows the data set’s owner type frequency
and percentage. The table divides vehicle ownership among first-time
owners, second, third, or fourth. Most of the data is for first-time
owners with 2921 cars, followed by a second at 916, then fourth-time
ownership at 782, and finally third at 330. First-time owners make up
59% of the data for car sales in India, while third-time owners are only
7% of Indian car sales. The code knitr::kable from the kableExtra
library allowed for a custom table creation (Zhu, 2021). The code
mutate() added new columns to the table to display important values
describing the data set. According to Bluman (2018), the information in
this table, frequencies, and percentages should be displayed in a pie
chart because pie charts divide sections according to the percentage for
each categorical value. Therefore, car sales agents can use the
information in this table to target audiences and sell to the
appropriate owner type to increase sales.
Task 4
This task presents a density plot for the kilometers of cars in the
car sales data set
Here we created a density plot to show the distribution of the
variable kilometers in the data set car sales. The density plot graphs
all the data for all kilometers, so we can interpret what values are
most common in car mileage for car sales in India. From the plot, the
mean for kilometers is 55,809km, and the data goes from zero to nearly
200,000km for cars in India. In the plot, there are also two lines to
show the z-score. According to Bluman (2018), the standard score is a
value that tells us how many standard deviations a data value falls
above or below the mean. We can find the standard score by using a value
and subtracting the mean, and then dividing that by the standard
deviation of the data set (Bluman, 2018). In this example, we used
z-scores of 2.4 and -3.1 to show where the values fall on the standard
curve. A z-score of 2.4 equals 124,843km in the car sales data set,
while a z-score of -3.1 equals -33,360km on the standard curve. The
negative z-score is inaccurate for customers because car kilometers
cannot be below zero. Using the graph, we can use any kilometer value to
find how many standard deviations that value exceeds the mean. The
density plot is essential in car sales because customers want to know
where the vehicle stands compared to another vehicle in terms of
kilometers before purchasing. The code text() added numerical values
besides the lines on the graph (Chiluiza, 2021).
Task 5
This task presents a box plot and a histogram for the kilometers of
the cars from the car sales data set
Here we have two graphs showing kilometers in cars for the
Indian car sales data set. These two graphs accompany the density plot
in the last task to show more visuals about the data distribution. From
the box plot, the blue data tells us that the mean is greater than the
median. According to Bluman (2018), when the mean is to the right of the
median, most of the data is to the left of the mean resulting in a
positive skew. The dots on the right side of the box plot show potential
outliers in the kilometers data. The second graph is a histogram and
shows the same data from the box plot and the density curve to show the
kilometers data in another visual format. As kilometers are continuous
data, the histogram divides the values into bins and paints a picture of
the distribution. Continuous data is values that can hold infinite
values between two specific points and include fractions or decimals
(Bluman, 2021). The red line shows the median, and the pink line shows
the mean. The data is right-skewed, as explained in the box plot. The
code par(mfrow) allows a presentation of two graphs in a row format to
optimize data visualization. In these tables, the code main=NA hides the
titles because the graphs continue to explain the density curve.
Task 6
This task presents a box plot and a histogram for the price of cars
in the car sales data set
Here we have two graphs showing the price in car sales from the
car sales data set. The boxplot and the histogram show the data
distribution. The box plot shows that the mean is larger than the
median, making the data positively skewed and confirming that most sales
prices are less than $8,000. The histogram shows another view of the
sales price distribution because it groups the data in bins and helps us
see the positive skew. The tallest bin in the histogram is at $5000,
showing us that most data points in the car sales data set group in this
price range. Finally, the box plot shows some outliers in the data
because some cars have higher prices than others based on conditions
like mileage and condition (D’Allegro, 2021). The red line on the
histogram is the median, and the orange line is the mean.
Task 7
This task presents box plots for the price of cars per different
owner
Here is a boxplot for the distribution of price and the type of
owner for the car sales data set. As previously explained, the ownership
divides among first, second, third, and fourth owners. Each owner pays a
varying price for a vehicle based on their desired specifications. From
the box plots, the owner with the lowest median price is the third
owner, followed by the fourth, first, and second. Second, owners are
more willing to pay higher prices for their vehicles based on the data.
The fourth owners have more separation in the data because there are
numerous outliers, meaning that while some owners pay low prices, others
are willing to pay increased prices for their fourth vehicle. The third
owners have the most extensive range because the box plot shows an
elongated interquartile range in the data. The box plots show that first
owners pay between a few hundred dollars and $16,000 for their first
car. According to Bluman (2021), using box plots is an excellent tool to
evaluate data distribution, especially between categories, to compare
the data distribution.
Task 8
This task presents box plots of kilometers of cars in different
locations
Task 8 presents a box plot showcasing the kilometers of cars in
different locations in India in the Indian car sales data set. For
example, the box plot shows that Hyderabad has the highest median car
sales in India, and Kolkata has the lowest median car sales in India.
According to Yi (2021), box plots are good visualization tools because
they emphasize 50% of the data with the box or interquartile range.
Having a clear vision of the data allows for a better understanding of
how the distribution compares to other categories. In this chart, for
example, we can easily see what location in India has the most outliers
in car prices. In Kolkata, there are many outliers in car prices. We can
also see that the prices have a more extended interquartile range in
Chennai. Box plots offer information about the skewness of the data,
variance, symmetry, and possible outliers (Yi, 2021). This box plot
gives a global overview of data we previously explained, like kilometers
and location, and combines them to give the audience a better
understanding of the overall Indian car sales data set.
Task 9
This task shows the box plot statistics of the kilometers of cars
in the car sales data set
| Descriptive_Statistics | Values |
|---|---|
| Min | 171 |
| Q1 | 34994 |
| Median | 54000 |
| Q3 | 72618 |
| Max | 129000 |
The following table shows descriptive statistics for the
variable kilometers in the Indian Car sales data set. The code
boxplot.stats() generates the minimum, Q1, Median, Q3, and maximum
variable kilometers for the global data. The variable kilometer is
important for car sales because, as D’Allegro (2021) stated, kilometers
affects the car’s condition and varies the price customer are willing to
pay. The information from the code was inserted in a vector to create a
table using knitr::kable (Zhu, 2021). From the information, the car with
the lowest kilometers was 171km, which is a new car. The Q1 was
34,994km, the median kilometers for the data set was 54,000km, and the
Q3 was 72,618km. Finally, the car with the highest kilometers in the
Indian car sales data set was 129,000km, resulting in a range of
128,829km. According to Bluman (2018), box plot stats are a five-number
summary that helps understand data, and these points create
visualizations, as seen in the next chart below.
Task 10
This task presents a dot chart with the quartiles of kilometers in
cars from the previous table
The final analysis of the Indian car sales data set is a dot
chart to present the box plot statistics from task 9. The dot chart
shows five red dots to explain the variable kilometers in car sales.
From the chart we see the number of kilometers in cars increasing. Each
dot shows a box plot stat like the min, max, Q1, Q3, and median. The
visual helps gain a deeper understanding of what the data points mean
for the overall data set.
Conclusion
From the data, we found that factors like location, kilometers,
ownership, and condition affect the price of car sales. The analysis
section of the report gave basic descriptive statistics of efficiency,
power, seats, kilometers, and price to paint a picture of the data set.
Moreover, we found the location frequency and fuel types in cars in
India. Some key findings are that Mumbai has the highest car sales,
gasoline cars are the most popular, but surprisingly, diesel is a
popular fuel source in Indian cars. Furthermore, we found that most car
owners in India are first-car owners 59% and very few are third-car
owners, with only 7% in this category. Finally, through a density plot,
we found that the mean kilometers for the entire data set were 55,809km
per car.
Additionally, the price of cars is around $8383.22, and most sales
are near the median of $7146. Finally, we found that the median
kilometers for cars in India were 54,000km, and the mean was
55,809.14km. The car with the highest kilometers in the sales data set
was 149,000km, and the lowest was 171km.
From the data, we can recommend that the cars that sell the most
are either gasoline or diesel. Although there is a push for EVs in
India, the EV market still needs to produce more sales, according to the
data set. Some locations that are more prone to sell cars are Mumbai,
Hyderabad, Pune, Kochi, Coimbatore, and Delhi. Car sales companies can
use the data to evaluate where they should sell cars and how much
consumers will pay. The data can also be used by car manufacturers when
establishing what product line of cars to sell in India. The descriptive
statistics give information about sales in the country and what factors
to consider. The report can also help new car owners compare how many
kilometers are regular in car sales and what price first, second, third,
or fourth-time owners are paying for cars in India.
Through the report, new skills were acquired, like adding
columns to data tables, various graphs like density plots and
scatterplots, and using new libraries. From the overall project, the
visualizations influence the need for foreign direct investment and how
India is still an emerging market. The manufacturing sector is
increasing, and the opportunity to leverage care sales to make
profitable investments are apparent through the data.
Bibliography
Reference
Albright, E. (2023). ENV710 Statistics Review Website. Duke
Nicholas School of the Environment. Retrieved March 2, 2023, from https://sites.nicholas.duke.edu/statsreview/continuous-probability-distributions/#:~:text=Continuous%20probability%20distribution%3A%20A%20probability,50).
Bluman A. G. (2018). Elementary statistics: a step by step approach
(Tenth). McGraw-Hill Education.
Carlier, M. (2023). Automotive Industry Worldwide- statistics &
facts. Statista. Retrieved March 2, 2023, from https://www.statista.com/topics/1487/automotive-industry/#topicOverview
Chiluiza, D. (2021). Introduction to data anlysis using R, R Studio
and R Markdown- Short Manual Series: Histograms. RPubs. Retrieved March
1, 2023, from https://rpubs.com/Dee_Chiluiza/816756
D’Allegro, J. (2021). Just what factors into the value of your used
car. Investopedia. Retrieved February 28, 2023, from https://www.investopedia.com/articles/investing/090314/just-what-factors-value-your-used-car.asp
IBEF. (2022). Automobile Industry in India. India Brand Equity
Foundation. Retrieved March 2, 2023, from https://www.ibef.org/industry/india-automobiles
IEA. (2021). Fuel economy in India- part of the global fuel economy
initiative 2021. International Energy Agency. Retrieved March 1, 2023,
from https://www.iea.org/articles/fuel-economy-in-india
Madiraju, P. (2021) Statistics 101: Beginners Guide to Continuous
Probability Distribution. Analytics Vidhya. Retrieved Mar 2, 2023, from
https://www.analyticsvidhya.com/blog/2021/02/statistics-101-beginners-guide-to-continuous-probability-distributions/
Reed, B. (2023). Musk unveils plans for low production cost while
skirting affordable car option. Guardian. Retrieved March 2, 2023, from
https://www.theguardian.com/technology/2023/mar/01/elon-musk-tesla-fossil-fuels-economy
Reuters. (2023). India’s passenger vehicle sales to grow 9%-10% in
2024- Crisil Ratings. Reuters. Retrieved March 2, 2023, from https://www.reuters.com/world/india/indias-passenger-vehicle-sales-grow-9-10-2024-crisil-ratings-2023-02-28/
Statista Research Department. (2023). Passenger car market share
across India in 2022, by vendor. Statista. Retrieved March 2, 2023, from
https://www.statista.com/statistics/316850/indian-passenger-car-market-share/#statisticContainer
Yi, M. (2021). A complete guide to box plots. Chartio. Retrieved
March 2, 2023, from https://chartio.com/learn/charts/box-plot-complete-guide/#:~:text=Box%20plots%20are%20used%20to,skew%2C%20variance%2C%20and%20outliers.
Young, J. (2023). Discrete Probability Distribution: Overview and
Examples. Investopedia. Retrieved March 2, 2023, from https://www.investopedia.com/terms/d/discrete-distribution.asp#:~:text=A%20discrete%20probability%20distribution%20counts,%2C%20Poisson%2C%20and%20Bernoulli%20distributions.
Zhu, H. (2021). Create awesome HTML table with knit::Kable and
kableExtra. Can.R-Project. Retrieved March 1, 2023, from https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
Appendix
Due to the submission of this report in an HTML file, an R
Markdown file has been attached to the report. The file name is
Project1_JohnWoodDowney.Rmd