Car_Data dataset is taken from Kaggle (https://www.kaggle.com/datasets/athirags/car-data) and it contains 9 columns with 301 records. This dataset describes the details of cars and its various features using four - categorical variables, two - continous variables and three - integer variables.
a ) Describe the dataset using appropriate plots/curves/charts
cars = read.csv(file.choose())
head(cars)
Histogram
hist(cars$Selling_Price, col = 'red')
library(ggplot2)
Scatterplot
ggplot(data = cars, aes(x = Year, y = Selling_Price)) + geom_point()
Bar plot
ggplot(data = cars, aes(x=Year, y=Selling_Price)) +
geom_bar(stat="identity")
Scatter plot
ggplot(data = cars, aes(x = Fuel_Type, y = Selling_Price)) + geom_point()
Scatter plot
ggplot(data = cars, aes(x = Car_Name, y = Selling_Price)) + geom_point()
Bar Plot
ggplot(cars, aes(x="", y=Transmission, fill=Transmission)) + geom_bar(stat="identity", width=1) + coord_polar("y", start=0)
b ) Consider one of continuous attributes, and compute central and variational measures.
Central Measures
Price=cars$Selling_Price
# 1.Mean
# Mean is the average of all values. Here we are calculating the average of Selling_Price
mean_Price = mean(Price)
mean_Price
[1] 4.661296
# 2.Median
# Median is the middle value in the column Selling_Price when it is in sorted order
median_Price = median(Price)
median_Price
[1] 3.6
# 3.Mode
# Mode is the most common or repeating value in Selling_Price
Price_table =table(Price)
Price_table
Price
0.1 0.12 0.15 0.16 0.17 0.18 0.2 0.25 0.27 0.3
1 1 1 1 1 1 6 5 1 3
0.31 0.35 0.38 0.4 0.42 0.45 0.48 0.5 0.51 0.52
1 4 2 5 2 8 4 5 1 1
0.55 0.6 0.65 0.72 0.75 0.78 0.8 0.9 0.95 1
2 8 4 1 4 1 1 2 1 1
1.05 1.1 1.11 1.15 1.2 1.25 1.35 1.45 1.5 1.65
5 3 1 4 3 2 3 1 1 1
1.7 1.75 1.95 2 2.1 2.25 2.35 2.5 2.55 2.65
1 1 2 1 1 3 1 2 2 3
2.7 2.75 2.85 2.9 2.95 3 3.1 3.15 3.25 3.35
1 2 3 3 2 4 4 1 3 2
3.45 3.49 3.5 3.51 3.6 3.65 3.75 3.8 3.9 3.95
1 1 2 1 1 1 2 1 2 2
4 4.1 4.15 4.35 4.4 4.5 4.6 4.65 4.75 4.8
5 2 1 1 3 7 1 1 6 2
4.85 4.9 4.95 5 5.11 5.15 5.2 5.25 5.3 5.35
1 2 2 1 1 1 1 7 2 1
5.4 5.5 5.65 5.75 5.8 5.85 5.9 5.95 6 6.1
2 5 1 2 1 2 1 2 4 1
6.15 6.25 6.4 6.45 6.5 6.6 6.7 6.75 6.85 6.95
1 2 1 1 2 1 1 1 1 1
7.05 7.2 7.25 7.4 7.45 7.5 7.75 7.9 8.25 8.35
1 1 2 1 3 3 3 1 2 1
8.4 8.5 8.55 8.65 8.75 8.99 9.1 9.15 9.25 9.5
2 1 1 1 1 1 1 1 3 1
9.65 9.7 10.11 10.25 10.9 11.25 11.45 11.5 11.75 12.5
1 1 1 1 1 3 1 1 1 1
12.9 14.25 14.5 14.73 14.9 16 17 18 18.75 19.75
1 1 1 1 1 1 1 1 1 1
19.99 20.75 23 23.5 33 35
1 1 3 1 1 1
mode_Price = names(Price_table)[which(Price_table==max(Price_table))]
mode_Price
[1] "0.45" "0.6"
Variational Measures
# 1.Range
# Range is the difference between the highest and lowest value in the column
range_Price = max(Price)-min(Price)
range_Price
[1] 34.9
# 2.Interquartile Range
# IQR is the range of values located in the midpoint of the distribution of values.
Q=quantile(Price)
Q
0% 25% 50% 75% 100%
0.1 0.9 3.6 6.0 35.0
Q1 =Q[2];Q1
25%
0.9
Q3 =Q[4];Q3
75%
6
IQR = Q3-Q1
IQR
75%
5.1
# 4.Standard Deviation
# Measure of variation of set of values
SD_Price = sd(Price)
SD_Price
[1] 5.082812
# 5.Varience
# The mean squared difference between each data point and the distribution's center as determined by the mean
varience_Price = var(Price)
varience_Price
[1] 25.83497
c ) For a particular variable of the dataset, use Chebyshev’s rule, and propose one-sigma interval. Based on your proposed interval, specify the outliers if any.
#Chebyshev's Rule for outliers
k = 1
lowerRange = mean_Price - k * varience_Price
upperRange = mean_Price + k * varience_Price
round(lowerRange, 2)
[1] -21.17
round(upperRange, 2)
[1] 30.5
outliers = cars$Selling_Price < lowerRange | cars$Selling_Price > upperRange
sum(outliers)
[1] 2
d ) Explain how the box-plot technique can be used to detect outliers. Apply this technique for one attribute of the dataset
The box plot is an effective graphical technique for spotting outliers in a dataset. It displays the median, quartiles, and extreme values (minimum and maximum) of a numerical variable to provide a visual representation of its distribution. Outliers can be identified by measuring the distance between the quartiles (interquartile range or IQR) and the distribution’s extremes.
An outlier is defined as an observation that is more than 1.5 times the IQR below or above the first quartile. In other terms, an outlier is an observation that lies outside the range outlined by the box plot’s whiskers. To apply this technique to detect outliers, let’s consider the Selling_Price attribute. We can create a box plot for this variable using R as follows:
boxplot(cars$Selling_Price, main = "Box Plot of Selling_Price")
Above graph is a box plot of the Selling_Price attribute, displaying the data distribution and the location of any outliers. Outliers will be displayed as isolated points outside the box plot’s whiskers if they exist.
This box plot can be used to identify potential outliers and determine whether they are real data points or data entry errors. If they are real data points, we must decide whether to keep or discard them from the analysis, based on the research question and the impact of the outliers on the results.
q = quantile(cars$Selling_Price, c(0.25, 0.75))
iqr = q[2] - q[1]
upperOutliers = q[2] + 1.5 * iqr
lowerOutliers = q[1] - 1.5 * iqr
boxplotOutliers = cars$Selling_Price > upperOutliers | cars$Selling_Price < lowerOutliers
sum(boxplotOutliers)
[1] 17
a ) Select four variables of the dataset, and propose an appropriate probability model to quantify uncertainty of each variable.
We are generating the CarData dataset from kaggle (https://www.kaggle.com/datasets/athirags/car-data) to do Q2. This is the dataset which contains the details of different cars and its various features. Here we are considering the following Four Variables to introduce various probability model to quantify the uncertainty of each variables:
Below mentioned are the proposed approriate Probability models for the variables mentioned above:
Lets see in detail about the variables and its proposed probability models:
The ‘Present_Price’ column contains continuous numerical data that represents the car’s current ex-showroom pricing. Because the data is continuous, we can use a ‘Normal Distribution’ to model it. Using the sample mean and sample standard deviation of the Present Price column, we can estimate the normal distribution parameters (mean and standard deviation).
The ‘Fuel_Type’ column contains categorical data that represents the type of fuel that the car uses (petrol, diesel, or CNG). A ‘Multinomial Distribution’, which expresses the probability of a single outcome out of numerous possible outcomes, can be used to simulate; in this case, the probability of the car using petrol, diesel, or CNG as fuel. Using the percentage of cars in the dataset that use each fuel type, we can estimate the parameters of the multinomial distribution (the probabilities of the car using each fuel type).
The ‘Transmission’ column contains categorical data indicating whether the vehicle is equipped with a manual or automated transmission. A ‘Bernoulli Distribution’, which expresses the likelihood of a single binary result, can be used to model this; in this case, the probability of the car having an automatic transmission. Using the proportion of cars in the dataset that have automatic transmissions, we may estimate the parameter of the Bernoulli distribution (the chance of the car having an automatic transmission).
The ‘Kms_Driven’ column contains continuous numerical data that represents the total number of kilometers driven by the vehicle. This statistic is always positive and frequently skewed to the right, with most automobiles having low mileage and a few having very high mileage. The ‘Gamma Distribution’ is an appropriate probability model for quantifying uncertainty for this variable.
data=read.csv(file.choose())
head(data)
NA
Normal Distribution
X=data$Present_Price
mean = mean(X)
sd = sd(X)
sim=rnorm(1000,mean,sd)
pred=mean(sim)
pred
[1] 7.591407
# find Pr(8< P < 20.9)
pnorm(20.9,mu,sigma)-pnorm(8,mu,sigma)
[1] 0.4205066
hist(data$Present_Price, main = "Normal Distribution - Present_Price", xlab = "Price", freq = FALSE)
curve(dnorm(x, mean = mean, sd = sd), add = TRUE, col = "red")
Multinomial Distribution
X=data$Fuel_Type
t=table(X); t
X
CNG Diesel Petrol
2 60 239
mode=names(t)[which(t==max(t))]; print(mode)
[1] "Petrol"
p=t/sum(t);p
X
CNG Diesel Petrol
0.006644518 0.199335548 0.794019934
barplot(p, main = "Multinomial Distribution - Fuel_Type", xlab = "Fuel Type", ylab = "Outcomes", ylim = c(0, 1))
points(x = 0:5, y = rep(1/6, 6), col = "blue")
Bernoulli Distribution
transmission = data$Transmission
transmission_prop = table(transmission) / nrow(data)
transmission_param = transmission_prop[2] # Proportion
barplot(transmission_prop, main = "Bernoulli Distribution - Transmission", xlab = "Transmission", ylab = "Outcomes", ylim = c(0, 1))
abline(h = transmission_param, col = "blue")
Gamma Distribution
X = data$Kms_Driven
E=mean(X); V=var(X)
#parameter estimation
lambda=E/V
alpha=lambda*E
lambda
[1] 2.443292e-05
alpha
[1] 0.902728
#P(X>6500)=1-P(X<6500)
1-pgamma(6500,alpha,lambda)
[1] 0.8168217
alpha1=5
alpha2=1
lambda=.5
x=seq(0,15,0.01)
pdf1=dgamma(x,alpha1,lambda)
pdf2=dgamma(x,alpha2,lambda)
plot(x,pdf1,ylim=c(0,.5),col='red',main="Gamma Distribution - Kms_Driven ")
lines(x,pdf2,col='blue')
c ) Express the way in which each model can be used for the predictive analytics, then find the prediction for each attribute.
Normal Distribution
price_probability = pnorm(X, mean = mean, sd = sd)
price_prediction = X[which.max(price_probability)]
## prediction
c("The predicted price is", price_prediction)
[1] "The predicted price is" "27000"
Multinomial Distribution
x=c(1,1,0); n =2
dmultinom(x,n,p)
[1] 0.002648977
c('The predicted Fuel_type is:',mode)
[1] "The predicted Fuel_type is:" "Petrol"
Bernoulli Distribution
trans = transmission_prop
transmission_pred = names(transmission_prop)[which.max(transmission_prop)]
c("The predicted transmission type is", transmission_pred)
[1] "The predicted transmission type is"
[2] "Manual"
Gamma Distribution
data=rgamma(10000,alpha,lambda)
pred=mean(data)
c("The predicted Kms_Driven is", pred)
[1] "The predicted Kms_Driven is"
[2] "1.81819342060843"
a ) Consider two categorical variables of the dataset, develop a binary decision making strategy to check whether two variables are independent at the significant level alpha=0.01
Step 1: State the hypotheses
H0: X1 (Seller_Type) and X2 (Transmission) are independent
H1: X1 (Seller_Type) and X2 (Transmission) are dependent
# Load the dataset
data=read.csv(file.choose())
# Define X1 and X2
X1=data$Seller_Type
X2=data$Transmission
Step 2: Set significance level
#set 0.01 as significance level
alpha = 0.01
# Obtain the contigency table
C_table=table(X1,X2)
c("Contigency Table :"); C_table
[1] "Contigency Table :"
X2
X1 Automatic Manual
Dealer 29 166
Individual 11 95
# Define the matrix
E=matrix(NA,2,2);E
[,1] [,2]
[1,] NA NA
[2,] NA NA
#calculate No. of rows
N = nrow(data);N
[1] 301
X1=2;X2=2
for(i in 1:X1){
for(j in 1:X2){
Ci=sum(C_table[i,]);
Cj=sum(C_table[,j])
E[i,j]=(Ci*Cj)/N
}
}
Step 3: Compute the test.value
test.value=sum((C_table-E)^2/E)
test.value
[1] 1.203807
Step 4: Find the c.value
c.value = qchisq(1-alpha,4)
c.value
[1] 13.2767
Step 5: Specify the decision rule
If test.value ≥ c.value therefore H0 is rejected
if (test.value < c.value){
c("H0 is accepted: ie, X1 and X2 are independent")
}else{
c("H0 is rejected: ie, X1 and X2 are dependent")
}
[1] "H0 is accepted: ie, X1 and X2 are independent"
Step 6: Make a decision and conclusion
Here H0 is accepted and H1 is rejected. Where H0:X1 and X2 are independent
If the condition in step 5 is satisfied, then two categorical variables are dependent at the significance level α.
Here the null hypothesis (H0) is satisfied, therefore two categorical variables Seller_Type and Transmission are independent at the significance level α = 0.01
b ) Consider one categorical variable, apply goodness of fit test to evaluate whether a candidate set of probabilities can be appropriate to quantify the uncertainty of class frequency at the significant level alpha=0.05.
Step 1: State the hypotheses
H0: p1=p2=p3=1/3
H1: Not H0
X=data$Owner
Step 2: Set significance level
alpha = 0.05
C_table2=table(X)
Probability =C_table2/sum(C_table2)
P0=rep(1/3,3)
N=nrow(data)
E=N*P0
Step 3: Compute the test.value
test.value=sum((C_table2-E)^2/E)
test.value
[1] 538.2126
Step 4: Find the c.value
c.value = qchisq(1-alpha, (N-1))
c.value
[1] 341.3951
Step 5: Specify the decision rule
test.value ≥ c.value therefore H0 is rejected
if (test.value < c.value){
c("H0 is accepted")
}else{
c("H0 is rejected")
}
[1] "H0 is rejected"
Step 6: Make a decision and conclusion
Here c.value less than test.value, therefore the null hypothesis (H0) is rejected and alternate hypothesis H1 is accepted.
c ) Consider one continuous variable in the data set, and apply test of mean for a proposed candidate of μ at the significant level alpha=0.05.
Y=data$Selling_Price
Step 1: State the hypotheses
Let Mu0 be 4.5
H0: Mu0 < 4.5 H1: Mu0 >= 4.5
Step 2: Set significance level
alpha = 0.05
# Upper one sided test
Mu0 = 4.5
# Mean of Selling_Price
Y_bar=mean(Y);Y
[1] 3.35 4.75 7.25 2.85 4.60 9.25 6.75 6.50 8.75
[10] 7.45 2.85 6.85 7.50 6.10 2.25 7.75 7.25 7.75
[19] 3.25 2.65 2.85 4.90 4.40 2.50 2.90 3.00 4.15
[28] 6.00 1.95 7.45 3.10 2.35 4.95 6.00 5.50 2.95
[37] 4.65 0.35 3.00 2.25 5.85 2.55 1.95 5.50 1.25
[46] 7.50 2.65 1.05 5.80 7.75 14.90 23.00 18.00 16.00
[55] 2.75 3.60 4.50 4.75 4.10 19.99 6.95 4.50 18.75
[64] 23.50 33.00 4.75 19.75 9.25 4.35 14.25 3.95 4.50
[73] 7.45 2.65 4.90 3.95 5.50 1.50 5.25 14.50 14.73
[82] 4.75 23.00 12.50 3.49 2.50 35.00 5.90 3.45 4.75
[91] 3.80 11.25 3.51 23.00 4.00 5.85 20.75 17.00 7.05
[100] 9.65 1.75 1.70 1.65 1.45 1.35 1.35 1.35 1.25
[109] 1.20 1.20 1.20 1.15 1.15 1.15 1.15 1.11 1.10
[118] 1.10 1.10 1.05 1.05 1.05 1.05 1.00 0.95 0.90
[127] 0.90 0.75 0.80 0.78 0.75 0.75 0.75 0.72 0.65
[136] 0.65 0.65 0.65 0.60 0.60 0.60 0.60 0.60 0.60
[145] 0.60 0.60 0.55 0.55 0.52 0.51 0.50 0.50 0.50
[154] 0.50 0.50 0.48 0.48 0.48 0.48 0.45 0.45 0.45
[163] 0.45 0.45 0.45 0.45 0.45 0.42 0.42 0.40 0.40
[172] 0.40 0.40 0.40 0.38 0.38 0.35 0.35 0.35 0.31
[181] 0.30 0.30 0.30 0.27 0.25 0.25 0.25 0.25 0.25
[190] 0.20 0.20 0.20 0.20 0.20 0.20 0.18 0.17 0.16
[199] 0.15 0.12 0.10 3.25 4.40 2.95 2.75 5.25 5.75
[208] 5.15 7.90 4.85 3.10 11.75 11.25 2.90 5.25 4.50
[217] 2.90 3.15 6.45 4.50 3.50 4.50 6.00 8.25 5.11
[226] 2.70 5.25 2.55 4.95 3.10 6.15 9.25 11.45 3.90
[235] 5.50 9.10 3.10 11.25 4.80 2.00 5.35 4.75 4.40
[244] 6.25 5.95 5.20 3.75 5.95 4.00 5.25 12.90 5.00
[253] 5.40 7.20 5.25 3.00 10.25 8.50 8.40 3.90 9.15
[262] 5.50 4.00 6.60 4.00 6.50 3.65 8.35 4.80 6.70
[271] 4.10 3.00 7.50 2.25 5.30 10.90 8.65 9.70 6.00
[280] 6.25 5.25 2.10 8.25 8.99 3.50 7.40 5.65 5.75
[289] 8.40 10.11 4.50 5.40 6.40 3.25 3.75 8.55 9.50
[298] 4.00 3.35 11.50 5.30
# Obtain Standard Deviation
SD=sd(Y);SD
[1] 5.082812
N=length(Y);N
[1] 301
Step 3: Compute the test.value
test.value=(Y_bar-Mu0)/(SD/sqrt(N))
test.value
[1] 0.5505566
Step 4: Find the c.value
c.value = qnorm(1-alpha)
c.value
[1] 1.644854
Step 5: Specify the decision rule
If test.value ≥ c.value therefore H0 is rejected
if (test.value < c.value){
c("H0 is accepted")
}else{
c("H0 is rejected")
}
[1] "H0 is accepted"
Step 6: Make a decision and conclusion
If the condition in step 5 is satisfied, the candidate value is atleast equal to Mu0.
Here c.value greater than test.value, therefore the null hypothesis (H0) is accepted and alternate hypothesis H1 is rejected.