Step 1: Find the Data

When determining which dataset to use, I followed the guidelines provided in the assignment:

  1. The sample size must be at least 20.

  2. The data must include at least 3 different variables.

  3. Out of the three variables, at least one variable must be categorical with 2 (up to 5 possible) categories.

Consequently, I decided to use the “mtcars” dataset, which is already integrated in R.

Step 2: Import the Date Into R Studio

#Create a data frame using the data.frame function
data <- data.frame(mtcars)

Step 3: Displaying the Data

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The head function allows us to gain a brief insight into the dataset. For example, here we get an overview of the 11 variables and the first 6 cars from the sample.

Step 4: Explain Your Data

The unit of observation in the mtcars dataset is individual car models from 1973 - 1974.

The sample size (n) is 32, where each unit represents an individual car model from 1973 - 1974.

The dataset includes 11 variables with definitions and units as follows:

Step 5: Source of the Data

The data used in the mtcars dataset was extracted from the 1974 Motor Trend US magazine.

Step 6: Data Manipulation

The two categorical variables engine configuration (vs) and transmission type (am) were transformed into factor variables to make the graphical analysis easier and their interpretation more straightforward.

data$VSFactor <- factor(data$vs, 
                              levels = c(0, 1), 
                              labels = c("V-shaped", "Straight"))

data$AMFactor <- factor(data$am, 
                              levels = c(0, 1), 
                              labels = c("Automatic", "Manual"))

Step 7: Descriptive Statistics

#install.packages("psych")
library(psych)
describe(data)
##           vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg          1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl          2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp         3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp           4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat         5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt           6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec         7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs           8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am           9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear        10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb        11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
## VSFactor*   12 32   1.44   0.50   1.00    1.42   0.00  1.00   2.00   1.00  0.24
## AMFactor*   13 32   1.41   0.50   1.00    1.38   0.00  1.00   2.00   1.00  0.36
##           kurtosis    se
## mpg          -0.37  1.07
## cyl          -1.76  0.32
## disp         -1.21 21.91
## hp           -0.14 12.12
## drat         -0.71  0.09
## wt           -0.02  0.17
## qsec          0.34  0.32
## vs           -2.00  0.09
## am           -1.92  0.09
## gear         -1.07  0.13
## carb          1.26  0.29
## VSFactor*    -2.00  0.09
## AMFactor*    -1.92  0.09
  1. mpg - On average a car in this dataset can travel 20.09 miles per US gallon of fuel.
  2. cyl - The minimum number of cylinders in at least one car is 4 and the maximum number of cylinders in at least one car is 8. Therefore, the difference (range) between minimum and maximum number of cylinders is 4.
  3. displ - On average a car’s displacement deviates by 123.94 cubic inches from the mean displacement of cars in the dataset.
  4. hp - The maximum horsepower of at least one of the 32 cars in the sample is 335.
  5. drat - The median rear axle ratio is 3.7, meaning that 50% of the cars in the sample had a drat of 3.7 or less and 50% had 3.7 or higher.
  6. wt - The trimmed mean weight of the sample is 3.15 thousand pounds. This is only somewhat lower than the regular mean of 3.22. Hence, the values that were removed to calculate the trimmed mean were only slightly higher than the average.
  7. qsec - The average quarter mile time is 17.85 seconds.
  8. vs - The median value for the engine configuration is 0. This means that at least 50% of cars have a v-shaped engine.
  9. am - The median value for the transmission type is 0. This means that at least 50% of cars have an automatic transmission type.
  10. gear - The maximum number of gears of any car in the datset is 5, while the minimum is 3. This means that the difference between the minimum and maximum number of gears in cars in the dataset is 2.
  11. carb - The average number of carburetors is 2.81.

Step 8: Graphical Analysis

Histograms

Histograms are best used when displaying continuous numerical data. Therefore, the variables miles per gallon (mpg), displacement (disp), horsepower (hp), weight (wt), quarter mile time (qsec), and rear axle ratio (drat) were visualized using histograms.

# Miles per Gallon (mpg)
hist(mtcars$mpg, 
     main = "Distribution of Miles per Gallon (mpg)", 
     xlab = "Miles per Gallon", 
     ylab = "Frequency",
     col = "lightblue",
     breaks = seq(from = 10, to = 35, by = 1))

The histogram shows the distribution of miles per gallon (mpg) for the sample of 32 cars. There are multiple things to note in this histogram:

Firstly, the histogram appears to be skewed to the right where the tail is longer. This means that more cars have a lower mpg value and fewer have a high value.

Secondly, the majority of the data is centered around the 15 to 22 miles per gallon range. This means that the majority of cars have a mpg range inbetween these values.

Thirdly, the distribution somewhat looks bi-modal with two peaks at around 15 and 30 mpg. This indicates that there are two groups of cars within the dataset that have different average mpg values.

Lastly, there are no extreme outliers present.

# Displacement (disp)
hist(mtcars$disp, 
     main = "Distribution of Displacement (disp)", 
     xlab = "Displacement (cu.in.)", 
     ylab = "Frequency",
     col = "lightgreen",
     breaks = seq(from = 0, to = 500, by = 10))

The histogram shows the distribution of displacement for the sample of 32 cars. There are multiple things to note in this histogram:

Firstly, there are multiple peaks in the histogram, indicating that it is multimodal. This might indicate that there are multiple types of cars in the dataset which have different average displacement ratios.

Secondly, the data appears to center around two peaks: 80-150 and 280-370. This indicates that the majority of cars have a displacement of approximately these values.

Thirdly, there is quite a large range of values in the sample which means that displacement values for cars vary in the extremes.

# Horsepower (hp)
hist(mtcars$hp, 
     main = "Distribution of Horsepower (hp)", 
     xlab = "Horsepower", 
     ylab = "Frequency",
     col = "salmon",
     breaks = seq(from = 10, to = 400, by = 10))

The histogram shows the distribution of horsepower for the sample of 32 cars. There are multiple things to note in this histogram:

Firstly, there are multiple peaks in this distribution, indicating a multimodal distribution. Hence, there are multiple groups of cars with different horsepower values (one group around 60 hp, another around 100 hp, and another around 170 hp).

Secondly, the data appears to be centered around the 90 hp to 140 hp range. Hence, most types of cars have a horsepower value in this range.

Thirdly, the distribution appears to be skewed to the right where the tail is longer. Therefore, there are more cars with lower horsepower than cars with high horsepower in the sample.

Lastly, there appear to be no extreme outliers.

# Weight (wt)
hist(mtcars$wt, 
     main = "Distribution of Weight (wt)", 
     xlab = "Weight (1000 lbs)", 
     ylab = "Frequency",
     col = "purple",
     breaks = seq(from = 1, to = 6, by = 0.2))

The histogram shows the distribution of weight for the sample of 32 cars. There are multiple things to note in this histogram:

Firstly, the data has one clear peak with a value of 3.6 (thousand pounds). Therefore, the majority of cars have a weight equal or extremely similar to this value.

Secondly, the data appears to be centered around 3 (thousand pounds) indicating that the majority of cars have a weight similar to this value.

Thirdly, there does not appear to be any outliers nor skewness.

# Quarter Mile Time (qsec)
hist(mtcars$qsec, 
     main = "Distribution of 1/4 Mile Time (qsec)", 
     xlab = "1/4 Mile Time (seconds)", 
     ylab = "Frequency",
     col = "skyblue",
     breaks = seq(from = 10, to = 25, by = 0.5))

The histogram shows the distribution of quarter mile time (in seconds) for the sample of 32 cars. There are multiple things to note in this histogram:

Firstly, the distribution appears to be bimodal meaning that there are two peaks in the distribution. One at around 17.5 seconds and one at around 19 seconds. Hence, most cars have a quarter mile time equal or extremely similar to that time.

Secondly, the data appears to be clustered around the 18 seconds, meaning that the majority of cars have a value similar to this one.

Thirdly, there do not appear to be any outliers and the distribution does not appear skewed.

#Rear Axle Ratio (drat)
hist(mtcars$drat, 
     main = "Distribution of Rear Axle Ratio (drat)", 
     xlab = "Rear Axle Ratio", 
     ylab = "Frequency",
     col = "peachpuff",
     breaks = seq(from = 2.5, to = 5, by = 0.1))

The histogram shows the distribution of rear axle ratio for the sample of 32 cars. There are multiple things to note in this histogram:

Firstly, the distribution appears to be multimodal with one peak at around 3.0, another at 3.6, 3.8 and antoher the other at 3.9. Hence, the majority of cars have a value equal to or similar to those values.

Secondly, there appear to be two clusters of data. One centered around 3.0 and another centered around 3.7. Hence, the majority of cars a have a rear axle ratio of approximately these values.

Lastly, there do not appear to be any significant outliers nor a significant skewness.

Boxplots

Boxplots are useful when comparing distributions of numerical data across multiple groups. Therefore, I have decided to use the boxplots to compare the categories in the variables cylinders, gears, carburetors, engine configuration, and transmission type in relation to miles per gallon.

# Cylinders (cyl)
boxplot(mtcars$mpg ~ mtcars$cyl, 
        main = "MPG by Number of Cylinders", 
        xlab = "Number of Cylinders", 
        ylab = "Miles per Gallon",
        col = "lightblue")

These boxplots show the miles per gallon for each number of cylinders present in the sample of 32 cars. There are multiple things to note in these boxplots:

Firstly, the cars with four cylinders have a much larger range of mpg values than cars with six or eight cylinders. Furthermore, their median is approximately 26, which is higher than both other groups (with 20 and 15 mpg). Hence, it appears that in general the mpg decreases as the number of cylinder increase. Also, the interquartile range is much larger for the four cylinders than for the other two, indicating that mpg values are much more variant. The IQ range for the 6 and 8 cylinder cars are much similar, however, there are more extreme values for the 8 cylinder cars. Here we have one outlier at around 10 mpg.

# Gears (gear)
boxplot(mtcars$mpg ~ mtcars$gear, 
        main = "MPG by Number of Gears", 
        xlab = "Number of Gears", 
        ylab = "Miles per Gallon",
        col = "lightgreen")

These boxplots show the miles per gallon for each number of gears present in the sample of 32 cars. There are multiple things to note in these boxplots:

Firstly, the median mpg is highest for cars with four gears. Hence, this indicates that for mpg to be very good you do not need gears at the extreme levels but rather in the middle. Further, the IQR is smallest for cars with three gears and largest for cars with five gears. Hence, the variation in mpg increases as the number of gears increases. Lastly, there are no outliers, however, the maximum mpg is achieved for cars with four gears and the minimumg mpg is achieved for cars with 3 gears.

# Carburetors (carb)
boxplot(mtcars$mpg ~ mtcars$carb, 
        main = "MPG by Number of Carburetors", 
        xlab = "Number of Carburetors", 
        ylab = "Miles per Gallon",
        col = "salmon")

These boxplots show the miles per gallon for each number of carburetors present in the sample of 32 cars. There are multiple things to note in these boxplots:

Firstly, with the exception of six carburetors, the median value of mpg appears to decrease with the number of carburetors. Because there is only one car with six carburetors it is difficult to assume whether it provides sufficient proof to contradict this trend. More cars with six carburetors would be required to really contradict or support this trend. The same goes for the one car with eight carburetors.

Secondly, the range of mpg values varies most significantly for the cars with one and two carburetors. Notably, cars with three carburetors have a much smaller range than those with one, two, or three carburetors. Hence, there is variability in fuel efficiency for these cars. Again while the range for cars with six and eight carburetors is minimal this is due to their sample size being one. Hence, more data is required to make reliable assertions about these groups.

# Engine Configuration (vs) with MPG
boxplot(mpg ~ VSFactor, 
        data = data, 
        main = "MPG by Engine Configuration", 
        xlab = "Engine Configuration", 
        ylab = "Miles per Gallon",
        col = "lightgray")

These boxplots show the miles per gallon for each engine configuration present in the sample of 32 cars. There are multiple things to note in these boxplots:

Firstly, the mediam mpg is higher for cars with a straight engine configuration than for cars with a v-shaped engine configuration. Specifically, the median mpg for cars with a straight engine configuration is approximately 23. This means that 50% of these cars have an mpg of 23 or higher and 50% have an mpg of 23 or lower.

Secondly, the range of mpg is larger of cars with a straight engine configuration. This means that there is more variation in mpg for those cars than for cars with a v-shaped engine configuration. Nevertheless, they tend to be more fuel efficient than the v-shaped engine cars. However, it should be noted that there is one outlier in the v-shaped category that has very high mpg.

# Transmission Type (am) with MPG
boxplot(mpg ~ AMFactor, 
        data = data, 
        main = "MPG by Transmission Typ", 
        xlab = "Transmission Type", 
        ylab = "Miles per Gallon",
        col = "aquamarine")

These boxplots show the miles per gallon for each transmission type present in the sample of 32 cars. There are multiple things to note in these boxplots:

Firstly, the median mpg is higher for manual cars than for automatic cars. For manual cars it is approximately 24 and for automatic cars it is approximately 18. Hence, for manual cars 50% or cars have an mpg of 24 and below and 50% have an mpg of 24 or above. This is the same for automatic cars, however, centered around 18 mpg.

Secondly, the range of values for the manual cars is much larger than for the automatic cars. However, generally the manual cars are still more fuel efficient than the automatic cars.

Scatter Plot

Scatter plots are valuable when you want to compare two variables. Specifically, two continuous variables. Therefore, I chose to compare the miles per gallon and weight variables..

plot(mtcars$wt, mtcars$mpg,
     main = "MPG vs. Weight",
     xlab = "Weight (1000 lbs)",
     ylab = "Miles per Gallon",
     pch = 19,
     col = "blue")
abline(lm(mpg ~ wt, data = mtcars), col = "red")

This scatter plot shows a negative relationship between mpg and weight. This means that as the car becomes heavier, the fuel efficiency becomes worse. This trend is indicated by the red trendline which is drawn through the 32 data points.