Data visualization homework no.1

  1. Choose a data set (the number of data attributes should be more than 5), explain why it is important or interesting for you.
  2. Formulate research questions (for which you expect to find the answers)
  3. Make some visualizations for the formulated questions.
  4. Prepare a presentation (where you explain the data, questions, problems, results) and upload it.
  library(readr)
  library(tidyverse)
  library(DataExplorer)
  library(treemap)
  library(scales)
  library(dplyr)
  library(ggplot2)
  cars <- read_csv("train.csv")

Preparing data set to visualize.

  data <- select(cars, -ID, -Doors)
  data <- as.data.frame(data)
  data <- data[rowSums(is.na(data)) == 0, ] # Deleting NA values
  data1 <- data[data$Fueltype != "Hydrogen", ] 
  head(data)
##   Price Levy Manufacturer   Model Prod.year  Category Leatherinterior Fueltype
## 1 13328 1399        LEXUS   RX450      2010      Jeep             Yes   Hybrid
## 2 16621 1018    CHEVROLET Equinox      2011      Jeep              No   Petrol
## 3  8467    -        HONDA     FIT      2006 Hatchback              No   Petrol
## 4  3607  862         FORD  Escape      2011      Jeep             Yes   Hybrid
## 5 11726  446        HONDA     FIT      2014 Hatchback             Yes   Petrol
## 6 39493  891      HYUNDAI SantaFE      2016      Jeep             Yes   Diesel
##   Enginevolume Mileage Cylinders Gearboxtype Drivewheels           Wheel  Color
## 1          3.5  186005         6   Automatic         4x4       Leftwheel Silver
## 2            3  192000         6   Tiptronic         4x4       Leftwheel  Black
## 3          1.3  200000         4    Variator       Front Right-handdrive  Black
## 4          2.5  168966         4   Automatic         4x4       Leftwheel  White
## 5          1.3   91901         4   Automatic       Front       Leftwheel Silver
## 6            2  160931         4   Automatic       Front       Leftwheel  White
##   Airbags
## 1      12
## 2       8
## 3       2
## 4       0
## 5       4
## 6       4

This dataset is is published in https://www.kaggle.com/sidharth178/car-prices-dataset for car price prediction. With the rise in the variety of cars with differentiated capabilities and features such as model, production year, category, brand, fuel type, engine volume, mileage, cylinders, colour, airbags and many more, author is bringing a car price prediction challenge for all.

I chose this dataset for the variables variety and nowadays relevance.

Question 1. What kind of transmission is the most common for each fuel type?

ggplot(data = data1, aes(x = Fueltype, fill = Gearboxtype)) +
       geom_bar() +
       xlab('FUEL TYPE') +
       ylab('COUNT') +
       ggtitle('GEARBOX TYPE BASED ON FUEL TYPE') +
       labs(fill = "GEARBOX TYPE")

Answer: The most popular is automatic transmission with petrol fuel. The least petrol cars are with continuously variable transmission. 2nd based by popularity fuel type is diesel with automatic transmission.

Question 2. What kind of car brand is the most popular based on the data?

model_count <- data %>%
    group_by(Manufacturer)%>%
    summarize(count=n())%>%
    arrange(desc(count))
model_count_10 <- model_count[1:10,]
  
ggplot(model_count_10, aes(x="",y=reorder(count,Manufacturer),fill= Manufacturer))+
       geom_col(position = "dodge") +
       scale_fill_brewer(palette="Paired") +
       theme_minimal() +
       labs(x="MANUFACTURER",y="COUNT") +
       ggtitle('TOP 10 BIGGEST CAR MANUFACTURERS') +
       labs(fill = "MANUFACTURER")

The most popular cars manufacturer is hyundai with 3769 counted values. The second is toyota (3661) and the third – mercedes-benz (2073).

Question 3. What kind of cars are the most expensive?

data %>% group_by(Manufacturer) %>% summarise(AVERAGE=mean(Price)) %>%
    arrange(desc(AVERAGE)) %>% head(10) %>%
    ggplot(aes(x=reorder(Manufacturer,AVERAGE),y=AVERAGE, fill=factor(Manufacturer))) + 
    geom_bar(stat='identity') +  coord_flip() +
    theme_light() +
    ggtitle("TOP 10 EXPENSIVE CARS") + xlab("MODEL") + ylab("AVERAGE PRICE") +
    theme(legend.position="none")

Answer: This graph shows 10 the most expensive cars of the data. It is clearly visible that the most expensive is lamborghini.

Question 4. What is the most popular car color?

 color <- data %>%
    group_by(Color)%>%
    summarize(count=n())%>%
    arrange(desc(count))
  
  par(mar = c(7, 4, 2, 2) + 0.2)
  barplot(color$count,names.arg = color$Color,col = c("Black", "White", "lightgray", "Grey40", "Blue", "Red", "Green", "Orange", "saddlebrown", "orangered3",
                              "goldenrod2", "beige", "skyblue1", "yellow", "purple", "pink"), las=2,main="Cars popularity by color")

The most popular colors are black and white, and the least – purple and pink.

Question 5. What kind of car body types are the most common and are they made with leather interior or not?

abs <- ggplot(data) +  ggtitle("CARS BODY TYPES BASED ON LEATHER INTERIOR") + xlab("CATEGORY") + ylab("COUNT") + labs(fill = "LEATHER INTERIOR") +
    geom_bar(aes(x = Category, fill = Leatherinterior), position = position_dodge(preserve = 'single'))
  abs + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Answer: The most popular cars body type is sedan with leather interior.