Andmete lugemine

Fail:

data <- read.csv("C:/Users/kristinarako/Downloads/CARPRICE.csv") 

Andmete tabel:

library(DT)
datatable(data, options=list(scrollX=1,pageLenght=5,searching = FALSE,scroller = TRUE,scrollY=200))

Andmestiku struktuur

Ülevaade andmestiku struktuurist:

str(data)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : chr  "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr  "std" "std" "std" "std" ...
##  $ doornumber      : chr  "two" "two" "two" "four" ...
##  $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation  : chr  "front" "front" "front" "front" ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber  : chr  "four" "four" "six" "four" ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

Andmestiku ülevaade:

library(summarytools)
dfSummary(data,plain.ascii = FALSE, style = "grid",tmp.img.dir  = "/tmp",graph.magnif = 0.85)
## temporary images written to 'C:\tmp'

Data Frame Summary

data

Dimensions: 205 x 26
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 car_ID
[integer]
Mean (sd) : 103 (59.3)
min < med < max:
1 < 103 < 205
IQR (CV) : 102 (0.6)
205 distinct values
(Integer sequence)
205
(100.0%)
0
(0.0%)
2 symboling
[integer]
Mean (sd) : 0.8 (1.2)
min < med < max:
-2 < 1 < 3
IQR (CV) : 2 (1.5)
-2 : 3 ( 1.5%)
-1 : 22 (10.7%)
0 : 67 (32.7%)
1 : 54 (26.3%)
2 : 32 (15.6%)
3 : 27 (13.2%)
205
(100.0%)
0
(0.0%)
3 CarName
[character]
1. peugeot 504
2. toyota corolla
3. toyota corona
4. subaru dl
5. honda civic
6. mazda 626
7. mitsubishi g4
8. mitsubishi mirage g4
9. mitsubishi outlander
10. toyota mark ii
[ 137 others ]
6 ( 2.9%)
6 ( 2.9%)
6 ( 2.9%)
4 ( 2.0%)
3 ( 1.5%)
3 ( 1.5%)
3 ( 1.5%)
3 ( 1.5%)
3 ( 1.5%)
3 ( 1.5%)
165 (80.5%)
205
(100.0%)
0
(0.0%)
4 fueltype
[character]
1. diesel
2. gas
20 ( 9.8%)
185 (90.2%)
205
(100.0%)
0
(0.0%)
5 aspiration
[character]
1. std
2. turbo
168 (82.0%)
37 (18.0%)
205
(100.0%)
0
(0.0%)
6 doornumber
[character]
1. four
2. two
115 (56.1%)
90 (43.9%)
205
(100.0%)
0
(0.0%)
7 carbody
[character]
1. convertible
2. hardtop
3. hatchback
4. sedan
5. wagon
6 ( 2.9%)
8 ( 3.9%)
70 (34.1%)
96 (46.8%)
25 (12.2%)
205
(100.0%)
0
(0.0%)
8 drivewheel
[character]
1. 4wd
2. fwd
3. rwd
9 ( 4.4%)
120 (58.5%)
76 (37.1%)
205
(100.0%)
0
(0.0%)
9 enginelocation
[character]
1. front
2. rear
202 (98.5%)
3 ( 1.5%)
205
(100.0%)
0
(0.0%)
10 wheelbase
[numeric]
Mean (sd) : 98.8 (6)
min < med < max:
86.6 < 97 < 120.9
IQR (CV) : 7.9 (0.1)
53 distinct values 205
(100.0%)
0
(0.0%)
11 carlength
[numeric]
Mean (sd) : 174 (12.3)
min < med < max:
141.1 < 173.2 < 208.1
IQR (CV) : 16.8 (0.1)
75 distinct values 205
(100.0%)
0
(0.0%)
12 carwidth
[numeric]
Mean (sd) : 65.9 (2.1)
min < med < max:
60.3 < 65.5 < 72.3
IQR (CV) : 2.8 (0)
44 distinct values 205
(100.0%)
0
(0.0%)
13 carheight
[numeric]
Mean (sd) : 53.7 (2.4)
min < med < max:
47.8 < 54.1 < 59.8
IQR (CV) : 3.5 (0)
49 distinct values 205
(100.0%)
0
(0.0%)
14 curbweight
[integer]
Mean (sd) : 2555.6 (520.7)
min < med < max:
1488 < 2414 < 4066
IQR (CV) : 790 (0.2)
171 distinct values 205
(100.0%)
0
(0.0%)
15 enginetype
[character]
1. dohc
2. dohcv
3. l
4. ohc
5. ohcf
6. ohcv
7. rotor
12 ( 5.9%)
1 ( 0.5%)
12 ( 5.9%)
148 (72.2%)
15 ( 7.3%)
13 ( 6.3%)
4 ( 2.0%)
205
(100.0%)
0
(0.0%)
16 cylindernumber
[character]
1. eight
2. five
3. four
4. six
5. three
6. twelve
7. two
5 ( 2.4%)
11 ( 5.4%)
159 (77.6%)
24 (11.7%)
1 ( 0.5%)
1 ( 0.5%)
4 ( 2.0%)
205
(100.0%)
0
(0.0%)
17 enginesize
[integer]
Mean (sd) : 126.9 (41.6)
min < med < max:
61 < 120 < 326
IQR (CV) : 44 (0.3)
44 distinct values 205
(100.0%)
0
(0.0%)
18 fuelsystem
[character]
1. 1bbl
2. 2bbl
3. 4bbl
4. idi
5. mfi
6. mpfi
7. spdi
8. spfi
11 ( 5.4%)
66 (32.2%)
3 ( 1.5%)
20 ( 9.8%)
1 ( 0.5%)
94 (45.9%)
9 ( 4.4%)
1 ( 0.5%)
205
(100.0%)
0
(0.0%)
19 boreratio
[numeric]
Mean (sd) : 3.3 (0.3)
min < med < max:
2.5 < 3.3 < 3.9
IQR (CV) : 0.4 (0.1)
38 distinct values 205
(100.0%)
0
(0.0%)
20 stroke
[numeric]
Mean (sd) : 3.3 (0.3)
min < med < max:
2.1 < 3.3 < 4.2
IQR (CV) : 0.3 (0.1)
37 distinct values 205
(100.0%)
0
(0.0%)
21 compressionratio
[numeric]
Mean (sd) : 10.1 (4)
min < med < max:
7 < 9 < 23
IQR (CV) : 0.8 (0.4)
32 distinct values 205
(100.0%)
0
(0.0%)
22 horsepower
[integer]
Mean (sd) : 104.1 (39.5)
min < med < max:
48 < 95 < 288
IQR (CV) : 46 (0.4)
59 distinct values 205
(100.0%)
0
(0.0%)
23 peakrpm
[integer]
Mean (sd) : 5125.1 (477)
min < med < max:
4150 < 5200 < 6600
IQR (CV) : 700 (0.1)
23 distinct values 205
(100.0%)
0
(0.0%)
24 citympg
[integer]
Mean (sd) : 25.2 (6.5)
min < med < max:
13 < 24 < 49
IQR (CV) : 11 (0.3)
29 distinct values 205
(100.0%)
0
(0.0%)
25 highwaympg
[integer]
Mean (sd) : 30.8 (6.9)
min < med < max:
16 < 30 < 54
IQR (CV) : 9 (0.2)
30 distinct values 205
(100.0%)
0
(0.0%)
26 price
[numeric]
Mean (sd) : 13276.7 (7988.9)
min < med < max:
5118 < 10295 < 45400
IQR (CV) : 8715 (0.6)
189 distinct values 205
(100.0%)
0
(0.0%)

Andmete puhastamine

Duplikaatide kontroll

sum(duplicated(data))
## [1] 0

Andmestikus ei ole duplikaate.

Puuduvate väärtuste kontroll

sum(is.na(data))
## [1] 0

Puuduvaid väärtusi ei ole

sum(!complete.cases(data))
## [1] 0

Puuduvate väärtustega objekte ei ole.

sum(complete.cases(data))
## [1] 205

Andmestikus on 205 objekti. Teeme uuesti andmestikust ülevaate:

str(data)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : chr  "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr  "std" "std" "std" "std" ...
##  $ doornumber      : chr  "two" "two" "two" "four" ...
##  $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation  : chr  "front" "front" "front" "front" ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber  : chr  "four" "four" "six" "four" ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

Faktortunnuste moodustamine

data<- as.data.frame(unclass(data),stringsAsFactors = TRUE)

Ettevalmistatud andmestiku struktuur:

skimr::skim(data)
Data summary
Name data
Number of rows 205
Number of columns 26
_______________________
Column type frequency:
factor 10
numeric 16
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
CarName 0 1 FALSE 147 peu: 6, toy: 6, toy: 6, sub: 4
fueltype 0 1 FALSE 2 gas: 185, die: 20
aspiration 0 1 FALSE 2 std: 168, tur: 37
doornumber 0 1 FALSE 2 fou: 115, two: 90
carbody 0 1 FALSE 5 sed: 96, hat: 70, wag: 25, har: 8
drivewheel 0 1 FALSE 3 fwd: 120, rwd: 76, 4wd: 9
enginelocation 0 1 FALSE 2 fro: 202, rea: 3
enginetype 0 1 FALSE 7 ohc: 148, ohc: 15, ohc: 13, doh: 12
cylindernumber 0 1 FALSE 7 fou: 159, six: 24, fiv: 11, eig: 5
fuelsystem 0 1 FALSE 8 mpf: 94, 2bb: 66, idi: 20, 1bb: 11

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
car_ID 0 1 103.00 59.32 1.00 52.00 103.00 154.00 205.00 ▇▇▇▇▇
symboling 0 1 0.83 1.25 -2.00 0.00 1.00 2.00 3.00 ▃▇▆▃▃
wheelbase 0 1 98.76 6.02 86.60 94.50 97.00 102.40 120.90 ▁▇▂▁▁
carlength 0 1 174.05 12.34 141.10 166.30 173.20 183.10 208.10 ▁▅▇▃▁
carwidth 0 1 65.91 2.15 60.30 64.10 65.50 66.90 72.30 ▁▇▇▂▁
carheight 0 1 53.72 2.44 47.80 52.00 54.10 55.50 59.80 ▁▆▇▆▂
curbweight 0 1 2555.57 520.68 1488.00 2145.00 2414.00 2935.00 4066.00 ▃▇▅▃▁
enginesize 0 1 126.91 41.64 61.00 97.00 120.00 141.00 326.00 ▇▆▂▁▁
boreratio 0 1 3.33 0.27 2.54 3.15 3.31 3.58 3.94 ▁▅▇▇▂
stroke 0 1 3.26 0.31 2.07 3.11 3.29 3.41 4.17 ▁▂▇▇▁
compressionratio 0 1 10.14 3.97 7.00 8.60 9.00 9.40 23.00 ▇▁▁▁▁
horsepower 0 1 104.12 39.54 48.00 70.00 95.00 116.00 288.00 ▇▅▂▁▁
peakrpm 0 1 5125.12 476.99 4150.00 4800.00 5200.00 5500.00 6600.00 ▂▇▇▂▁
citympg 0 1 25.22 6.54 13.00 19.00 24.00 30.00 49.00 ▆▇▅▂▁
highwaympg 0 1 30.75 6.89 16.00 25.00 30.00 34.00 54.00 ▂▇▆▁▁
price 0 1 13276.71 7988.85 5118.00 7788.00 10295.00 16503.00 45400.00 ▇▃▁▁▁

Lõppandmestikus on 10 faktortunnust ja 16 arvulist tunnust (koos ID tunnusega).

Andmete kirjeldav analüüs

Valin edasiseks analüüsiks arvuline tunnus horsepower ja mittearvuline tunnus carbody.

Mittearvulise tunnuse analüüs

Mittearvuline tunnus carbody on järjestustunnus.

Järjestame kategooriad sisu järgi õigesti:

data$carbody <- factor(data$carbody,levels = c("convertible","hardtop","hatchback","sedan", "wagon"))

Visualiseerime tunnuse väärtuste jaotuse tulpdiagrammil:

barplot(prop.table(table(data$carbody))*100, col = "skyblue", cex.names = 0.7, cex.axis = 0.8, main = "Tunnuse Carbody suhteliste sageduste tulpdiagramm", ylab = "suht. sagedus, %")

Tunnuse sagedustabel:

sagedus <- table(data$carbody)
suht.sagedus <-round(prop.table(table(data$carbody))*100,2)
sagedustabel <- cbind(rownames(sagedus),sagedus,suht.sagedus)
sagedustabel <- data.frame(sagedustabel,row.names = NULL)
sagedustabel <-rbind(sagedustabel,c("Total",sum(sagedus),sum(suht.sagedus)))
names(sagedustabel) <- c("Carbody","sagedus","suht.sagedus,%")

library(knitr)
library(kableExtra)

kable_styling(kable(sagedustabel,align = c('l','r','r')),bootstrap_options = c("striped","hover"),full_width = 0,position = "left")
Carbody sagedus suht.sagedus,%
convertible 6 2.93
hardtop 8 3.9
hatchback 70 34.15
sedan 96 46.83
wagon 25 12.2
Total 205 100.01

Järeldus: andmestikus kõige sagedamine esinevad sedaani kerega mudelid, nad moodustavad ~47% kogu andmestikust. Kõige harvemini esinevad Kabriolett (convertible, 3%) ja hardtop (hardtop, 3.9%) autod.

Arvulise tunnuse analüüs

Sihttunnus horsepower on pidev tunnus. Visualiseerime tunnuse väärtuste jaotuse:

h <- hist(data$horsepower, col = "skyblue",main = "Tunnuse horsepower sageduste histogramm", xlab="horsepower", ylab = "sagedus")

Tunnuse Horsepower suhteliste sageduste histogramm ja karpdiagramm:

par(mfrow=c(1,2))
h$counts <- h$counts/sum(h$counts)*100
plot(h,col = "skyblue",main = "Tunnuse Horsepower \n suhteliste sageduste histogramm", xlab="Horsepower", ylab = "suht.sagedus,%",xlim = c(0,300),cex.main=0.8)

b <- boxplot(data$horsepower, col = "skyblue", horizontal = 1, main="Tunnuse Horsepower karpdiagramm",xlab="Horsepower",cex.main=0.8, range = 3)

Erindite analüüs. Karpdiagrammi parameeter range lubab eraldada vaid olulisemaid erindeid (standardne väärtus: range=1.5, parameetrit on võimalik suurendada olulisemate erindite eraldamiseks). Kui parameeter range=3, siis on olemas kolm olulisemat erindit paremalt. Erindite väärtused:

b$out
## [1] 262 288

Objektid, mis esinevad erinditena:

which(data$horsepower%in%b$out)
## [1]  50 130

Info erindite kohta:

datatable(data[which(data$horsepower%in%b$out) , ],options=list(scrollX=1,pageLenght=5,searching = FALSE,scroller = TRUE,scrollY=200))

Erinditeks on Jaguar XK ja Porche Cayenne mille mootorite hobusejõud on 262 ja 280 vastavalt.

Tunnuse sagedustabel: `

sagedus <- table(data$horsepower)
suht.sagedus <-round(prop.table(table(data$horsepower))*100,2)
sagedustabel <- cbind(rownames(sagedus),sagedus,suht.sagedus)
sagedustabel <- data.frame(sagedustabel,row.names = NULL)
sagedustabel <-rbind(sagedustabel,c("Total",sum(sagedus),sum(suht.sagedus)))
names(sagedustabel) <- c("Horsepower","sagedus","suht.sagedus,%")

library(knitr)
library(kableExtra)

kable_styling(kable(sagedustabel,align = c('l','r','r')),bootstrap_options = c("striped","hover"),full_width = 0,position = "left")
Horsepower sagedus suht.sagedus,%
48 1 0.49
52 2 0.98
55 1 0.49
56 2 0.98
58 1 0.49
60 1 0.49
62 6 2.93
64 1 0.49
68 19 9.27
69 10 4.88
70 11 5.37
72 1 0.49
73 3 1.46
76 5 2.44
78 1 0.49
82 5 2.44
84 5 2.44
85 3 1.46
86 4 1.95
88 6 2.93
90 5 2.44
92 4 1.95
94 2 0.98
95 7 3.41
97 5 2.44
100 2 0.98
101 6 2.93
102 5 2.44
106 1 0.49
110 8 3.9
111 4 1.95
112 2 0.98
114 6 2.93
115 1 0.49
116 9 4.39
120 1 0.49
121 3 1.46
123 4 1.95
134 1 0.49
135 1 0.49
140 1 0.49
142 1 0.49
143 1 0.49
145 5 2.44
152 3 1.46
154 1 0.49
155 2 0.98
156 2 0.98
160 6 2.93
161 2 0.98
162 2 0.98
175 1 0.49
176 2 0.98
182 3 1.46
184 2 0.98
200 1 0.49
207 3 1.46
262 1 0.49
288 1 0.49
Total 205 100.09

Järeldus: andmestikus kõige sagedamine esinevad 68 hobusejõuga autod, mis moodustavad ~9% kogu andmestikust. Kõige harvemini esinevad näitajad: 48, 55, 58, 60, 64, 72, 78, 106, 115, 120, 134-143, 154, 175, 200, 262, 288.

Tunnuse arvkarakteristikud:

library(FSA)
arvkar <- Summarize(data$horsepower, digits = 2)
kable_styling(kable(t(arvkar)))
n mean sd min Q1 median Q3 max
205 104.12 39.54 48 70 95 116 288

Järeldus: tunnuse Horsepower jaotus on nõrga parempoolse asümmeetriaga, keskmine hobusejõud auto mootorites on 104,12 ja see on mediaanist (95) suurem. Minimaalne väärtus on 48 ja maksimaalne on 288. Tunnusel on erindid , erinditena esinevad 2 objekti ridade numbritega 50 ja 130, mille hobusejõud on 262 ja 288.