title: “car tema livia” author: “Raluca Popp” date: “19 November 2017” output: html_document
library(foreign)
car <- read.csv("C:/Users/poppr/Desktop/cars.csv")
Explore the dataset. with head(), names(), summary()
names(car) # arata lista de variabile din dataset
## [1] "mpg" "engine" "horse" "weight" "accel" "year"
## [7] "origin" "cylinder" "American" "Japanese" "European" "temp"
summary(car) #gives summary statistics pentru toate variabilele
## mpg engine horse weight
## Min. : 9.00 Min. : 4.0 Min. : 46.00 Min. : 732
## 1st Qu.:17.50 1st Qu.:104.2 1st Qu.: 75.75 1st Qu.:2224
## Median :23.00 Median :148.5 Median : 95.00 Median :2811
## Mean :23.51 Mean :194.0 Mean :104.83 Mean :2970
## 3rd Qu.:29.00 3rd Qu.:293.2 3rd Qu.:129.25 3rd Qu.:3612
## Max. :46.60 Max. :455.0 Max. :230.00 Max. :5140
## NA's :8 NA's :6
## accel year origin cylinder
## Min. : 8.00 Min. : 0.00 Min. :1.00 Min. :3.000
## 1st Qu.:13.62 1st Qu.:73.00 1st Qu.:1.00 1st Qu.:4.000
## Median :15.50 Median :76.00 Median :1.00 Median :4.000
## Mean :15.50 Mean :75.75 Mean :1.57 Mean :5.469
## 3rd Qu.:17.07 3rd Qu.:79.00 3rd Qu.:2.00 3rd Qu.:8.000
## Max. :24.80 Max. :82.00 Max. :3.00 Max. :8.000
## NA's :1 NA's :1
## American Japanese European temp
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 9.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:13.00
## Median :1.0000 Median :0.0000 Median :0.0000 Median :16.00
## Mean :0.6256 Mean :0.1951 Mean :0.1802 Mean :16.14
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:19.00
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :27.00
## NA's :1 NA's :1 NA's :9
You want to buy a car with good fuel economy.You have a dataset “cars.csv” of cars that contains how many miles a car can go on a gallon of gas. Convert this to the European measure of how many liters of gasoline a car drinks on a 100km stretch. (You will need to think about this conversion a little.) Compute the new variable.
car$kmph <- (235.05/(car$mpg))
Run and report basic descriptives of the original and new variable. Are there any outliers (that are over 3 standard deviations away from the mean)?
summary(car$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 17.50 23.00 23.51 29.00 46.60 8
sd(car$mpg, na.rm = TRUE)
## [1] 7.815984
summary(car$kmph)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.044 8.105 10.220 11.210 13.430 26.120 8
sd(car$kmph, na.rm = TRUE)
## [1] 3.899351
Any outliers will have values for mpg over 3 standard deviations from the mean. Ca sa ii identifici, calculezi standard deviation cu sd() si inmultesti cu 3 SD (7.815984 * 3 = 23.44795) si aduni cu mean (which is 23.51). Daca ai observatii unde mpg e mai mare decat 23.44 + 23.51, adica mai mari decat 46.95 inseamna ca sunt outliers. Daca te uiti la mpg, the summary function returns the maximum value (46.60).
Outliers pt kmph. 3 sd from the mean is 3.899351 * 3 + 11.210, adica 11.69 + 11.210 = 22.9. Maximum value aici e 26.120.
Deci dupa conversie in kmph, sunt outliers. Ca sa ii vezi, scrii asta
car$kmph > 22.9
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA
## [12] NA NA NA NA FALSE FALSE NA FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [34] FALSE TRUE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [331] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [353] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [364] FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE FALSE
## [375] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
si iti printeaza un tabel de ala de true/false unde svcrie true la observatiile care is outliers. Si asa vezi ca ai outliers.
Using the “cars.csv” database, divide the fuel economy in two categories (low=0 and high=1) by making a new variable. It is up to you where you draw the line between low and high. I drew the line at the mean.nu stiu daca e cea mai buna solutie, dar au zis ca it’s up to me.
summary(car$kmph)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.044 8.105 10.220 11.210 13.430 26.120 8
car$fueleconomy <- NA
car$fueleconomy[car$kmph <= 11.21] <- "0"
car$fueleconomy[car$kmph > 11.21] <- "1"
table(car$fueleconomy)
##
## 0 1
## 227 171
Sau poti sa recodezi using the ifelse() rezultatul e acelasi, dar nu stiu pe care l-ati invatat in clasa. Il folosesti pe care vrei
car$fueleconomy2 <- ifelse(car$kmph <= 11.21, 0, 1)
table(car$fueleconomy2)
##
## 0 1
## 227 171
Run a crosstab between your new dichotomous fuel economy variable and origin of car and report the results in a meaningful way (think about best information to present in the cells). Run a chi-square test of independence for the categories of fuel economy and country of origin and report the results.
Crosstab. asta imi arata frequencies, adica rezultatul e count data. Dar imi e greu sa stiu ce inseamna fiecare variablial, as aca numesc variabilele in urmatioarea comanda
table(car$origin, car$fueleconomy)
##
## 0 1
## 1 92 156
## 2 61 9
## 3 74 5
Acum arata mai frumos.
table("car origin" = car$origin, "fuel economy" = car$fueleconomy)
## fuel economy
## car origin 0 1
## 1 92 156
## 2 61 9
## 3 74 5
Dar poate ar trebui sa ma uit la proportions, nu la numere, ca numerele doar asa is mai greu de interpretat. La asta se refera cand zice think about what info to present in the cells. in loc de functia table() folosesc prop.table(). Dar prop.table() nu functioneaza de una singuram si trebuie folosita si table() inside it. iti arat cum aici.
prop.table(table("car origin" = car$origin,
"fuel economy" = car$fueleconomy))
## fuel economy
## car origin 0 1
## 1 0.23173804 0.39294710
## 2 0.15365239 0.02267003
## 3 0.18639798 0.01259446
Dra ca sa obtinem procente, inmultim cu 100
prop.table(table("car origin" = car$origin,
"fuel economy" = car$fueleconomy)) * 100
## fuel economy
## car origin 0 1
## 1 23.173804 39.294710
## 2 15.365239 2.267003
## 3 18.639798 1.259446
Pt ca ai vazut inainte ca ai NA in the data, trebuie sa le incluzi si pe ele, cu argumentyul exclude= NULL. asta iti va mai crea o coloana pt NA.
prop.table(table("car origin" = car$origin,
"fuel economy" = car$fueleconomy, exclude= NULL)) * 100
## fuel economy
## car origin 0 1 <NA>
## 1 22.6600985 38.4236453 1.2315271
## 2 15.0246305 2.2167488 0.7389163
## 3 18.2266010 1.2315271 0.0000000
## <NA> 0.0000000 0.2463054 0.0000000
Acum poti sa vezi ca masinile din origin 1 is cele mai multe in dataset. De asemenea, tot alea au si cel mai mare procent de masini care is cu high fuel economy. (1= American 2= European 3= Japanese)
table(car$origin)
##
## 1 2 3
## 253 73 79
vezi ca masinile din america is cele mai multe
Acum vom face Chi-Square test of independence, pe care te las sa il interpretezi :P (gasesti cum sa interpretezi aici, e super simplu https://www.r-bloggers.com/chi-squared-test/)
chisq.test(car$origin, car$fueleconomy)
##
## Pearson's Chi-squared test
##
## data: car$origin and car$fueleconomy
## X-squared = 109.48, df = 2, p-value < 2.2e-16
There are three car salesmen in your town one of them has American cars, another one has Japanese cars and the other has European cars. What do you think?Which one will offer the best solution for you? Why do you think this? Does it make a real difference?
Ca sa raspunzi la intrebarea asta trebuie sa te uiti la prop.table de mai devreme. In tabel apare ca masinile americane au cel mai mare procent de fuel economy, dar daca stai s ate gandesti nu are sens, ca masinile america is masive si nu is cunoscute opentru a fi cele mai eficiente. Asa ca porbabil trebuie sa ne gandim ce alte variabile trebuie sa folosim ca sa vedem ce origine trebuie sa aiba masina pe care ne-o cumparam.
cum afli ce alte variabile ai in dataset?
names(car)
## [1] "mpg" "engine" "horse" "weight"
## [5] "accel" "year" "origin" "cylinder"
## [9] "American" "Japanese" "European" "temp"
## [13] "kmph" "fueleconomy" "fueleconomy2"
uita-te la year
prop.table(table(car$year, car$origin)) * 100
##
## 1 2 3
## 0 0.0000000 0.0000000 0.0000000
## 70 6.4197531 1.4814815 0.4938272
## 71 4.9382716 1.2345679 0.9876543
## 72 4.4444444 1.2345679 1.2345679
## 73 7.1604938 1.7283951 0.9876543
## 74 3.7037037 1.4814815 1.4814815
## 75 4.9382716 1.4814815 0.9876543
## 76 5.4320988 1.9753086 0.9876543
## 77 4.4444444 0.9876543 1.4814815
## 78 5.4320988 1.4814815 1.9753086
## 79 5.6790123 0.9876543 0.4938272
## 80 1.7283951 2.2222222 3.2098765
## 81 3.2098765 1.2345679 2.9629630
## 82 4.9382716 0.4938272 2.2222222
subset only the newest cars
newcars <- subset(car, year=="82")
Am luat numai masinile din 82, cele mai noi.
prop.table(table(newcars$fueleconomy, newcars$origin)) * 100
##
## 1 2 3
## 0 64.516129 6.451613 29.032258
Sau facem subset doar la masinile cu good fuel economy
econcars <- subset(car, fueleconomy==1)
prop.table(table(econcars$cyl, econcars$origin)) * 100
##
## 1 2 3
## 3 0.0000000 0.0000000 1.1764706
## 4 1.1764706 2.9411765 0.5882353
## 5 0.0000000 0.5882353 0.0000000
## 6 32.3529412 1.7647059 1.1764706
## 8 58.2352941 0.0000000 0.0000000
remove the ouliers
hist(car$kmph)
carclean <- subset(car, kmph < 22.9)
ma uit care e cea mai fuel efficient duoa ce scap de outliers, dar e tot american
prop.table(table(carclean$fueleconomy, carclean$origin)) * 100
##
## 1 2 3
## 0 23.291139 15.443038 18.734177
## 1 38.987342 2.278481 1.265823