title: “car tema livia” author: “Raluca Popp” date: “19 November 2017” output: html_document

library(foreign)
car <- read.csv("C:/Users/poppr/Desktop/cars.csv")

Explore the dataset. with head(), names(), summary()

names(car) # arata lista de variabile din dataset

##  [1] "mpg"      "engine"   "horse"    "weight"   "accel"    "year"    
##  [7] "origin"   "cylinder" "American" "Japanese" "European" "temp"

summary(car) #gives summary statistics pentru toate variabilele

##       mpg            engine          horse            weight    
##  Min.   : 9.00   Min.   :  4.0   Min.   : 46.00   Min.   : 732  
##  1st Qu.:17.50   1st Qu.:104.2   1st Qu.: 75.75   1st Qu.:2224  
##  Median :23.00   Median :148.5   Median : 95.00   Median :2811  
##  Mean   :23.51   Mean   :194.0   Mean   :104.83   Mean   :2970  
##  3rd Qu.:29.00   3rd Qu.:293.2   3rd Qu.:129.25   3rd Qu.:3612  
##  Max.   :46.60   Max.   :455.0   Max.   :230.00   Max.   :5140  
##  NA's   :8                       NA's   :6                      
##      accel            year           origin        cylinder    
##  Min.   : 8.00   Min.   : 0.00   Min.   :1.00   Min.   :3.000  
##  1st Qu.:13.62   1st Qu.:73.00   1st Qu.:1.00   1st Qu.:4.000  
##  Median :15.50   Median :76.00   Median :1.00   Median :4.000  
##  Mean   :15.50   Mean   :75.75   Mean   :1.57   Mean   :5.469  
##  3rd Qu.:17.07   3rd Qu.:79.00   3rd Qu.:2.00   3rd Qu.:8.000  
##  Max.   :24.80   Max.   :82.00   Max.   :3.00   Max.   :8.000  
##                                  NA's   :1      NA's   :1      
##     American         Japanese         European           temp      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   : 9.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:13.00  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :16.00  
##  Mean   :0.6256   Mean   :0.1951   Mean   :0.1802   Mean   :16.14  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:19.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :27.00  
##                   NA's   :1        NA's   :1        NA's   :9

You want to buy a car with good fuel economy.You have a dataset “cars.csv” of cars that contains how many miles a car can go on a gallon of gas. Convert this to the European measure of how many liters of gasoline a car drinks on a 100km stretch. (You will need to think about this conversion a little.) Compute the new variable.

car$kmph <- (235.05/(car$mpg))

Run and report basic descriptives of the original and new variable. Are there any outliers (that are over 3 standard deviations away from the mean)?

summary(car$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    9.00   17.50   23.00   23.51   29.00   46.60       8

sd(car$mpg, na.rm = TRUE)

## [1] 7.815984

summary(car$kmph)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   5.044   8.105  10.220  11.210  13.430  26.120       8

sd(car$kmph, na.rm = TRUE)

## [1] 3.899351

Any outliers will have values for mpg over 3 standard deviations from the mean. Ca sa ii identifici, calculezi standard deviation cu sd() si inmultesti cu 3 SD (7.815984 * 3 = 23.44795) si aduni cu mean (which is 23.51). Daca ai observatii unde mpg e mai mare decat 23.44 + 23.51, adica mai mari decat 46.95 inseamna ca sunt outliers. Daca te uiti la mpg, the summary function returns the maximum value (46.60).

Outliers pt kmph. 3 sd from the mean is 3.899351 * 3 + 11.210, adica 11.69 + 11.210 = 22.9. Maximum value aici e 26.120.

Deci dupa conversie in kmph, sunt outliers. Ca sa ii vezi, scrii asta

car$kmph > 22.9

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA
##  [12]    NA    NA    NA    NA FALSE FALSE    NA FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [34] FALSE  TRUE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [331] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [353] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [364] FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE
## [375] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

si iti printeaza un tabel de ala de true/false unde svcrie true la observatiile care is outliers. Si asa vezi ca ai outliers.

Using the “cars.csv” database, divide the fuel economy in two categories (low=0 and high=1) by making a new variable. It is up to you where you draw the line between low and high. I drew the line at the mean.nu stiu daca e cea mai buna solutie, dar au zis ca it’s up to me.

summary(car$kmph)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   5.044   8.105  10.220  11.210  13.430  26.120       8

car$fueleconomy <- NA
car$fueleconomy[car$kmph <= 11.21] <- "0" 
car$fueleconomy[car$kmph > 11.21] <- "1"

table(car$fueleconomy)

## 
##   0   1 
## 227 171

Sau poti sa recodezi using the ifelse() rezultatul e acelasi, dar nu stiu pe care l-ati invatat in clasa. Il folosesti pe care vrei

car$fueleconomy2 <- ifelse(car$kmph <= 11.21, 0, 1)
table(car$fueleconomy2)

## 
##   0   1 
## 227 171

Run a crosstab between your new dichotomous fuel economy variable and origin of car and report the results in a meaningful way (think about best information to present in the cells). Run a chi-square test of independence for the categories of fuel economy and country of origin and report the results.

Crosstab. asta imi arata frequencies, adica rezultatul e count data. Dar imi e greu sa stiu ce inseamna fiecare variablial, as aca numesc variabilele in urmatioarea comanda

table(car$origin, car$fueleconomy)

##    
##       0   1
##   1  92 156
##   2  61   9
##   3  74   5

Acum arata mai frumos.

table("car origin" = car$origin, "fuel economy" = car$fueleconomy)

##           fuel economy
## car origin   0   1
##          1  92 156
##          2  61   9
##          3  74   5

Dar poate ar trebui sa ma uit la proportions, nu la numere, ca numerele doar asa is mai greu de interpretat. La asta se refera cand zice think about what info to present in the cells. in loc de functia table() folosesc prop.table(). Dar prop.table() nu functioneaza de una singuram si trebuie folosita si table() inside it. iti arat cum aici.

prop.table(table("car origin" = car$origin, 
           "fuel economy" = car$fueleconomy))

##           fuel economy
## car origin          0          1
##          1 0.23173804 0.39294710
##          2 0.15365239 0.02267003
##          3 0.18639798 0.01259446

Dra ca sa obtinem procente, inmultim cu 100

prop.table(table("car origin" = car$origin, 
           "fuel economy" = car$fueleconomy)) * 100

##           fuel economy
## car origin         0         1
##          1 23.173804 39.294710
##          2 15.365239  2.267003
##          3 18.639798  1.259446

Pt ca ai vazut inainte ca ai NA in the data, trebuie sa le incluzi si pe ele, cu argumentyul exclude= NULL. asta iti va mai crea o coloana pt NA.

prop.table(table("car origin" = car$origin, 
           "fuel economy" = car$fueleconomy, exclude= NULL)) * 100

##           fuel economy
## car origin          0          1       <NA>
##       1    22.6600985 38.4236453  1.2315271
##       2    15.0246305  2.2167488  0.7389163
##       3    18.2266010  1.2315271  0.0000000
##       <NA>  0.0000000  0.2463054  0.0000000

Acum poti sa vezi ca masinile din origin 1 is cele mai multe in dataset. De asemenea, tot alea au si cel mai mare procent de masini care is cu high fuel economy. (1= American 2= European 3= Japanese)

table(car$origin)

## 
##   1   2   3 
## 253  73  79

vezi ca masinile din america is cele mai multe

Acum vom face Chi-Square test of independence, pe care te las sa il interpretezi :P (gasesti cum sa interpretezi aici, e super simplu https://www.r-bloggers.com/chi-squared-test/)

chisq.test(car$origin, car$fueleconomy)

## 
##  Pearson's Chi-squared test
## 
## data:  car$origin and car$fueleconomy
## X-squared = 109.48, df = 2, p-value < 2.2e-16

There are three car salesmen in your town one of them has American cars, another one has Japanese cars and the other has European cars. What do you think?Which one will offer the best solution for you? Why do you think this? Does it make a real difference?

Ca sa raspunzi la intrebarea asta trebuie sa te uiti la prop.table de mai devreme. In tabel apare ca masinile americane au cel mai mare procent de fuel economy, dar daca stai s ate gandesti nu are sens, ca masinile america is masive si nu is cunoscute opentru a fi cele mai eficiente. Asa ca porbabil trebuie sa ne gandim ce alte variabile trebuie sa folosim ca sa vedem ce origine trebuie sa aiba masina pe care ne-o cumparam.

cum afli ce alte variabile ai in dataset?

names(car)

##  [1] "mpg"          "engine"       "horse"        "weight"      
##  [5] "accel"        "year"         "origin"       "cylinder"    
##  [9] "American"     "Japanese"     "European"     "temp"        
## [13] "kmph"         "fueleconomy"  "fueleconomy2"

uita-te la year

prop.table(table(car$year, car$origin))  * 100

##     
##              1         2         3
##   0  0.0000000 0.0000000 0.0000000
##   70 6.4197531 1.4814815 0.4938272
##   71 4.9382716 1.2345679 0.9876543
##   72 4.4444444 1.2345679 1.2345679
##   73 7.1604938 1.7283951 0.9876543
##   74 3.7037037 1.4814815 1.4814815
##   75 4.9382716 1.4814815 0.9876543
##   76 5.4320988 1.9753086 0.9876543
##   77 4.4444444 0.9876543 1.4814815
##   78 5.4320988 1.4814815 1.9753086
##   79 5.6790123 0.9876543 0.4938272
##   80 1.7283951 2.2222222 3.2098765
##   81 3.2098765 1.2345679 2.9629630
##   82 4.9382716 0.4938272 2.2222222

subset only the newest cars

newcars <- subset(car, year=="82")

Am luat numai masinile din 82, cele mai noi.

prop.table(table(newcars$fueleconomy, newcars$origin)) * 100

##    
##             1         2         3
##   0 64.516129  6.451613 29.032258

Sau facem subset doar la masinile cu good fuel economy

econcars <- subset(car, fueleconomy==1)

prop.table(table(econcars$cyl, econcars$origin)) * 100

##    
##              1          2          3
##   3  0.0000000  0.0000000  1.1764706
##   4  1.1764706  2.9411765  0.5882353
##   5  0.0000000  0.5882353  0.0000000
##   6 32.3529412  1.7647059  1.1764706
##   8 58.2352941  0.0000000  0.0000000

remove the ouliers

hist(car$kmph)

carclean <- subset(car, kmph < 22.9)

ma uit care e cea mai fuel efficient duoa ce scap de outliers, dar e tot american

prop.table(table(carclean$fueleconomy, carclean$origin)) * 100

##    
##             1         2         3
##   0 23.291139 15.443038 18.734177
##   1 38.987342  2.278481  1.265823