Exploring numeric variables
Summarize by using the function summary()
summary(usedcars)
year model price mileage color
Min. :2000 Length:150 Min. : 3800 Min. : 4867 Length:150
1st Qu.:2008 Class :character 1st Qu.:10995 1st Qu.: 27200 Class :character
Median :2009 Mode :character Median :13592 Median : 36385 Mode :character
Mean :2009 Mean :12962 Mean : 44261
3rd Qu.:2010 3rd Qu.:14904 3rd Qu.: 55124
Max. :2012 Max. :21992 Max. :151479
transmission
Length:150
Class :character
Mode :character
You can get the summary of particular columns
summary(usedcars[c("price","mileage")])
price mileage
Min. : 3800 Min. : 4867
1st Qu.:10995 1st Qu.: 27200
Median :13592 Median : 36385
Mean :12962 Mean : 44261
3rd Qu.:14904 3rd Qu.: 55124
Max. :21992 Max. :151479
summary(usedcars[c("model")])
model
Length:150
Class :character
Mode :character
The min/max of used car prices
range(usedcars$year)
[1] 2000 2012
range(usedcars$model)
[1] "SE" "SES"
range(usedcars$mileage)
[1] 4867 151479
The difference of the range
diff(range(usedcars$price))
[1] 18192
Inter Quartile Range (IQR)
IQR tells you how wide the middle of your data is
it’s calculated by: IQR = Q3 − Q1
IQR(usedcars$price)
[1] 3909.5
Use quantile to calculate five-number summary
quantile(usedcars$price)
0% 25% 50% 75% 100%
3800.0 10995.0 13591.5 14904.5 21992.0
The 99th percentile
this lets us filter the outliers by specifying the probability
range
quantile(usedcars$price, probs = c(0.01,0.99))
quintiles
quantile(usedcars$price, seq(from = 0, to = 1, by = 0.2))
0% 20% 40% 60% 80% 100%
3800.0 10759.4 12993.8 13992.0 14999.0 21992.0
boxplot of used car prices and mileage
boxplot(usedcars$price, main="Boxplot of Used Car Prices",
ylab="Price ($)")

histograms of used car prices and mileage
hist(usedcars$price, main = "Histogram of Used Car Prices",
xlab = "Price ($)")

variance and standard deviation with var() and std()
sd(usedcars$price)
[1] 3122.482
var(usedcars$mileage)
[1] 728033954
Exploring numeric Variable
one-way tables for the used car data
table(usedcars$year)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
3 1 1 1 3 2 6 11 14 42 49 16 1
compute table proportions
model_table <- table(usedcars$model)
prop.table(model_table)
SE SEL SES
0.5200000 0.1533333 0.3266667
round the data
color_table <- table(usedcars$color)
color_pct <- prop.table(color_table) * 100
round(color_pct, digits = 1)
Black Blue Gold Gray Green Red Silver White Yellow
23.3 11.3 0.7 10.7 3.3 16.7 21.3 10.7 2.0
Exploring relationships between variables
scatterplot of price vs. mileage
plot(x = usedcars$mileage, y = usedcars$price,
main = "Scatterplot of Price vs. Mileage",
xlab = "Used Car Odometer (mi.)",
ylab = "Used Car Price ($)")

new variable indicating conservative colors
usedcars$conservative <-
usedcars$color %in% c("Black", "Gray", "Silver", "White")
# checking our variable
table(usedcars$conservative)
FALSE TRUE
51 99
installing gmodels package
install.packages("gmodels")
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/gtools_3.9.5.tar.gz'
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/gdata_3.0.1.tar.gz'
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/gmodels_2.19.1.tar.gz'
The downloaded source packages are in
‘/tmp/Rtmp12cPec/downloaded_packages’
Crosstab of conservative by model
gmodels is mainly used for data analysis helpers
CrossTable() for contingency tables (way nicer than table())
Odds ratios, proportions, chi-square summaries, etc
library(gmodels)
CrossTable(x = usedcars$model, y = usedcars$conservative)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 150
| usedcars$conservative
usedcars$model | FALSE | TRUE | Row Total |
---------------|-----------|-----------|-----------|
SE | 27 | 51 | 78 |
| 0.009 | 0.004 | |
| 0.346 | 0.654 | 0.520 |
| 0.529 | 0.515 | |
| 0.180 | 0.340 | |
---------------|-----------|-----------|-----------|
SEL | 7 | 16 | 23 |
| 0.086 | 0.044 | |
| 0.304 | 0.696 | 0.153 |
| 0.137 | 0.162 | |
| 0.047 | 0.107 | |
---------------|-----------|-----------|-----------|
SES | 17 | 32 | 49 |
| 0.007 | 0.004 | |
| 0.347 | 0.653 | 0.327 |
| 0.333 | 0.323 | |
| 0.113 | 0.213 | |
---------------|-----------|-----------|-----------|
Column Total | 51 | 99 | 150 |
| 0.340 | 0.660 | |
---------------|-----------|-----------|-----------|
The contribution of True is much more significant than False,
which means that more cars have conservative colors (grey, black,
white and Silver)
Now, we will try more args in the Crosstable() function.
CrossTable(x = usedcars$model, y = usedcars$conservative,
chisq = TRUE,
expected = TRUE,
prop.r = TRUE,
prop.c = TRUE,
prop.t = TRUE,
prop.chisq = TRUE,
)
Cell Contents
|-------------------------|
| N |
| Expected N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 150
| usedcars$conservative
usedcars$model | FALSE | TRUE | Row Total |
---------------|-----------|-----------|-----------|
SE | 27 | 51 | 78 |
| 26.520 | 51.480 | |
| 0.009 | 0.004 | |
| 0.346 | 0.654 | 0.520 |
| 0.529 | 0.515 | |
| 0.180 | 0.340 | |
---------------|-----------|-----------|-----------|
SEL | 7 | 16 | 23 |
| 7.820 | 15.180 | |
| 0.086 | 0.044 | |
| 0.304 | 0.696 | 0.153 |
| 0.137 | 0.162 | |
| 0.047 | 0.107 | |
---------------|-----------|-----------|-----------|
SES | 17 | 32 | 49 |
| 16.660 | 32.340 | |
| 0.007 | 0.004 | |
| 0.347 | 0.653 | 0.327 |
| 0.333 | 0.323 | |
| 0.113 | 0.213 | |
---------------|-----------|-----------|-----------|
Column Total | 51 | 99 | 150 |
| 0.340 | 0.660 | |
---------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 0.1539564 d.f. = 2 p = 0.92591
Now we will try the assocstats()
It returns chi-squared test, Cramer’s V, Contingency Coeff, and
Phi)
assocplot(table(x = usedcars$model,y=usedcars$conservative))

now we can use ci() for confidence intervals for proportions
ci(x=usedcars$price, confidence = 0.95)
Estimate CI lower CI upper Std. Error
12961.9333 12458.1496 13465.7170 254.9496