Descriptive Statistics

Computers

Research Question: We want to investigate if the price of ‘premium’ computers is significantly different from not ‘premium’ ones. What are the typical properties of both groups?

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

computers$price <- as.numeric(computers$price)
computers$premium <- factor(computers$premium, labels=c("yes","no"))

Do not forget to attach the data so that it is simpler to call the variables and not using “data set + $ + variable”. It would be nice to start without any outliers and NA’s. Let’s solve this problem with the “dlookr” package:

sum(is.na(computers))

## [1] 0

?imputate_na #in case of many NA's

Frequency Table and TAI

In the first stage of our analysis we are going to group our data in the form of the simple frequency table. First, let’s take a look at the distribution of prices in our sample and verify the tabular accuracy using TAI measure:

Let’s calculate TAI index to check the properties of the tabulated data.

tabelka2 <- classIntervals(computers$price, n=11, style="fixed", fixedBreaks=seq(0,5500,by=500))
jenks.tests(tabelka2)

##        # classes  Goodness of fit Tabular accuracy 
##       11.0000000        0.9428233        0.7469290

As we can see - TAI index…

We can use different recipes… (style):

tabelka3<-classIntervals(computers$price, n=10, style="sd")
plot(tabelka3,pal=c(1:10))

jenks.tests(tabelka3)

##        # classes  Goodness of fit Tabular accuracy 
##        9.0000000        0.9273661        0.7161721

Contingency tables

Now, let’s take a look at the categorical data and make some cross-tabulations.

##     premium
## ram   yes   no
##   2     0  394
##   4   472 1764
##   8   140 2180
##   16    0  996
##   24    0  297
##   32    0   16

##        ram
## premium    2    4    8   16   24   32
##     yes    0  472  140    0    0    0
##     no   394 1764 2180  996  297   16

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by cd, multi, premium type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here. Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

We can also plot qualitative data only, presenting the frequencies from the contingency tables using … plots.

ggplot2 plots

In this section we will present the same plots but with the use of ggplot2 and ggpubr packages.

Ggplot2 allows to show the average value of each group using the stat_summary() function. No more need to calculate your mean values before plotting!

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.