Research Question: We want to investigate if the price of ‘premium’ computers is significantly different from not ‘premium’ ones. What are the typical properties of both groups?
As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.
computers$price <- as.numeric(computers$price)
computers$premium <- factor(computers$premium, labels=c("yes","no"))
Do not forget to attach the data so that it is simpler to call the variables and not using “data set + $ + variable”. It would be nice to start without any outliers and NA’s. Let’s solve this problem with the “dlookr” package:
sum(is.na(computers))
## [1] 0
?imputate_na #in case of many NA's
In the first stage of our analysis we are going to group our data in the form of the simple frequency table. First, let’s take a look at the distribution of prices in our sample and verify the tabular accuracy using TAI measure:
Let’s calculate TAI index to check the properties of the tabulated data.
tabelka2 <- classIntervals(computers$price, n=11, style="fixed", fixedBreaks=seq(0,5500,by=500))
jenks.tests(tabelka2)
## # classes Goodness of fit Tabular accuracy
## 11.0000000 0.9428233 0.7469290
As we can see - TAI index…
We can use different recipes… (style):
tabelka3<-classIntervals(computers$price, n=10, style="sd")
plot(tabelka3,pal=c(1:10))
jenks.tests(tabelka3)
## # classes Goodness of fit Tabular accuracy
## 9.0000000 0.9273661 0.7161721
Now, let’s take a look at the categorical data and make some cross-tabulations.
## premium
## ram yes no
## 2 0 394
## 4 472 1764
## 8 140 2180
## 16 0 996
## 24 0 297
## 32 0 16
## ram
## premium 2 4 8 16 24 32
## yes 0 472 140 0 0 0
## no 394 1764 2180 996 297 16
In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by cd, multi, premium type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here. Note that the
echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
We can also plot qualitative data only, presenting the frequencies from the contingency tables using … plots.
In this section we will present the same plots but with the use of ggplot2 and ggpubr packages.
Ggplot2 allows to show the average value of each group using the stat_summary() function. No more need to calculate your mean values before plotting!
Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.