Portfolio assignment 1: Statistical summaries of a single variable and data types

I will import a dataset from the archdata-library, focusing on Romano-British pottery, to be used throughout this assignment. This dataset includes 12 different variables, and contains 48 observations.

library(archdata)
data("RBPottery")

Section 1 - interval or ratio data

This section of the assignment will focus on numerical data from the dataset. Here we’ll focus on the percentage of iron trioxide in the pottery.

Percentage of iron trioxide

To make our lifes easier, and not having to write both the dataset and the variable each time, we will first save the exact variable under a shorter name:

Fe203<-RBPottery$Fe2O3

Then we will create a summary of all the numerical data of the variable ‘Fe203’.

summary(Fe203)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.920   5.428   6.895   5.826   7.353   9.520

This summary shows an entire overview of the following: The lowest value of the iron trioxide value is 0.9%, while the highest value is 9.52%. The lowest 25% of the dataset has a percentage value of iron trioxide below 5.42%, while the highest 25% has a value above 7.35%. The median of percentage value is 68.95%, while the mean value is 5.82%.

Let us dwell by the last two: the median and the mean. Both of these allow us to analyse central tendencies in a dataset from different perspectives. The mean represents the total sum of the inputs divided by the total number of input, which would be 48. Manually, the process would be as follows:

sum(Fe203)/48

## [1] 5.825833

As seen, this gives us roughly the same value.

Median on the other hand defines the half-way value in the full distribution of the data in the variable. In this case it would require ranking all 48 values from lowest to highest - the middle value would be the median, as half the values would be beneath and the other half above. Here, we’ll rank the inputs in correlation to their percentage of iron trioxide from highest to lowest:

median(Fe203)

## [1] 6.895

Now, lets visualize the data. We’ll do both a histogram and a boxplot to visualize and compare the data. The histogram shows the proportion of the different percentage values of the iron trioxide. Furthermore, we have added to vertical ablines, one green symbolizing the mean, one blue symbolizing the median.

hist(Fe203,main="Percentage of iron trioxide",xlab="%",ylab="Frequency")
abline(v=mean(Fe203),col=3,lwd=2)
abline(v=median(Fe203),col=4,lwd=2)

The histogram provides a way to visualize the distribution of the data, showing us a complete overview of the frequency of iron trioxide percentage in the pottery.

Now, lets try visualizing the data with a boxplot.

boxplot(Fe203,main="Percentage of iron trioxide",ylab="%")

This kind of graph includes the median automatically, here shown as the black line in the grey box, as well as the other values of our summary(Fe203); The grey box in our boxplot symbolizes the amount of data between our 1st and 3rd quarter. Those two are instead visualized at the top and bottom of the box, showing the full range of the dataset. However, we see are group of datapoints lying below the lowest border of the 1st quarter - these are outliers in the dataset, falling significantly outside of the main distribution.

But which method of visualization is better? The truth is that the two methods are great at visualizing different aspects of a dataset. While the histogram gave us the full distribution of Fe203 within the specific dataset, showing us the frequency for each individual pot, the boxplot instead gave us an overview of the summary of the dataset, indicating the median and all the quarters, as well as the outliers in the data distribution. I would argue that in general, boxplots are great for comparing tendencies across datasets - in a case like this this, it could have been the percentage of iron trioxide from assemblages from ie. different locations, or the percentage of different oxides - to get an outline of the disposition in each assemblage. On the contrary, histograms are great at visualzing distribution within a specified dataset, such as the case above, where the frequency of percentage of iron trioxide is seen in greater detail.

Section 2 - nominal or ordinal data

Now, lets discuss the nominal data in the dataset. Here we will focus on the different locations of kilns used to manufacture the Roman pottery in the dataset.

Location of kilns

Again, to make our lifes easier, and not having to write both the dataset and the variable each time, well first save the exact variable under a shorter name. This time, it is kilns:

Kilns<-RBPottery$Kiln

Now, as mentioned, this data is not numerical but categorical, so to use it for visualization, we will first have to create a table of the data.

table(Kilns)

## Kilns
##     Gloucester     Llanedeyrn       Caldicot Islands Thorns   Ashley Rails 
##             22             14              2              5              5

With the table created, we can already see some dispertion in the dataset. However, let us put the data into a barplot instead to create a visualization.

barplot(table(Kilns),main="Kilns",ylab="number of pots")

This barplot shows the distribution of pots from each of the five kilns. Howvever, it is important to note that as of now, the distribution at each of the kilns is numerical and not proportional. Should we wish to compare the kiln-distribution in this dataset to that of another, it would be beneficial to remake our table into a proportional table, which would instead show the proportional distribution of the pottery from each kiln; This would make our dataset more comparable to other datasets - however, we will not do that today.

Qualitative vs Quantitative data

Instead, we will discuss the nature of the data itself. As the previous sections show, some data in the dataset is numerical while other is nominal. The numerical data is, as the name implies, representative of quantitative data that can be measured or counted. This could be the number of Romano-British pots or their measured height. These numerical data allow us to calculate mathematical tendencies, such as mean or median, as we did in section 1. The nominal data on the other hand is part of the qualitative data, consisting of both nominal and ordinal data. While the ordinal data can be ordered, nominal data simply represent data that can be grouped into categories, but not structured in a specific order. This is the case with our kiln-data above: All the data is within the kiln-location categories, but there are no internal hierarchy. Thus, it is also not possible to make calculations on this kind of data - instead, it can be used to visualize the tendency in the dataset.