Data Analytics Module
Lecturer: Hans van der Zwan
Handout 02
Topic: summary statistics
Literature
Rumsey D. J. (2010). Statistical Essentials for Dummies. Hoboken: Wiley Publishing.
Ismay C. & Kim A. Y. (2019). ModernDive. Statistical Inference for Data Science. https://moderndive.com.
Recommended literature
Preparation class
See module description
Graphs show the form of the distribution of the data and are a very usefull tool in exploring a dataset. Besides graphs, statistics that summarize the (distribution of the) data, are used to transform data into information. The five-number summary, which forms the basis for a boxplot, is a good example of summarizing data.
The most important statistics are statistics which measure central tendency and spread. Which summary statistics are appropriate depends on the type of a variable (categorical or numerical) and the context of the data to be studied.
Metrics for central tendency: (arithmetric) mean and median. For grouped data, also the mode can be used to measure the central tendency. With individual measurements of a numerical variable, the mode is in most cases not a good metric for central tendency, because an individual observation can have the highest frequency without being in the center of the distribution.
Most common metrics to measure variation in the data: range, interquartile range (IQR), variance and standard deviation.
There are many other statistics that can be used to describe a data set. Some examples:
- the number of observations in the data set; actually this is almost always the first statistic reported;
- the proportion of the data that has a certain property; e.g. the proportion of voters supporting the proposal in the last tweet of president Trump;
- 5% trimmed mean: the mean of the data after removing the 5% lowest and the 5% highest values;
- coefficent of variation (CV): \(\frac{SD}{MEAN}\);
- skewness coefficient.
In many cases data are summarized per group. E.g. houseprices in the Netherlands can be summarized per province or per municipality, healthcare costs can be sumarized per age group etc.
Example: table with summary statistics of houses sold in London in January 2019
Table 1
Summary statistics houses sold in London in January 2019
TYPE | COUNT | AVERAGE | MEDIAN | SD | PERC_ABOVE_1mln |
FLAT | 782 | 592,700 | 445,000 | 612,700 | 5.1 |
DETACHED | 37 | 1,498,800 | 1,250,000 | 873,100 | 1.7 |
SEMI-DETACHED | 125 | 982,200 | 805,000 | 642,400 | 3.0 |
TERRACED | 442 | 876,100 | 630,000 | 919,700 | 7.5 |
MS Excel comes with a lot of statistical functions which can be used to create a data summary. For an overview of statistical functions in Excel, see here.
Another possibility to create an overview of the most common summary statistics is using the analyses toolpak. On the internet, one can find many instruction manuals how to use this add-in, see for instance here.
In many cases it is useful to transform values to another unit of measurement. E.g. if in a data set there is a temperature variable (TEMP) measured in degrees Fahrenheit and for one reason or another the unit should be degrees Celsius, the following variable transformation can be used: TEMP_CELCIUS = (TEMP – 32) x 5/9.
A common transformation used is standardizing; the values of a variable (X-variable) are transformed into a new variable (Z-variable) which is the location of the observation relative to the mean, expressed in standard deviations: \(z = \frac{x\;-\;mean}{standard\;deviation}\) .
The z-value gives information about the place of a certain observation in the distribution.
Open data sources offer data in a variety of different formats. A common used format is the csv (comma seperated values) format. Be aware that sometimes the semi-colon instead of the comma is used as a seperator in such a file.
In many cases the same data can be presensted in different ways. A common used transformation is from long to wide data format or the other way around.
In May 2019 the municipality administration of Amsterdam announced that from 2030 on cars that run on diesel or petrol are not allowed in the city [https://www.cnet.com/roadshow/news/amsterdam-ban-gas-diesel-cars-motorcycles-2030/]. This because of the effect of these vehicles on the air quality in the city.
Although there were a lot of negative reactions on this intention, it is well known that bad air quality has a negative influence on people’s health, see for instance this WHO publication. The question can be asked if the other major cities in The Netherlands should consider measures similar to these of Amsterdam.
The above considerations lead to the research question: Are the four largest cities in the Netherlands different in terms of air quality? To answer this question, it has been subdivided in a couple of sub questions:
(Q1) How can air quality be measured?
(Q2) What are standards for healthy air quality?
(Q3) How does air quality relate to human health?
(Q4) Which patterns can be identified in the air quality in the four major cities over the year?
(Q5) Which patterns can be identified in the air quality in the four major cities during a day?
(Q6) Are there significant differences between the quality of the air in Amsterdam, Rotterdam, Utrecht and The Hague.
Assume that, based on the answer on Q1 and Q2, the PM10 level has been used as metric for air quality in this research. To answer the field research questions (Q4, Q5 and Q6) use has been made of secundary data which is colleted by the Netherlands National Institute for Public Health and the Environment (RIVM) and published on luchtmeetnet.nl. This research is based on the published figures from 2018.
PM10 levels as measured at six locations in Amsterdam city centre, two locations in The Hague city centre, two locations in Utrecht and four locations in Rotterdam city centre, are used.
At all locations the PM10 level is measured every hour. The data set contains negative values. These are due to problems with the measurement instruments and are replaced by NA (not available). A graphical data analysis has been performed to detect outliers. Outliers are examined and it has been discussed how to deal with these values.
Identify daily patterns in the data has been done by:
Option 1.
See the research example above about air pollution in the four major Dutch cities. Choose one of these cities and investigate the pattern over the year (see section Operationalization (1): pattern over the year).
Option 2.
The file hp_london_January161718.xlsx contains information about the houses sold in London in January 2016, January 2017 and in January 2018.
Analyze the data in this file:
(i) compare the number of properties sold in January 2016, 2017 and 2018.
(ii) plot side-by-side and stacked barplots with the number of properties sold in Janaury 2016, 2017 and 2018, divided by property type.
(iii) analyze the selling prices using boxplots and histograms.
(iv) create tables with the most common summary statistics to analyze the selling prizes of the different property types over the years.
Option 3.
Analyze the data in your own data set using graphs and summary statistics.