Datasaurus and the ABC principle - Assignment 2

Tam Nguyen

[A]

1.134

High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Heart Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL cholesterol.36 U.S. women aged 20 and over have a mean HDL of 55 mg/dl with a standard deviation of 15.5 mg/dl. Assume that the distribution is Normal.


(a) What percent of women have low values of HDL (40 mg/dl or less)?

We know that women aged 20 and over have a mean HDL of 55 mg/dl and a SD of 15.5 mg/dl. From these we can calculate the z score of women with low values of HDL (40 mg/dl, or less). We calculate z scores using the formula \[zscore = \frac{(x - mean)} {sd}\]

# define the standardized score
findZscore <- function(x, mean_x, sd_x) {
  (x - mean_x)/sd_x 
} 

# find the z score of women with HDL of 40mg/dl
findZscore(40, 55, 15.5)
## [1] -0.9677419

With a z score of -0.96, we look at the z table and found that about 16.6% women have the HDL of 40 mg/dl or less.


(b) HDL levels of 60 mg/dl and higher are believed to protect people from heart disease. What percent of women have protective levels of HDL?

We do the same calculation above with women who have HDL levels of 60mg/dl or higer. Then we can calculate how many percentages of women have protective level of HDL:

findZscore(60, 55, 15.5)
## [1] 0.3225806
# find the area to the right of the z score:
protect <- 1 - findZscore(60, 55, 15.5)
1 - protect
## [1] 0.3225806

With a z score of 0.32, we look at the z table and find out that about 26% women have protective levels of HDL.


(c) Women with more than 40 mg/dl but less than 60 mg/dl of HDL are in the intermediate range, neither very good or very bad. What proportion are in this category?

We can calculate the intermediate range by subtracting the percents of women with 60 mg/dl by that of women with 40 mg/dl.

findZscore(40, 55, 15.5) 
## [1] -0.9677419
findZscore(60, 55, 15.5) 
## [1] 0.3225806

From the results above we found that 17% women have HDL less than 40 mg/dl and 63% women have HDL less than 60 mg/dl. Therefore, the intermediate range is 63% - 17%, which equals 46%.


1.138

Quartiles for Normal distributions. The quartiles of any distribution are the values with cumulative proportions 0.25 and 0.75.

(a) What are the quartiles of the standard Normal distribution?

We can use R to calculate the quantiles of the normal distrubution using R. We are going to create 10000 observations with mean equals zero and standard deviation equals 1 using rnorm function in R:

set.seed(10000)
normal <- rnorm(10000, mean = 0, sd = 1)

# calculate the quantiles for this normal distribution:
quantile(normal)
##            0%           25%           50%           75%          100% 
## -4.0226781179 -0.6733417727 -0.0005222524  0.6951116476  3.3967841641

Quantile 1 is shown here with a standardized score equals -0.67 and Q3 is also shown here with 0.67.

Similarly, given that the first quartile and the third quartile have proportions of 0.25 and 0.75, we can look at the z table and see z scores of these quartiles respectively. The first quartile has a z score of approximately -0.67 and the third quartile has a z score of approximately 0.67.

(b) Using your numerical values from part (a), write an equation that gives the quartiles of the N(μ, σ) distribution in terms of μ and σ.

The equation for each quantile is calculated by multiplying the quantile we are given by the SD, then we add the product of it by the mean:

Q1 = mean - 0.674*SD

Q2 = mean - 0.00052*SD

Q3 = mean + 0.674*SD


1.146

CO2 emissions in vehicles. Natural Resources Canada tests new vehicles each year and reports several variables related to fuel consumption for vehicles in different classes. For 2015, it provides data for 526 vehicles that use regular fuel. Two variables reported are carbon dioxide (CO2) emissions and highway fuel consumption. CO2 is measured in grams per kilometer (g/km), and highway fuel consumption measured in liters per 100 kilometers (L/km). Use graphical and numerical summaries to describe the distribution of CO2 emissions for these vehicles. Be sure to justify your choice of summaries.

Answer

First, we display the dataset to get a feel of the data:

## make the table of emission
knitr::kable(
  emission[1:5, ],
  caption = "CO2 emission table"
)
CO2 emission table
ModelYear Manufacturer Model VehicleClass EngineSize Cylinders Transmission FUEL FuelConsCity FuelConsHwy FuelConsComb CO2
2015 BUICK ENCLAVE SUV - STANDARD 3.6 6 A6 X 14.2 9.9 12.3 283
2015 BUICK ENCLAVE AWD SUV - STANDARD 3.6 6 A6 X 14.6 10.2 12.6 290
2015 BUICK ENCORE SUV - SMALL 1.4 4 AS6 X 9.5 7.2 8.5 196
2015 BUICK ENCORE AWD SUV - SMALL 1.4 4 AS6 X 10.2 8.0 9.2 212
2015 BUICK LACROSSE MID-SIZE 3.6 6 AS6 X 13.7 8.6 11.4 262

Given that there are a lot of columns that are not essential for our analysis, such as model year or engine sizes, we are going to select only the colums that related to our graphical summary. This is done by selecting only VehicleClass, CO2 and Highway Fuel Consumptions:

# select only Vehicle Class and their emission
newEmission <- emission %>% 
  select(VehicleClass, CO2, FuelConsHwy) %>% 
  mutate(Vehicle = as.factor(as.character(VehicleClass)), 
         VehicleClass = NULL) 


# make a table
knitr::kable(
  newEmission[1:5 , ],
  caption = "Table of CO2 emmissions based on vehicle classes"
)
Table of CO2 emmissions based on vehicle classes
CO2 FuelConsHwy Vehicle
283 9.9 SUV - STANDARD
290 10.2 SUV - STANDARD
196 7.2 SUV - SMALL
212 8.0 SUV - SMALL
262 8.6 MID-SIZE

The above table shows only the variables that we need to analysize. The next step is to create a graphical summary so that we can better visualise the distribution:

#------------------
#Graphical Summary 
#------------------
newEmission %>% 
  ggplot(aes(x = FuelConsHwy, y = CO2, colour = Vehicle)) +
  geom_jitter( alpha = 1/3) +
  facet_wrap(~Vehicle) +
   theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        panel.background = element_rect(fill = "#fdf9ff"),
        strip.background = element_rect(fill = "#a6bddb", color = "#400156", size =0.5),
        strip.text = element_text(size = 10, color = "black"),
        panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)) +
  labs(title = "CO2 and High Way Consumptions based on Vehicles",
       x = "Fuel Consumption in Litres per 100km (L/km)",
       y = "CO2 emission (g/km)") 

This graphic shows us the relationship between highway fuel consumptions (axis x) and CO2 emission (axis y) based on each vehicle class. We can see that there is a strong, linear correlation between CO2 emmissions and highway fuel consumptions for each vehicle class. This means that the more fuel consumptions vehicles are used, the more CO2 emissions are being produced. This graphic also illustrates which vehicle class has the most CO2 emissions. VAN - PASSENGER produces the most CO2 emissions with over 400 g/km. While TWO-SEATER produces the least CO2 with only about 150 g/km. Some a lot of other vehicle classes, the distributions of CO2 emissions and highway fuel consumptions are vary.

The advantage of this graphic is that it clearly shows the relationship between highway fuel consumptions and CO2 emissions. Additionally, we can see these distributions for each vehicle class individually.


1.165

Blueberries and anthocyanins. Anthocyanins are compounds that have been associated with health benefits associated with the heart, bones, and the brain. Blueberries are a good source of many different anthocyanins. Researchers at the Piedmont Research Station of North Carolina State University have assembled a database giving the concentrations of 18 different anthocyanins for 267 varieties of blueberries.47 Four of the anthocyanins measured are delphinidin-3-arabinoside, malvidin-3-arabinoside, cyanidin-3-galactoside, and delphinidin-3-glucoside, all measured in units of mg/100g of berries. In the data file, we have simplified the names of these anthocyanins to Antho1, Antho2, Antho3, and Antho4. Figure 1.35 gives graphical and numeric summaries from JMP for Antho1. Use this output to write a summary of the distribution of Antho1 using the methods and ideas that you learned in this chapter.

Answer

We will select only the variable of interest. In this case it is Antho1:

# select Antho1
berry <- berry %>% select(Antho1)

## draw the histogram for Antho1
berry %>% 
  ggplot(aes(x = Antho1)) +
  geom_histogram(binwidth = 0.2, alpha = 0.7,  fill = "orange") +
  labs(title = "Histogram of Antho1",
       subtitle = "Antho1 is measured in mg/100g of berries",
       x = "mg/100g")

The above histogram is approximately symmetrical and bell shaped, with one exception being that there is a slightly higher frequency of Antho1 when it reaches near 2mg/100g of berries.

We can also graph the QQplot and some summary statistics table:

# summary statistics for Antho1
quantile(berry$Antho1)
##        0%       25%       50%       75%      100% 
## 0.2070234 1.2856481 1.6204762 1.9153709 3.2373255
describe(berry$Antho1)
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 267 1.63 0.52   1.62    1.61 0.47 0.21 3.24  3.03 0.28     0.51
##      se
## X1 0.03
qqnorm(berry$Antho1)

This plot shows the relationship between the observed quantiles and the theoretical quantiles. We can see that there is a very linear relationship in the graph, which shows a (almost) straight line. This means that the graph is normal distributed.


Datasaurus

[F] Prof. Tucker’s BIG DATA Challenge (Datasaurus). At Moodle, you’ll find a tabLseparated variable (.tsv) file called “DatasaurusDozen.tsv”. It has 1,847 records; there are 12 distinct data sets inside. For example, row #712 reads: “star 58.2136082599 91.881891513”. The value in the first column, star, refers to the data set — all of the “star” rows will form one of the dozen. The next column has “x” values, while the third and final column as “y” values. For each of the 12 data sets, find the mean of X & Y, the standard deviations of X & Y, and the correlation coefficient between the two. The final step should be to plot the data. For the writeL up, you don’t need to include your analysis of all twelve data sets, but describe in high level terms what is going on with this dozen datasets. Why are we seeing what we’re seeing? Hint: data.frame and factor are R ideas that could help make this easier.

Answer

First, we are going to find the mean, the SD, and the correlation coefficients for each of the dataset:

# group the dataset together and produce a summary table
datasaurusNew <- datasaurus %>% 
  group_by(dataset) %>% 
  summarise(mean_x = mean(x),
            mean_y = mean(y),
            sd_x = sd(x),
            sd_y = sd(y),
            cor = cor(x, y))

# display the table
knitr::kable(datasaurusNew[ , ],
             caption = "Datasaurus Table with summary")
Datasaurus Table with summary
dataset mean_x mean_y sd_x sd_y cor
away 54.26610 47.83472 16.76983 26.93974 -0.0641284
bullseye 54.26873 47.83082 16.76924 26.93573 -0.0685864
circle 54.26732 47.83772 16.76001 26.93004 -0.0683434
dino 54.26327 47.83225 16.76514 26.93540 -0.0644719
dots 54.26030 47.83983 16.76774 26.93019 -0.0603414
h_lines 54.26144 47.83025 16.76590 26.93988 -0.0617148
high_lines 54.26881 47.83545 16.76670 26.94000 -0.0685042
slant_down 54.26785 47.83590 16.76676 26.93610 -0.0689797
slant_up 54.26588 47.83150 16.76885 26.93861 -0.0686092
star 54.26734 47.83955 16.76896 26.93027 -0.0629611
v_lines 54.26993 47.83699 16.76996 26.93768 -0.0694456
wide_lines 54.26692 47.83160 16.77000 26.93790 -0.0665752
x_shape 54.26015 47.83972 16.76996 26.93000 -0.0655833

Interestingly, the summary statistics of these datasets are almost identical! This is surprising because each individual observation for each dataset is not equal to each other.

We can also plot the datasets to clearly see the distributions:

datasaurus %>% 
  ggplot(aes(x, y, colour = dataset)) +
  geom_point(alpha = 1/3) +
  labs(title = "Plot for each dataset in Datasaurus",
       x = "Value of X",
       y = "Value of Y") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        panel.background = element_rect(fill = "#fdf9ff"),
        strip.background = element_rect(fill = "#a6bddb", color = "#400156", size =0.5),
        strip.text = element_text(size = 10, color = "black"),
        panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)) +
    facet_wrap(~dataset)

This graph shows very interesting and beautiful patterns for each of the dataset, though the summary statistics of these datasets are identical. This gives us an insight into how different it is between the summary statistics and the distributions of the datasets. We can conclude from this graph that it’s very, very important to always be charting (A.B.C). Because summary statistics, though can give us a general feel of the data, we cannot clearly see the whole picture, or the distributions of the dataset itself.