Datasaurus and the ABC principle

[A]

1.134

High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Heart Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL cholesterol.36 U.S. women aged 20 and over have a mean HDL of 55 mg/dl with a standard deviation of 15.5 mg/dl. Assume that the distribution is Normal.

(a) What percent of women have low values of HDL (40 mg/dl or less)?

We know that women aged 20 and over have a mean HDL of 55 mg/dl and a SD of 15.5 mg/dl. From these we can calculate the z score of women with low values of HDL (40 mg/dl, or less). We calculate z scores using the formula \[zscore = \frac{(x - mean)} {sd}\]

# define the standardized score
findZscore <- function(x, mean_x, sd_x) {
  (x - mean_x)/sd_x 
} 

# find the z score of women with HDL of 40mg/dl
findZscore(40, 55, 15.5)

## [1] -0.9677419

With a z score of -0.96, we look at the z table and found that about 16.6% women have the HDL of 40 mg/dl or less.

(b) HDL levels of 60 mg/dl and higher are believed to protect people from heart disease. What percent of women have protective levels of HDL?

We do the same calculation above with women who have HDL levels of 60mg/dl or higer. Then we can calculate how many percentages of women have protective level of HDL:

findZscore(60, 55, 15.5)

## [1] 0.3225806

# find the area to the right of the z score:
protect <- 1 - findZscore(60, 55, 15.5)
1 - protect

## [1] 0.3225806

With a z score of 0.32, we look at the z table and find out that about 26% women have protective levels of HDL.

(c) Women with more than 40 mg/dl but less than 60 mg/dl of HDL are in the intermediate range, neither very good or very bad. What proportion are in this category?

We can calculate the intermediate range by subtracting the percents of women with 60 mg/dl by that of women with 40 mg/dl.

findZscore(40, 55, 15.5)

## [1] -0.9677419

findZscore(60, 55, 15.5)

## [1] 0.3225806

From the results above we found that 17% women have HDL less than 40 mg/dl and 63% women have HDL less than 60 mg/dl. Therefore, the intermediate range is 63% - 17%, which equals 46%.

1.138

Quartiles for Normal distributions. The quartiles of any distribution are the values with cumulative proportions 0.25 and 0.75.

(a) What are the quartiles of the standard Normal distribution?

We can use R to calculate the quantiles of the normal distrubution using R. We are going to create 10000 observations with mean equals zero and standard deviation equals 1 using rnorm function in R:

set.seed(10000)
normal <- rnorm(10000, mean = 0, sd = 1)

# calculate the quantiles for this normal distribution:
quantile(normal)

##            0%           25%           50%           75%          100% 
## -4.0226781179 -0.6733417727 -0.0005222524  0.6951116476  3.3967841641

Quantile 1 is shown here with a standardized score equals -0.67 and Q3 is also shown here with 0.67.

Similarly, given that the first quartile and the third quartile have proportions of 0.25 and 0.75, we can look at the z table and see z scores of these quartiles respectively. The first quartile has a z score of approximately -0.67 and the third quartile has a z score of approximately 0.67.

(b) Using your numerical values from part (a), write an equation that gives the quartiles of the N(μ, σ) distribution in terms of μ and σ.

The equation for each quantile is calculated by multiplying the quantile we are given by the SD, then we add the product of it by the mean:

Q1 = mean - 0.674*SD

Q2 = mean - 0.00052*SD

Q3 = mean + 0.674*SD

1.146

CO2 emissions in vehicles. Natural Resources Canada tests new vehicles each year and reports several variables related to fuel consumption for vehicles in different classes. For 2015, it provides data for 526 vehicles that use regular fuel. Two variables reported are carbon dioxide (CO2) emissions and highway fuel consumption. CO2 is measured in grams per kilometer (g/km), and highway fuel consumption measured in liters per 100 kilometers (L/km). Use graphical and numerical summaries to describe the distribution of CO2 emissions for these vehicles. Be sure to justify your choice of summaries.

Answer

First, we display the dataset to get a feel of the data:

## make the table of emission
knitr::kable(
  emission[1:5, ],
  caption = "CO2 emission table"
)

CO2 emission table
ModelYear	Manufacturer	Model	VehicleClass	EngineSize	Cylinders	Transmission	FUEL	FuelConsCity	FuelConsHwy	FuelConsComb	CO2
2015	BUICK	ENCLAVE	SUV - STANDARD	3.6	6	A6	X	14.2	9.9	12.3	283
2015	BUICK	ENCLAVE AWD	SUV - STANDARD	3.6	6	A6	X	14.6	10.2	12.6	290
2015	BUICK	ENCORE	SUV - SMALL	1.4	4	AS6	X	9.5	7.2	8.5	196
2015	BUICK	ENCORE AWD	SUV - SMALL	1.4	4	AS6	X	10.2	8.0	9.2	212
2015	BUICK	LACROSSE	MID-SIZE	3.6	6	AS6	X	13.7	8.6	11.4	262

Given that there are a lot of columns that are not essential for our analysis, such as model year or engine sizes, we are going to select only the colums that related to our graphical summary. This is done by selecting only VehicleClass, CO2 and Highway Fuel Consumptions:

# select only Vehicle Class and their emission
newEmission <- emission %>% 
  select(VehicleClass, CO2, FuelConsHwy) %>% 
  mutate(Vehicle = as.factor(as.character(VehicleClass)), 
         VehicleClass = NULL) 


# make a table
knitr::kable(
  newEmission[1:5 , ],
  caption = "Table of CO2 emmissions based on vehicle classes"
)

Table of CO2 emmissions based on vehicle classes
CO2	FuelConsHwy	Vehicle
283	9.9	SUV - STANDARD
290	10.2	SUV - STANDARD
196	7.2	SUV - SMALL
212	8.0	SUV - SMALL
262	8.6	MID-SIZE

The above table shows only the variables that we need to analysize. The next step is to create a graphical summary so that we can better visualise the distribution:

#------------------
#Graphical Summary 
#------------------
newEmission %>% 
  ggplot(aes(x = FuelConsHwy, y = CO2, colour = Vehicle)) +
  geom_jitter( alpha = 1/3) +
  facet_wrap(~Vehicle) +
   theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        panel.background = element_rect(fill = "#fdf9ff"),
        strip.background = element_rect(fill = "#a6bddb", color = "#400156", size =0.5),
        strip.text = element_text(size = 10, color = "black"),
        panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)) +
  labs(title = "CO2 and High Way Consumptions based on Vehicles",
       x = "Fuel Consumption in Litres per 100km (L/km)",
       y = "CO2 emission (g/km)")

This graphic shows us the relationship between highway fuel consumptions (axis x) and CO2 emission (axis y) based on each vehicle class. We can see that there is a strong, linear correlation between CO2 emmissions and highway fuel consumptions for each vehicle class. This means that the more fuel consumptions vehicles are used, the more CO2 emissions are being produced. This graphic also illustrates which vehicle class has the most CO2 emissions. VAN - PASSENGER produces the most CO2 emissions with over 400 g/km. While TWO-SEATER produces the least CO2 with only about 150 g/km. Some a lot of other vehicle classes, the distributions of CO2 emissions and highway fuel consumptions are vary.

The advantage of this graphic is that it clearly shows the relationship between highway fuel consumptions and CO2 emissions. Additionally, we can see these distributions for each vehicle class individually.

1.165

Blueberries and anthocyanins. Anthocyanins are compounds that have been associated with health benefits associated with the heart, bones, and the brain. Blueberries are a good source of many different anthocyanins. Researchers at the Piedmont Research Station of North Carolina State University have assembled a database giving the concentrations of 18 different anthocyanins for 267 varieties of blueberries.47 Four of the anthocyanins measured are delphinidin-3-arabinoside, malvidin-3-arabinoside, cyanidin-3-galactoside, and delphinidin-3-glucoside, all measured in units of mg/100g of berries. In the data file, we have simplified the names of these anthocyanins to Antho1, Antho2, Antho3, and Antho4. Figure 1.35 gives graphical and numeric summaries from JMP for Antho1. Use this output to write a summary of the distribution of Antho1 using the methods and ideas that you learned in this chapter.

Answer

We will select only the variable of interest. In this case it is Antho1:

# select Antho1
berry <- berry %>% select(Antho1)

## draw the histogram for Antho1
berry %>% 
  ggplot(aes(x = Antho1)) +
  geom_histogram(binwidth = 0.2, alpha = 0.7,  fill = "orange") +
  labs(title = "Histogram of Antho1",
       subtitle = "Antho1 is measured in mg/100g of berries",
       x = "mg/100g")

The above histogram is approximately symmetrical and bell shaped, with one exception being that there is a slightly higher frequency of Antho1 when it reaches near 2mg/100g of berries.

We can also graph the QQplot and some summary statistics table:

# summary statistics for Antho1
quantile(berry$Antho1)

##        0%       25%       50%       75%      100% 
## 0.2070234 1.2856481 1.6204762 1.9153709 3.2373255

describe(berry$Antho1)

##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 267 1.63 0.52   1.62    1.61 0.47 0.21 3.24  3.03 0.28     0.51
##      se
## X1 0.03

qqnorm(berry$Antho1)

This plot shows the relationship between the observed quantiles and the theoretical quantiles. We can see that there is a very linear relationship in the graph, which shows a (almost) straight line. This means that the graph is normal distributed.

Datasaurus

[F] Prof. Tucker’s BIG DATA Challenge (Datasaurus). At Moodle, you’ll find a tabLseparated variable (.tsv) file called “DatasaurusDozen.tsv”. It has 1,847 records; there are 12 distinct data sets inside. For example, row #712 reads: “star 58.2136082599 91.881891513”. The value in the first column, star, refers to the data set — all of the “star” rows will form one of the dozen. The next column has “x” values, while the third and final column as “y” values. For each of the 12 data sets, find the mean of X & Y, the standard deviations of X & Y, and the correlation coefficient between the two. The final step should be to plot the data. For the writeL up, you don’t need to include your analysis of all twelve data sets, but describe in high level terms what is going on with this dozen datasets. Why are we seeing what we’re seeing? Hint: data.frame and factor are R ideas that could help make this easier.

Answer

First, we are going to find the mean, the SD, and the correlation coefficients for each of the dataset:

# group the dataset together and produce a summary table
datasaurusNew <- datasaurus %>% 
  group_by(dataset) %>% 
  summarise(mean_x = mean(x),
            mean_y = mean(y),
            sd_x = sd(x),
            sd_y = sd(y),
            cor = cor(x, y))

# display the table
knitr::kable(datasaurusNew[ , ],
             caption = "Datasaurus Table with summary")

Datasaurus Table with summary
dataset	mean_x	mean_y	sd_x	sd_y	cor
away	54.26610	47.83472	16.76983	26.93974	-0.0641284
bullseye	54.26873	47.83082	16.76924	26.93573	-0.0685864
circle	54.26732	47.83772	16.76001	26.93004	-0.0683434
dino	54.26327	47.83225	16.76514	26.93540	-0.0644719
dots	54.26030	47.83983	16.76774	26.93019	-0.0603414
h_lines	54.26144	47.83025	16.76590	26.93988	-0.0617148
high_lines	54.26881	47.83545	16.76670	26.94000	-0.0685042
slant_down	54.26785	47.83590	16.76676	26.93610	-0.0689797
slant_up	54.26588	47.83150	16.76885	26.93861	-0.0686092
star	54.26734	47.83955	16.76896	26.93027	-0.0629611
v_lines	54.26993	47.83699	16.76996	26.93768	-0.0694456
wide_lines	54.26692	47.83160	16.77000	26.93790	-0.0665752
x_shape	54.26015	47.83972	16.76996	26.93000	-0.0655833

Interestingly, the summary statistics of these datasets are almost identical! This is surprising because each individual observation for each dataset is not equal to each other.

We can also plot the datasets to clearly see the distributions:

datasaurus %>% 
  ggplot(aes(x, y, colour = dataset)) +
  geom_point(alpha = 1/3) +
  labs(title = "Plot for each dataset in Datasaurus",
       x = "Value of X",
       y = "Value of Y") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        panel.background = element_rect(fill = "#fdf9ff"),
        strip.background = element_rect(fill = "#a6bddb", color = "#400156", size =0.5),
        strip.text = element_text(size = 10, color = "black"),
        panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)) +
    facet_wrap(~dataset)

This graph shows very interesting and beautiful patterns for each of the dataset, though the summary statistics of these datasets are identical. This gives us an insight into how different it is between the summary statistics and the distributions of the datasets. We can conclude from this graph that it’s very, very important to always be charting (A.B.C). Because summary statistics, though can give us a general feel of the data, we cannot clearly see the whole picture, or the distributions of the dataset itself.

Datasaurus and the ABC principle - Assignment 2

Tam Nguyen

[A]

1.134

1.138

1.146

1.165

Datasaurus