library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
Exercise 1
Make a plot (or plots) to visualize the distributions of the amount
of calories from fat of the options from these two restaurants. How do
their centers, shapes, and spreads compare?
#Load "fastfood"dataset
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan G… 380 60 7 2 0 95
## 2 Mcdonalds Single Ba… 840 410 45 17 1.5 130
## 3 Mcdonalds Double Ba… 1130 600 67 27 3 220
## 4 Mcdonalds Grilled B… 750 280 31 10 0.5 155
## 5 Mcdonalds Crispy Ba… 920 410 45 12 0.5 120
## 6 Mcdonalds Big Mac 540 250 28 10 1 80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
# create two new data sets, "mcdonalds" and "dairy_queen,"
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
head(mcdonalds)
## # A tibble: 6 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan G… 380 60 7 2 0 95
## 2 Mcdonalds Single Ba… 840 410 45 17 1.5 130
## 3 Mcdonalds Double Ba… 1130 600 67 27 3 220
## 4 Mcdonalds Grilled B… 750 280 31 10 0.5 155
## 5 Mcdonalds Crispy Ba… 920 410 45 12 0.5 120
## 6 Mcdonalds Big Mac 540 250 28 10 1 80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
head(dairy_queen)
## # A tibble: 6 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dairy Queen 1/2 lb. … 1000 660 74 26 2 170
## 2 Dairy Queen 1/2 lb. … 800 460 51 20 2 135
## 3 Dairy Queen 1/4 lb. … 630 330 37 13 1 95
## 4 Dairy Queen 1/4 lb. … 540 270 30 11 1 70
## 5 Dairy Queen 1/4 lb. … 570 310 35 11 1 75
## 6 Dairy Queen Original… 400 160 18 9 1 65
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

hist(dairy_queen$cal_fat)

These two histograms show distribution of cal_fat for Mcdonalds and
Dairy_queen. McDonald’s has a higher fat calories compared to Dairy
Queen. Both distributions are right-skewed, this means that there are
few items with very high calories from fat in the data set. McDonald’s
histogram increases in bin width of 200 calories, whereas Dairy Queen’s
histogram increases in bin width of 100 calories.
Exercise 2
Based on the this plot, does it appear that the data follow a nearly
normal distribution?
dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The standard deviation measures the spread or variability of the data
points around the mean (average fat calorie content). It quantifies how
much individual values deviate from the mean value. The histogram
closely resembles the curve and exhibits relatively few deviations, so,
“calories from fat” data for Dairy Queen follows a nearly normal
distribution.
Exercise 3
Make a normal probability plot of sim_norm. Do all of the points fall
on the line? How does this plot compare to the probability plot for the
real data? (Since sim_norm is not a data frame, it can be put directly
into the sample argument and the data argument can be dropped.)
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
# Create a Q-Q plot for the simulated data
qqplot_sim_norm <- ggplot() +
geom_qq(aes(sample = sim_norm)) +
geom_qq_line() +
labs(title = "Normal Q-Q Plot for Simulated Data")
qqplot_sim_norm

This Q-Q plot is to assess the normality of the “calories from fat”
data in Dairy Queen and generate a random sample of normally distributed
data using the ‘rnorm’ function. the “sim_norm” variable contains a
random sample of data that follows a normal distribution with a mean and
standard deviation matching those of the “calories from fat” data for
Dairy queen. Most of the simulated data points mostly follow the line,
QQ plot is approximately normal distributed.
Exercise 4
Does the normal probability plot for the calories from fat look
similar to the plots created for the simulated data? That is, do the
plots provide evidence that the calories are nearly normal?
qqnormsim(sample = cal_fat, data = dairy_queen)

A multiple Q-Q plots are generated to compare the original data to
several simulated normal data sets that helps us assess how well the
original data conforms to a normal distribution in comparison to the
simulated data.If the points in the Q-Q plot closely follow the
reference line, the original data looks closely to the reference line as
the simulated data Q-Q plots, Yes, the normal probability plot for
cal-fat data looks similar to the plots created for the simulated
data.
Exercise 5
Using the same technique, determine whether or not the calories from
McDonald’s menu appear to come from a normal distribution.
dqmean1<- mean(mcdonalds$cal_fat)
dqsd1 <- sd(mcdonalds$cal_fat)
ggplot(data = mcdonalds, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean1, sd = dqsd1), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create a Q-Q plot for McDonald's data
ggplot(data = mcdonalds, aes(sample = cal_fat)) +
geom_qq() +
geom_qq_line() +
labs(title = "Normal Q-Q Plot for Calories from Fat (McDonald's)")

qqnormsim(sample = cal_fat, data = mcdonalds)

The histogram somewhat resembles a normal distribution with appears
to be uni-modal (one main peak). The cal_fat data from McDonald’s menus
approximately follow a normal distribution, but, there are some
deviations and non-normal characteristics (like right-skewed).
Exercise 6
Write out two probability questions that you would like to answer
about any of the restaurants in this dataset. Calculate those
probabilities using both the theoretical normal distribution as well as
the empirical distribution (four probabilities in all). Which one had a
closer agreement between the two methods?
Question 1. Calculates the theoretical probability that a Dairy
Queen item has fewer than 600 calories from fat using the cumulative
distribution function (CDF) of the normal distribution
Question 2. What is the empirical probability that a randomly chosen
menu item from McDonald’s has between 200 and 600 calories from
fat?
1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
dairy_queen %>%
filter(cal_fat < 600) %>%
summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.952
mcdonalds %>%
filter(cal_fat >= 200 & cal_fat <= 600) %>%
summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.561
Exercise 7
Now let’s consider some of the other variables in the data set. Out
of all the different restaurants, which ones’ distribution is the
closest to normal for sodium?
# Find mean and standard deviation for each restaurant
restaurant_summary <- fastfood %>%
group_by(restaurant) %>%
summarise(mean_sodium = mean(sodium), sd_sodium = sd(sodium))
restaurant_summary
## # A tibble: 8 × 3
## restaurant mean_sodium sd_sodium
## <chr> <dbl> <dbl>
## 1 Arbys 1515. 664.
## 2 Burger King 1224. 500.
## 3 Chick Fil-A 1151. 727.
## 4 Dairy Queen 1182. 610.
## 5 Mcdonalds 1438. 1036.
## 6 Sonic 1351. 665.
## 7 Subway 1273. 744.
## 8 Taco Bell 1014. 474.
# Plot sodium distribution for each restaurant
ggplot(data = fastfood, aes(x = sodium)) +
geom_histogram(aes(y = ..density..), bins = 20) +
facet_wrap(~restaurant, scales = "free") +
geom_density() +
labs(title = "Sodium Distribution by Restaurant")

Arbys and Burger King restaurant’s distribution of sodium is the
closest to normal.
Exercise 8
Note that some of the normal probability plots for sodium
distributions seem to have a step wise pattern. why do you think this
might be the case?
Some of the Sodium Distribution plots by Restaurant have a step wise
pattern because type of data is discrete or categorical rather than
continuous normal distribution.
Exercise 9
As you can see, normal probability plots can be used both to assess
normality and visualize skewness. Make a normal probability plot for the
total carbohydrates from a restaurant of your choice. Based on this
normal probability plot, is this variable left skewed, symmetric, or
right skewed? Use a histogram to confirm your findings.
# create a new data set "Subway"
Subway <- fastfood %>%
filter(restaurant == "Subway")
head(Subway)
## # A tibble: 6 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Subway "6\" B.L.… 320 80 9 4 0 20
## 2 Subway "Footlong… 640 160 18 8 0 40
## 3 Subway "6\" BBQ … 430 160 18 6 0 50
## 4 Subway "Footlong… 860 320 36 12 0 100
## 5 Subway "6\" Big … 580 310 31 11 0 85
## 6 Subway "Footlong… 1160 620 62 22 0 170
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
# Plot Total Carbohydrates distribution of Subway
ggplot(data = Subway, aes(x = sodium)) +
geom_histogram(aes(y = ..density..), bins = 20) +
facet_wrap(~restaurant, scales = "free") +
geom_density() +
labs(title = "Total Carbohydrates Distribution of Subway ")

# Create a Q-Q plot
ggplot(data = Subway, aes(sample = total_carb)) +
geom_qq() +
geom_qq_line() +
labs(title = "Normal Q-Q Plot for Total Carbohydrates")

# Create a histogram
hist(Subway$total_carb)

According to the normal probability plot, we can clearly see that the
“total carbohydrates” variable for Subway is right-skewed.
---
title: "Lab 4: Distributions of Random Variables "
author: "Lwin Shwe"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
```

### Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?


```{r view-first-food}
#Load "fastfood"dataset
data("fastfood", package='openintro')
head(fastfood)

# create two new data sets, "mcdonalds" and "dairy_queen,"
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
head(mcdonalds)

dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")
head(dairy_queen)

hist(mcdonalds$cal_fat)
hist(dairy_queen$cal_fat)

```

These two histograms show distribution of cal_fat for Mcdonalds and Dairy_queen. McDonald's has a higher fat calories compared to Dairy Queen. Both distributions are right-skewed, this means that there are few items with very high calories from fat in the data set. McDonald's histogram increases in bin width of 200 calories, whereas Dairy Queen's histogram increases in bin width of 100 calories.

### Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

```{r normal-distribution}
dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
```


The standard deviation measures the spread or variability of the data points around the mean (average fat calorie content). It quantifies how much individual values deviate from the mean value. 
The histogram closely resembles the curve and exhibits relatively few deviations, so, "calories from fat" data for Dairy Queen follows a nearly normal distribution.


### Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

```{r QQ-plot}
ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

# Create a Q-Q plot for the simulated data
qqplot_sim_norm <- ggplot() +
  geom_qq(aes(sample = sim_norm)) +
  geom_qq_line() +
  labs(title = "Normal Q-Q Plot for Simulated Data")

qqplot_sim_norm

```


This Q-Q plot is to assess the normality of the "calories from fat" data in Dairy Queen and generate a random sample of normally distributed data using the 'rnorm' function. the "sim_norm" variable contains a random sample of data that follows a normal distribution with a mean and standard deviation matching those of the "calories from fat" data for Dairy queen. Most of the simulated data points mostly follow the line, QQ plot is approximately normal distributed.


### Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

```{r norm-prob}
qqnormsim(sample = cal_fat, data = dairy_queen)
```

A multiple Q-Q plots are generated to compare the original data to several simulated normal data sets that helps us assess how well the original data conforms to a normal distribution in comparison to the simulated data.If the points in the Q-Q plot closely follow the reference line, the original data looks closely to the reference line as the simulated data Q-Q plots,
Yes, the normal probability plot for cal-fat data looks similar to the plots created for the simulated data.

### Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

```{r QQ-McDonald}
dqmean1<- mean(mcdonalds$cal_fat)
dqsd1   <- sd(mcdonalds$cal_fat)
ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean1, sd = dqsd1), col = "tomato")


# Create a Q-Q plot for McDonald's data
ggplot(data = mcdonalds, aes(sample = cal_fat)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Normal Q-Q Plot for Calories from Fat (McDonald's)")

qqnormsim(sample = cal_fat, data = mcdonalds)

```

The histogram somewhat resembles a normal distribution with appears to be uni-modal (one main peak). The cal_fat data from McDonald's menus approximately follow a normal distribution, but, there are some deviations and non-normal characteristics (like right-skewed).


### Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

#### Question 1. Calculates the theoretical probability that a Dairy Queen item has fewer than 600 calories from fat using the cumulative distribution function (CDF) of the normal distribution

#### Question 2. What is the empirical probability that a randomly chosen menu item from McDonald's has between 200 and 600 calories from fat?

```{r pnorm-distribution}
1 - pnorm(q = 600, mean = dqmean, sd = dqsd)

dairy_queen %>% 
  filter(cal_fat < 600) %>%
  summarise(percent = n() / nrow(dairy_queen))


mcdonalds %>% 
  filter(cal_fat >= 200 & cal_fat <= 600) %>%
  summarise(percent = n() / nrow(mcdonalds))

```


### Exercise 7

Now let’s consider some of the other variables in the data set. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

```{r sodium-distribution}
# Find mean and standard deviation for each restaurant
restaurant_summary <- fastfood %>%
  group_by(restaurant) %>%
  summarise(mean_sodium = mean(sodium), sd_sodium = sd(sodium))
restaurant_summary

# Plot sodium distribution for each restaurant
ggplot(data = fastfood, aes(x = sodium)) +
  geom_histogram(aes(y = ..density..), bins = 20) +
  facet_wrap(~restaurant, scales = "free") +
  geom_density() +
  labs(title = "Sodium Distribution by Restaurant")

```

Arbys and Burger King restaurant's distribution of sodium is the closest to normal. 


### Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a step wise pattern. why do you think this might be the case?

Some of the Sodium Distribution plots by Restaurant have a step wise pattern because type of data is discrete or categorical rather than continuous normal distribution.


### Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

```{r carbohydrates}
# create a new data set "Subway"
Subway <- fastfood %>%
  filter(restaurant == "Subway")
head(Subway)

# Plot Total Carbohydrates distribution of Subway
ggplot(data = Subway, aes(x = sodium)) +
  geom_histogram(aes(y = ..density..), bins = 20) +
  facet_wrap(~restaurant, scales = "free") +
  geom_density() +
  labs(title = "Total Carbohydrates Distribution of Subway ")

# Create a Q-Q plot
ggplot(data = Subway, aes(sample = total_carb)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Normal Q-Q Plot for Total Carbohydrates")

# Create a histogram
hist(Subway$total_carb)


```


According to the normal probability plot, we can clearly see that the "total carbohydrates" variable for Subway is right-skewed.




