library(tidyverse)
library(openintro)
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 x 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G~      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba~      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba~     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B~      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1

  1. Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

For both plots I used a histogram and looking at the data for McDonalds it appears to be left skewed with the most of the items containing 500 calories or less, the data also seems to be have one peak. For dairy queen it seems to be normally distributed with the data spread throughout evenly with no skews so it is symmetric and there is also one peak.

ggplot(data=mcdonalds,aes(x=cal_fat)) +
  geom_histogram() +
  labs(title="Mcdonalds")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=dairy_queen,aes(x=cal_fat)) + 
  geom_histogram()+
  labs(title="Dairy Queen")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2

  1. Based on the this plot, does it appear that the data follow a nearly normal distribution?

It does appear to follow a nearly normal distribution because the curve helps make it appear symmetric and there is a bell curve shaped.

Exercise 3

  1. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

The points seem to follow the line but the plot seems to show more errant points that tend to divert away from the line than the original plot.

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
sim_norm1 <-as.data.frame(sim_norm)
ggplot(data=sim_norm1,aes(sample=sim_norm)) + 
  geom_line(stat="qq")

Exercise 4

  1. Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

The normal probability plot for the calories look somewhat similar to the plots but the plots do provide evidence that the calories are normal since they seem to close follow a diagonal line but with some errants dots.

Exercise 5

The calories from McDonald’s appear to come from a normal distribution if we follow the charts, they seem to follow a mostly diagonal line.

qqnormsim(sample = cal_fat, data = mcdonalds)

Exercise 6

  1. Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
Question 1

What is the probability that a randomly chosen Dairy Queen item has more than 300 calories?

avgdq <- mean(dairy_queen$calories)
sddq <- sd(dairy_queen$calories)

## Theoretical Method:

  1 - pnorm(q=300, mean=avgdq ,sd=sddq)
## [1] 0.8021241
## Empirical Method:

dairy_queen %>% 
  filter(calories > 300) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.833

Looking at both methods the probabilities of both are reasonably close with the theoretical method being 80% and the empirical being 83%.

Question 2

What is the probability that a randomly chosen McDonald item has more than 300 calories?

avgmcd <- mean(mcdonalds$calories)
sdmcd <- sd(mcdonalds$calories)


## Theoretical Method
1 - pnorm(q=300,mean=avgmcd,sd=sdmcd)
## [1] 0.7963677
## Empirical Method:

mcdonalds %>% 
  filter(calories > 300) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.842

The probabilities of the McDonald item having more than 300 calories using the theoretical method we get the probability being approximately 80% and using the empirical method we get the probability being 84%.

Looking at the 4 probabilities we can see that they both have same agreement between the two methods. With each item having a probability of 80% using the theoretical method and a probability of 84% using the empirical method, despite the slight difference they are pretty close to being normal.

Exercise 7

  1. Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
##1
ggplot(data = mcdonalds, aes(sample = sodium)) + 
  geom_line(stat = "qq")

##2
ggplot(data = dairy_queen, aes(sample = sodium)) + 
  geom_line(stat = "qq")

##3
ChickFilA <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

ggplot(data = ChickFilA, aes(sample = sodium)) + 
  geom_line(stat = "qq")

##4
Sonic <- fastfood %>%
  filter(restaurant == "Sonic")

ggplot(data = Sonic, aes(sample = sodium)) + 
  geom_line(stat = "qq")

Arbys <- fastfood %>%
  filter(restaurant == "Arbys")

ggplot(data = Arbys, aes(sample = sodium)) + 
  geom_line(stat = "qq")

Burger_King <- fastfood %>%
  filter(restaurant =="Burger King") 

ggplot(data = Burger_King, aes(sample = sodium)) + 
  geom_line(stat = "qq")

Subway <- fastfood %>%
  filter(restaurant =="Subway")

ggplot(data = Subway, aes(sample = sodium)) + 
  geom_line(stat = "qq")

Taco_Bell <- fastfood %>%
  filter(restaurant =="Taco Bell")

ggplot(data = Taco_Bell, aes(sample = sodium)) + 
  geom_line(stat = "qq")

The closest distribution to Normal I would see is Subway even though there is some errant points on the lower tail it seems to mostly resemble a diagonal line.

Exercise 8

  1. Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

I wonder if its because the sodium distribution is too high in certain items that it causes the graph to appear this way.But it can also be explained that the data for sodium has many repeated values.

Exercise 9

  1. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
ggplot(data=mcdonalds,aes(sample = total_carb)) +
  geom_line(stat="qq")

I think the variable for total carbs for Mcdonalds is right skewed because there is such a deviation at the right tail and its moving forward in a straight line.

ggplot(data = mcdonalds, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = mean(mcdonalds$total_carb), sd = sd(mcdonalds$total_carb)), col = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can see that there is a creation of a tail forming on the right hand side, which suggests that the skewness is positive.

---
title: "Lab 4"
author: "Al Haque"
date: "`Feb 24`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
data("fastfood", package='openintro')
head(fastfood)
```

```{r}
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")
```

### Exercise 1

1.  Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants.  How do their centers, shapes, and spreads compare?

For both plots I used a histogram and looking at the data for McDonalds it appears to be left skewed with the most of the items containing 500 calories or less, the data also seems to be have one peak. For dairy queen it seems to be normally distributed with the data spread throughout evenly with no skews so it is symmetric and there is also one peak.  

```{r}
ggplot(data=mcdonalds,aes(x=cal_fat)) +
  geom_histogram() +
  labs(title="Mcdonalds")
```


```{r}
ggplot(data=dairy_queen,aes(x=cal_fat)) + 
  geom_histogram()+
  labs(title="Dairy Queen")
```


### Exercise 2 

2.  Based on the this plot, does it appear that the data follow a nearly normal 
    distribution?

  It does appear to follow a nearly normal distribution because the curve helps make it appear symmetric and there is a bell curve shaped.
  
  


### Exercise 3
  
3.  Make a normal probability plot of `sim_norm`.  Do all of the points fall on the line?  How does this plot compare to the probability plot for the real data? (Since `sim_norm` is not a data frame, it can be put directly into the `sample` argument and the `data` argument can be dropped.)


 The points seem to follow the line but the plot seems to show more errant points that tend to divert away from the line than the original plot. 

```{r}
dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
sim_norm1 <-as.data.frame(sim_norm)
ggplot(data=sim_norm1,aes(sample=sim_norm)) + 
  geom_line(stat="qq")
```
  
### Exercise 4

4.  Does the normal probability plot for the calories from fat look similar to the plots 
    created for the simulated data?  That is, do the plots provide evidence that the
    calories are nearly normal?

The normal probability plot for the calories look somewhat similar to the plots but the plots do provide evidence that the calories are normal since they seem to close follow a diagonal line but with some errants dots.  


### Exercise 5

  The calories from McDonald's appear to come from a normal distribution if we follow the charts, they seem to follow a mostly diagonal line. 

```{r}
qqnormsim(sample = cal_fat, data = mcdonalds)
```



### Exercise 6

6.  Write out two probability questions that you would like to answer about any of the restaurants in this dataset.  Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all).  Which one had a closer agreement between the two methods?

#####  Question 1
  What is the probability that a randomly chosen Dairy Queen item has more than 300 calories?

```{r}
avgdq <- mean(dairy_queen$calories)
sddq <- sd(dairy_queen$calories)

## Theoretical Method:

  1 - pnorm(q=300, mean=avgdq ,sd=sddq)
```




```{r}
## Empirical Method:

dairy_queen %>% 
  filter(calories > 300) %>%
  summarise(percent = n() / nrow(dairy_queen))
```


Looking at both methods the probabilities of both are reasonably close with the theoretical method being 80% and the empirical being 83%. 


##### Question 2 
  What is the probability that a randomly chosen McDonald item has more than 300 calories?
  
```{r}
avgmcd <- mean(mcdonalds$calories)
sdmcd <- sd(mcdonalds$calories)


## Theoretical Method
1 - pnorm(q=300,mean=avgmcd,sd=sdmcd)
```
  

```{r}

## Empirical Method:

mcdonalds %>% 
  filter(calories > 300) %>%
  summarise(percent = n() / nrow(mcdonalds))
```

  The probabilities of the McDonald item having more than 300 calories using the theoretical method we get the probability being approximately 80% and using the empirical method we get the probability being 84%.
  

  Looking at the 4 probabilities we can see that they both have same agreement between the two methods. With each item having a probability of 80% using the theoretical method and a probability of 84% using the empirical method, despite the slight difference they are pretty close to being normal. 


### Exercise 7

7.  Now let's consider some of the other variables in the dataset.  Out of all the different restaurants, which ones' distribution is the closest to normal for sodium?

```{r}
##1
ggplot(data = mcdonalds, aes(sample = sodium)) + 
  geom_line(stat = "qq")
```

```{r}
##2
ggplot(data = dairy_queen, aes(sample = sodium)) + 
  geom_line(stat = "qq")
```

```{r}
##3
ChickFilA <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

ggplot(data = ChickFilA, aes(sample = sodium)) + 
  geom_line(stat = "qq")
```

```{r}
##4
Sonic <- fastfood %>%
  filter(restaurant == "Sonic")

ggplot(data = Sonic, aes(sample = sodium)) + 
  geom_line(stat = "qq")

```


```{r}
Arbys <- fastfood %>%
  filter(restaurant == "Arbys")

ggplot(data = Arbys, aes(sample = sodium)) + 
  geom_line(stat = "qq")

```

```{r}
	
Burger_King <- fastfood %>%
  filter(restaurant =="Burger King") 

ggplot(data = Burger_King, aes(sample = sodium)) + 
  geom_line(stat = "qq")

```

```{r}
Subway <- fastfood %>%
  filter(restaurant =="Subway")

ggplot(data = Subway, aes(sample = sodium)) + 
  geom_line(stat = "qq")


```

```{r}

Taco_Bell <- fastfood %>%
  filter(restaurant =="Taco Bell")

ggplot(data = Taco_Bell, aes(sample = sodium)) + 
  geom_line(stat = "qq")
```

  The closest distribution to Normal I would see is Subway even though there is some errant points on the lower tail it seems to mostly resemble a diagonal line. 
  


### Exercise 8   


8.  Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?
    
  I wonder if its because the sodium distribution is too high in certain items that it causes the graph to appear this way.But it can also be explained that the data for sodium has many repeated values. 


### Exercise 9

9.  As you can see, normal probability plots can be used both to assess normality and visualize skewness.  Make a normal probability plot for the total carbohydrates from a restaurant of your choice.  Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

```{r}

ggplot(data=mcdonalds,aes(sample = total_carb)) +
  geom_line(stat="qq")
```

I think the variable for total carbs for Mcdonalds is right skewed because there is such a deviation at the right tail and its moving forward in a straight line. 

```{r}
ggplot(data = mcdonalds, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = mean(mcdonalds$total_carb), sd = sd(mcdonalds$total_carb)), col = "blue")
```

  You can see that there is a creation of a tail forming on the right hand side, which suggests that the skewness is positive.   
