Setup

You’ll need the following packages:

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Once you upload your data set, you should end up with a table in your workspace as we did in class. Change the file name to your file’s name, and remove the # from the start of the line.

fastfood <- read_csv("fastfood.csv")

For your midterm take-home assignment, you will carry out an exploratory data analysis on the data set you’ve chosen. While there is no one way to perform EDA, your analysis will be structured in a way similar to what we’ve seen in class.

Submit your responses to these questions as a PDF which you knit from this file. Be sure to include the code you use. This assignment is due on Canvas at the assignment labeled “Midterm 1 Take-Home”.

Using RStudio, you will produce plots, calculate summary statistics to go along with the plots, and interpret the results. For each plot, you will also generate some questions and answers.

Variables and Types, Missing Data

List all your variables and give their types. What does a row in your data set represent? Are there any missing values? Do you know why?

Categorical: restaurant, item, salad

Numerical: calories, cal_fat, total_fat, sat_fat, trans_fat, cholesterol, sodium, total_carb, fiber, sugar, protein, vit_a, vit_c, calcium
The data set is a data frame with 515 observations (rows) and 17 variables (columns) listed in the table below, along with the description and variable type.
A row indicates a food item from a specific restaurante
no values are missing

What type of variation occurs within variables?

Choose a numerical variable, and plot a histogram of this variable.

fastfood %>% ggplot(aes(x=calories))+geom_histogram(bins=10)

Describe its modality and skew.
The graph is unimodal and skewed to the right
Calculate the mean, median, and SD.

 fastfood%>% summarize(mean(calories))

## # A tibble: 1 × 1
##   `mean(calories)`
##              <dbl>
## 1             531.

 fastfood%>% summarize(median(calories))

## # A tibble: 1 × 1
##   `median(calories)`
##                <dbl>
## 1                490

 fastfood%>% summarize(sd(calories))

## # A tibble: 1 × 1
##   `sd(calories)`
##            <dbl>
## 1           282.

Interpret the SD of the numerical variable you chose, using the framework sentence we discussed in class.

A randomly chosen fast food item in on average 282 away from the mean 531
Choose a categorical variable, and generate a bar plot for this categorical variable.

fastfood%>%ggplot(aes(x=restaurant))+geom_bar()

Use filter to narrow down your data to a certain subset you find interesting, and create a histogram for the same numerical variable you chose above.

fastfood %>% filter(restaurant =="Taco Bell")%>% ggplot(aes(x=calories))+geom_histogram(bins=10)

Is its distribution different from the first histogram you made? Can you say why or why not?
The histogram is different from the one made before is only calories observations taken from taco bell are being used to create the graph

What relationships do you see between variables?

Choose two categorical variables, and compare them with an appropriate plot.
```
fastfood %>% ggplot(aes(x=restaurant, fill=as.factor(salad)))+geom_bar()
```
- Describe any relationships you see between the variables in this plot.
  
  All resaurants have more than 50%, “Not Salad” options vs “Salad” options on there menu
Choose two numerical variables and compare them with an appropriate plot.

fastfood%>% ggplot(mapping=aes(x=calories, y=sodium))+geom_point()+geom_smooth(method="lm",se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Calculate the correlation coefficient, and describe any relationships you see. Use the sentences we introduced in class for describing relationships between these types of variables.
```
fastfood %>% summarize(cor(calories, sodium))
```
```
## # A tibble: 1 × 1
##   `cor(calories, sodium)`
##                     <dbl>
## 1                   0.818
```
If you saw a relationship between the variables in your plot, why do you think this is the case? If you didn’t see a relationship there, why do you think there isn’t one?

From the graph it is evident that most food items under 1000 calories contain sodim under 2500g. The As the calories increase by , the sodium found in the food items also increase by 0.818
Choose a categorical variable and a numerical variable, and use the appropriate type of plot to visualize their relationship below.

fastfood%>% ggplot(aes(x=as.factor(restaurant), y=calories))+geom_boxplot()

Use group_by to calculate the median within each value of your categorical variable.

fastfood%>%group_by(restaurant) %>% summarize(median(calories))

## # A tibble: 8 × 2
##   restaurant  `median(calories)`
##   <chr>                    <dbl>
## 1 Arbys                      550
## 2 Burger King                555
## 3 Chick Fil-A                390
## 4 Dairy Queen                485
## 5 Mcdonalds                  540
## 6 Sonic                      570
## 7 Subway                     460
## 8 Taco Bell                  420

If you saw or did not see differences between groups in your plot, why might this be the case?

I did not see much of a diffrence as the group by scenario is similar to the boxplot created.

Conclusion: On your own

After showing you can use some basic data manipulation and visualization tools, I’ll want to know what observations you can draw from these visualizations and summary statistics. Here is your chance to shine - tell me what you know about the dataset you chose! Here are some prompts to help frame this discussion.

What is something you observe in this dataset which is surprising or unexpected? Show what you find surprising using a chart or summary statistic calculation below.

fastfood %>%ggplot(mapping = aes(x = sugar, y =sodium, color = calories)) +geom_point() +facet_grid(. ~ restaurant)

Why is this observation surprising?
1. Chick-fil-A has the lowest calorie count, while Subway, which has the most salad options, has one of the highest calorie counts.
2. Taco Bell has lower sugar and sodium content despite having no salad options, which is surprising.
3. Subway, despite having more salad options, is the second-highest in terms of sugar content, which contradicts the expectation that it would offer healthier options.
4. McDonald’s has the highest sodium, and sugar content, which aligns with having the highest calorie count compared to other restaurants.
Based on these observations, it can be conclude that McDonald’s is the least healthy restaurant, and Chick-fil-A is the healthiest among the ones analyzed.
What further questions do you have now that you’ve seen this data?
- What percentage of sodium makes up the the food items in each restaurant?
- What percentage of sodium makes up the the food items in each restaurant?
What other data sets might be interesting to compare to this one?
Healthy food restaurants data set to compare the calories, sodium, sugars etc.

Math 120 Midterm 1 - Take-Home

Chanelle Russell

2023-11-26

Setup

Variables and Types, Missing Data

What type of variation occurs within variables?

What relationships do you see between variables?

Conclusion: On your own