You’ll need the following packages:
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Once you upload your data set, you should end up with a table in your workspace as we did in class. Change the file name to your file’s name, and remove the # from the start of the line.
fastfood <- read_csv("fastfood.csv")
For your midterm take-home assignment, you will carry out an exploratory data analysis on the data set you’ve chosen. While there is no one way to perform EDA, your analysis will be structured in a way similar to what we’ve seen in class.
Submit your responses to these questions as a PDF which you knit from this file. Be sure to include the code you use. This assignment is due on Canvas at the assignment labeled “Midterm 1 Take-Home”.
Using RStudio, you will produce plots, calculate summary statistics to go along with the plots, and interpret the results. For each plot, you will also generate some questions and answers.
List all your variables and give their types. What does a row in your data set represent? Are there any missing values? Do you know why?
Categorical: restaurant, item, salad
Numerical: calories, cal_fat, total_fat, sat_fat, trans_fat, cholesterol, sodium, total_carb, fiber, sugar, protein, vit_a, vit_c, calcium
The data set is a data frame with 515 observations (rows) and 17 variables (columns) listed in the table below, along with the description and variable type.
A row indicates a food item from a specific restaurante
no values are missing
fastfood %>% ggplot(aes(x=calories))+geom_histogram(bins=10)
Describe its modality and skew.
The graph is unimodal and skewed to the right
Calculate the mean, median, and SD.
fastfood%>% summarize(mean(calories))
## # A tibble: 1 × 1
## `mean(calories)`
## <dbl>
## 1 531.
fastfood%>% summarize(median(calories))
## # A tibble: 1 × 1
## `median(calories)`
## <dbl>
## 1 490
fastfood%>% summarize(sd(calories))
## # A tibble: 1 × 1
## `sd(calories)`
## <dbl>
## 1 282.
Interpret the SD of the numerical variable you chose, using the framework sentence we discussed in class.
A randomly chosen fast food item in on average 282 away from the mean 531
Choose a categorical variable, and generate a bar plot for this categorical variable.
fastfood%>%ggplot(aes(x=restaurant))+geom_bar()
filter
to narrow down your data to a certain subset
you find interesting, and create a histogram for the same numerical
variable you chose above.fastfood %>% filter(restaurant =="Taco Bell")%>% ggplot(aes(x=calories))+geom_histogram(bins=10)
Choose two categorical variables, and compare them with an appropriate plot.
fastfood %>% ggplot(aes(x=restaurant, fill=as.factor(salad)))+geom_bar()
Describe any relationships you see between the variables in this plot.
All resaurants have more than 50%, “Not Salad” options vs “Salad” options on there menu
Choose two numerical variables and compare them with an appropriate plot.
fastfood%>% ggplot(mapping=aes(x=calories, y=sodium))+geom_point()+geom_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Calculate the correlation coefficient, and describe any relationships you see. Use the sentences we introduced in class for describing relationships between these types of variables.
fastfood %>% summarize(cor(calories, sodium))
## # A tibble: 1 × 1
## `cor(calories, sodium)`
## <dbl>
## 1 0.818
If you saw a relationship between the variables in your plot, why do you think this is the case? If you didn’t see a relationship there, why do you think there isn’t one?
From the graph it is evident that most food items under 1000 calories
contain sodim under 2500g. The As the calories increase by , the sodium
found in the food items also increase by 0.818
Choose a categorical variable and a numerical variable, and use the appropriate type of plot to visualize their relationship below.
fastfood%>% ggplot(aes(x=as.factor(restaurant), y=calories))+geom_boxplot()
fastfood%>%group_by(restaurant) %>% summarize(median(calories))
## # A tibble: 8 × 2
## restaurant `median(calories)`
## <chr> <dbl>
## 1 Arbys 550
## 2 Burger King 555
## 3 Chick Fil-A 390
## 4 Dairy Queen 485
## 5 Mcdonalds 540
## 6 Sonic 570
## 7 Subway 460
## 8 Taco Bell 420
If you saw or did not see differences between groups in your plot, why might this be the case?
I did not see much of a diffrence as the group by scenario is similar to the boxplot created.
After showing you can use some basic data manipulation and visualization tools, I’ll want to know what observations you can draw from these visualizations and summary statistics. Here is your chance to shine - tell me what you know about the dataset you chose! Here are some prompts to help frame this discussion.
fastfood %>%ggplot(mapping = aes(x = sugar, y =sodium, color = calories)) +geom_point() +facet_grid(. ~ restaurant)
Why is this observation surprising?
Chick-fil-A has the lowest calorie count, while Subway, which has the most salad options, has one of the highest calorie counts.
Taco Bell has lower sugar and sodium content despite having no salad options, which is surprising.
Subway, despite having more salad options, is the second-highest in terms of sugar content, which contradicts the expectation that it would offer healthier options.
McDonald’s has the highest sodium, and sugar content, which aligns with having the highest calorie count compared to other restaurants.
Based on these observations, it can be conclude that McDonald’s is
the least healthy restaurant, and Chick-fil-A is the healthiest among
the ones analyzed.
What further questions do you have now that you’ve seen this data?
What percentage of sodium makes up the the food items in each restaurant?
What percentage of sodium makes up the the food items in each restaurant?
What other data sets might be interesting to compare to this
one?
Healthy food restaurants data set to compare the calories, sodium,
sugars etc.