Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

Load the following libraries

library(tidyverse)
library(tidyr)
library(magrittr)
library(kableExtra)
library(jtools)
library(gtsummary)
library(broom)

This fast food data set comes from: https://fastfoodnutrition.org/ The following are variable in the dataset; calories, cal_fat, total_fat, sat_fat, trans_fat, cholesterol, total_carb, fiber, sugar, and protein are expressed in grams, sodium is expressed as milligrams, and vit_a, vit_c, and calcium are expressed as percent daily value

Load the dataset fastfood calories – look in your files to see what the file name it

data <- read.csv("fastfood_calories.csv")
attach(data)
head(data,5)
  X restaurant                                      item calories cal_fat
1 1  Mcdonalds          Artisan Grilled Chicken Sandwich      380      60
2 2  Mcdonalds            Single Bacon Smokehouse Burger      840     410
3 3  Mcdonalds            Double Bacon Smokehouse Burger     1130     600
4 4  Mcdonalds Grilled Bacon Smokehouse Chicken Sandwich      750     280
5 5  Mcdonalds  Crispy Bacon Smokehouse Chicken Sandwich      920     410
  total_fat sat_fat trans_fat cholesterol sodium total_carb fiber sugar protein
1         7       2       0.0          95   1110         44     3    11      37
2        45      17       1.5         130   1580         62     2    18      46
3        67      27       3.0         220   1920         63     3    18      70
4        31      10       0.5         155   1940         62     2    18      55
5        45      12       0.5         120   1980         81     4    18      46
  vit_a vit_c calcium salad
1     4    20      20 Other
2     6    20      20 Other
3    10    20      50 Other
4     6    25      20 Other
5     6    20      20 Other

1) Is there a significant difference in the average number of calories for at least one of the following restaurants: Chick Fil-A, Mcdonalds, Subway?

State the Null and Alternative hypotheses.

Null: The three restaurants do not differ in their average number of calories.

Alternative: At least one of the three restaurants differ in their average number of calories.

** (Hint: use the filter function from tidyverse (be sure to ALWAYS run library(tidyverse) when you reopen R!) to create a new dataset that only includes the three restaurants)**

data1<-data %>%
  dplyr::select(calories, restaurant)%>%
  filter(restaurant == "Chick Fil-A"|
           restaurant == "Mcdonalds"|
           restaurant == "Subway")

head(data1,5)
  calories restaurant
1      380  Mcdonalds
2      840  Mcdonalds
3     1130  Mcdonalds
4      750  Mcdonalds
5      920  Mcdonalds
tail(data1,5)
    calories restaurant
176      810     Subway
177      740     Subway
178      680     Subway
179      790     Subway
180      820     Subway

One_Way ANOVA

Test the assumption of ANOVA

Run the test

anova_model <- aov(calories~restaurant,data = data1)
summary(anova_model)
             Df   Sum Sq Mean Sq F value  Pr(>F)   
restaurant    2  1335723  667861   6.468 0.00194 **
Residuals   177 18276284  103256                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visual Representation

boxplot(calories~restaurant,data = data1)

Using ggplot2

# Load ggplot2 package
library(ggplot2)

# Create a boxplot of salaries by job category
ggplot(data = data1, aes(x = restaurant, y = calories)) +
  geom_boxplot() +
  ggtitle("Boxplot of Calories Across Restaurants") +
  xlab("Restaurants") +
  ylab("Calories")

2) Which if any of the three restaurants has statistically significantly the lowest number of calories? (see Canvas for answer choices)

# use aggregate function to find the mean value for each category
aggregate(calories ~ restaurant, data = data1, FUN = mean)
   restaurant calories
1 Chick Fil-A 384.4444
2   Mcdonalds 640.3509
3      Subway 503.0208

Chick Fil-A had the lowest average calories as compared to Mcdonalds and Subway. Consider the results above.

3) Run a linear regression on the entire fast food dataset (not just the three restaurant dataset you created for the first questions!) using the following variables to predict calories: total_fat, total_carb, protein, sodium, fiber, and sugar. Which, if any, variables were statistically significant predictors? (select all that apply – see Canvas for answer choices)

render = 'normal_print'

Estimate the model

fit <- lm(calories~total_fat+total_carb+protein+sodium+fiber+sugar, data = data)
summ(fit,confint = TRUE, digits = 3)
Observations 503 (12 missing obs. deleted)
Dependent variable calories
Type OLS linear regression
F(6,496) 3090.241
0.974
Adj. R² 0.974
Est. 2.5% 97.5% t val. p
(Intercept) 3.462 -5.796 12.719 0.735 0.463
total_fat 8.555 8.210 8.899 48.768 0.000
total_carb 3.995 3.685 4.304 25.330 0.000
protein 3.993 3.594 4.392 19.645 0.000
sodium 0.008 -0.003 0.019 1.423 0.155
fiber -0.671 -2.549 1.207 -0.702 0.483
sugar -0.202 -0.933 0.528 -0.544 0.587
Standard errors: OLS

In the output, the “p” values associated with each coefficient estimate indicate the statistical significance of each predictor variable. A “p” value less than 0.05 indicates that the predictor variable is statistically significant.

In this model output, we can see that all predictor variables except for “sodium”, “fiber” and “sugar” are statistically significant, as their “p” values are less than 0.05.

Therefore, the statistically significant predictor variables in this model are “total_fat”, “total_carb”, and “protein”.

4) For every additional gram of total_fat, how much do calories go up, if at all?

Based on the output provided, the coefficient estimate for the “total_fat” variable is 8.555. This means that for every additional gram of total fat, calories are estimated to increase by 8.555 units, all other variables being held constant in the model.

OPIATE OVERDOSE IN TRAVIS COUNTY

This opiate overdose dataset comes from: https://data.austintexas.gov/Health-and-Community-Services/Opiate-Overdoses-by-Age-Range-Gender-and-Drug-Type/njyb-3fuz This dataset comes from Austin-Travis County EMS and reports the number of overdoses in Austin for the 2018 fiscal year. In this dataset we have a row_id (like a record id), age_groups in groups of five years, sex (male, female), and substance. Substance could be either 1) Heroin/Street, 2) Pharmacy, 3) Heroin/Street AND Pharmacy, or 4) Unknown

Load the dataset of opiate overdoes in Austin 2018 – look in your files to see what the file name it

mydata <-read.csv("opiate_overdoes_austin_2018.csv")
attach(mydata)
head(mydata,5)
  row_ID age_group    sex                  substance
1      1     30-34 Female Heroin/Street AND Pharmacy
2      2     45-49 Female Heroin/Street AND Pharmacy
3      3     55-59   Male Heroin/Street AND Pharmacy
4      4     20-24   Male                   Pharmacy
5      5     15-19   Male              Heroin/Street

5) According to this dataset, how many people overdoes on opiates in Austin 2018?

Count

str(mydata)
'data.frame':   392 obs. of  4 variables:
 $ row_ID   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age_group: chr  "30-34" "45-49" "55-59" "20-24" ...
 $ sex      : chr  "Female" "Female" "Male" "Male" ...
 $ substance: chr  "Heroin/Street AND Pharmacy" "Heroin/Street AND Pharmacy" "Heroin/Street AND Pharmacy" "Pharmacy" ...

From the dataset, there are 392 people who participated in this study that overdoes on opiates in August 2018

6) What proportion of all opiate overdose cases are: Male? Female? (round to two decimal places, be sure to report proportions, not percentages or raw numbers – hint: use two R functions together to get proportions)

## Get one-way table for male and female
mytable <- table(sex)
mytable
sex
Female   Male 
   121    271 
### Get a table of proportions
prop_table <- prop.table(mytable)
print(prop_table, digits = 2)
sex
Female   Male 
  0.31   0.69 

From results above, the proportions of male and female who overdoes opiates in August 2018 are 0.68, and 0.32, respectively.

7) Do men and women in Austin overdose on opiates in similar proportions to what we’d expect? (hint: what proportion of overdoses would you expect to be female if the null hypothesis were true)

State the Null and Alternative hypotheses.

Null:

  • Male and female in Austin overdose on opiates in similar proportions.

Alternative:

Male and female in Austin overdose on opiates in different proportions.

** if the null hypothesis were true, the proportion of female who overdoes opiates in Austin 2018 would be 0.5.**

8) There are four drug categories in this dataset. Report the proportion of each drug types. (round to two decimal places, be sure to report proportions not percentages)

## get the one-way table for substances
mytable2<- table(substance)
print(mytable2)
substance
             Heroin/Street Heroin/Street AND Pharmacy 
                       229                          3 
                  Pharmacy                    Unknown 
                        96                         64 
### get the proportions
prop_table2 <- prop.table(mytable2)

9) Is the substance of choice statistically independent from sex? (i.e. Do men and women who overdose use the same substances in equal proportions?)

State the Null and Alternative hypotheses.

Null:

There is no statistically significant association between substance overdoes and sex

Alternative:

There is a statistically significant association between substance overdoes and sex

Test the hypothesis

dat <- mydata[,c(3,4)] %>%
  tbl_summary(by = sex) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
dat
Characteristic Overall, N = 3921 Female, N = 1211 Male, N = 2711 p-value2
substance <0.001
    Heroin/Street 229 (58%) 56 (46%) 173 (64%)
    Heroin/Street AND Pharmacy 3 (0.8%) 2 (1.7%) 1 (0.4%)
    Pharmacy 96 (24%) 44 (36%) 52 (19%)
    Unknown 64 (16%) 19 (16%) 45 (17%)
1 n (%)
2 Fisher's exact test

From the results above, there is a statistically significant association between substance overdoes and sex as indicated by p-value <0.0001.

10) Based on the expected vs. observed data your chi-squared test, the data suggest: (see Canvas for answer choices)

chi_table <- table(sex,substance)
chi_table
        substance
sex      Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
  Female            56                          2       44      19
  Male             173                          1       52      45

Run the chi-square test

chi_results <- chisq.test(chi_table)
chi_results

    Pearson's Chi-squared test

data:  chi_table
X-squared = 16.333, df = 3, p-value = 0.0009687

11) Which age group had the highest opiate overdose cases? (see Canvas for answer choices)

agegroup <- table(age_group)
agegroup
age_group
15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 
    7    71    61    60    31    22    22    22    32    22    14    16     4 
80-84 85-89 90-94 
    3     3     2 
library(knitr)

# create data frame
age <- c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", 
         "60-64", "65-69", "70-74", "75-79", "80-84", "85-89", "90-94")
overdoses <- c(7, 71, 61, 60, 31, 22, 22, 22, 32, 22, 14, 16, 4, 3, 3, 2)
proportion <- overdoses/sum(overdoses)

# create data frame
df <- data.frame(age, overdoses, proportion)

# format table with kable
kable(df, 
      col.names = c("Age", "Overdoses", "proportion"),
      row.names = FALSE,
      align = "c") %>%
  kable_styling()
Age Overdoses proportion
15-19 7 0.0178571
20-24 71 0.1811224
25-29 61 0.1556122
30-34 60 0.1530612
35-39 31 0.0790816
40-44 22 0.0561224
45-49 22 0.0561224
50-54 22 0.0561224
55-59 32 0.0816327
60-64 22 0.0561224
65-69 14 0.0357143
70-74 16 0.0408163
75-79 4 0.0102041
80-84 3 0.0076531
85-89 3 0.0076531
90-94 2 0.0051020

The table shows the distribution of opiate overdoses across age categories. The column labeled “Overdoses” displays the number of overdoses for each age category, and the column labeled “Total” shows the total number of cases across all age categories.

From the table, we can see that the age category with the highest number of overdoses is 20-24, with 71 cases. The 25-29 and 30-34 age categories also have a relatively high number of overdoses, with 61 and 60 cases, respectively. The lowest number of overdoses are in the 90-94 age category, with only 2 cases.

Overall, the data suggests that opiate overdoses are most common in young adults, with decreasing frequency in older age categories.