Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

Load the following libraries

library(tidyverse)
library(tidyr)
library(magrittr)
library(kableExtra)
library(jtools)
library(gtsummary)
library(broom)
library(ggplot2)

This fast food data set comes from: https://fastfoodnutrition.org/ The following are variable in the dataset; calories, cal_fat, total_fat, sat_fat, trans_fat, cholesterol, total_carb, fiber, sugar, and protein are expressed in grams, sodium is expressed as milligrams, and vit_a, vit_c, and calcium are expressed as percent daily value

Load the dataset fastfood calories – look in your files to see what the file name it

data <- read.csv("fastfood_calories.csv")
attach(data)
attach(data)
head(data,5)

1) Is there a significant difference in the average number of calories for at least one of the following restaurants: Chick Fil-A, Mcdonalds, Subway?

State the Null and Alternative hypotheses.

Null: The three restaurants do not differ in their average number of calories.
Alternative: At least one of the three restaurants differ in their average number of calories.

(Hint: use the filter function from tidyverse (be sure to ALWAYS run library(tidyverse) when you reopen R!) to create a new dataset that only includes the three restaurants)

data1<-data %>%
  dplyr::select(calories, restaurant)%>%
  filter(restaurant == "Chick Fil-A"|
           restaurant == "Mcdonalds"|
           restaurant == "Subway")

head(data1,5)

One_Way ANOVA

Test the assumption of ANOVA

anova_model <- aov(calories~restaurant,data = data1)
summary(anova_model)

             Df   Sum Sq Mean Sq F value  Pr(>F)   
restaurant    2  1335723  667861   6.468 0.00194 **
Residuals   177 18276284  103256                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

View the plot

library(ggplot2)

# Create a boxplot of salaries by job category
ggplot(data = data1, aes(x = restaurant, y = calories)) +
  geom_boxplot() +
  ggtitle("Boxplot of Calories Across Restaurants") +
  xlab("Restaurants") +
  ylab("Calories")

2) Which if any of the three restaurants has statistically significantly the lowest number of calories? (see Canvas for answer choices)

use aggregate function to find the mean value for each category

# Use aggregate to calculate the mean by Restaurant
Mean_calories<- aggregate(calories, by = list(restaurant), FUN = mean)
# Rename the columns
names(Mean_calories) <- c("restaurant", "mean_calories")

# Print the result
print(Mean_calories)

   restaurant mean_calories
1       Arbys      532.7273
2 Burger King      608.5714
3 Chick Fil-A      384.4444
4 Dairy Queen      520.2381
5   Mcdonalds      640.3509
6       Sonic      631.6981
7      Subway      503.0208
8   Taco Bell      443.6522

Chick Fil-A had the lowest average calories as compared to Mcdonalds and Subway. Consider the results above.

3) Run a linear regression on the entire fast food dataset (not just the three restaurant dataset you created for the first questions!) using the following variables to predict calories: total_fat, total_carb, protein, sodium, fiber, and sugar. Which, if any, variables were statistically significant predictors? (select all that apply – see Canvas for answer choices)

render = 'normal_print'
### Estimate the model
fit <- lm(calories~total_fat+total_carb+protein+sodium+fiber+sugar, data = data)
summ(fit,confint = TRUE, digits = 3)

Observations	503 (12 missing obs. deleted)
Dependent variable	calories
Type	OLS linear regression

F(6,496)	3090.241
R²	0.974
Adj. R²	0.974

	Est.	2.5%	97.5%	t val.	p
(Intercept)	3.462	-5.796	12.719	0.735	0.463
total_fat	8.555	8.210	8.899	48.768	0.000
total_carb	3.995	3.685	4.304	25.330	0.000
protein	3.993	3.594	4.392	19.645	0.000
sodium	0.008	-0.003	0.019	1.423	0.155
fiber	-0.671	-2.549	1.207	-0.702	0.483
sugar	-0.202	-0.933	0.528	-0.544	0.587
Standard errors: OLS

In the output, the “p” values associated with each coefficient estimate indicate the statistical significance of each predictor variable. A “p” value less than 0.05 indicates that the predictor variable is statistically significant.

In this model output, we can see that all predictor variables except for “sodium”, “fiber” and “sugar” are statistically significant, as their “p” values are less than 0.05.

Therefore, the statistically significant predictor variables in this model are “total_fat”, “total_carb”, and “protein”.

4) For every additional gram of total_fat, how much do calories go up, if at all?

Based on the output provided, the coefficient estimate for the “total_fat” variable is 8.555. This means that for every additional gram of total fat, calories are estimated to increase by 8.555 units, all other variables being held constant in the model.

OPIATE OVERDOSE IN TRAVIS COUNTY

This opiate overdose dataset comes from: https://data.austintexas.gov/Health-and-Community-Services/Opiate-Overdoses-by-Age-Range-Gender-and-Drug-Type/njyb-3fuz

This dataset comes from Austin-Travis County EMS and reports the number of overdoses in Austin for the 2018 fiscal year. In this dataset we have a row_id (like a record id), age_groups in groups of five years, sex (male, female), and substance. Substance could be either 1) Heroin/Street, 2) Pharmacy, 3) Heroin/Street AND Pharmacy, or 4) Unknown Load the dataset of opiate overdoes in Austin 2018 – look in your files to see what the file name it

mydata <-read.csv("opiate_overdoes_austin_2018.csv")
attach(mydata)
head(mydata,5)

5) According to this dataset, how many people overdoes on opiates in Austin 2018?

Count

count(mydata)

From the dataset, there are 392 people who participated in this study that overdoes on opiates in August 2018. Alternatively, we can check the structure of the data using the str() functio as shown below

str(mydata)

'data.frame':   392 obs. of  4 variables:
 $ row_ID   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age_group: chr  "30-34" "45-49" "55-59" "20-24" ...
 $ sex      : chr  "Female" "Female" "Male" "Male" ...
 $ substance: chr  "Heroin/Street AND Pharmacy" "Heroin/Street AND Pharmacy" "Heroin/Street AND Pharmacy" "Pharmacy" ...

6) What proportion of all opiate overdose cases are: Male? Female? (round to two decimal places, be sure to report proportions, not percentages or raw numbers – hint: use two R functions together to get proportions)

Get one-way table for male and female

library(sjmisc)
frq(mydata, sex)

sex <character> 
# total N=392 valid N=392 mean=1.69 sd=0.46

Value  |   N | Raw % | Valid % | Cum. %
---------------------------------------
Female | 121 | 30.87 |   30.87 |  30.87
Male   | 271 | 69.13 |   69.13 | 100.00
<NA>   |   0 |  0.00 |    <NA> |   <NA>

Get a table of proportions

mytable <- table(sex)
prop_table <- prop.table(mytable)
print(prop_table, digits = 3)

sex
Female   Male 
 0.309  0.691

From results above, the proportions of male and female who overdoes opiates in August 2018 are 0.68, and 0.32, respectively.

7) Do men and women in Austin overdose on opiates in similar proportions to what we’d expect? (hint: what proportion of overdoses would you expect to be female if the null hypothesis were true)

State the Null and Alternative hypotheses.

Null: Male and female in Austin overdose on opiates in similar proportions.
Alternative: Male and female in Austin overdose on opiates in different proportions.

Chi-square Test

# Calculate the proportion of female overdoses
proportion_female <- sum(mydata$sex == "Female") / nrow(mydata)

# Print the proportion of female overdoses
print(paste("Proportion of female overdoses:", proportion_female))

[1] "Proportion of female overdoses: 0.308673469387755"

# Expected proportion of female overdoses under null hypothesis
expected_proportion <- 0.5  # Assuming equal proportions of male and female overdoses under null hypothesis

# Perform chi-square test to assess the difference between observed and expected proportions
observed_counts <- table(mydata$sex)
expected_counts <- c(expected_proportion * sum(observed_counts), (1 - expected_proportion) * sum(observed_counts))
chi_squared <- chisq.test(observed_counts, p = c(expected_proportion, 1 - expected_proportion))

# Print the results of the chi-square test
print(chi_squared)


    Chi-squared test for given probabilities

data:  observed_counts
X-squared = 57.398, df = 1, p-value = 3.56e-14

Note!!! if the null hypothesis were true, the proportion of female who overdoes opiates in Austin 2018 would be 0.5.

8) There are four drug categories in this dataset. Report the proportion of each drug types. (round to two decimal places, be sure to report proportions not percentages)

get the one-way table for substances

mytable2<- table(substance)
kable(mytable2)

substance	Freq
Heroin/Street	229
Heroin/Street AND Pharmacy	3
Pharmacy	96
Unknown	64

get the proportions

Proportions <- kable(prop.table(mytable2))
names(Proportions) <- c("Substance", "Proportion")
Proportions

substance	Freq
Heroin/Street	0.5841837
Heroin/Street AND Pharmacy	0.0076531
Pharmacy	0.2448980
Unknown	0.1632653

9) Is the substance of choice statistically independent from sex? (i.e. Do men and women who overdose use the same substances in equal proportions?)

####State the Null and Alternative hypotheses.

Null: There is no statistically significant association between substance overdoes and sex
Alternative: There is a statistically significant association between substance overdoes and sex

Test the hypothesis

dat <- mydata[,c(3,4)] %>%
  tbl_summary(by = sex) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
dat

Characteristic	Overall, N = 392¹	Female, N = 121¹	Male, N = 271¹	p-value²
substance				<0.001
Heroin/Street	229 (58%)	56 (46%)	173 (64%)
Heroin/Street AND Pharmacy	3 (0.8%)	2 (1.7%)	1 (0.4%)
Pharmacy	96 (24%)	44 (36%)	52 (19%)
Unknown	64 (16%)	19 (16%)	45 (17%)
¹ n (%)
² Fisher’s exact test

Notes on Chi-square and Fisher’s Exact Test

The chi-square test and Fisher’s exact test are both statistical tests used to analyze categorical data, particularly when comparing the frequencies or proportions of different categories between groups. However, they differ in their assumptions, applicability, and computational methods:

Assumptions:

Chi-square test:

It assumes that the sample size is sufficiently large and that the expected frequency in each cell of the contingency table is not too small (typically, all expected frequencies should be greater than 5).

Fisher’s exact test:

It does not rely on any specific assumptions about the sample size or expected frequencies. It is applicable even when the sample size is small and when expected frequencies in some cells are less than 5.

Applicability:

Chi-square test

It is commonly used for analyzing categorical data with two or more groups or factors.

Fisher’s exact test

It is particularly useful when dealing with small sample sizes or when the assumptions of the chi-square test are violated, such as having expected cell frequencies less than 5.

Computational Methods:

Chi-square test

It calculates a test statistic based on the difference between observed and expected frequencies, squared and divided by the expected frequencies, which follows a chi-square distribution under the null hypothesis.

Fisher’s exact test

It calculates the exact probability of observing the given distribution of frequencies (or more extreme) under the null hypothesis by considering all possible arrangements of the data that have the same row and column totals, using combinatorial methods.

In summary, while both tests are used for analyzing categorical data, the chi-square test is more commonly used and applicable for larger sample sizes, whereas Fisher’s exact test is preferred for small sample sizes or when the assumptions of the chi-square test are violated.

From the results above, there is a statistically significant association between substance overdoes and sex as indicated by p-value <0.0001.

10) Based on the expected vs. observed data your chi-squared test, the data suggest: (see Canvas for answer choices)

chi_table <- table(sex,substance)
chi_table

        substance
sex      Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
  Female            56                          2       44      19
  Male             173                          1       52      45

results <-chisq.test(chi_table)
results


    Pearson's Chi-squared test

data:  chi_table
X-squared = 16.333, df = 3, p-value = 0.0009687

Print the observed frequency

results$observed

        substance
sex      Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
  Female            56                          2       44      19
  Male             173                          1       52      45

Print the Expected Frequencies

results$expected

        substance
sex      Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
  Female      70.68622                  0.9260204 29.63265 19.7551
  Male       158.31378                  2.0739796 66.36735 44.2449

# create the observed and expected frequency tables
observed <- matrix(c(56, 2, 44, 19, 173, 1, 52, 45), nrow = 2, byrow = TRUE)
rownames(observed) <- c("Female", "Male")
colnames(observed) <- c("Heroin/Street", "Heroin/Street AND Pharmacy", "Pharmacy", "Unknown")

observed

       Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
Female            56                          2       44      19
Male             173                          1       52      45

expected <- matrix(c(70.68622, 0.9260204, 29.63265, 19.7551, 158.31378, 2.0739796, 66.36735, 44.2449), nrow = 2, byrow = TRUE)
rownames(expected) <- c("Female", "Male")
colnames(expected) <- c("Heroin/Street", "Heroin/Street AND Pharmacy", "Pharmacy", "Unknown")

expected

       Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
Female      70.68622                  0.9260204 29.63265 19.7551
Male       158.31378                  2.0739796 66.36735 44.2449

# conduct a chi-square test
result <- chisq.test(observed, correct = FALSE)

# print the result
print(result)


    Pearson's Chi-squared test

data:  observed
X-squared = 16.333, df = 3, p-value = 0.0009687

11) Which age group had the highest opiate overdose cases? (see Canvas for answer choices)

agegroup <- table(age_group)
agegroup

age_group
15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 
    7    71    61    60    31    22    22    22    32    22    14    16     4 
80-84 85-89 90-94 
    3     3     2

library(knitr)

# create data frame
age <- c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", 
         "60-64", "65-69", "70-74", "75-79", "80-84", "85-89", "90-94")
overdoses <- c(7, 71, 61, 60, 31, 22, 22, 22, 32, 22, 14, 16, 4, 3, 3, 2)
proportion <- overdoses/sum(overdoses)

# create data frame
df <- data.frame(age, overdoses, proportion)

# format table with kable
kable(df, 
      col.names = c("Age", "Overdoses", "proportion"),
      row.names = FALSE,
      align = "c") %>%
  kable_styling()

Age	Overdoses	proportion
15-19	7	0.0178571
20-24	71	0.1811224
25-29	61	0.1556122
30-34	60	0.1530612
35-39	31	0.0790816
40-44	22	0.0561224
45-49	22	0.0561224
50-54	22	0.0561224
55-59	32	0.0816327
60-64	22	0.0561224
65-69	14	0.0357143
70-74	16	0.0408163
75-79	4	0.0102041
80-84	3	0.0076531
85-89	3	0.0076531
90-94	2	0.0051020

The table shows the distribution of opiate overdoses across age categories. The column labeled “Overdoses” displays the number of overdoses for each age category, and the column labeled “Total” shows the total number of cases across all age categories.

From the table, we can see that the age category with the highest number of overdoses is 20-24, with 71 cases. The 25-29 and 30-34 age categories also have a relatively high number of overdoses, with 61 and 60 cases, respectively. The lowest number of overdoses are in the 90-94 age category, with only 2 cases.

Overall, the data suggests that opiate overdoses are most common in young adults, with decreasing frequency in older age categories.

Alternatively (preferably the best)

# Assuming you have already loaded your data into a dataframe called 'data'

# Aggregate counts of occurrences of substance by age bracket
counts <- aggregate(substance ~ age_group, data = mydata, FUN = length)

# Rename the columns for clarity
colnames(counts) <- c("age_group", "count_of_substances")

# Print the result
print(counts)

   age_group count_of_substances
1      15-19                   7
2      20-24                  71
3      25-29                  61
4      30-34                  60
5      35-39                  31
6      40-44                  22
7      45-49                  22
8      50-54                  22
9      55-59                  32
10     60-64                  22
11     65-69                  14
12     70-74                  16
13     75-79                   4
14     80-84                   3
15     85-89                   3
16     90-94                   2

counts$Proportion <- counts$count_of_substances/sum(counts$count_of_substances)
counts

Seperate Substance Overdose across age bracked

# Assuming you have already loaded your data into a dataframe called 'data'

# Count occurrences of each substance within each age group
substance_counts <- table(mydata$age_group, mydata$substance)
substance_counts

       
        Heroin/Street Heroin/Street AND Pharmacy Pharmacy Unknown
  15-19             5                          0        0       2
  20-24            58                          0        8       5
  25-29            46                          0        7       8
  30-34            48                          1        5       6
  35-39            25                          0        3       3
  40-44            10                          0        7       5
  45-49            13                          1        3       5
  50-54             6                          0        8       8
  55-59             8                          1       13      10
  60-64             5                          0       13       4
  65-69             4                          0        8       2
  70-74             0                          0       11       5
  75-79             0                          0        3       1
  80-84             0                          0        3       0
  85-89             1                          0        2       0
  90-94             0                          0        2       0

CHI-SQUARE TEST AND HYPOTHESIS TESTING DONE USING RSTUDIO

LUMUMBA WANDERA VICTOR

April 03, 2024

Set up Rstudio

Load the following libraries

Load the dataset fastfood calories – look in your files to see what the file name it

1) Is there a significant difference in the average number of calories for at least one of the following restaurants: Chick Fil-A, Mcdonalds, Subway?

State the Null and Alternative hypotheses.

One_Way ANOVA

Test the assumption of ANOVA

View the plot

2) Which if any of the three restaurants has statistically significantly the lowest number of calories? (see Canvas for answer choices)

use aggregate function to find the mean value for each category

4) For every additional gram of total_fat, how much do calories go up, if at all?

OPIATE OVERDOSE IN TRAVIS COUNTY

5) According to this dataset, how many people overdoes on opiates in Austin 2018?

Count

6) What proportion of all opiate overdose cases are: Male? Female? (round to two decimal places, be sure to report proportions, not percentages or raw numbers – hint: use two R functions together to get proportions)

Get one-way table for male and female

Get a table of proportions

7) Do men and women in Austin overdose on opiates in similar proportions to what we’d expect? (hint: what proportion of overdoses would you expect to be female if the null hypothesis were true)

State the Null and Alternative hypotheses.

Chi-square Test

8) There are four drug categories in this dataset. Report the proportion of each drug types. (round to two decimal places, be sure to report proportions not percentages)

get the one-way table for substances

get the proportions

9) Is the substance of choice statistically independent from sex? (i.e. Do men and women who overdose use the same substances in equal proportions?)

Test the hypothesis

Notes on Chi-square and Fisher’s Exact Test

Assumptions:

Chi-square test:

Fisher’s exact test:

Applicability:

Chi-square test

Fisher’s exact test

Computational Methods:

Chi-square test

Fisher’s exact test

10) Based on the expected vs. observed data your chi-squared test, the data suggest: (see Canvas for answer choices)

Print the observed frequency

Print the Expected Frequencies

11) Which age group had the highest opiate overdose cases? (see Canvas for answer choices)

Alternatively (preferably the best)

Seperate Substance Overdose across age bracked