Assignment 3- ITEC Research

About Dataset

This dataset features metrics on red meat. Specifically, it features data on its per capita availability which has been adjusted for loss due to waste. The dataset comes from the Economic Research Service or ERC of the United States. (LINK: https://www.ers.usda.gov/data-products/food-availability-per-capita-data-system) Specifically, the data featured is from a cleaned dataset about red meat with only metrics pertaining to beef. But the metric that’d be focused on is the calories available daily number from 1970 to 2019.

library("ggplot2")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidyr")

Question 1

data <-read.csv("redmeat.csv")

ggplot(data,aes(x=Year, y = Value)) +
geom_bar(stat= "identity", fill = "steelblue") +
  labs(title="Caloric Availability for Expenditure  of Beef from 1970-2019 in the United States", X = "Year", y= "Calories available daily- Number") +theme_minimal()

This graph shows the calories available for people’s consumption from 1970 to 2019. It shows the trend of people eating beef peaking between 1975 to 1980 and then gradually decreasing to present day.

Question 2

#plot mean
data_summary <- data%>%
  group_by(Year) %>%
 summarize(mean_calories = mean(Value, na.rm = TRUE))
#Plot mean per year

ggplot(data_summary, aes(x = Year, y = mean_calories)) +
  geom_line(color = "blue", size =1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  geom_point(color = "red", size = 2)

## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

  labs(title = "Mean Calories per Year" ,
       x = "Year",
       y = "Mean Calories of Red Meat Available (Beef)") +
    theme_minimal()

## NULL

The results show that on average, people are eating less and less red meat per year due to the per capita availability of beef decreasing.

Question 3

#Scatterplot and Regression Analysis

#Scatterplot (Part 1)
ggplot(data_summary, aes(x = Year, y = mean_calories)) +
  geom_point(color = "blue", size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "red") + #linear regression line
  labs(title = "Regression Analysis: Caloric Availability by Year",
       x = "Year",
       y = "Mean Average Caloric Availability of Red Meat (beef)" ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

#Regression Analysis Model (part 2)
model <- lm(mean_calories ~ Year, data = data_summary)
summary(model)

## 
## Call:
## lm(formula = mean_calories ~ Year, data = data_summary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1988 -1.1493 -0.1794  1.0034  5.9368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 706.8842    36.7086   19.26   <2e-16 ***
## Year         -0.3300     0.0184  -17.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.878 on 48 degrees of freedom
## Multiple R-squared:  0.8701, Adjusted R-squared:  0.8674 
## F-statistic: 321.5 on 1 and 48 DF,  p-value: < 2.2e-16

There is a strong negative correlation of 0.8701 indicating caloric availability of red meat and beef having a downward trend year by year to the present day.

Question 4

data <- read.csv("redmeat.csv")

# Convert Value column to numeric
data$Value <- as.numeric(data$Value)

# Remove missing values
dataset <- subset(data, !is.na(Value))

# Plot histogram
hist(dataset$Value,
     main = "Distribution of Calories Available Daily",
     xlab = "Calories Available Daily",
     ylab = "Frequency",
     col = "lightblue",
     border = "black",
     breaks = 10)

Question 5

#Change values to numeric
data$Value <- as.numeric(data$Value)

dataset <- subset(data, !is.na(Value) & !is.na(Year))

#Split into two samples

sample1 <- subset(dataset, Year >=1970 & Year <=1995)
sample2 <- subset(dataset, Year<=1995 & Year <=2019)

#Run two-sample t-test (unequal variance)
t_test_result <- wilcox.test(sample1$Value, sample2$Value, var.equal = FALSE)

# Show result
print(t_test_result)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  sample1$Value and sample2$Value
## W = 86528, p-value = 1
## alternative hypothesis: true location shift is not equal to 0

With a p-value of 1, this means the null hypothesis can be accepted due to it being more than 0.05 and that the observed correlation is due to random chance.