data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
The rating could be a key measure of interest, especially in a dataset related to cocoa products.So i am considering rating as a response variable here.
The location of the company making the cocoa products, the “company_location” column in the dataset is a categorical field that affects the “rating” response variable. The company’s location may have an impact on several elements that could affect the goods’ rating, including the production process, the quality of the cocoa beans used, and other regional considerations.
H0: Based on the chosen categorized column (explanatory variable), there is not a noticeable difference in the mean rating between the various groups.
model <- aov(rating ~ company_location, data = data)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## company_location 66 21.9 0.3314 1.702 0.000418 ***
## Residuals 2463 479.7 0.1947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA findings:
- At 0.000418 for the p-value (Pr(>F)), the significance level is
lower than the usual 0.05. This shows that, depending on the firm
location, there is a notable variation in the mean rating between the
various categories.
-The F-value of 1.702 suggests that there is more variation in the group
means than would be predicted by chance.
As a result, we find that there is a substantial difference in the mean
rating between groups according to the company location, rejecting the
null hypothesis.
Here we are considering cocoa_percent column that might influence the response variable rating
Checking the correlation between cocoa_percent and rating
correlation <- cor(data$cocoa_percent, data$rating)
correlation
## [1] -0.1466896
There is a weak negative linear association between “rating” and “cocoa_percent”, as indicated by the correlation coefficient of roughly -0.1467. We can still continue creating a linear regression model to investigate the relationship between “rating” and “cocoa_percent” in spite of the weak association.
Linear regression
# Build the linear regression model
model <- lm(rating ~ cocoa_percent, data = data)
summary(model)
##
## Call:
## lm(formula = rating ~ cocoa_percent, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.21541 -0.23867 0.03459 0.28459 0.99393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0295 0.1121 35.949 < 2e-16 ***
## cocoa_percent -1.1630 0.1560 -7.456 1.22e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4406 on 2528 degrees of freedom
## Multiple R-squared: 0.02152, Adjusted R-squared: 0.02113
## F-statistic: 55.59 on 1 and 2528 DF, p-value: 1.218e-13
-The projected rating when cocoa_percent is 0 is shown by the intercept, which is 4.0295. Nevertheless, this intercept might not have a useful interpretation in this situation because cocoa_percent cannot be less than 0. -Cocoa_percent has a coefficient of -1.1630. This indicates that we anticipate a rating loss of about 1.1630 units for every unit rise in cocoa_percent. -Because the cocoa_percent intercept and coefficient are both statistically significant, it is improbable that they will be zero.