data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")

Response variable

The rating could be a key measure of interest, especially in a dataset related to cocoa products.So i am considering rating as a response variable here.

Categorical Column Of Data (explanatory variable)

The location of the company making the cocoa products, the “company_location” column in the dataset is a categorical field that affects the “rating” response variable. The company’s location may have an impact on several elements that could affect the goods’ rating, including the production process, the quality of the cocoa beans used, and other regional considerations.

Null hypothesis for an ANOVA test

H0: Based on the chosen categorized column (explanatory variable), there is not a noticeable difference in the mean rating between the various groups.

ANOVA test

model <- aov(rating ~ company_location, data = data)


summary(model)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## company_location   66   21.9  0.3314   1.702 0.000418 ***
## Residuals        2463  479.7  0.1947                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA findings:
- At 0.000418 for the p-value (Pr(>F)), the significance level is lower than the usual 0.05. This shows that, depending on the firm location, there is a notable variation in the mean rating between the various categories.
-The F-value of 1.702 suggests that there is more variation in the group means than would be predicted by chance.
As a result, we find that there is a substantial difference in the mean rating between groups according to the company location, rejecting the null hypothesis.

Finding a single continuous (or ordered integer, non-binary) column

Here we are considering cocoa_percent column that might influence the response variable rating

Checking the correlation between cocoa_percent and rating

correlation <- cor(data$cocoa_percent, data$rating)
correlation
## [1] -0.1466896

There is a weak negative linear association between “rating” and “cocoa_percent”, as indicated by the correlation coefficient of roughly -0.1467. We can still continue creating a linear regression model to investigate the relationship between “rating” and “cocoa_percent” in spite of the weak association.

Linear regression

# Build the linear regression model
model <- lm(rating ~ cocoa_percent, data = data)

summary(model)
## 
## Call:
## lm(formula = rating ~ cocoa_percent, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21541 -0.23867  0.03459  0.28459  0.99393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.0295     0.1121  35.949  < 2e-16 ***
## cocoa_percent  -1.1630     0.1560  -7.456 1.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4406 on 2528 degrees of freedom
## Multiple R-squared:  0.02152,    Adjusted R-squared:  0.02113 
## F-statistic: 55.59 on 1 and 2528 DF,  p-value: 1.218e-13

-The projected rating when cocoa_percent is 0 is shown by the intercept, which is 4.0295. Nevertheless, this intercept might not have a useful interpretation in this situation because cocoa_percent cannot be less than 0. -Cocoa_percent has a coefficient of -1.1630. This indicates that we anticipate a rating loss of about 1.1630 units for every unit rise in cocoa_percent. -Because the cocoa_percent intercept and coefficient are both statistically significant, it is improbable that they will be zero.