Loading Necessary Packages and Data

# Load necessary packages
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(broom)
library(forcats)
data <- read.csv("C:/Users/aiden/OneDrive/mergedfile.csv")

Selecting Response and Explanatory Variables

# Select your response variable and explanatory variable
response_var <- data$Adj.Close
explanatory_var <- data$Sector

In this example, Adj.Close (the adjusted closing price of stocks) is selected as the continuous response variable. This is valuable in the context of stock market data, as it reflects the price at which a stock closes after all market activities are considered.

The categorical variable Sector is chosen as the explanatory variable. This variable represents different industries, and we expect it to influence stock prices (the response variable). Different sectors may have varying effects on stock prices, making this a reasonable choice. ## Performing the ANOVA Test

# Fit an ANOVA model
anova_model <- aov(response_var ~ explanatory_var, data = data)

# Print ANOVA summary
summary(anova_model)
##                     Df    Sum Sq   Mean Sq F value Pr(>F)    
## explanatory_var     10 1.222e+09 122160896    2406 <2e-16 ***
## Residuals       339524 1.724e+10     50766                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 12562 observations deleted due to missingness
# If there are more than 10 categories in 'Sector', you can consolidate them manually or using a threshold
data$Sector <- fct_lump_n(data$Sector, n = 10)

# Re-run the ANOVA with the consolidated categories if necessary
anova_model_consolidated <- aov(response_var ~ explanatory_var, data = data)
summary(anova_model_consolidated)
##                     Df    Sum Sq   Mean Sq F value Pr(>F)    
## explanatory_var     10 1.222e+09 122160896    2406 <2e-16 ***
## Residuals       339524 1.724e+10     50766                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 12562 observations deleted due to missingness

The ANOVA model is built to test whether the means of stock prices differ across sectors. If the p-value in the ANOVA table is less than 0.05, you reject the null hypothesis and conclude that sector has a significant impact on stock prices.

Visualizing the ANOVA Results

# Visualize the response variable across sectors using boxplots
ggplot(data, aes(x = Sector, y = Adj.Close)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Distribution of Adjusted Closing Prices Across Sectors",
       x = "Sector", y = "Adjusted Closing Price")
## Warning: Removed 12562 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Building a Simple Linear Regression Model

# Select another continuous variable for linear regression, e.g., Revenue Growth
linear_model <- lm(Adj.Close ~ Revenuegrowth, data = data)

# Print model summary
summary(linear_model)
## 
## Call:
## lm(formula = Adj.Close ~ Revenuegrowth, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -111.6  -78.1  -53.0   -2.9 4012.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   104.7525     0.4273 245.172  < 2e-16 ***
## Revenuegrowth  22.8919     3.5845   6.386  1.7e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 233.1 on 339533 degrees of freedom
##   (12562 observations deleted due to missingness)
## Multiple R-squared:  0.0001201,  Adjusted R-squared:  0.0001172 
## F-statistic: 40.79 on 1 and 339533 DF,  p-value: 1.701e-10

Interpreting Coefficients and Model Fit

# Print regression coefficients and model fit statistics
tidy(linear_model)  # Coefficients
## # A tibble: 2 × 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      105.      0.427    245.   0       
## 2 Revenuegrowth     22.9     3.58       6.39 1.70e-10
glance(linear_model)  # Model summary (R-squared, etc.)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df    logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>     <dbl>  <dbl>  <dbl>
## 1  0.000120      0.000117  233.      40.8 1.70e-10     1 -2332807. 4.67e6 4.67e6
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Visualize the regression line
ggplot(data, aes(x = Revenuegrowth, y = Adj.Close)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Linear Relationship Between Revenue Growth and Adjusted Close",
       x = "Revenue Growth", y = "Adjusted Closing Price")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 12562 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 12562 rows containing missing values or values outside the scale range
## (`geom_point()`).

The regression output provides the coefficients, indicating how much the stock price changes for each unit increase in Revenuegrowth. For example:

Summary

ANOVA Results:

Regression Analysis:

Both analyses provide insights into the relationships between your response variable (e.g., stock price) and various explanatory variables (sector, revenue growth), offering useful information for decision-making in the context of financial data.