Introduction

This analysis uses the mergedfile.csv dataset to build and evaluate a linear model. The goals are to: - Build a linear model using relevant variables. - Diagnose the model to identify potential issues. - Interpret significant coefficients. - Summarize insights and propose further questions.

Data Inspection

# Load necessary libraries
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Load the dataset
data <- read.csv("C:/Users/aiden/OneDrive/mergedfile.csv")

# Inspect the structure and summary statistics of the dataset
str(data)
## 'data.frame':    352097 obs. of  23 variables:
##  $ Date               : chr  "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
##  $ Symbol             : chr  "MMM" "MMM" "MMM" "MMM" ...
##  $ Adj.Close          : num  44.3 44 44.6 44.6 44.9 ...
##  $ Close              : num  69.4 69 70 70 70.5 ...
##  $ High               : num  69.8 69.6 70.7 70 70.5 ...
##  $ Low                : num  69.1 68.3 69.8 68.7 69.6 ...
##  $ Open               : num  69.5 69.2 70.1 69.7 70 ...
##  $ Volume             : num  3640265 3405012 6301126 5346240 4073337 ...
##  $ Exchange           : chr  "NYQ" "NYQ" "NYQ" "NYQ" ...
##  $ Shortname          : chr  "3M Company" "3M Company" "3M Company" "3M Company" ...
##  $ Longname           : chr  "3M Company" "3M Company" "3M Company" "3M Company" ...
##  $ Sector             : chr  "Industrials" "Industrials" "Industrials" "Industrials" ...
##  $ Industry           : chr  "Conglomerates" "Conglomerates" "Conglomerates" "Conglomerates" ...
##  $ Currentprice       : num  131 131 131 131 131 ...
##  $ Marketcap          : num  7.17e+10 7.17e+10 7.17e+10 7.17e+10 7.17e+10 ...
##  $ Ebitda             : num  7.35e+09 7.35e+09 7.35e+09 7.35e+09 7.35e+09 ...
##  $ Revenuegrowth      : num  -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 ...
##  $ City               : chr  "Saint Paul" "Saint Paul" "Saint Paul" "Saint Paul" ...
##  $ State              : chr  "MN" "MN" "MN" "MN" ...
##  $ Country            : chr  "United States" "United States" "United States" "United States" ...
##  $ Fulltimeemployees  : num  85000 85000 85000 85000 85000 85000 85000 85000 85000 85000 ...
##  $ Longbusinesssummary: chr  "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ ...
##  $ Weight             : num  0.00137 0.00137 0.00137 0.00137 0.00137 ...
summary(data)
##      Date              Symbol            Adj.Close           Close        
##  Length:352097      Length:352097      Min.   :   1.03   Min.   :   1.03  
##  Class :character   Class :character   1st Qu.:  28.07   1st Qu.:  33.74  
##  Mode  :character   Mode  :character   Median :  52.26   Median :  60.79  
##                                        Mean   : 105.71   Mean   : 113.01  
##                                        3rd Qu.: 102.52   3rd Qu.: 113.34  
##                                        Max.   :4119.09   Max.   :4119.09  
##                                        NA's   :12562     NA's   :12562    
##       High              Low               Open             Volume         
##  Min.   :   1.26   Min.   :   1.01   Min.   :   1.03   Min.   :0.000e+00  
##  1st Qu.:  34.10   1st Qu.:  33.34   1st Qu.:  33.72   1st Qu.:9.424e+05  
##  Median :  61.41   Median :  60.14   Median :  60.75   Median :2.131e+06  
##  Mean   : 114.22   Mean   : 111.75   Mean   : 113.00   Mean   :9.998e+06  
##  3rd Qu.: 114.50   3rd Qu.: 112.10   3rd Qu.: 113.31   3rd Qu.:4.965e+06  
##  Max.   :4144.32   Max.   :4110.64   Max.   :4117.00   Max.   :1.881e+09  
##  NA's   :12562     NA's   :12562     NA's   :12562     NA's   :12562      
##    Exchange          Shortname           Longname            Sector         
##  Length:352097      Length:352097      Length:352097      Length:352097     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    Industry          Currentprice       Marketcap             Ebitda          
##  Length:352097      Min.   :  10.39   Min.   :6.823e+09   Min.   :-8.410e+08  
##  Class :character   1st Qu.:  72.29   1st Qu.:1.748e+10   1st Qu.: 1.568e+09  
##  Mode  :character   Median : 130.55   Median :4.050e+10   Median : 2.977e+09  
##                     Mean   : 232.74   Mean   :1.769e+11   Mean   : 1.078e+10  
##                     3rd Qu.: 227.00   3rd Qu.:1.099e+11   3rd Qu.: 6.499e+09  
##                     Max.   :3830.58   Max.   :3.449e+12   Max.   : 1.318e+11  
##                                                           NA's   :22110       
##  Revenuegrowth          City              State             Country         
##  Min.   :-0.39700   Length:352097      Length:352097      Length:352097     
##  1st Qu.:-0.01600   Class :character   Class :character   Class :character  
##  Median : 0.04700   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 0.04264                                                           
##  3rd Qu.: 0.09600                                                           
##  Max.   : 0.46300                                                           
##                                                                             
##  Fulltimeemployees Longbusinesssummary     Weight         
##  Min.   :    568   Length:352097       Min.   :0.0001304  
##  1st Qu.:   9372   Class :character    1st Qu.:0.0003340  
##  Median :  24150   Mode  :character    Median :0.0007741  
##  Mean   :  68873                       Mean   :0.0033815  
##  3rd Qu.:  57000                       3rd Qu.:0.0021012  
##  Max.   :1525000                       Max.   :0.0659147  
## 
# Check for missing values
sum(is.na(data))
## [1] 97482

Linear Model

# Build the linear model
model <- lm(Adj.Close ~ Volume + High + Sector, data = data)

# Summary of the model
summary(model)
## 
## Call:
## lm(formula = Adj.Close ~ Volume + High + Sector, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.936   -3.015    1.834    4.870   63.040 
## 
## Coefficients:
##                                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)                  -1.283e+01  1.051e-01  -122.068   <2e-16 ***
## Volume                        9.619e-09  4.108e-10    23.415   <2e-16 ***
## High                          9.847e-01  6.763e-05 14560.407   <2e-16 ***
## SectorCommunication Services  9.359e+00  1.367e-01    68.454   <2e-16 ***
## SectorConsumer Cyclical       1.016e+01  1.130e-01    89.916   <2e-16 ***
## SectorConsumer Defensive      3.650e+00  1.240e-01    29.439   <2e-16 ***
## SectorEnergy                  1.302e+00  1.482e-01     8.786   <2e-16 ***
## SectorFinancial Services      4.381e+00  1.107e-01    39.564   <2e-16 ***
## SectorHealthcare              6.894e+00  1.122e-01    61.443   <2e-16 ***
## SectorIndustrials             5.888e+00  1.147e-01    51.336   <2e-16 ***
## SectorReal Estate            -5.274e+00  1.209e-01   -43.628   <2e-16 ***
## SectorTechnology              9.809e+00  1.118e-01    87.714   <2e-16 ***
## SectorUtilities               5.198e+00  1.210e-01    42.960   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.987 on 339522 degrees of freedom
##   (12562 observations deleted due to missingness)
## Multiple R-squared:  0.9985, Adjusted R-squared:  0.9985 
## F-statistic: 1.902e+07 on 12 and 339522 DF,  p-value: < 2.2e-16

Explanation of Outputs

  • Intercept and Coefficients: The intercept represents the expected Adj.Close when all predictors are zero (not always meaningful depending on context).
  • P-values: Indicate the significance of each coefficient.
  • R-squared: Represents the proportion of variance in Adj.Close explained by the model.
  • Adjusted R-squared: Adjusts for the number of predictors.

Model Diagnostics

Diagnostic Plots

# Generate diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Multicollinearity Check

# Check Variance Inflation Factor (VIF)
vif(model)
##            GVIF Df GVIF^(1/(2*Df))
## Volume 1.078364  1        1.038443
## High   1.073093  1        1.035902
## Sector 1.145058 10        1.006796

Explanation of Outputs

  • Residuals vs. Fitted: Checks linearity and homoscedasticity.
  • Q-Q Plot: Assesses normality of residuals.
  • Scale-Location: Evaluates variance of residuals across fitted values.
  • Residuals vs. Leverage: Identifies outliers and influential points.
  • VIF: Identifies multicollinearity (VIF > 5 indicates concern).

Coefficient Interpretation

# Confidence interval for coefficients
confint(model, level = 0.95)
##                                      2.5 %        97.5 %
## (Intercept)                  -1.303543e+01 -1.262344e+01
## Volume                        8.813470e-09  1.042371e-08
## High                          9.845292e-01  9.847943e-01
## SectorCommunication Services  9.090566e+00  9.626474e+00
## SectorConsumer Cyclical       9.940391e+00  1.038341e+01
## SectorConsumer Defensive      3.407430e+00  3.893513e+00
## SectorEnergy                  1.011475e+00  1.592309e+00
## SectorFinancial Services      4.164445e+00  4.598553e+00
## SectorHealthcare              6.674019e+00  7.113834e+00
## SectorIndustrials             5.663546e+00  6.113175e+00
## SectorReal Estate            -5.511336e+00 -5.037432e+00
## SectorTechnology              9.589440e+00  1.002779e+01
## SectorUtilities               4.961137e+00  5.435466e+00

Explanation

  • Interpret the coefficient of Volume: For every unit increase in Volume, the Adj.Close is expected to change by the coefficient value, holding other variables constant.
  • Use the confidence interval to discuss the range within which the true coefficient likely falls.

Insights and Further Questions

Conclusion

This analysis provides a foundation for understanding the predictors of Adj.Close. Further investigations may refine the model and address identified issues.