This analysis uses the mergedfile.csv
dataset to build
and evaluate a linear model. The goals are to: - Build a linear model
using relevant variables. - Diagnose the model to identify potential
issues. - Interpret significant coefficients. - Summarize insights and
propose further questions.
# Load necessary libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Load the dataset
data <- read.csv("C:/Users/aiden/OneDrive/mergedfile.csv")
# Inspect the structure and summary statistics of the dataset
str(data)
## 'data.frame': 352097 obs. of 23 variables:
## $ Date : chr "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
## $ Symbol : chr "MMM" "MMM" "MMM" "MMM" ...
## $ Adj.Close : num 44.3 44 44.6 44.6 44.9 ...
## $ Close : num 69.4 69 70 70 70.5 ...
## $ High : num 69.8 69.6 70.7 70 70.5 ...
## $ Low : num 69.1 68.3 69.8 68.7 69.6 ...
## $ Open : num 69.5 69.2 70.1 69.7 70 ...
## $ Volume : num 3640265 3405012 6301126 5346240 4073337 ...
## $ Exchange : chr "NYQ" "NYQ" "NYQ" "NYQ" ...
## $ Shortname : chr "3M Company" "3M Company" "3M Company" "3M Company" ...
## $ Longname : chr "3M Company" "3M Company" "3M Company" "3M Company" ...
## $ Sector : chr "Industrials" "Industrials" "Industrials" "Industrials" ...
## $ Industry : chr "Conglomerates" "Conglomerates" "Conglomerates" "Conglomerates" ...
## $ Currentprice : num 131 131 131 131 131 ...
## $ Marketcap : num 7.17e+10 7.17e+10 7.17e+10 7.17e+10 7.17e+10 ...
## $ Ebitda : num 7.35e+09 7.35e+09 7.35e+09 7.35e+09 7.35e+09 ...
## $ Revenuegrowth : num -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 -0.004 ...
## $ City : chr "Saint Paul" "Saint Paul" "Saint Paul" "Saint Paul" ...
## $ State : chr "MN" "MN" "MN" "MN" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ Fulltimeemployees : num 85000 85000 85000 85000 85000 85000 85000 85000 85000 85000 ...
## $ Longbusinesssummary: chr "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ "3M Company provides diversified technology services in the United States and internationally. The company's Saf"| __truncated__ ...
## $ Weight : num 0.00137 0.00137 0.00137 0.00137 0.00137 ...
summary(data)
## Date Symbol Adj.Close Close
## Length:352097 Length:352097 Min. : 1.03 Min. : 1.03
## Class :character Class :character 1st Qu.: 28.07 1st Qu.: 33.74
## Mode :character Mode :character Median : 52.26 Median : 60.79
## Mean : 105.71 Mean : 113.01
## 3rd Qu.: 102.52 3rd Qu.: 113.34
## Max. :4119.09 Max. :4119.09
## NA's :12562 NA's :12562
## High Low Open Volume
## Min. : 1.26 Min. : 1.01 Min. : 1.03 Min. :0.000e+00
## 1st Qu.: 34.10 1st Qu.: 33.34 1st Qu.: 33.72 1st Qu.:9.424e+05
## Median : 61.41 Median : 60.14 Median : 60.75 Median :2.131e+06
## Mean : 114.22 Mean : 111.75 Mean : 113.00 Mean :9.998e+06
## 3rd Qu.: 114.50 3rd Qu.: 112.10 3rd Qu.: 113.31 3rd Qu.:4.965e+06
## Max. :4144.32 Max. :4110.64 Max. :4117.00 Max. :1.881e+09
## NA's :12562 NA's :12562 NA's :12562 NA's :12562
## Exchange Shortname Longname Sector
## Length:352097 Length:352097 Length:352097 Length:352097
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Industry Currentprice Marketcap Ebitda
## Length:352097 Min. : 10.39 Min. :6.823e+09 Min. :-8.410e+08
## Class :character 1st Qu.: 72.29 1st Qu.:1.748e+10 1st Qu.: 1.568e+09
## Mode :character Median : 130.55 Median :4.050e+10 Median : 2.977e+09
## Mean : 232.74 Mean :1.769e+11 Mean : 1.078e+10
## 3rd Qu.: 227.00 3rd Qu.:1.099e+11 3rd Qu.: 6.499e+09
## Max. :3830.58 Max. :3.449e+12 Max. : 1.318e+11
## NA's :22110
## Revenuegrowth City State Country
## Min. :-0.39700 Length:352097 Length:352097 Length:352097
## 1st Qu.:-0.01600 Class :character Class :character Class :character
## Median : 0.04700 Mode :character Mode :character Mode :character
## Mean : 0.04264
## 3rd Qu.: 0.09600
## Max. : 0.46300
##
## Fulltimeemployees Longbusinesssummary Weight
## Min. : 568 Length:352097 Min. :0.0001304
## 1st Qu.: 9372 Class :character 1st Qu.:0.0003340
## Median : 24150 Mode :character Median :0.0007741
## Mean : 68873 Mean :0.0033815
## 3rd Qu.: 57000 3rd Qu.:0.0021012
## Max. :1525000 Max. :0.0659147
##
# Check for missing values
sum(is.na(data))
## [1] 97482
# Build the linear model
model <- lm(Adj.Close ~ Volume + High + Sector, data = data)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = Adj.Close ~ Volume + High + Sector, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.936 -3.015 1.834 4.870 63.040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.283e+01 1.051e-01 -122.068 <2e-16 ***
## Volume 9.619e-09 4.108e-10 23.415 <2e-16 ***
## High 9.847e-01 6.763e-05 14560.407 <2e-16 ***
## SectorCommunication Services 9.359e+00 1.367e-01 68.454 <2e-16 ***
## SectorConsumer Cyclical 1.016e+01 1.130e-01 89.916 <2e-16 ***
## SectorConsumer Defensive 3.650e+00 1.240e-01 29.439 <2e-16 ***
## SectorEnergy 1.302e+00 1.482e-01 8.786 <2e-16 ***
## SectorFinancial Services 4.381e+00 1.107e-01 39.564 <2e-16 ***
## SectorHealthcare 6.894e+00 1.122e-01 61.443 <2e-16 ***
## SectorIndustrials 5.888e+00 1.147e-01 51.336 <2e-16 ***
## SectorReal Estate -5.274e+00 1.209e-01 -43.628 <2e-16 ***
## SectorTechnology 9.809e+00 1.118e-01 87.714 <2e-16 ***
## SectorUtilities 5.198e+00 1.210e-01 42.960 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.987 on 339522 degrees of freedom
## (12562 observations deleted due to missingness)
## Multiple R-squared: 0.9985, Adjusted R-squared: 0.9985
## F-statistic: 1.902e+07 on 12 and 339522 DF, p-value: < 2.2e-16
Adj.Close
when all predictors are
zero (not always meaningful depending on context).Adj.Close
explained by the model.# Generate diagnostic plots
par(mfrow = c(2, 2))
plot(model)
# Check Variance Inflation Factor (VIF)
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Volume 1.078364 1 1.038443
## High 1.073093 1 1.035902
## Sector 1.145058 10 1.006796
# Confidence interval for coefficients
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -1.303543e+01 -1.262344e+01
## Volume 8.813470e-09 1.042371e-08
## High 9.845292e-01 9.847943e-01
## SectorCommunication Services 9.090566e+00 9.626474e+00
## SectorConsumer Cyclical 9.940391e+00 1.038341e+01
## SectorConsumer Defensive 3.407430e+00 3.893513e+00
## SectorEnergy 1.011475e+00 1.592309e+00
## SectorFinancial Services 4.164445e+00 4.598553e+00
## SectorHealthcare 6.674019e+00 7.113834e+00
## SectorIndustrials 5.663546e+00 6.113175e+00
## SectorReal Estate -5.511336e+00 -5.037432e+00
## SectorTechnology 9.589440e+00 1.002779e+01
## SectorUtilities 4.961137e+00 5.435466e+00
Volume
: For every unit
increase in Volume
, the Adj.Close
is expected
to change by the coefficient value, holding other variables
constant.Adj.Close
were
identified.Low
,
Open
) affect the model?This analysis provides a foundation for understanding the predictors
of Adj.Close
. Further investigations may refine the model
and address identified issues.