library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
urlfile<- "https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/sal_by_exp.csv"
mydata<-read_csv(url(urlfile))
## New names:
## Rows: 30 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (3): ...1, YearsExperience, Salary
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(mydata)
## # A tibble: 6 × 3
## ...1 YearsExperience Salary
## <dbl> <dbl> <dbl>
## 1 0 1.2 39344
## 2 1 1.4 46206
## 3 2 1.6 37732
## 4 3 2.1 43526
## 5 4 2.3 39892
## 6 5 3 56643
nrow(mydata)
## [1] 30
ncol(mydata)
## [1] 3
str(mydata)
## spc_tbl_ [30 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:30] 0 1 2 3 4 5 6 7 8 9 ...
## $ YearsExperience: num [1:30] 1.2 1.4 1.6 2.1 2.3 3 3.1 3.3 3.3 3.8 ...
## $ Salary : num [1:30] 39344 46206 37732 43526 39892 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. YearsExperience = col_double(),
## .. Salary = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#Fetching column names and renaming
colnames(mydata)
## [1] "...1" "YearsExperience" "Salary"
# Creating a dataframe
names(mydata)[names(mydata) == "...1"] <- "ID"
mydata
## # A tibble: 30 × 3
## ID YearsExperience Salary
## <dbl> <dbl> <dbl>
## 1 0 1.2 39344
## 2 1 1.4 46206
## 3 2 1.6 37732
## 4 3 2.1 43526
## 5 4 2.3 39892
## 6 5 3 56643
## 7 6 3.1 60151
## 8 7 3.3 54446
## 9 8 3.3 64446
## 10 9 3.8 57190
## # ℹ 20 more rows
# Convert Tibble to Data Frame
df_data <- data.frame(mydata)
df_data
## ID YearsExperience Salary
## 1 0 1.2 39344
## 2 1 1.4 46206
## 3 2 1.6 37732
## 4 3 2.1 43526
## 5 4 2.3 39892
## 6 5 3.0 56643
## 7 6 3.1 60151
## 8 7 3.3 54446
## 9 8 3.3 64446
## 10 9 3.8 57190
## 11 10 4.0 63219
## 12 11 4.1 55795
## 13 12 4.1 56958
## 14 13 4.2 57082
## 15 14 4.6 61112
## 16 15 5.0 67939
## 17 16 5.2 66030
## 18 17 5.4 83089
## 19 18 6.0 81364
## 20 19 6.1 93941
## 21 20 6.9 91739
## 22 21 7.2 98274
## 23 22 8.0 101303
## 24 23 8.3 113813
## 25 24 8.8 109432
## 26 25 9.1 105583
## 27 26 9.6 116970
## 28 27 9.7 112636
## 29 28 10.4 122392
## 30 29 10.6 121873
plot(df_data[,"YearsExperience"],df_data[,"Salary"], main="Relationship Trend",
xlab="Years Experience", ylab="Salary")
Dependent Variable - Salaries on y-axis
Independent variable - YearsExperience on x_axis
This chart shows the increase in salaries as the years of experience increase.
The simplest regression model is a straight line. It has the mathematical form:
\[\hat{y} = a_{0} + a_{1}x_{1}\]
where x is the independent Variable
y is the dependent variable
a1 is the slope
a0 is the y-intercept of the line
and yˆ is the output value the model predicts. The ^ indicates a predicted or estimated value, not the actual observed value.
R provides the function lm() that generates a linear model from the data contained in a data frame.
df_data.lm <- lm(YearsExperience ~ Salary, data=df_data)
df_data.lm
##
## Call:
## lm(formula = YearsExperience ~ Salary, data = df_data)
##
## Coefficients:
## (Intercept) Salary
## -2.2832618 0.0001013
In this case, the y-intercept is a0 = -2.2832618 and the slope is a1 = 0.0001013
Thus, the final regression model is:
Salary = -2.2832618 + 0.0001013* YearsExperience
The following code plots the original data along with the fitted line:
plot(YearsExperience ~ Salary, data=df_data)
abline(df_data.lm)
summary(df_data.lm)
##
## Call:
## lm(formula = YearsExperience ~ Salary, data = df_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12974 -0.46457 0.04105 0.54311 0.79669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.283e+00 3.273e-01 -6.976 1.38e-07 ***
## Salary 1.013e-04 4.059e-06 24.950 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5992 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
Residuals: These are the differences between the observed values of the dependent variable and the values predicted by the model. The summary statistics (Min, 1Q, Median, 3Q, Max) explains the distribution of these residuals.
Max - the distance from the regression line of the point furthest above the line.
Median - median value of all of the residuals.
1Q and 3Q - the first and third quartiles of all the sorted residual values.
The line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero.
Std. Error - the statistical standard error for each of the coefficients.
Intercept: The estimated intercept of the regression line which is approximately -2.283.
Salary: The estimated coefficient for the “Salary” variable indicates the change in the dependent variable for a one-unit change in the independent variable which is approximately 0.0001013.
Significance codes: These asterisks denote the level of significance of the coefficients.
’***’ indicates very high significance, meaning the corresponding coefficient is highly likely to be different from zero.
Multiple R-squared: This value (0.957) is the proportion of the variance in the dependent variable that is predictable from the independent variable(s) which means 95.7% of the variability in “YearsExperience” can be explained by “Salary” in this model.
These results are from a linear regression analysis where the dependent variable is “YearsExperience” and the independent variable is “Salary”. Here’s an interpretation:
Residuals: These are the differences between the observed values of the dependent variable and the values predicted by the model. The summary statistics (Min, 1Q, Median, 3Q, Max) give information about the distribution of these residuals.
Intercept: The estimated intercept of the regression line. In this case, it’s approximately -2.283. Salary: The estimated coefficient for the “Salary” variable. It indicates the change in the dependent variable for a one-unit change in the independent variable. In this case, it’s approximately 0.0001013. Significance codes: These asterisks denote the level of significance of the coefficients. For instance, ’***’ indicates very high significance, meaning the corresponding coefficient is highly likely to be different from zero.
Residual standard error: This is an estimate of the standard deviation of the error term in the regression model.
Multiple R-squared: This value (0.957) indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). In other words, about 95.7% of the variability in “YearsExperience” can be explained by “Salary” in this model.
Adjusted R-squared: is number of predictors in the model. It’s often considered a more reliable measure when there are multiple predictors.
F-statistic and p-value: The F-statistic tests the overall significance of the regression model. The low p-value (< 2.2e-16) implies that the regression model is statistically significant, i.e., at least one independent variable has a non-zero coefficient in predicting the dependent variable.
As indicated by the high R-squared value and the significance of the coefficients, the Overall model seems to fit the data well.
The following function calls produce the residuals plot for our model:
plot(fitted(df_data.lm),resid(df_data.lm))
Another test of the residuals uses the quantile-versus-quantile, or Q-Q,
plot.
qqnorm(resid(df_data.lm))
qqline(resid(df_data.lm))
If the residuals were normally distributed, we would expect the points
plotted in this figure to follow a straight line.
As we see that the two ends does not diverge considerably from that line.
This indicates that the residuals are normally distributed.
par(mfrow=c(2,2))
plot(df_data.lm)
The “Scale-Location” plot is above is an alternate way of visualizing
the residuals versus fitted values from the linear regression model.