Discussion_Week11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

Fetching CSV from Github:

urlfile<- "https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/sal_by_exp.csv"

mydata<-read_csv(url(urlfile))

## New names:
## Rows: 30 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (3): ...1, YearsExperience, Salary
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

head(mydata)

## # A tibble: 6 × 3
##    ...1 YearsExperience Salary
##   <dbl>           <dbl>  <dbl>
## 1     0             1.2  39344
## 2     1             1.4  46206
## 3     2             1.6  37732
## 4     3             2.1  43526
## 5     4             2.3  39892
## 6     5             3    56643

nrow(mydata)

## [1] 30

ncol(mydata)

## [1] 3

str(mydata)

## spc_tbl_ [30 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1           : num [1:30] 0 1 2 3 4 5 6 7 8 9 ...
##  $ YearsExperience: num [1:30] 1.2 1.4 1.6 2.1 2.3 3 3.1 3.3 3.3 3.8 ...
##  $ Salary         : num [1:30] 39344 46206 37732 43526 39892 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   YearsExperience = col_double(),
##   ..   Salary = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

#Fetching column names and renaming

colnames(mydata)

## [1] "...1"            "YearsExperience" "Salary"

# Creating a dataframe

names(mydata)[names(mydata) == "...1"] <- "ID"
mydata

## # A tibble: 30 × 3
##       ID YearsExperience Salary
##    <dbl>           <dbl>  <dbl>
##  1     0             1.2  39344
##  2     1             1.4  46206
##  3     2             1.6  37732
##  4     3             2.1  43526
##  5     4             2.3  39892
##  6     5             3    56643
##  7     6             3.1  60151
##  8     7             3.3  54446
##  9     8             3.3  64446
## 10     9             3.8  57190
## # ℹ 20 more rows

# Convert Tibble to Data Frame
df_data <- data.frame(mydata)
df_data

##    ID YearsExperience Salary
## 1   0             1.2  39344
## 2   1             1.4  46206
## 3   2             1.6  37732
## 4   3             2.1  43526
## 5   4             2.3  39892
## 6   5             3.0  56643
## 7   6             3.1  60151
## 8   7             3.3  54446
## 9   8             3.3  64446
## 10  9             3.8  57190
## 11 10             4.0  63219
## 12 11             4.1  55795
## 13 12             4.1  56958
## 14 13             4.2  57082
## 15 14             4.6  61112
## 16 15             5.0  67939
## 17 16             5.2  66030
## 18 17             5.4  83089
## 19 18             6.0  81364
## 20 19             6.1  93941
## 21 20             6.9  91739
## 22 21             7.2  98274
## 23 22             8.0 101303
## 24 23             8.3 113813
## 25 24             8.8 109432
## 26 25             9.1 105583
## 27 26             9.6 116970
## 28 27             9.7 112636
## 29 28            10.4 122392
## 30 29            10.6 121873

Visualize the Data

plot(df_data[,"YearsExperience"],df_data[,"Salary"], main="Relationship Trend",
xlab="Years Experience", ylab="Salary")

Dependent Variable - Salaries on y-axis

Independent variable - YearsExperience on x_axis

This chart shows the increase in salaries as the years of experience increase.

The Linear Model Function:

The simplest regression model is a straight line. It has the mathematical form:

\[\hat{y} = a_{0} + a_{1}x_{1}\]

where x is the independent Variable

y is the dependent variable

a1 is the slope

a0 is the y-intercept of the line

and yˆ is the output value the model predicts. The ^ indicates a predicted or estimated value, not the actual observed value.

R provides the function lm() that generates a linear model from the data contained in a data frame.

df_data.lm <- lm(YearsExperience ~ Salary, data=df_data)

df_data.lm

## 
## Call:
## lm(formula = YearsExperience ~ Salary, data = df_data)
## 
## Coefficients:
## (Intercept)       Salary  
##  -2.2832618    0.0001013

In this case, the y-intercept is a0 = -2.2832618 and the slope is a1 = 0.0001013

Thus, the final regression model is:

Salary = -2.2832618 + 0.0001013* YearsExperience

The following code plots the original data along with the fitted line:

plot(YearsExperience ~ Salary, data=df_data)
abline(df_data.lm)

Evaluating the Quality of the Model

summary(df_data.lm)

## 
## Call:
## lm(formula = YearsExperience ~ Salary, data = df_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12974 -0.46457  0.04105  0.54311  0.79669 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.283e+00  3.273e-01  -6.976 1.38e-07 ***
## Salary       1.013e-04  4.059e-06  24.950  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5992 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16

Interpretation of the Results:

Residuals: These are the differences between the observed values of the dependent variable and the values predicted by the model. The summary statistics (Min, 1Q, Median, 3Q, Max) explains the distribution of these residuals.

Max - the distance from the regression line of the point furthest above the line.

Median - median value of all of the residuals.

1Q and 3Q - the first and third quartiles of all the sorted residual values.

The line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero.

Std. Error - the statistical standard error for each of the coefficients.

Intercept: The estimated intercept of the regression line which is approximately -2.283.

Salary: The estimated coefficient for the “Salary” variable indicates the change in the dependent variable for a one-unit change in the independent variable which is approximately 0.0001013.

Significance codes: These asterisks denote the level of significance of the coefficients.

’***’ indicates very high significance, meaning the corresponding coefficient is highly likely to be different from zero.

Multiple R-squared: This value (0.957) is the proportion of the variance in the dependent variable that is predictable from the independent variable(s) which means 95.7% of the variability in “YearsExperience” can be explained by “Salary” in this model.

These results are from a linear regression analysis where the dependent variable is “YearsExperience” and the independent variable is “Salary”. Here’s an interpretation:

Intercept: The estimated intercept of the regression line. In this case, it’s approximately -2.283. Salary: The estimated coefficient for the “Salary” variable. It indicates the change in the dependent variable for a one-unit change in the independent variable. In this case, it’s approximately 0.0001013. Significance codes: These asterisks denote the level of significance of the coefficients. For instance, ’***’ indicates very high significance, meaning the corresponding coefficient is highly likely to be different from zero.

Residual standard error: This is an estimate of the standard deviation of the error term in the regression model.

Multiple R-squared: This value (0.957) indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). In other words, about 95.7% of the variability in “YearsExperience” can be explained by “Salary” in this model.

Adjusted R-squared: is number of predictors in the model. It’s often considered a more reliable measure when there are multiple predictors.

F-statistic and p-value: The F-statistic tests the overall significance of the regression model. The low p-value (< 2.2e-16) implies that the regression model is statistically significant, i.e., at least one independent variable has a non-zero coefficient in predicting the dependent variable.

As indicated by the high R-squared value and the significance of the coefficients, the Overall model seems to fit the data well.

Residual Analysis:

The following function calls produce the residuals plot for our model:

plot(fitted(df_data.lm),resid(df_data.lm))

Another test of the residuals uses the quantile-versus-quantile, or Q-Q, plot.

qqnorm(resid(df_data.lm))
qqline(resid(df_data.lm))

If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line.

As we see that the two ends does not diverge considerably from that line.

This indicates that the residuals are normally distributed.

par(mfrow=c(2,2))
plot(df_data.lm)

The “Scale-Location” plot is above is an alternate way of visualizing the residuals versus fitted values from the linear regression model.

Discussion_Week11_MQari

Mubashira Qari

2024-04-06