Fraud Analysis Linear Regression

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9

Goal: Practice basic linear regression techniques in R.

Before starting: 1. You are not allowed to search for solutions to this assignment. 2. You are allowed to search information about packages and functions that can help you.

Individual assignment only: 47 total points (Rmd and html solution)

[1 point] Q1.

Start by entering your name and today’s date in Lines 3 and 4, respectively. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specified the knitting parameters.

[2 point] Q2.

Before getting started, clear your Environment using the rm command inline. Then, restart R and clear output.

rm(list = ls())

[2 points] Q3.

Read in the linearregression.csv file we provided into R. This is simple simulated data that we will use to revise linear regression.

library(tidyverse)

## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Read the linearregression.csv file into R
data <- read_csv("linearregression.csv")

## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): x, y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows of the data
head(data)

## # A tibble: 6 × 2
##        x      y
##    <dbl>  <dbl>
## 1  0.373 -1.31 
## 2  1.42   1.15 
## 3  2.40   4.47 
## 4 -0.762 -0.163
## 5  0.416 -0.124
## 6 -0.741 -3.06

[5 points] Q4.

Perform a simple linear regression of y (y-axis) against x (x-axis) and assign it to the variable reg1. Display the summary of your regression.

# Perform simple linear regression
reg1 <- lm(y ~ x, data = data)

# Display the summary of the regression
summary(reg1)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3520 -0.7891  0.1100  0.6959  3.6718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.5936     0.1330  -4.465 2.15e-05 ***
## x             0.8639     0.1383   6.246 1.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 98 degrees of freedom
## Multiple R-squared:  0.2847, Adjusted R-squared:  0.2775 
## F-statistic: 39.01 on 1 and 98 DF,  p-value: 1.087e-08

[2 points] Q5.

What is the R squared of your regression?

### This section doesn't require code. Just knit and submit the Rmd and html files.### 
r_squared <- summary(reg1)$r.squared

# Display the R-squared value
r_squared

## [1] 0.2847496

#The R squared is 0.2847

[5 points] Q6.

Which variables are significant in your regression? How do you know?

### This section doesn't require code. Just knit and submit the Rmd and html files.### 
#x is significant in my regression, because its p-value for the coefficient of x is 1.09e-08, which is significantly less than common significance levels(0.05, 0.01, etc.).

[5 points] Q7.

In simple English, tell us what it means to say that a regression variable is significant?

### This section doesn't require code. Just knit and submit the Rmd and html files.### 
#When a regression variable is considered significant, it means that the estimated effect of the variable is not just random, and that the variable has a real impact on the outcome under study

[5 points] Q8.

Make a scatterplot of y against x in base R and superimpose a colored straight line of best fit.

# Create a scatterplot
plot(data$x, data$y, main="Scatterplot of y against x",
     xlab="x", ylab="y", pch=19, col="blue")

# Add a straight line of best fit
abline(reg1, col="red")

# Add a legend
legend("topleft", legend="Line of Best Fit", col="red", lty=1)

[10 points] Q9.

As you have seen from the previous questions, the R squared is bad (low) and the line of best fit (which is just the regression line of best fit) does not really fit the points well. Now is your chance to change this. Make a new regression, reg2, in which you add another term to the independent variables of the regression to make your regression fit better. Explain your choice of this new variable. Display the summary of this new regression. Check to make sure that your new variable is significant. What is your new R squared? This R squared should be about twice as large as your R squared for reg1. Why is your R squared so much better?

# Create a new variable x_squared
data$x_squared <- data$x^2

# Perform the new regression (reg2)
reg2 <- lm(y ~ x + x_squared, data = data)

# Display the summary of the new regression
summary(reg2)

## 
## Call:
## lm(formula = y ~ x + x_squared, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3889 -0.3905  0.3306  0.6682  1.3573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.09838    0.12106  -9.073 1.35e-14 ***
## x            0.60083    0.11255   5.338 6.16e-07 ***
## x_squared    0.59754    0.07431   8.041 2.20e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.017 on 97 degrees of freedom
## Multiple R-squared:  0.5708, Adjusted R-squared:  0.562 
## F-statistic: 64.51 on 2 and 97 DF,  p-value: < 2.2e-16

# I chose to add x^2 because the quadratic term allows the regression line to curve, potentially fitting the data better. Adding x^2 would make the regression fit better. My new variable is significant. New R squared = 0.5708 My R squared is so much better because my regression fits the data better.

[5 points] Q10.

Use the following code to see four common diagnostic plots for your new regression. These plots are often used to evaluate regressions. The first of these plots is a Residual vs Fitted Values plot (commonly just called Residual plot). When points are randomly scatter on both sides of the dotted line in this plot, we know our regression is doing well. Do you think your reg2 does well as a regression over the data provided?

# Create diagnostic plots for reg2
par(mfrow = c(2, 2))

# Residuals vs Fitted Values plot
plot(reg2, which = 1, main = "Residuals vs Fitted Values")

# Normal Q-Q Plot
plot(reg2, which = 2, main = "Normal Q-Q Plot")

# Scale-Location Plot
plot(reg2, which = 3, main = "Scale-Location Plot")

# Residuals vs Leverage Plot
plot(reg2, which = 5, main = "Residuals vs Leverage Plot")

#I think its not good because the points are heavily distributed on the upper side of the dotted line.

[5 points] Q11.

Knit to html after eliminating all the errors. Submit both the Rmd and html files. Tip: Do not worry about minor formatting issues.

### This section doesn't require code. Just knit and submit the Rmd and html files.###