2024-06-04

Inferential Statistics

Inferential statistics are statistical methods that allow us to make meaningful predictions based on data.

Uses of Inferential Statistics:

  • Testing hypotheses
  • Estimating parameters or boundaries of data values
  • Making meaningful predictions

Simple Linear Regression

What is Linear Regression?

  • A line that models the relationship between a dependent variable and an independent variable through a data set.

  • Often, the independent variable is denoted by “y”, or the vertical axis.

  • The dependent variable (or the variable that may “cause” a certain result) is denoted by “x”, or the horizontal axis.

  • Simple Linear Regression is often referred to as the “line of best fit”.

Checking fitness of Linear Regressions

One way to calculate how well a line fits a data set is through the R-Squared method.

This method measures the proportion of variance explained by the independent variable.

\[ R^2 = 1 - \frac{\text{SS}_{\text{Regression Error}}}{\text{SS}_{\text{Total Error}}} \]

Simple Linear Regression Model

Simple Linear Regression is a line, therefore, is represented by the following equation.

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] (In layman’s terms)

Y = Y-intercept + Slope*x + error

Simple Linear Regression Example

The following plot is from the data set train.csv. Points on the graph represent second hand vehicles. Price, our independent variable, is on the y-axis, and mileage, our dependent variable, is on the x axis.

R Code used to generate the previous plotly plot.

x = train$km
y = train$current.price
plot(x,y, pch = 16,
     col = "slateblue4",
     xlab = "Mileage",
     ylab = "Price",
     main = "Mileage vs Price of Used Car Market"
     )

Example using ggplot

Using the same data, a ggplot2 can also be used to generate a plot:

Layering on a Linear Regression line to the ggplot:

## `geom_smooth()` using formula = 'y ~ x'

Full R code of ggplot with linear regression line:

ggplot(train, aes(x = km, y = current.price)) +
  geom_point(color = "mediumslateblue", size = .3) +
  labs(title = "Mileage vs Price: Used Car Market",
       x = "Mileage",
       y = "Price") +
  geom_smooth(method = "lm", se = FALSE, color = "orangered") +
  theme_minimal() 

Conclusion

Simple linear regression allows us to make meaningful predictions based on data. For the used car market data set, we are able to see a distinct negative correlation between price and mileage. Meaning, as mileage goes up, price goes down. This makes sense because mileage indicates the amount of wear on a vehicle, therefore, a car with more mileage is prone to more repairs and thus has a lower value.