population

# Set a seed for reproducibility
set.seed(123)

# Generate independent variable (predictor)
x <- 1:100

# Generate dependent variable (response) with some noise
# y = 2*x + 5 + random_noise
y <- 2 * x + 5 + rnorm(100, mean = 0, sd = 20)

# Create a data frame
my_data <- data.frame(x = x, y = y)

# View the first few rows of the data
head(my_data)

Explanation:

set.seed(123): This ensures that the random numbers generated are the same every time you run this code. This is crucial for reproducible research.
x <- 1:100: Creates a vector x containing numbers from 1 to 100. This will be our predictor variable.
y <- 2 * x + 5 + rnorm(100, mean = 0, sd = 20):
- 2 * x + 5: This creates a linear relationship between y and x with a slope of 2 and an intercept of 5.
- rnorm(100, mean = 0, sd = 20): This adds random noise to the y values. rnorm generates random numbers from a normal distribution. We’re generating 100 such numbers with a mean of 0 and a standard deviation of 20. This simulates real-world data where the relationship isn’t perfectly linear.
my_data <- data.frame(x = x, y = y): Combines the x and y vectors into a data frame, which is a standard way to store and manipulate tabular data in R.
head(my_data): Displays the first 6 rows of the my_data data frame.

2. Fitting the Linear Regression Model

The core function in R for linear regression is lm().

# Fit the linear regression model
# The formula y ~ x means "model y as a function of x"
model <- lm(y ~ x, data = my_data)

Explanation:

lm(): This is the function to fit linear models.
y ~ x: This is the formula for the model.
- y is the dependent variable (the variable you want to predict).
- ~ separates the dependent variable from the independent variables.
- x is the independent variable (the variable used to predict y).
data = my_data: Specifies the data frame containing the variables y and x.

3. Examining the Model Results

Once the model is fitted, you can inspect its results using several functions.

# Get a summary of the model
summary(model)

# Get the coefficients of the model
coefficients(model)

# Get the fitted (predicted) values
fitted(model)

# Get the residuals (differences between actual and fitted values)
residuals(model)

Explanation:

summary(model): This is the most comprehensive output. It provides:
- Call: The command used to create the model.
- Residuals: A statistical summary of the residuals (min, 1st Qu., Median, 3rd Qu., max).
- Coefficients: This is the key part. It shows:
  - Estimate: The estimated values for the intercept and the coefficient for x.
  - Std. Error: The standard error of the estimated coefficients.
  - t value: The test statistic for testing if the coefficient is significantly different from zero.
  - Pr(>|t|): The p-value associated with the t-statistic. A low p-value (typically < 0.05) suggests that the predictor variable is statistically significant.
- Residual standard error: An estimate of the standard deviation of the residuals.
- Multiple R-squared: The proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- Adjusted R-squared: Similar to R-squared but adjusted for the number of predictors in the model.
- F-statistic: A test statistic for the overall significance of the model.
coefficients(model): Returns a named vector containing the intercept and the slope coefficient.
fitted(model): Returns a vector of the predicted values of y for each observation in your data.
residuals(model): Returns a vector of the differences between the actual y values and the predicted y values.

4. Visualizing the Model

It’s always a good practice to visualize your data and the fitted regression line.

# Plot the data and the regression line
plot(my_data$x, my_data$y, main = "Linear Regression Model",
     xlab = "Independent Variable (x)", ylab = "Dependent Variable (y)")
abline(model, col = "red", lwd = 2) # Add the regression line
legend("topleft", legend = paste("y =", round(coef(model)[2], 2), "* x +", round(coef(model)[1], 2)),
       col = "red", lty = 1, lwd = 2)

Explanation:

plot(my_data$x, my_data$y, ...): Creates a scatter plot of your data points.
- main: Sets the title of the plot.
- xlab, ylab: Set the labels for the x and y axes.
abline(model, col = "red", lwd = 2):
- abline(): A function that draws straight lines on a plot.
- model: When given a linear model object, it automatically draws the regression line defined by the model’s intercept and slope.
- col = "red": Sets the color of the line to red.
- lwd = 2: Sets the line width to 2.
legend(...): Adds a legend to the plot, showing the equation of the fitted line.

5. Making Predictions

You can use the fitted model to predict y values for new, unseen x values.

# Create new data for prediction
new_data <- data.frame(x = c(50, 110, 150))

# Predict y values for the new data
predictions <- predict(model, newdata = new_data)

# Display the predictions
print(predictions)

Explanation:

new_data <- data.frame(x = c(50, 110, 150)): Creates a new data frame with the x values for which you want to make predictions. It’s crucial that this data frame has the same column name (x) as used in the model.
predict(model, newdata = new_data):
- predict(): The function for making predictions from a fitted model.
- model: The fitted linear regression model.
- newdata = new_data: Specifies the data frame containing the new predictor values.

Example with Real Data (Conceptual)

If you had a CSV file named sales_data.csv with columns advertising_spend and sales, you would do this:

# 1. Load your data
sales_data <- read.csv("sales_data.csv")

# 2. Fit the model
sales_model <- lm(sales ~ advertising_spend, data = sales_data)

# 3. Examine results
summary(sales_model)

# 4. Visualize
plot(sales_data$advertising_spend, sales_data$sales,
     main = "Sales vs. Advertising Spend",
     xlab = "Advertising Spend", ylab = "Sales")
abline(sales_model, col = "blue", lwd = 2)

# 5. Predict for a new advertising spend value
new_ad_spend <- data.frame(advertising_spend = 5000)
predicted_sales <- predict(sales_model, newdata = new_ad_spend)
print(predicted_sales)