Learn Linear and Logisitc Regression in R

library(ggplot2)

## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(fst)

taiwan_real_estate<-read.fst('taiwan_real_estate.fst')
str(taiwan_real_estate)

## 'data.frame':    414 obs. of  4 variables:
##  $ dist_to_mrt_m  : num  84.9 306.6 562 562 390.6 ...
##  $ n_convenience  : num  10 9 5 5 5 3 7 6 1 3 ...
##  $ house_age_years: Ord.factor w/ 3 levels "0 to 15"<"15 to 30"<..: 3 2 1 1 1 1 3 2 3 2 ...
##  $ price_twd_msq  : num  11.5 12.8 14.3 16.6 13 ...

Plotting x=n_convenience,y=price_twd_msq with a trend line calculated by Linear Regression

ggplot(taiwan_real_estate,aes(x=n_convenience,y=price_twd_msq)) +
  # scatter plot
  geom_point(shape=1,size=2,alpha=0.5) +
  # adding trend line by Linear Regression method
  geom_smooth( method='lm', se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

## update by making points 50% and tranparency 0.5
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq,alpha=0.5)) +
  geom_point()

# Run a linear regression of price_twd_msq vs. n_convenience
lm(price_twd_msq ~ n_convenience, data = taiwan_real_estate)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        8.2242         0.7981

Intercept 8.22 meaning:

On average, a house with zero convenience stores nearby had a price of 8.2242 TWD per square meter. #### 0.7981 x n_covience meaning: Increase the number of nearby convenience stores by one, then the expected increase in house price is 0.7981 TWD per square meter. #### Using taiwan_real_estate, 1-plot price_twd_msq 1-Make it a histogram with 10 bins 2-Facet the plot so each house age group gets its own panel

ggplot(taiwan_real_estate,aes(price_twd_msq)) +
   geom_histogram(bins=10) +
   facet_wrap(vars(house_age_years))

Group taiwan_real_estate by house_age_years. Summarize to calculate the mean price_twd_msq for each group, naming the column mean_by_group. Assign the result to summary_stats and look at the numbers.

summary_stats <- taiwan_real_estate %>% 
  
 group_by(house_age_years) %>% 
  
  summarize(mean_by_group = mean(price_twd_msq))


summary_stats

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 3 x 2
##   house_age_years mean_by_group
##   <ord>                   <dbl>
## 1 0 to 15                 12.6 
## 2 15 to 30                 9.88
## 3 30 to 45                11.4

lm() with a categorical explanatory variable

Linear regressions also work with categorical explanatory variables. In this case, the code to run the model is the same, but the coefficients returned by the model are different. Here we’ll run a linear regression on the Taiwan real estate dataset.

mdl_price_vs_age <- lm(
  price_twd_msq ~ house_age_years, 
  data = taiwan_real_estate
)

mdl_price_vs_age

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years, data = taiwan_real_estate)
## 
## Coefficients:
##       (Intercept)  house_age_years.L  house_age_years.Q  
##           11.3025            -0.8798             1.7462

Removing intercept from the above model

mdl_price_vs_age_no_intercept <- lm(
  price_twd_msq ~ house_age_years + 0, 
  data = taiwan_real_estate
)

mdl_price_vs_age_no_intercept

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##  house_age_years0 to 15  house_age_years15 to 30  house_age_years30 to 45  
##                  12.637                    9.877                   11.393

We see that The coefficients of the model are just the means of each category in case of a single Categorical Explnatory vairable

Predicting

mdl_price_vs_conv <- lm(
  price_twd_msq ~ n_convenience, 
  data = taiwan_real_estate
)
mdl_price_vs_conv

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        8.2242         0.7981

explanatory_data <- tibble(
  n_convenience = 0:10
)

 # Use mdl_price_vs_conv to predict with explanatory_data
predict(mdl_price_vs_conv, explanatory_data)

##         1         2         3         4         5         6         7         8 
##  8.224237  9.022317  9.820397 10.618477 11.416556 12.214636 13.012716 13.810795 
##         9        10        11 
## 14.608875 15.406955 16.205035

Predicting inside a dataframe (by adding a new column for predictions )

# Edit this, so predictions are stored in prediction_data
prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(mdl_price_vs_conv, explanatory_data)
  )

# See the result
prediction_data

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 11 x 2
##    n_convenience price_twd_msq
##            <int>         <dbl>
##  1             0          8.22
##  2             1          9.02
##  3             2          9.82
##  4             3         10.6 
##  5             4         11.4 
##  6             5         12.2 
##  7             6         13.0 
##  8             7         13.8 
##  9             8         14.6 
## 10             9         15.4 
## 11            10         16.2

Extend the plotting code to include the point predictions in prediction_data. Color the points yellow.

# Add to the plot
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point() +
  
  # Add LM model layer
  geom_smooth(method = "lm", se = FALSE) +
  
  # Add a point layer of prediction data, colored yellow
  
  geom_point( data =prediction_data , color='yellow')

## `geom_smooth()` using formula 'y ~ x'

How to access the LM objects: 1-coefficients 2-fitted : getting the explanotary variable back i.e fitted (linear_model) 3-residuals : Actual response minus predicted response i.e residuals(linear_model).They should follow normal distribution .The median should be almost zero. Then Look at first and last quartile,they should have almost same value 4- Using “broom” library, we can use tidy() function to get a vector or dataframe of the model objects 5- glance returns model level results

coefficients(mdl_price_vs_conv)

##   (Intercept) n_convenience 
##     8.2242375     0.7980797

# to get the explanatory variables use fitted
# fitted(mdl_price_vs_conv)

# residuals(mdl_price_vs_conv)

summary(mdl_price_vs_conv)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7132  -2.2213  -0.5409   1.8105  26.5299 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.22424    0.28500   28.86   <2e-16 ***
## n_convenience  0.79808    0.05653   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.384 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

Doing predictions Manually

# Get the coefficients of mdl_price_vs_conv
coeffs <- coefficients(mdl_price_vs_conv)

# Get the intercept
intercept <- coeffs[1]

# Get the slope
slope <- coeffs[2]

explanatory_data %>% 
  mutate(
    # Manually calculate the predictions
    price_twd_msq = intercept + slope * n_convenience
  )

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 11 x 2
##    n_convenience price_twd_msq
##            <int>         <dbl>
##  1             0          8.22
##  2             1          9.02
##  3             2          9.82
##  4             3         10.6 
##  5             4         11.4 
##  6             5         12.2 
##  7             6         13.0 
##  8             7         13.8 
##  9             8         14.6 
## 10             9         15.4 
## 11            10         16.2

# Compare to the results from predict()
# predict(mdl_price_vs_conv, explanatory_data)

# Get the coefficient-level elements of the model
#library(broom)
# tidy(mdl_price_vs_conv)
# Get the observation-level elements of the model
# augment(mdl_price_vs_conv)

Transforming the explanatory variable

If there is no straight line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both of the variables.

# Run the code to see the plot
# Edit so x-axis is square root of dist_to_mrt_m
ggplot(taiwan_real_estate, aes(dist_to_mrt_m, price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Now we take the x aesthetic and and make it x-transformed

# Edit so x-axis is square root of dist_to_mrt_m
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Run a linear regression of price_twd_msq versus the square root of dist_to_mrt_m using taiwan_real_estate.

mdl_price_vs_dist<-lm(price_twd_msq ~ sqrt(dist_to_mrt_m),data=taiwan_real_estate)
mdl_price_vs_dist

## 
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m), data = taiwan_real_estate)
## 
## Coefficients:
##         (Intercept)  sqrt(dist_to_mrt_m)  
##             16.7098              -0.1828

1-Create a data frame of prediction data named prediction_data. 2-Start with explanatory_data, and add a column named after the response variable. 3-Predict values using mdl_price_vs_dist and explanatory_data

#just making a tibble of values between 0 and 80 with steps of 10 and square rooted
explanatory_data <- tibble(
  dist_to_mrt_m = seq(0, 80, 10) ^ 2
)
prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(mdl_price_vs_dist, explanatory_data)
  )

# See the result
prediction_data

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 9 x 2
##   dist_to_mrt_m price_twd_msq
##           <dbl>         <dbl>
## 1             0         16.7 
## 2           100         14.9 
## 3           400         13.1 
## 4           900         11.2 
## 5          1600          9.40
## 6          2500          7.57
## 7          3600          5.74
## 8          4900          3.91
## 9          6400          2.08

plot

ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
  geom_point()

# Add points from prediction_data, colored green, size 5
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Now add the PREDICTION DATA , colored green & size 5
  geom_point(data = prediction_data,color='green',size=5)

## `geom_smooth()` using formula 'y ~ x'

##### Result :By transforming the explanatory variable, the relationship with the response variable became linear, and so a linear regression became an appropriate model.

###Transforming the response variable too The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, we “back transform” the predictions.

ad_conversion<-read.fst('ad_conversion.fst')

# Run the code to see the plot
# Edit to raise x, y aesthetics to power 0.25
ggplot(ad_conversion, aes(n_impressions^0.25, n_clicks^0.25)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Run a linear regression of n_clicks to the power 0.25 vs. n_impressions to the power 0.25 using ad_conversion

mdl_click_vs_impression <- lm(I(n_clicks^0.25)~I(n_impressions^0.25) , data=ad_conversion)

For predictions make a tibble of seq(0,3e6,5e5) and then make predictions with this tibble and store it in a new column called n_clicks_025 Also take it exp power of 4 and save it in n_click column (as we took ^/14 exp power of n_clicks when we used linear model) This extra step at the end is to undo that transformation. That is, we “back transform” the predictions.

# Use this explanatory data
explanatory_data <- tibble(
  n_impressions = seq(0, 3e6, 5e5)
)

prediction_data <- explanatory_data %>% 
  mutate(
    n_clicks_025 = predict(mdl_click_vs_impression, explanatory_data),
    n_clicks = n_clicks_025 ^ 4
  )

ggplot(ad_conversion, aes(n_impressions ^ 0.25, n_clicks ^ 0.25)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add points from prediction_data, colored green
  geom_point(data = prediction_data, color = "green")

## `geom_smooth()` using formula 'y ~ x'