MSBA320_Midterm_R_Kanupriya.utf8

MSBA 320 - Midterm 1 - Kanupriya

Facebook Friends Analysis - Using Linear Regression to predict no. of friends on Facebook

Data cleaning: – 26 records with column “Children” containing values other than 0,1 have been modified to reflect Children=1. This field now displays presence/absence of children, instead of number of children. – 7 records with column “Gender” containing values other than 0,1 have been deleted as there was no way to categorize them as either 0 or 1.

#install.packages("goeveg")
library(goeveg)

## Warning: package 'goeveg' was built under R version 4.0.2

## Welcome to the GoeVeg Package

library(tidyverse)

## -- Attaching packages ------------------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

#install.packages("tigerstats")
library(tigerstats)

## Warning: package 'tigerstats' was built under R version 4.0.2

## Loading required package: abd

## Warning: package 'abd' was built under R version 4.0.2

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## Loading required package: lattice

## Loading required package: grid

## Loading required package: mosaic

## Warning: package 'mosaic' was built under R version 4.0.2

## Loading required package: ggformula

## Warning: package 'ggformula' was built under R version 4.0.2

## Loading required package: ggstance

## Warning: package 'ggstance' was built under R version 4.0.2

## 
## Attaching package: 'ggstance'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh

## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")

## Loading required package: mosaicData

## Warning: package 'mosaicData' was built under R version 4.0.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Have you tried the ggformula package for your plots?

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

## Welcome to tigerstats!
## To learn more about this package, consult its website:
##  http://homerhanumat.github.io/tigerstats

#install.packages("corrplot")
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.2

## corrplot 0.84 loaded

library(car)

## Warning: package 'car' was built under R version 4.0.2

## Loading required package: carData

## 
## Attaching package: 'car'

## The following objects are masked from 'package:mosaic':
## 
##     deltaMethod, logit

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tigerstats':
## 
##     tips

## The following object is masked from 'package:tidyr':
## 
##     smiths

Describe the data structure and identify int and binary variables.

fbdata = read.csv("C:\\temp\\FacebookFriends.csv", header = TRUE)
names(fbdata) # displays column names

##  [1] "ï..Sample"    "Age"          "Photos"       "Tags"         "Albums"      
##  [6] "Gender"       "Emp"          "Profile"      "Cover"        "Orientation" 
## [11] "Relationship" "Posts"        "Replies"      "Children"     "Likes"       
## [16] "Edu"          "Events"       "Friends"

str(fbdata) # displays dataframe structure

## 'data.frame':    708 obs. of  18 variables:
##  $ ï..Sample   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age         : int  57 42 42 55 30 25 28 32 29 25 ...
##  $ Photos      : int  7 531 1396 394 916 241 138 167 922 231 ...
##  $ Tags        : int  27 241 423 139 231 345 206 581 1702 209 ...
##  $ Albums      : int  5 19 40 48 78 6 15 10 23 8 ...
##  $ Gender      : int  1 0 0 0 1 1 1 1 0 0 ...
##  $ Emp         : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ Profile     : int  1 0 1 1 1 1 0 1 1 1 ...
##  $ Cover       : int  0 0 0 1 1 1 1 0 1 1 ...
##  $ Orientation : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Relationship: int  1 1 1 1 1 0 0 0 1 0 ...
##  $ Posts       : int  1 5 186 6 412 543 254 236 8 11 ...
##  $ Replies     : int  0 1 31 4 752 654 163 195 9 4 ...
##  $ Children    : int  1 1 1 1 1 0 0 0 1 1 ...
##  $ Likes       : int  21 9 43 187 34 48 151 64 9 29 ...
##  $ Edu         : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Events      : int  0 1 1 4 2 5 1 3 2 27 ...
##  $ Friends     : int  68 234 727 437 372 552 248 875 636 440 ...

Above output identifies object “fbdata” as a dataframe with 708 observations of 18 variables. The variables in this dataset are of the following two types: 1. integer type/continuous - Age, Photos, Tags, Albums, Posts, Replies, Likes, Events, Friends 2. binary type/discrete - Gender, Emp, Profile, Cover, Orientation, Relationship, Children, Edu

Get the mean, median, sample standard deviation, and coefficient of variation for all variables. What do these statistics tell you about the distributions?

summary(fbdata) # summary statistics for EDA

##    ï..Sample          Age            Photos             Tags         
##  Min.   :  1.0   Min.   :13.00   Min.   :    0.0   Min.   :    0.00  
##  1st Qu.:177.8   1st Qu.:21.00   1st Qu.:  119.5   1st Qu.:   74.25  
##  Median :354.5   Median :22.00   Median :  317.5   Median :  245.00  
##  Mean   :354.5   Mean   :24.39   Mean   :  725.2   Mean   :  459.87  
##  3rd Qu.:531.2   3rd Qu.:25.00   3rd Qu.:  757.0   3rd Qu.:  572.25  
##  Max.   :708.0   Max.   :81.00   Max.   :11995.0   Max.   :10460.00  
##      Albums           Gender            Emp            Profile      
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:  6.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median : 12.00   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   : 18.97   Mean   :0.5268   Mean   :0.7429   Mean   :0.9167  
##  3rd Qu.: 22.25   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :172.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Cover         Orientation      Relationship        Posts       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:  6.00  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median : 15.00  
##  Mean   :0.5042   Mean   :0.9562   Mean   :0.4915   Mean   : 37.85  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 30.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :794.00  
##     Replies          Children          Likes             Edu        
##  Min.   :  0.00   Min.   :0.0000   Min.   :   0.0   Min.   :0.0000  
##  1st Qu.:  5.00   1st Qu.:0.0000   1st Qu.:  21.0   1st Qu.:0.0000  
##  Median : 14.00   Median :0.0000   Median :  59.0   Median :0.0000  
##  Mean   : 35.01   Mean   :0.1398   Mean   : 143.9   Mean   :0.2599  
##  3rd Qu.: 29.00   3rd Qu.:0.0000   3rd Qu.: 145.0   3rd Qu.:1.0000  
##  Max.   :752.00   Max.   :1.0000   Max.   :2619.0   Max.   :1.0000  
##      Events           Friends      
##  Min.   :  0.000   Min.   :   3.0  
##  1st Qu.:  1.000   1st Qu.: 304.8  
##  Median :  4.000   Median : 565.5  
##  Mean   :  8.874   Mean   : 699.6  
##  3rd Qu.:  9.000   3rd Qu.: 899.8  
##  Max.   :396.000   Max.   :4999.0

Above output shows mean and median for all variables. These are the measures of central tendency and tell us about the centrality of the data, i.e., what’s the average value? Ideally, we would expect mean to be equal to median. However, that’s not usually the case as data may contain outliers. Since mean is impacted by outliers while median is more robust, we can use the difference between the two to understand the direction of skew.

Mean > Median : Positive skew Mean < Median : Negative skew Mean = Median : No skew

In our dataset “fbdata”, all continuous variables are positively skewed.

It is important to note that mean and median are useful measures but not for binary data. Binary data is best summarized through counts and percentages of the constituents (see 3a below). In the case of binary data, the mean might give us some indication of the majority preference (like, “Gender” has Mean=0.5268, shows the population has more 1s(males) than 0s (females)), but the median (i.e., middle value) provides no useful information. In fact, it is incorrect to calculate a median for binary variables as it can only be meaningfully calculated for quantities that have order and are numeric.

Next, we calculate sample standard deviation (SD) and coefficient of variation (CV) for each variable.

# calculates and stores SD for all variables in the exact order as they appear in the dataset
SDfbdata = c(sd(fbdata$ï..Sample),sd(fbdata$Age),sd(fbdata$Photos),sd(fbdata$Tags),sd(fbdata$Albums),sd(fbdata$Gender),sd(fbdata$Emp),sd(fbdata$Profile),sd(fbdata$Cover),sd(fbdata$Orientation),sd(fbdata$Relationship),sd(fbdata$Posts),sd(fbdata$Replies),sd(fbdata$Children),sd(fbdata$Likes),sd(fbdata$Edu),sd(fbdata$Events),sd(fbdata$Friends))

# calculates and stores CV for all variables in the exact order as they appear in the dataset
CVfbdata = c(cv(fbdata$ï..Sample),cv(fbdata$Age),cv(fbdata$Photos),cv(fbdata$Tags),cv(fbdata$Albums),cv(fbdata$Gender),cv(fbdata$Emp),cv(fbdata$Profile),cv(fbdata$Cover),cv(fbdata$Orientation),cv(fbdata$Relationship),cv(fbdata$Posts),cv(fbdata$Replies),cv(fbdata$Children),cv(fbdata$Likes),cv(fbdata$Edu),cv(fbdata$Events),cv(fbdata$Friends))

# retrieves column names from fbdata and stores in vector LblNames
LblNames = colnames(fbdata)
  
# Creating dataframes to hold newly calculated SD and CV variables for our attributes
SDfbdata = data.frame(LblNames,SDfbdata)
print(SDfbdata)

##        LblNames     SDfbdata
## 1     ï..Sample  204.5262819
## 2           Age    6.9498059
## 3        Photos 1270.9496198
## 4          Tags  724.0591398
## 5        Albums   22.5988178
## 6        Gender    0.4996323
## 7           Emp    0.4373229
## 8       Profile    0.2765808
## 9         Cover    0.5003355
## 10  Orientation    0.2047618
## 11 Relationship    0.5002816
## 12        Posts   90.8427790
## 13      Replies   84.0115155
## 14     Children    0.3470563
## 15        Likes  261.3122451
## 16          Edu    0.4388825
## 17       Events   24.1071339
## 18      Friends  600.9328886

CVfbdata = data.frame(LblNames,CVfbdata)
print(CVfbdata)

##        LblNames  CVfbdata
## 1     ï..Sample 0.5769430
## 2           Age 0.2849799
## 3        Photos 1.7524365
## 4          Tags 1.5744864
## 5        Albums 1.1910051
## 6        Gender 0.9483637
## 7           Emp 0.5886400
## 8       Profile 0.3017245
## 9         Cover 0.9922620
## 10  Orientation 0.2141378
## 11 Relationship 1.0178143
## 12        Posts 2.4003242
## 13      Replies 2.3993607
## 14     Children 2.4819782
## 15        Likes 1.8157013
## 16          Edu 1.6887434
## 17       Events 2.7165129
## 18      Friends 0.8589227

Standard Deviation (SD) is a useful measure in understanding the spread of the data. It is derived by calculating the square root of the variance - the metric that measures the average degree to which each number is different from the mean. Variance is calculated using squared values to ensure that positive and negative distances from the mean do not cancel each other out when added together. The more spread out a group of numbers is, the higher is the standard deviation of that group.

Coefficient of Variation (CV) is another measure of data dispersion and shows us the extent of variability around the mean. It is calculated as a ratio of (std dev:mean) and, being unit-less, is useful when measuring degree of dispersion between different variables. However, there are certain limitations to this interpretation and CV can be considered a reasonable metric only for ratio-scale variables. It is not a useful metric for variables with mean close to zero (as is the case with binary types) since CV approaches infinity in this case.

In the given dataset (comparing only non-binary variables), we can conclude that field “Events” demonstrates the highest variability among all, while “Age” displays the least.

Choose all non-binary variables and create frequency tabulations, boxplots, and histograms for all. Describe the distribution for each.

3a. FREQUENCY TABULATIONS - created for binary types below:

table(fbdata$Gender)

## 
##   0   1 
## 335 373

table(fbdata$Emp)

## 
##   0   1 
## 182 526

table(fbdata$Profile)

## 
##   0   1 
##  59 649

table(fbdata$Cover)

## 
##   0   1 
## 351 357

table(fbdata$Orientation)

## 
##   0   1 
##  31 677

table(fbdata$Relationship)

## 
##   0   1 
## 360 348

table(fbdata$Children)

## 
##   0   1 
## 609  99

table(fbdata$Edu)

## 
##   0   1 
## 524 184

Frequency tabulations show us how frequently a specific value occurs in a set of data. Alternatively, we can also examine above distributions using % values for each factor:

rowPerc(table(fbdata$Gender))

##   
##        0     1 Total
##    47.32 52.68   100

rowPerc(table(fbdata$Emp))

##   
##        0     1 Total
##    25.71 74.29   100

rowPerc(table(fbdata$Profile))

##   
##       0     1 Total
##    8.33 91.67   100

rowPerc(table(fbdata$Cover))

##   
##        0     1 Total
##    49.58 50.42   100

rowPerc(table(fbdata$Orientation))

##   
##       0     1 Total
##    4.38 95.62   100

rowPerc(table(fbdata$Relationship))

##   
##        0     1 Total
##    50.85 49.15   100

rowPerc(table(fbdata$Children))

##   
##        0     1 Total
##    86.02 13.98   100

rowPerc(table(fbdata$Edu))

##   
##        0     1 Total
##    74.01 25.99   100

3b. BOXPLOTS - created for non-binary types below

ggplot(fbdata, aes(x=Age)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Photos)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Tags)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Albums)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Posts)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Replies)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Likes)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Events)) + 
  geom_boxplot()

ggplot(fbdata, aes(x=Friends)) + 
  geom_boxplot()

Boxplots display a 5-number summary, including minimum, 1st quartile/25th percentile, median, 3rd quartile/75% percentile, and maximum, which is useful when describing data distribution, specifically: -Centrality: The average values can be inferred via median -Dispersion: We can infer two important measures from a boxplot - range and IQR (Inter-Quartile Range). Range is the distance between the two whiskers, while IQR is the distance between the lines signifying 1st and 3rd quartiles. -Signs of skewness: The positioning of the median line w.r.t. the 1st and 3rd quartile lines tells us about the skew of the data: -median close to 3rd quartile - skewed left -median in the center - symmetric/not skewed -median close to the 1st quartile - skewed right In addition to above, boxplots also give a quick view of the outliers present. These are plotted as individual points outside the plot. It is important to note that boxplots calculate outliers according to Tukey’s method - values > 1.5*IQR from the 1st/3rd quartiles (not using the empirical rule of (+-)3SD).

3c. HISTOGRAMS - created for non-binary types below:

ggplot(fbdata, aes(x=Age)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Photos)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Tags)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Albums)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Posts)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Replies)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Likes)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Events)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

ggplot(fbdata, aes(x=Friends)) + 
  geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))

Histograms are a graphical visualization of frequency tabuations and provide insight on the following aspects of a distribution’s shape: - mode - modal class is one that is higher than those on either side of it. Can be unimodal/bimodal/multimodal/no mode(uniform) - skew - right/positive skew (long right tail), left/negative skew (long left tail), no skew (symmetric)

In our dataset fbdata, all non-binary variables show a multimodal, right skewed distribution.This is in line with our expectations from business data which is usually bound by zero on the left but remains unbounded on the right.

Propose a regression model to predict number of Friends by selecting significant non-binary variables. Estimate the model and describe the fit (R-squared, etc.). Which of your proposed predictors are significant? (Note: need to use the correlation map / heatmap to provide quantitative values for significance).

Our first task here is to analyze correlation between the non-binary variables.

#creating a new dataframe to hold all non-binary values from fbdata for further analysis.
fbdatanb = data.frame(fbdata$Age,fbdata$Photos,fbdata$Tags,fbdata$Albums,fbdata$Posts,fbdata$Replies,fbdata$Likes,fbdata$Events,fbdata$Friends)

cormat = cor(fbdatanb) #creates and stores correlation matrix in variable "cormat"

corrplot(cormat, type = "lower") #plots correlation matrix - method 1

heatmap(cormat, symm = TRUE) #plots correlation matrix - heatmap style - method 2

qplot(x=Var1, y=Var2, data=melt(cormat), fill=value, geom="tile") #plots correlation matrix - heatmap style - method 3

Looking at the correlation matrix, we observe: - strong positive correlation “Albums and Photos” “Replies and Posts” - Moderate positive correlation “Tags and Albums” “Tags and Photos” “Likes and Photos” “Likes and Albums” “Likes and Friends” “Replies and Albums” “Likes and Albums” “Posts and Albums” “Albums and Friends” “Tags and Friends” “Photos and Friends” - Moderate negative correlation “Age and Friends” “Age and Likes” “Age and Tags” “Age and Photos” All other combinations show weakly correlation and will therefore be ignored in further analysis.

As we start building the regression model, we choose variables “Photos” as the explanatory variable since it displays the strongest correlation with “Friends”.

Model 1 Response Variable = Friends Potential Explanatory Variables = Photos

model1 <- lm(Friends~Photos, data = fbdata)
summary(model1)

## 
## Call:
## lm(formula = Friends ~ Photos, data = fbdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1612.8  -369.6  -120.6   167.7  4319.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 615.56714   25.23145  24.397  < 2e-16 ***
## Photos        0.11592    0.01725   6.719 3.77e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 583 on 706 degrees of freedom
## Multiple R-squared:  0.0601, Adjusted R-squared:  0.05877 
## F-statistic: 45.15 on 1 and 706 DF,  p-value: 3.768e-11

Above output shows that “Photos” is a significant variable (p-value tends to zero). However, looking at the R squared value, we see that this model explains only 6% of the variation in Friends. To boost the explanatory power of our regression model, we add another explanatory variable to it.

Model 2 Response Variable = Friends Potential Explanatory Variables = Photos, Tags

model2 <- lm(Friends~Photos+Tags, data = fbdata)
summary(model2)

## 
## Call:
## lm(formula = Friends ~ Photos + Tags, data = fbdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1310.8  -359.4  -121.5   167.7  4348.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 585.72015   26.44987  22.145  < 2e-16 ***
## Photos        0.07858    0.02018   3.895 0.000108 ***
## Tags          0.12378    0.03541   3.495 0.000503 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 578.4 on 705 degrees of freedom
## Multiple R-squared:  0.07611,    Adjusted R-squared:  0.07349 
## F-statistic: 29.04 on 2 and 705 DF,  p-value: 7.6e-13

Since p-values for both variables are small (ideally <0.05), both are significant. We also notice an uplift in the R-squared value - this model explains 7.6% variation in Friends. To compare model 1 and 2, we look at adjusted R-squared value (see reason below) which turns out to be higher for model 2 - confirming model 2 is better than model 1. To boost model 2’s predictive power, we add another explanatory variable next.

“The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.” (source - https://blog.minitab.com/blog/adventures-in-statistics-2/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables)

Model 3 Response Variable = Friends Potential Explanatory Variables = Photos, Tags, Likes

model3 <- lm(Friends~Photos+Tags+Likes, data = fbdata)
summary(model3)

## 
## Call:
## lm(formula = Friends ~ Photos + Tags + Likes, data = fbdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1242.7  -344.5  -118.6   172.3  4345.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 526.40835   27.79659  18.938  < 2e-16 ***
## Photos        0.06765    0.01981   3.415 0.000675 ***
## Tags          0.12087    0.03462   3.491 0.000511 ***
## Likes         0.47651    0.08199   5.812 9.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 565.4 on 704 degrees of freedom
## Multiple R-squared:  0.1184, Adjusted R-squared:  0.1147 
## F-statistic: 31.52 on 3 and 704 DF,  p-value: < 2.2e-16

Once again, all explanatory variables are significant and both R-squared and adjusted R-squared values show an uplift. Adding another variable to further improve our model.

Model 4 Response Variable = Friends Potential Explanatory Variables = Photos, Tags, Likes, Albums

model4 <- lm(Friends~Photos+Tags+Likes+Albums, data = fbdata)
summary(model4)

## 
## Call:
## lm(formula = Friends ~ Photos + Tags + Likes + Albums, data = fbdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1234.2  -344.3  -117.2   171.4  4345.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 525.12179   29.86306  17.584  < 2e-16 ***
## Photos        0.06530    0.02808   2.326 0.020318 *  
## Tags          0.11981    0.03579   3.348 0.000858 ***
## Likes         0.47554    0.08245   5.768  1.2e-08 ***
## Albums        0.19083    1.61170   0.118 0.905782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 565.8 on 703 degrees of freedom
## Multiple R-squared:  0.1184, Adjusted R-squared:  0.1134 
## F-statistic: 23.61 on 4 and 703 DF,  p-value: < 2.2e-16

Adjusted R-squared falls in model 4, so removing “Albums” from our regression. Replacing with the next variable in-line “Age”.

Model 5 Response Variable = Friends Potential Explanatory Variables = Photos, Tags, Likes, Age

model5 <- lm(Friends~Photos+Tags+Likes+Age, data = fbdata)
summary(model5)

## 
## Call:
## lm(formula = Friends ~ Photos + Tags + Likes + Age, data = fbdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1216.4  -328.5  -105.6   172.2  4467.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 831.12312   83.57385   9.945  < 2e-16 ***
## Photos        0.06708    0.01962   3.419 0.000665 ***
## Tags          0.10835    0.03444   3.146 0.001725 ** 
## Likes         0.42649    0.08221   5.187 2.79e-07 ***
## Age         -11.94655    3.09375  -3.862 0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 559.9 on 703 degrees of freedom
## Multiple R-squared:  0.1367, Adjusted R-squared:  0.1318 
## F-statistic: 27.84 on 4 and 703 DF,  p-value: < 2.2e-16

This model looks better than model 3 since adjust R-squared value has increased.

Next, we will calculate the VIF (Variance Inflationary Factor) for model 5 to diagnose any possible multicollinearity issues.

vif(model5)

##   Photos     Tags    Likes      Age 
## 1.402139 1.402147 1.040797 1.042485

Since all VIF values are low (<10), we can conclude that this model does not have a multicollinearity issue. Plotting the model below:

par(mfrow=c(2,2))
plot(model5)

Graph at (-1,1) - shows the residuals against the fitted values. It looks good as the red line is quite close to the dotted line (confirming linearity) and we do not see any evidence of curvature (confirming homoskedasticity). Graph at (1,1) - shows the ordered residuals plotted against the quantiles of the standard normal distribution. If, as we hope, our errors really are normal, then this plot should be linear. The data are well behaved with only 3 points having a larger residual than expected (confirming normality of errors). Graph at (-1,-1) - very like the first plot but shows the square root of the standardised residuals against the fitted values (useful for detecting non-constant variance). The red line is almost horizontal and the spread around it doesn’t vary with the fitted values, ruling out heteroskedasticity. Graph at (1,-1) - shows the Residuals vs Leverage plot which is useful in identifying outliers in our regression model. The dotted line represents Cook’s distance - anything outside it is an outlier. Here, we see that 1 value (240) is an outlier.

Therefore, we accept model 5 as our final regression model. Next, we take one of the 7 data points that we had excluded earlier (with Gender=2), and use model 5 to predict number of Friends for that user:

predict(model5,list(Photos=1260,Tags=431,Likes=140,Age=24))

##        1 
## 735.3308

Actual no. of Friends = 723 Predicted no. of Friends = 735