MSBA 320 - Midterm 1 - Kanupriya
Facebook Friends Analysis - Using Linear Regression to predict no. of friends on Facebook
Data cleaning: – 26 records with column “Children” containing values other than 0,1 have been modified to reflect Children=1. This field now displays presence/absence of children, instead of number of children. – 7 records with column “Gender” containing values other than 0,1 have been deleted as there was no way to categorize them as either 0 or 1.
#install.packages("goeveg")
library(goeveg)
## Warning: package 'goeveg' was built under R version 4.0.2
## Welcome to the GoeVeg Package
library(tidyverse)
## -- Attaching packages ------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.0 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#install.packages("tigerstats")
library(tigerstats)
## Warning: package 'tigerstats' was built under R version 4.0.2
## Loading required package: abd
## Warning: package 'abd' was built under R version 4.0.2
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: lattice
## Loading required package: grid
## Loading required package: mosaic
## Warning: package 'mosaic' was built under R version 4.0.2
## Loading required package: ggformula
## Warning: package 'ggformula' was built under R version 4.0.2
## Loading required package: ggstance
## Warning: package 'ggstance' was built under R version 4.0.2
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Warning: package 'mosaicData' was built under R version 4.0.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Have you tried the ggformula package for your plots?
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## Welcome to tigerstats!
## To learn more about this package, consult its website:
## http://homerhanumat.github.io/tigerstats
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.2
## corrplot 0.84 loaded
library(car)
## Warning: package 'car' was built under R version 4.0.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following objects are masked from 'package:mosaic':
##
## deltaMethod, logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tigerstats':
##
## tips
## The following object is masked from 'package:tidyr':
##
## smiths
fbdata = read.csv("C:\\temp\\FacebookFriends.csv", header = TRUE)
names(fbdata) # displays column names
## [1] "ï..Sample" "Age" "Photos" "Tags" "Albums"
## [6] "Gender" "Emp" "Profile" "Cover" "Orientation"
## [11] "Relationship" "Posts" "Replies" "Children" "Likes"
## [16] "Edu" "Events" "Friends"
str(fbdata) # displays dataframe structure
## 'data.frame': 708 obs. of 18 variables:
## $ ï..Sample : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 57 42 42 55 30 25 28 32 29 25 ...
## $ Photos : int 7 531 1396 394 916 241 138 167 922 231 ...
## $ Tags : int 27 241 423 139 231 345 206 581 1702 209 ...
## $ Albums : int 5 19 40 48 78 6 15 10 23 8 ...
## $ Gender : int 1 0 0 0 1 1 1 1 0 0 ...
## $ Emp : int 1 0 1 1 1 1 1 1 1 1 ...
## $ Profile : int 1 0 1 1 1 1 0 1 1 1 ...
## $ Cover : int 0 0 0 1 1 1 1 0 1 1 ...
## $ Orientation : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Relationship: int 1 1 1 1 1 0 0 0 1 0 ...
## $ Posts : int 1 5 186 6 412 543 254 236 8 11 ...
## $ Replies : int 0 1 31 4 752 654 163 195 9 4 ...
## $ Children : int 1 1 1 1 1 0 0 0 1 1 ...
## $ Likes : int 21 9 43 187 34 48 151 64 9 29 ...
## $ Edu : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Events : int 0 1 1 4 2 5 1 3 2 27 ...
## $ Friends : int 68 234 727 437 372 552 248 875 636 440 ...
Above output identifies object “fbdata” as a dataframe with 708 observations of 18 variables. The variables in this dataset are of the following two types: 1. integer type/continuous - Age, Photos, Tags, Albums, Posts, Replies, Likes, Events, Friends 2. binary type/discrete - Gender, Emp, Profile, Cover, Orientation, Relationship, Children, Edu
summary(fbdata) # summary statistics for EDA
## ï..Sample Age Photos Tags
## Min. : 1.0 Min. :13.00 Min. : 0.0 Min. : 0.00
## 1st Qu.:177.8 1st Qu.:21.00 1st Qu.: 119.5 1st Qu.: 74.25
## Median :354.5 Median :22.00 Median : 317.5 Median : 245.00
## Mean :354.5 Mean :24.39 Mean : 725.2 Mean : 459.87
## 3rd Qu.:531.2 3rd Qu.:25.00 3rd Qu.: 757.0 3rd Qu.: 572.25
## Max. :708.0 Max. :81.00 Max. :11995.0 Max. :10460.00
## Albums Gender Emp Profile
## Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 6.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median : 12.00 Median :1.0000 Median :1.0000 Median :1.0000
## Mean : 18.97 Mean :0.5268 Mean :0.7429 Mean :0.9167
## 3rd Qu.: 22.25 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :172.00 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Cover Orientation Relationship Posts
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :1.0000 Median :1.0000 Median :0.0000 Median : 15.00
## Mean :0.5042 Mean :0.9562 Mean :0.4915 Mean : 37.85
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 30.00
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :794.00
## Replies Children Likes Edu
## Min. : 0.00 Min. :0.0000 Min. : 0.0 Min. :0.0000
## 1st Qu.: 5.00 1st Qu.:0.0000 1st Qu.: 21.0 1st Qu.:0.0000
## Median : 14.00 Median :0.0000 Median : 59.0 Median :0.0000
## Mean : 35.01 Mean :0.1398 Mean : 143.9 Mean :0.2599
## 3rd Qu.: 29.00 3rd Qu.:0.0000 3rd Qu.: 145.0 3rd Qu.:1.0000
## Max. :752.00 Max. :1.0000 Max. :2619.0 Max. :1.0000
## Events Friends
## Min. : 0.000 Min. : 3.0
## 1st Qu.: 1.000 1st Qu.: 304.8
## Median : 4.000 Median : 565.5
## Mean : 8.874 Mean : 699.6
## 3rd Qu.: 9.000 3rd Qu.: 899.8
## Max. :396.000 Max. :4999.0
Above output shows mean and median for all variables. These are the measures of central tendency and tell us about the centrality of the data, i.e., what’s the average value? Ideally, we would expect mean to be equal to median. However, that’s not usually the case as data may contain outliers. Since mean is impacted by outliers while median is more robust, we can use the difference between the two to understand the direction of skew.
Mean > Median : Positive skew Mean < Median : Negative skew Mean = Median : No skew
In our dataset “fbdata”, all continuous variables are positively skewed.
It is important to note that mean and median are useful measures but not for binary data. Binary data is best summarized through counts and percentages of the constituents (see 3a below). In the case of binary data, the mean might give us some indication of the majority preference (like, “Gender” has Mean=0.5268, shows the population has more 1s(males) than 0s (females)), but the median (i.e., middle value) provides no useful information. In fact, it is incorrect to calculate a median for binary variables as it can only be meaningfully calculated for quantities that have order and are numeric.
Next, we calculate sample standard deviation (SD) and coefficient of variation (CV) for each variable.
# calculates and stores SD for all variables in the exact order as they appear in the dataset
SDfbdata = c(sd(fbdata$ï..Sample),sd(fbdata$Age),sd(fbdata$Photos),sd(fbdata$Tags),sd(fbdata$Albums),sd(fbdata$Gender),sd(fbdata$Emp),sd(fbdata$Profile),sd(fbdata$Cover),sd(fbdata$Orientation),sd(fbdata$Relationship),sd(fbdata$Posts),sd(fbdata$Replies),sd(fbdata$Children),sd(fbdata$Likes),sd(fbdata$Edu),sd(fbdata$Events),sd(fbdata$Friends))
# calculates and stores CV for all variables in the exact order as they appear in the dataset
CVfbdata = c(cv(fbdata$ï..Sample),cv(fbdata$Age),cv(fbdata$Photos),cv(fbdata$Tags),cv(fbdata$Albums),cv(fbdata$Gender),cv(fbdata$Emp),cv(fbdata$Profile),cv(fbdata$Cover),cv(fbdata$Orientation),cv(fbdata$Relationship),cv(fbdata$Posts),cv(fbdata$Replies),cv(fbdata$Children),cv(fbdata$Likes),cv(fbdata$Edu),cv(fbdata$Events),cv(fbdata$Friends))
# retrieves column names from fbdata and stores in vector LblNames
LblNames = colnames(fbdata)
# Creating dataframes to hold newly calculated SD and CV variables for our attributes
SDfbdata = data.frame(LblNames,SDfbdata)
print(SDfbdata)
## LblNames SDfbdata
## 1 ï..Sample 204.5262819
## 2 Age 6.9498059
## 3 Photos 1270.9496198
## 4 Tags 724.0591398
## 5 Albums 22.5988178
## 6 Gender 0.4996323
## 7 Emp 0.4373229
## 8 Profile 0.2765808
## 9 Cover 0.5003355
## 10 Orientation 0.2047618
## 11 Relationship 0.5002816
## 12 Posts 90.8427790
## 13 Replies 84.0115155
## 14 Children 0.3470563
## 15 Likes 261.3122451
## 16 Edu 0.4388825
## 17 Events 24.1071339
## 18 Friends 600.9328886
CVfbdata = data.frame(LblNames,CVfbdata)
print(CVfbdata)
## LblNames CVfbdata
## 1 ï..Sample 0.5769430
## 2 Age 0.2849799
## 3 Photos 1.7524365
## 4 Tags 1.5744864
## 5 Albums 1.1910051
## 6 Gender 0.9483637
## 7 Emp 0.5886400
## 8 Profile 0.3017245
## 9 Cover 0.9922620
## 10 Orientation 0.2141378
## 11 Relationship 1.0178143
## 12 Posts 2.4003242
## 13 Replies 2.3993607
## 14 Children 2.4819782
## 15 Likes 1.8157013
## 16 Edu 1.6887434
## 17 Events 2.7165129
## 18 Friends 0.8589227
Standard Deviation (SD) is a useful measure in understanding the spread of the data. It is derived by calculating the square root of the variance - the metric that measures the average degree to which each number is different from the mean. Variance is calculated using squared values to ensure that positive and negative distances from the mean do not cancel each other out when added together. The more spread out a group of numbers is, the higher is the standard deviation of that group.
Coefficient of Variation (CV) is another measure of data dispersion and shows us the extent of variability around the mean. It is calculated as a ratio of (std dev:mean) and, being unit-less, is useful when measuring degree of dispersion between different variables. However, there are certain limitations to this interpretation and CV can be considered a reasonable metric only for ratio-scale variables. It is not a useful metric for variables with mean close to zero (as is the case with binary types) since CV approaches infinity in this case.
In the given dataset (comparing only non-binary variables), we can conclude that field “Events” demonstrates the highest variability among all, while “Age” displays the least.
3a. FREQUENCY TABULATIONS - created for binary types below:
table(fbdata$Gender)
##
## 0 1
## 335 373
table(fbdata$Emp)
##
## 0 1
## 182 526
table(fbdata$Profile)
##
## 0 1
## 59 649
table(fbdata$Cover)
##
## 0 1
## 351 357
table(fbdata$Orientation)
##
## 0 1
## 31 677
table(fbdata$Relationship)
##
## 0 1
## 360 348
table(fbdata$Children)
##
## 0 1
## 609 99
table(fbdata$Edu)
##
## 0 1
## 524 184
Frequency tabulations show us how frequently a specific value occurs in a set of data. Alternatively, we can also examine above distributions using % values for each factor:
rowPerc(table(fbdata$Gender))
##
## 0 1 Total
## 47.32 52.68 100
rowPerc(table(fbdata$Emp))
##
## 0 1 Total
## 25.71 74.29 100
rowPerc(table(fbdata$Profile))
##
## 0 1 Total
## 8.33 91.67 100
rowPerc(table(fbdata$Cover))
##
## 0 1 Total
## 49.58 50.42 100
rowPerc(table(fbdata$Orientation))
##
## 0 1 Total
## 4.38 95.62 100
rowPerc(table(fbdata$Relationship))
##
## 0 1 Total
## 50.85 49.15 100
rowPerc(table(fbdata$Children))
##
## 0 1 Total
## 86.02 13.98 100
rowPerc(table(fbdata$Edu))
##
## 0 1 Total
## 74.01 25.99 100
3b. BOXPLOTS - created for non-binary types below
ggplot(fbdata, aes(x=Age)) +
geom_boxplot()
ggplot(fbdata, aes(x=Photos)) +
geom_boxplot()
ggplot(fbdata, aes(x=Tags)) +
geom_boxplot()
ggplot(fbdata, aes(x=Albums)) +
geom_boxplot()
ggplot(fbdata, aes(x=Posts)) +
geom_boxplot()
ggplot(fbdata, aes(x=Replies)) +
geom_boxplot()
ggplot(fbdata, aes(x=Likes)) +
geom_boxplot()
ggplot(fbdata, aes(x=Events)) +
geom_boxplot()
ggplot(fbdata, aes(x=Friends)) +
geom_boxplot()
Boxplots display a 5-number summary, including minimum, 1st quartile/25th percentile, median, 3rd quartile/75% percentile, and maximum, which is useful when describing data distribution, specifically: -Centrality: The average values can be inferred via median -Dispersion: We can infer two important measures from a boxplot - range and IQR (Inter-Quartile Range). Range is the distance between the two whiskers, while IQR is the distance between the lines signifying 1st and 3rd quartiles. -Signs of skewness: The positioning of the median line w.r.t. the 1st and 3rd quartile lines tells us about the skew of the data: -median close to 3rd quartile - skewed left -median in the center - symmetric/not skewed -median close to the 1st quartile - skewed right In addition to above, boxplots also give a quick view of the outliers present. These are plotted as individual points outside the plot. It is important to note that boxplots calculate outliers according to Tukey’s method - values > 1.5*IQR from the 1st/3rd quartiles (not using the empirical rule of (+-)3SD).
3c. HISTOGRAMS - created for non-binary types below:
ggplot(fbdata, aes(x=Age)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Photos)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Tags)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Albums)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Posts)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Replies)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Likes)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Events)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
ggplot(fbdata, aes(x=Friends)) +
geom_histogram(binwidth = function(x) 2*IQR(x)/(length(x)^(1/3)))
Histograms are a graphical visualization of frequency tabuations and provide insight on the following aspects of a distribution’s shape: - mode - modal class is one that is higher than those on either side of it. Can be unimodal/bimodal/multimodal/no mode(uniform) - skew - right/positive skew (long right tail), left/negative skew (long left tail), no skew (symmetric)
In our dataset fbdata, all non-binary variables show a multimodal, right skewed distribution.This is in line with our expectations from business data which is usually bound by zero on the left but remains unbounded on the right.
Our first task here is to analyze correlation between the non-binary variables.
#creating a new dataframe to hold all non-binary values from fbdata for further analysis.
fbdatanb = data.frame(fbdata$Age,fbdata$Photos,fbdata$Tags,fbdata$Albums,fbdata$Posts,fbdata$Replies,fbdata$Likes,fbdata$Events,fbdata$Friends)
cormat = cor(fbdatanb) #creates and stores correlation matrix in variable "cormat"
corrplot(cormat, type = "lower") #plots correlation matrix - method 1
heatmap(cormat, symm = TRUE) #plots correlation matrix - heatmap style - method 2
qplot(x=Var1, y=Var2, data=melt(cormat), fill=value, geom="tile") #plots correlation matrix - heatmap style - method 3
Looking at the correlation matrix, we observe: - strong positive correlation “Albums and Photos” “Replies and Posts” - Moderate positive correlation “Tags and Albums” “Tags and Photos” “Likes and Photos” “Likes and Albums” “Likes and Friends” “Replies and Albums” “Likes and Albums” “Posts and Albums” “Albums and Friends” “Tags and Friends” “Photos and Friends” - Moderate negative correlation “Age and Friends” “Age and Likes” “Age and Tags” “Age and Photos” All other combinations show weakly correlation and will therefore be ignored in further analysis.
As we start building the regression model, we choose variables “Photos” as the explanatory variable since it displays the strongest correlation with “Friends”.
Model 1 Response Variable = Friends Potential Explanatory Variables = Photos
model1 <- lm(Friends~Photos, data = fbdata)
summary(model1)
##
## Call:
## lm(formula = Friends ~ Photos, data = fbdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1612.8 -369.6 -120.6 167.7 4319.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 615.56714 25.23145 24.397 < 2e-16 ***
## Photos 0.11592 0.01725 6.719 3.77e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 583 on 706 degrees of freedom
## Multiple R-squared: 0.0601, Adjusted R-squared: 0.05877
## F-statistic: 45.15 on 1 and 706 DF, p-value: 3.768e-11
Above output shows that “Photos” is a significant variable (p-value tends to zero). However, looking at the R squared value, we see that this model explains only 6% of the variation in Friends. To boost the explanatory power of our regression model, we add another explanatory variable to it.
Model 2 Response Variable = Friends Potential Explanatory Variables = Photos, Tags
model2 <- lm(Friends~Photos+Tags, data = fbdata)
summary(model2)
##
## Call:
## lm(formula = Friends ~ Photos + Tags, data = fbdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1310.8 -359.4 -121.5 167.7 4348.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 585.72015 26.44987 22.145 < 2e-16 ***
## Photos 0.07858 0.02018 3.895 0.000108 ***
## Tags 0.12378 0.03541 3.495 0.000503 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 578.4 on 705 degrees of freedom
## Multiple R-squared: 0.07611, Adjusted R-squared: 0.07349
## F-statistic: 29.04 on 2 and 705 DF, p-value: 7.6e-13
Since p-values for both variables are small (ideally <0.05), both are significant. We also notice an uplift in the R-squared value - this model explains 7.6% variation in Friends. To compare model 1 and 2, we look at adjusted R-squared value (see reason below) which turns out to be higher for model 2 - confirming model 2 is better than model 1. To boost model 2’s predictive power, we add another explanatory variable next.
“The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.” (source - https://blog.minitab.com/blog/adventures-in-statistics-2/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables)
Model 3 Response Variable = Friends Potential Explanatory Variables = Photos, Tags, Likes
model3 <- lm(Friends~Photos+Tags+Likes, data = fbdata)
summary(model3)
##
## Call:
## lm(formula = Friends ~ Photos + Tags + Likes, data = fbdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1242.7 -344.5 -118.6 172.3 4345.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 526.40835 27.79659 18.938 < 2e-16 ***
## Photos 0.06765 0.01981 3.415 0.000675 ***
## Tags 0.12087 0.03462 3.491 0.000511 ***
## Likes 0.47651 0.08199 5.812 9.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.4 on 704 degrees of freedom
## Multiple R-squared: 0.1184, Adjusted R-squared: 0.1147
## F-statistic: 31.52 on 3 and 704 DF, p-value: < 2.2e-16
Once again, all explanatory variables are significant and both R-squared and adjusted R-squared values show an uplift. Adding another variable to further improve our model.
Model 4 Response Variable = Friends Potential Explanatory Variables = Photos, Tags, Likes, Albums
model4 <- lm(Friends~Photos+Tags+Likes+Albums, data = fbdata)
summary(model4)
##
## Call:
## lm(formula = Friends ~ Photos + Tags + Likes + Albums, data = fbdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1234.2 -344.3 -117.2 171.4 4345.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 525.12179 29.86306 17.584 < 2e-16 ***
## Photos 0.06530 0.02808 2.326 0.020318 *
## Tags 0.11981 0.03579 3.348 0.000858 ***
## Likes 0.47554 0.08245 5.768 1.2e-08 ***
## Albums 0.19083 1.61170 0.118 0.905782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.8 on 703 degrees of freedom
## Multiple R-squared: 0.1184, Adjusted R-squared: 0.1134
## F-statistic: 23.61 on 4 and 703 DF, p-value: < 2.2e-16
Adjusted R-squared falls in model 4, so removing “Albums” from our regression. Replacing with the next variable in-line “Age”.
Model 5 Response Variable = Friends Potential Explanatory Variables = Photos, Tags, Likes, Age
model5 <- lm(Friends~Photos+Tags+Likes+Age, data = fbdata)
summary(model5)
##
## Call:
## lm(formula = Friends ~ Photos + Tags + Likes + Age, data = fbdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1216.4 -328.5 -105.6 172.2 4467.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 831.12312 83.57385 9.945 < 2e-16 ***
## Photos 0.06708 0.01962 3.419 0.000665 ***
## Tags 0.10835 0.03444 3.146 0.001725 **
## Likes 0.42649 0.08221 5.187 2.79e-07 ***
## Age -11.94655 3.09375 -3.862 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 559.9 on 703 degrees of freedom
## Multiple R-squared: 0.1367, Adjusted R-squared: 0.1318
## F-statistic: 27.84 on 4 and 703 DF, p-value: < 2.2e-16
This model looks better than model 3 since adjust R-squared value has increased.
Next, we will calculate the VIF (Variance Inflationary Factor) for model 5 to diagnose any possible multicollinearity issues.
vif(model5)
## Photos Tags Likes Age
## 1.402139 1.402147 1.040797 1.042485
Since all VIF values are low (<10), we can conclude that this model does not have a multicollinearity issue. Plotting the model below:
par(mfrow=c(2,2))
plot(model5)
Graph at (-1,1) - shows the residuals against the fitted values. It looks good as the red line is quite close to the dotted line (confirming linearity) and we do not see any evidence of curvature (confirming homoskedasticity). Graph at (1,1) - shows the ordered residuals plotted against the quantiles of the standard normal distribution. If, as we hope, our errors really are normal, then this plot should be linear. The data are well behaved with only 3 points having a larger residual than expected (confirming normality of errors). Graph at (-1,-1) - very like the first plot but shows the square root of the standardised residuals against the fitted values (useful for detecting non-constant variance). The red line is almost horizontal and the spread around it doesn’t vary with the fitted values, ruling out heteroskedasticity. Graph at (1,-1) - shows the Residuals vs Leverage plot which is useful in identifying outliers in our regression model. The dotted line represents Cook’s distance - anything outside it is an outlier. Here, we see that 1 value (240) is an outlier.
Therefore, we accept model 5 as our final regression model. Next, we take one of the 7 data points that we had excluded earlier (with Gender=2), and use model 5 to predict number of Friends for that user:
predict(model5,list(Photos=1260,Tags=431,Likes=140,Age=24))
## 1
## 735.3308
Actual no. of Friends = 723 Predicted no. of Friends = 735