Let us import the csv which we know is in our working directory into our datafram named “data”
data<-read.csv("quiz2.csv",header = TRUE,sep = ",")
data
summary(data)
## subregion region pop income
## Length:29 Length:29 Min. : 2311 Min. :26784
## Class :character Class :character 1st Qu.: 19005 1st Qu.:37970
## Mode :character Mode :character Median : 32122 Median :41595
## Mean : 135940 Mean :43858
## 3rd Qu.: 101482 3rd Qu.:47469
## Max. :1554720 Max. :70821
## ipaddr ufo2010 infections
## Min. : 637 Min. : 0.00 Min. : 39
## 1st Qu.: 12294 1st Qu.: 0.00 1st Qu.: 123
## Median : 30418 Median : 2.00 Median : 245
## Mean : 440130 Mean : 16.66 Mean :1117
## 3rd Qu.: 102104 3rd Qu.: 9.00 3rd Qu.: 672
## Max. :5394949 Max. :169.00 Max. :6781
Above we get a statistical analysis of our data frame we see seven columns two with categorical character data and the other five with what appears to be numeric quantitative data Lets further inspect the data frames data structure by using the str() function
str(data)
## 'data.frame': 29 obs. of 7 variables:
## $ subregion : chr "abbeville" "acadia" "accomack" "ada" ...
## $ region : chr "south carolina" "louisiana" "virginia" "idaho" ...
## $ pop : int 25101 61912 33341 409061 7481 18675 25581 22286 459598 3915 ...
## $ income : int 34670 37970 41595 55304 47623 31775 33157 31038 56089 36845 ...
## $ ipaddr : int 30330 38203 41338 1035427 3762 14303 75347 4543 3206226 1916 ...
## $ ufo2010 : int 2 6 2 59 0 1 1 0 115 0 ...
## $ infections: int 245 215 2076 5023 189 195 123 116 3298 430 ...
Lets do more eda by inspecting the top 6 rows of the data frame by using the head() function
head(data)
lets do a linear regression of the infections field aginst the income and population fields. We will store this below as model1 and then review its coeficients, r squared value and p value. by using the summary() function on it.
model1<-lm(infections~income+pop,data = data)
summary(model1)
##
## Call:
## lm(formula = infections ~ income + pop, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1676.4 -809.5 -249.5 83.6 6087.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.888e+03 1.649e+03 -1.145 0.2626
## income 6.161e-02 3.969e-02 1.552 0.1327
## pop 2.227e-03 1.293e-03 1.723 0.0968 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1489 on 26 degrees of freedom
## Multiple R-squared: 0.3955, Adjusted R-squared: 0.349
## F-statistic: 8.504 on 2 and 26 DF, p-value: 0.00144
We cab see that the p value is within a %1 significance level, the are is low at 34% Now let us inspect the confidense intervals of this linear model using the confint() function
confint(model1)
## 2.5 % 97.5 %
## (Intercept) -5.277208e+03 1.501182e+03
## income -1.996819e-02 1.431971e-01
## pop -4.300653e-04 4.884768e-03
Let us proceed to remove regions and subregions and test for multicolinearility.
Lets remove the qulitative data so we can correlate the quantitative attributes
data1<-data[,c(3,4,5,6,7)]
#View(data1)
lets correlate this numeric attributes so we can inspect them in a confusion matrix
cor.data<-cor(data1)
cor.data
## pop income ipaddr ufo2010 infections
## pop 1.0000000 0.6844901 0.9511769 0.9418469 0.5826141
## income 0.6844901 1.0000000 0.6928019 0.6840534 0.5713655
## ipaddr 0.9511769 0.6928019 1.0000000 0.9829151 0.5696741
## ufo2010 0.9418469 0.6840534 0.9829151 1.0000000 0.6005828
## infections 0.5826141 0.5713655 0.5696741 0.6005828 1.0000000
lets install the package necessary to plot this correlation matrix. it is called corrplot
#install.packages("Correlplot")
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.1
## corrplot 0.92 loaded
now we will use corplot to show out correlation matrix with the ellipse method
corrplot(cor.data,method = "ellipse")
Notice Ufo and ip adress has high correlation as we as population and ip
address Visualization methods in corrplot
There are seven visualization methods (parameter method) in corrplot package, named “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.
Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients.
Let us now proceed to install the leap package in order to do further analysis of the model.
#install.packages("leaps")
library("leaps")
## Warning: package 'leaps' was built under R version 4.2.1
Lets build another linear regression model this with infections against all the quantitative found in the last data frame we created.
model2<-lm(infections~.,data = data1)
summary(model2)
##
## Call:
## lm(formula = infections ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1575.4 -735.6 -332.8 178.0 6131.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.976e+03 1.696e+03 -1.165 0.255
## pop 1.612e-03 3.110e-03 0.518 0.609
## income 6.079e-02 4.057e-02 1.499 0.147
## ipaddr -1.585e-03 1.455e-03 -1.089 0.287
## ufo2010 5.438e+01 4.080e+01 1.333 0.195
##
## Residual standard error: 1496 on 24 degrees of freedom
## Multiple R-squared: 0.4372, Adjusted R-squared: 0.3434
## F-statistic: 4.66 on 4 and 24 DF, p-value: 0.006312
Let us proceed to use regsubsets (a function for model selection)
submodel2<-regsubsets(infections~.,data = data1)
best.summary<-summary(submodel2)
names(best.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
which.min(best.summary$rss)
## [1] 4
par(mfrow=c(1,2))
plot(best.summary$cp,xlab = "number of features",ylab = "cp")
which.min(best.summary$bic)
## [1] 1
which.max(best.summary$adjr2)
## [1] 2
bestsummary.fit1<-lm(infections~income+ufo2010,data = data1)
summary(bestsummary.fit1)
##
## Call:
## lm(formula = infections ~ income + ufo2010, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1686.3 -715.9 -360.5 70.1 6096.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.713e+03 1.633e+03 -1.049 0.3037
## income 5.725e-02 3.922e-02 1.460 0.1563
## ufo2010 1.919e+01 1.006e+01 1.907 0.0676 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1472 on 26 degrees of freedom
## Multiple R-squared: 0.4091, Adjusted R-squared: 0.3637
## F-statistic: 9.002 on 2 and 26 DF, p-value: 0.00107
The model is acceptable in significance but can improve an variance given its r squared is low.