Handout

Let us import the csv which we know is in our working directory into our datafram named “data”

data<-read.csv("quiz2.csv",header = TRUE,sep = ",")
data

summary(data)

##   subregion            region               pop              income     
##  Length:29          Length:29          Min.   :   2311   Min.   :26784  
##  Class :character   Class :character   1st Qu.:  19005   1st Qu.:37970  
##  Mode  :character   Mode  :character   Median :  32122   Median :41595  
##                                        Mean   : 135940   Mean   :43858  
##                                        3rd Qu.: 101482   3rd Qu.:47469  
##                                        Max.   :1554720   Max.   :70821  
##      ipaddr           ufo2010         infections  
##  Min.   :    637   Min.   :  0.00   Min.   :  39  
##  1st Qu.:  12294   1st Qu.:  0.00   1st Qu.: 123  
##  Median :  30418   Median :  2.00   Median : 245  
##  Mean   : 440130   Mean   : 16.66   Mean   :1117  
##  3rd Qu.: 102104   3rd Qu.:  9.00   3rd Qu.: 672  
##  Max.   :5394949   Max.   :169.00   Max.   :6781

Above we get a statistical analysis of our data frame we see seven columns two with categorical character data and the other five with what appears to be numeric quantitative data Lets further inspect the data frames data structure by using the str() function

str(data)

## 'data.frame':    29 obs. of  7 variables:
##  $ subregion : chr  "abbeville" "acadia" "accomack" "ada" ...
##  $ region    : chr  "south carolina" "louisiana" "virginia" "idaho" ...
##  $ pop       : int  25101 61912 33341 409061 7481 18675 25581 22286 459598 3915 ...
##  $ income    : int  34670 37970 41595 55304 47623 31775 33157 31038 56089 36845 ...
##  $ ipaddr    : int  30330 38203 41338 1035427 3762 14303 75347 4543 3206226 1916 ...
##  $ ufo2010   : int  2 6 2 59 0 1 1 0 115 0 ...
##  $ infections: int  245 215 2076 5023 189 195 123 116 3298 430 ...

Lets do more eda by inspecting the top 6 rows of the data frame by using the head() function

head(data)

lets do a linear regression of the infections field aginst the income and population fields. We will store this below as model1 and then review its coeficients, r squared value and p value. by using the summary() function on it.

model1<-lm(infections~income+pop,data = data)
summary(model1)

## 
## Call:
## lm(formula = infections ~ income + pop, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1676.4  -809.5  -249.5    83.6  6087.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.888e+03  1.649e+03  -1.145   0.2626  
## income       6.161e-02  3.969e-02   1.552   0.1327  
## pop          2.227e-03  1.293e-03   1.723   0.0968 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1489 on 26 degrees of freedom
## Multiple R-squared:  0.3955, Adjusted R-squared:  0.349 
## F-statistic: 8.504 on 2 and 26 DF,  p-value: 0.00144

We cab see that the p value is within a %1 significance level, the are is low at 34% Now let us inspect the confidense intervals of this linear model using the confint() function

confint(model1)

##                     2.5 %       97.5 %
## (Intercept) -5.277208e+03 1.501182e+03
## income      -1.996819e-02 1.431971e-01
## pop         -4.300653e-04 4.884768e-03

Let us proceed to remove regions and subregions and test for multicolinearility.

Lets remove the qulitative data so we can correlate the quantitative attributes

data1<-data[,c(3,4,5,6,7)]
#View(data1)

lets correlate this numeric attributes so we can inspect them in a confusion matrix

cor.data<-cor(data1)
cor.data

##                  pop    income    ipaddr   ufo2010 infections
## pop        1.0000000 0.6844901 0.9511769 0.9418469  0.5826141
## income     0.6844901 1.0000000 0.6928019 0.6840534  0.5713655
## ipaddr     0.9511769 0.6928019 1.0000000 0.9829151  0.5696741
## ufo2010    0.9418469 0.6840534 0.9829151 1.0000000  0.6005828
## infections 0.5826141 0.5713655 0.5696741 0.6005828  1.0000000

lets install the package necessary to plot this correlation matrix. it is called corrplot

#install.packages("Correlplot")
#install.packages("corrplot")

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.2.1

## corrplot 0.92 loaded

now we will use corplot to show out correlation matrix with the ellipse method

corrplot(cor.data,method = "ellipse")

Notice Ufo and ip adress has high correlation as we as population and ip address Visualization methods in corrplot

There are seven visualization methods (parameter method) in corrplot package, named “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.

Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients.

Let us now proceed to install the leap package in order to do further analysis of the model.

#install.packages("leaps")
library("leaps")

## Warning: package 'leaps' was built under R version 4.2.1

Lets build another linear regression model this with infections against all the quantitative found in the last data frame we created.

model2<-lm(infections~.,data = data1)
summary(model2)

## 
## Call:
## lm(formula = infections ~ ., data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1575.4  -735.6  -332.8   178.0  6131.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.976e+03  1.696e+03  -1.165    0.255
## pop          1.612e-03  3.110e-03   0.518    0.609
## income       6.079e-02  4.057e-02   1.499    0.147
## ipaddr      -1.585e-03  1.455e-03  -1.089    0.287
## ufo2010      5.438e+01  4.080e+01   1.333    0.195
## 
## Residual standard error: 1496 on 24 degrees of freedom
## Multiple R-squared:  0.4372, Adjusted R-squared:  0.3434 
## F-statistic:  4.66 on 4 and 24 DF,  p-value: 0.006312

Let us proceed to use regsubsets (a function for model selection)

submodel2<-regsubsets(infections~.,data = data1)
best.summary<-summary(submodel2)

names(best.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

which.min(best.summary$rss)

## [1] 4

par(mfrow=c(1,2))

plot(best.summary$cp,xlab = "number of features",ylab = "cp")

which.min(best.summary$bic)

## [1] 1

which.max(best.summary$adjr2)

## [1] 2

bestsummary.fit1<-lm(infections~income+ufo2010,data = data1)

summary(bestsummary.fit1)

## 
## Call:
## lm(formula = infections ~ income + ufo2010, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1686.3  -715.9  -360.5    70.1  6096.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.713e+03  1.633e+03  -1.049   0.3037  
## income       5.725e-02  3.922e-02   1.460   0.1563  
## ufo2010      1.919e+01  1.006e+01   1.907   0.0676 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1472 on 26 degrees of freedom
## Multiple R-squared:  0.4091, Adjusted R-squared:  0.3637 
## F-statistic: 9.002 on 2 and 26 DF,  p-value: 0.00107

The model is acceptable in significance but can improve an variance given its r squared is low.

Handout_11