Assignment #1 Data Mining

Conceptual Questions

Question 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
- We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary
  - Regression because y value of interest is CEO salary which is a continuous variable. Inference would be the topic of interest because we want to understand the factors that affect salary
  - N = 500, P for predictors is 4
- We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price,and ten other variables.
  - Classification because we have 2 response variables success or failure. Prediction would be the of most interest since we would like to know if it will succeed or not. An argument could be made about inference since about inference to help set our price and budget of the product to maximize the likelihood of the success.
  - N = 20, P = 13
- We are interest in predicting the % change in the USD/Euro
  - Regression model since % change is a continuous variable. Prediction would be the important feature because that is what is requested. But if this was in a foreign exchange setting you may want to categorize if the change is going to be postive or negative.
  - N = (time-series so periods under study), P= many (could be related to the stock market of US, and EU, birth rate, GDP etc)

Question 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
- A more flexible approach is beneficial since it can model different types of problems by making less assumptions. A downside to flexible modeling is that is very likely to overfit the data and not be useful for predicting new data. Less flexible approaches have the opposite problem where they work on only certain types of problems and can not fit the data enough. But less flexible methods that be used to gain inference about the variables. A more flexible model can be used to when prediction is the focus of the analysis and a lot of data is available. A less flexible approach can be used when data is scarce, and inference is the goal of the analysis

Question 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
- Parametric methods assume certain relationship of the variables for example linear and estimates less parameters than a non-parametric approach. Whereas a non-parametric method does not assume any explicit type of relationship between X and y. The advantages of a parametric approach is that it is easier to explain and requires less data to effectively model and it is less prone to overfitting then nonparametric models. It’s disadvantage is that it does not have the most accurate predictive model and the real life phenomenon may have a more complex function associated with it.

Applied Questions

Question 8

This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

Reading in data

college = read.csv("College.csv")

Editing Data

rownames(college) = college[,1]
fix(college)

college =college [,-1]
fix (college )

Summary of Data, Exploratory plots

summary(college)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

pairs(college[,2:11])

plot(as.factor(college$Private), college$Outstate, xlab = "Private", ylab= "Out of State Tuition ")

Exploring Elite Schools

Elite =rep ("No", nrow(college))
Elite [college$Top10perc >50]=" Yes"
Elite =as.factor (Elite)
college =data.frame(college ,Elite)

summary(college$Elite)

##  Yes   No 
##   78  699

plot(college$Elite, college$Outstate,
     xlab = "Elite School", ylab= "Out of State Tuition ")

Histograms

par(mfrow=c(2,2))
hist(college$Enroll)
hist(college$PhD)
hist(college$F.Undergrad)
hist(college$Apps)

Explore Data and Summary

num_cols <- unlist(lapply(college, is.numeric)) 
x1 = college[,num_cols] #Taking all the numeric columns
library(tidyverse)

## -- Attaching packages --------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

x1 %>%
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_density()

cormat <- round(cor(x1),2)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
    geom_tile(color = "white")+
    scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                         midpoint = 0, limit = c(-1,1), space = "Lab", 
                         name="Pearson\nCorrelation")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Exploring the data further leads to showing a lot of the variables are unimodel with many variables being skewed to the right. When examining pearson’s correlation we can see some relationships make sense. For example negative relationship with Student Faculty ratio and expenditure per student. Has the sF ratio increases the expenditure per student goes down because there are more students to faculty.

Question 9

Auto data

Quantitative vs Qualitative

Auto = read.csv("Auto.csv")

str(Auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130" "165" "150" "150" ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

Variable	Quantitative vs Qualitative
mpg	Quantitative
cylinders	Quantitative
displacement	Quantitative
horsepower	Quantitative
weight	Quantitative
acceleration	Quantitative
year	Qualitative
origin	Qualitative
name	Qualitative

Range of Quantitative variables

Auto$horsepower <- as.numeric(Auto$horsepower) #Making it numeric

## Warning: NAs introduced by coercion

Auto$origin <- as.factor(Auto$origin)
Auto$year <- as.factor(Auto$year)
num_cols <- unlist(lapply(Auto, is.numeric)) 
x2 = na.omit(Auto[,num_cols])
lapply(x2,range)

## $mpg
## [1]  9.0 46.6
## 
## $cylinders
## [1] 3 8
## 
## $displacement
## [1]  68 455
## 
## $horsepower
## [1]  46 230
## 
## $weight
## [1] 1613 5140
## 
## $acceleration
## [1]  8.0 24.8

Mean

lapply(x2,mean)

## $mpg
## [1] 23.44592
## 
## $cylinders
## [1] 5.471939
## 
## $displacement
## [1] 194.412
## 
## $horsepower
## [1] 104.4694
## 
## $weight
## [1] 2977.584
## 
## $acceleration
## [1] 15.54133

Standard deviation

lapply(x2,sd)

## $mpg
## [1] 7.805007
## 
## $cylinders
## [1] 1.705783
## 
## $displacement
## [1] 104.644
## 
## $horsepower
## [1] 38.49116
## 
## $weight
## [1] 849.4026
## 
## $acceleration
## [1] 2.758864

Subset of data

subsetx2 = x2[-10:-85,]

Range

lapply(subsetx2,range)

## $mpg
## [1] 11.0 46.6
## 
## $cylinders
## [1] 3 8
## 
## $displacement
## [1]  68 455
## 
## $horsepower
## [1]  46 230
## 
## $weight
## [1] 1649 4997
## 
## $acceleration
## [1]  8.5 24.8

Mean

lapply(subsetx2,mean)

## $mpg
## [1] 24.40443
## 
## $cylinders
## [1] 5.373418
## 
## $displacement
## [1] 187.2405
## 
## $horsepower
## [1] 100.7215
## 
## $weight
## [1] 2935.972
## 
## $acceleration
## [1] 15.7269

Standard deviation

lapply(subsetx2,sd)

## $mpg
## [1] 7.867283
## 
## $cylinders
## [1] 1.654179
## 
## $displacement
## [1] 99.67837
## 
## $horsepower
## [1] 35.70885
## 
## $weight
## [1] 811.3002
## 
## $acceleration
## [1] 2.693721

Exploring Auto Data

x2 %>%
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_density()

cormat <- round(cor(x2),2)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
    geom_tile(color = "white")+
    scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                         midpoint = 0, limit = c(-1,1), space = "Lab", 
                         name="Pearson\nCorrelation")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

pairs(x2)

Exploring auto data shows that there is a lot of linearity between variables. Acceleration variable being the one that standout as not having a strong linear relationship with the other variables.

Suppose Predicting MPG

On the topic of predicting miles per gallon our plots suggest that many variables have a linear relationship with MPG. Displacement, horsepower, and weight all have a negative relationship with MPG. Modeling MPG would be fairly easy. One possible issue is multicollinearity where displacement, horsepower, and weight all have positive linear relationship with each other.

Question 10

Boston Data

Loading Data

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

dim(Boston)

## [1] 506  14

The Boston dataset contains 506 rows and 14 columns. Each row represents an observation of a neighborhood in Boston. Each column is a variable that holds information about the observation like crime rate, average number of rooms per house etc.

Pairwise Scatterplots

pairs(Boston)

Based of the pairwise scatterplots some things of is the relationship between nox and dis. The closer a suburb is to employment centers more pollution makes sense. One thing of note where I think there is spatial autocorrelation is when comparing nox and age. Age has a positive correlation with nox because I suspect that older homes will be closer to employment centers. The actual age of a home I don’t think would influence the nitrogen in the atmosphere unless they have a special kind of chimney.

Crime Rate Observation

Regarding the per capita crime rate by town the only predictors that have a small positive correlation are age and dis. The distance away from urban centers will have less crime. As mentioned above the old homes are closer to urban centers.

Range

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

When looking at the ranges of the variables we can see that some suburbs have a very high crime rate, age, residential land, nitrogen oxides, lower status of the population, and very rich homes.

Suburbs by the River

sum(Boston$chas)

## [1] 35

There are 35 suburbs bound to the river.

Median pupil-teacher ratios

median(Boston$ptratio)

## [1] 19.05

The median of Median pupil-teacher ratios is 19.05

Lowest Owner Value

Boston[Boston$medv == min(Boston$medv),]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio  black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90 30.59
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97 22.98
##     medv
## 399    5
## 406    5

There are two suburbs that have the lowest median value of owner-occupied homes. Both suburbs have high crime rate. Slightly higher levels in the nitrogen oxides. Age and distance are the highest and lowest values respectfully. They have very high pupil and teacher ratio. The black value is also high. Lower status of the population is also high.

Rooms per Dwelling

sum(Boston$rm > 7)

## [1] 64

sum(Boston$rm > 8)

## [1] 13

Boston[Boston$rm > 8,]

##        crim zn indus chas    nox    rm  age    dis rad tax ptratio  black lstat
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0 396.90  4.21
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7 388.45  3.32
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7 390.55  2.88
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4 385.05  4.14
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4 382.00  4.63
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4 387.38  3.13
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4 385.91  2.47
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4 378.95  3.95
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1 396.90  3.54
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0 389.70  5.12
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0 386.86  5.91
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0 384.54  7.44
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2 354.55  5.29
##     medv
## 98  38.7
## 164 50.0
## 205 50.0
## 225 44.8
## 226 50.0
## 227 37.6
## 233 41.7
## 234 48.3
## 254 42.8
## 258 50.0
## 263 48.8
## 268 50.0
## 365 21.9

Suburbs that have more then 8 average number of rooms per dwelling have low crime rates, and low lower status of the population. Median value of owner-occupied homes is also high. Strangely despite the high home values the property tax is below average.

Assignment #1 Data Mining

Victor Feagins

1/24/2021

Conceptual Questions

Question 2

Question 5

Question 6

Applied Questions

Question 8

Reading in data

Editing Data

Summary of Data, Exploratory plots

Exploring Elite Schools

Histograms

Explore Data and Summary

Question 9

Auto data

Quantitative vs Qualitative

Range of Quantitative variables

Mean

Standard deviation

Subset of data

Range

Mean

Standard deviation

Exploring Auto Data

Suppose Predicting MPG

Question 10

Boston Data

Loading Data

Pairwise Scatterplots

Crime Rate Observation

Range

Suburbs by the River

Median pupil-teacher ratios

Lowest Owner Value

Rooms per Dwelling