Species - 범주
나머지 - 연속형
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 0 0 0
중위수와 평균의 차이가 크면, 이상치가 많다는 의미로 볼 수 있다. (중요)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
Standard Deviation : 제곱 = 분산 = eigenvalue
Proportaion of Variance : 전체 분산에서 차지하는 비율
Cumlative Proportion : 누적 비율
PC 1은 72.9% 전체 변동량을 설명한다
PC 1 + PC 2 는 95.81%를 설명한다.
Rotation
각 주성분들의 eigenvector (= 각 변수들의 가중치)
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
X
## PC1 PC2 PC3 PC4
## [1,] -2.257141 -0.4784238 0.12727962 0.024087508
## [2,] -2.074013 0.6718827 0.23382552 0.102662845
## [3,] -2.356335 0.3407664 -0.04405390 0.028282305
## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116
Scree Plot
3 번쨰 주성분까지 선택하면 안되는 이유 -> Variance 는 eigenvalue 인데 1 이하 임으로, 2개의 주성분이 적당하다.
2 개의 차원으로 축소
## PC1 PC2
## [1,] -2.257141 -0.47842383
## [2,] -2.074013 0.67188269
## [3,] -2.356335 0.34076642
## [4,] -2.291707 0.59539986
## [5,] -2.381863 -0.64467566
## [6,] -2.068701 -1.48420530
## [7,] -2.435868 -0.04748512
## [8,] -2.225392 -0.22240300
## [9,] -2.326845 1.11160370
## [10,] -2.177035 0.46744757
fviz_pca_biplot(iris.pca,
# Individuals
geom.ind = "point",
fill.ind = iris$Species, col.ind = "black",
pointshape = 21, pointsize = 2,
palette = "jco",
addEllipses = TRUE,
# Variables
alpha.var ="contrib", col.var = "contrib",
gradient.cols = "RdYlBu",
legend.title = list(fill = "Species", color = "Contrib",
alpha = "Contrib"))고급 시각화-2
fviz_pca_biplot(iris.pca,
# Fill individuals by groups
geom.ind = "point",
pointshape = 21,
pointsize = 2.5,
fill.ind = iris$Species,
col.ind = "black",
# Color variable by groups
col.var = factor(c("sepal", "sepal", "petal", "petal")),
legend.title = list(fill = "Species", color = "Clusters"),
repel = TRUE) +
ggpubr::fill_palette("jco")V 가까운 거리/방향성 = 상관성 증가 PC1 (DIM1) 기준 좌우 / PC2(DIM2) 기준 상하
2008년 이후, 게스트와 호스트는 여행이나, 좀 더 색다르고, 개별적인 여행 경험을 얻고자 에어비엔비를 사용하였다.
Column Books
idlisting: ID name: name of the listing host_id: host ID host_name: name of the host neighbourhood_grouplocation: neighbourhoodarea latitude: latitude coordinates longitude: longitude coordinates room_typelisting: space type price: price in dollars minimum_night: samount of nights minimum number_of_reviews: number of reviews last_review: latest review reviews_per_month: number of reviews per month calculated_host_listings_count: amount of listing per host availability_365: number of days when listing is available for booking
Data Pre Processing
setwd("C:/Users/Administrator/Desktop/R Analysis/Business R Aanlysis source")
read.csv("AB_NYC_2019.csv") -> df
str(df)## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "" "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
df %>%
select(-reviews_per_month, -last_review, -latitude, -longitude) %>%
mutate(name = as.character(name),
id = as.character(id),
host_id= as.character(host_id),
hots_name = as.character(host_name),
price = as.numeric(price)) -> df_2Numeric 만 추출
Descriptive Statistics
Correlation
## price minimum_nights number_of_reviews
## price 1.00000000 0.04279933 -0.04795423
## minimum_nights 0.04279933 1.00000000 -0.08011607
## number_of_reviews -0.04795423 -0.08011607 1.00000000
## calculated_host_listings_count 0.05747169 0.12795963 -0.07237606
## availability_365 0.08182883 0.14430306 0.17202758
## calculated_host_listings_count availability_365
## price 0.05747169 0.08182883
## minimum_nights 0.12795963 0.14430306
## number_of_reviews -0.07237606 0.17202758
## calculated_host_listings_count 1.00000000 0.22570137
## availability_365 0.22570137 1.00000000
PCA Analysis
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion 0.2735 0.5002 0.6940 0.8683 1.0000
PC 4가 86.83%의 변동률을 설명하는 것으로 4개가 적당해보인다.
X
## PC1 PC2 PC3 PC4 PC5
## [1,] 1.0032784 0.5813873 0.07919756 0.30552371 -1.5550835
## [2,] 1.0039239 1.1772506 0.45957024 0.03989221 -1.1041550
## [3,] 0.9551646 0.4063914 0.06127306 0.17283789 -1.7050652
## [4,] 0.2656214 5.0369605 0.35645301 -0.85687593 2.2226285
## [5,] -0.6608942 -0.5561055 -0.32651731 -0.15276554 0.3290961
Rotation - eigenvectors
## PC1 PC2 PC3 PC4
## price 0.27569841 -0.2114304 0.920046259 -0.12497146
## minimum_nights 0.46486745 -0.2822177 -0.335773478 -0.73645904
## number_of_reviews 0.03681049 0.8354198 0.083556017 -0.18126643
## calculated_host_listings_count 0.57439556 -0.1461332 -0.183649326 0.63909258
## availability_365 0.61368019 0.3954354 0.007897413 0.02670907
## PC5
## price 0.1310900
## minimum_nights 0.2216658
## number_of_reviews 0.5107637
## calculated_host_listings_count 0.4544760
## availability_365 -0.6828263
해석) PC1 365일 중, 가용 가능 날짜가 positive correlation
PC2 리뷰의 숫자가 가장 positve correlation 하다.
PC3 가격
PC4 최소 숙박일이 낮아지면, listing_count 은 증가한다.
sdev
## [1] 1.1694633 1.0645600 0.9845079 0.9334173 0.8115071
숫자형 변수만 추출하여, PCA 분석 결과, 전체 변동 중 50%의 변동은 PC1 (27.35%) 와 PC2 (22.67%)를 차지하고 있다. PC1 에서 listing_count 와 availability_365가 가장 많은 변동량을 설명하고 있고, PC2에서는 review의 갯수이다. 그 외 다른 숫자형 변수 들은 약간의 다중 공선성이 서로간 있어보인다.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion 0.2735 0.5002 0.6940 0.8683 1.0000
###(1) 각 PC축에 가깝게 평행을 이루는 변수가 해당 PC에 영향을 가장 많이 주는 변수.
: PC 1에 가장 영향을 많이 주는 변수는 host_listing
###(2) 각 빨간선의 길이는 원변수의 분산을 표현, 길이가 길수록 분산이 길다.
: 그 다음으로 PC1 에 영향을 주는 변수는 mimum_nights 로 볼수 있다.
###(3) 각 빨간선이 가까울수록 서로 상관관계가 있다. (반대로 서로 거리가 멀수록 상관관계가 적다.)
: listing_count, price, mimum_night는 상관관계가 높다고 판단됨
범죄에 가장 연관성이 있는 변수들을 파악해보기
USArrest 데이터 세트 활용
library(corrplot)
cor(USArrests, method = "pearson") -> corr
corrplot(corr, method = "number") -> corr_image## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
Variance 가 1 이하로 넘어가지 않는걸 생각하면 2개의 주성분이 유효함
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
K-mean 두가지 예제
setwd("C:/Users/Administrator/Desktop/R Analysis/Fast Campus")
read.csv("Wholesale customers data.csv", header = T,
stringsAsFactors = T) -> df## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
Channel 과 Region 변수를 바꿔주기
결측치 확인
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
기술통계와 분포
## Channel Region Fresh Milk Grocery Frozen
## median NA NA 8.504000e+03 3.627000e+03 4.755500e+03 1.526000e+03
## mean NA NA 1.200030e+04 5.796266e+03 7.951277e+03 3.071932e+03
## SE.mean NA NA 6.029377e+02 3.518457e+02 4.530455e+02 2.314375e+02
## CI.mean NA NA 1.185003e+03 6.915113e+02 8.904077e+02 4.548631e+02
## var NA NA 1.599549e+08 5.446997e+07 9.031010e+07 2.356785e+07
## std.dev NA NA 1.264733e+04 7.380377e+03 9.503163e+03 4.854673e+03
## coef.var NA NA 1.053918e+00 1.273299e+00 1.195174e+00 1.580332e+00
## Detergents_Paper Delicassen
## median 8.165000e+02 9.655000e+02
## mean 2.881493e+03 1.524870e+03
## SE.mean 2.272985e+02 1.344433e+02
## CI.mean 4.467286e+02 2.642325e+02
## var 2.273244e+07 7.952997e+06
## std.dev 4.767854e+03 2.820106e+03
## coef.var 1.654647e+00 1.849407e+00
## Channel Region Fresh Milk Grocery
## 1:298 1: 77 Min. : 3 Min. : 55 Min. : 3
## 2:142 2: 47 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153
## 3:316 Median : 8504 Median : 3627 Median : 4756
## Mean : 12000 Mean : 5796 Mean : 7951
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656
## Max. :112151 Max. :73498 Max. :92780
## Frozen Detergents_Paper Delicassen
## Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :60869.0 Max. :40827.0 Max. :47943.0
K-mean 는 이상치의 영향을 많이 받기 때문에, 이상치를 제거해주는 것이 좋다.
temp <-NULL
for(i in 3:ncol(df)) {
temp <-rbind(temp, df[order(df[,i],
decreasing = T),]
%>% slice(1:5))
}
temp %>%
arrange(Fresh) %>%
head()## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 85 20959 45828 36 24231 1423
## 2 2 3 85 20959 45828 36 24231 1423
## 3 2 2 8565 4980 67298 131 38102 1215
## 4 2 2 8565 4980 67298 131 38102 1215
## 5 1 3 11314 3090 2062 35009 71 2698
## 6 2 3 16117 46197 92780 1026 40827 2944
중복을 제거하기 distinct 함수
## null device
## 1
1.K 군집 개수 설정 (Elbow Method)
WSS 의 최소 지점 : 5 개
library(factoextra)
set.seed(1234)
fviz_nbclust(df.rm.outlier[, 3:ncol(df.rm.outlier)],
kmeans, method = "wss", k.max = 15)+
theme_minimal()2.K 군집 개수 설정 (Silloutte Method)
K 는 3개
set.seed(1234)
fviz_nbclust(df.rm.outlier[, 3:ncol(df.rm.outlier)],
kmeans, method = "silhouette", k.max = 15)+
theme_minimal()구매 데이터의 고객 클러스터링이기 때문에 K 가 5개 인게 나을것 같다
3.K-Means Modelling
kmeans(df.rm.outlier[,3:ncol(df.rm.outlier)],
center = 5,
iter.max= 1000) -> df.kmeans
#iter.max : 군집화 후, 재 군집화 과정에서 몇번 반복 시킬 것인가
df.kmeans## K-means clustering with 5 clusters of sizes 179, 42, 72, 110, 18
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 4267.933 3751.480 4672.950 2211.313 1550.4469 1036.006
## 2 25332.000 5603.548 7160.024 4144.667 1449.2381 2053.333
## 3 5152.250 12536.694 19616.472 1644.014 8794.1389 1696.653
## 4 14527.509 2606.064 3503.873 3202.073 804.8091 1037.882
## 5 40558.056 3113.444 3814.333 2974.833 684.2778 1271.333
##
## Clustering vector:
## [1] 4 1 1 4 2 1 4 1 1 3 1 4 2 2 2 4 1 1 2 1 4 1 2 2 4 4 4 3 5 2 1 4 2 1 1 2 3
## [38] 3 2 4 3 3 1 3 3 4 3 1 1 5 3 2 1 3 3 4 1 1 1 3 1 1 2 1 1 4 1 2 1 4 1 3 4 1
## [75] 1 3 1 4 4 1 2 4 4 3 3 1 1 1 1 4 3 3 1 4 4 1 3 1 3 4 3 4 4 4 4 4 1 4 1 4 1
## [112] 4 4 5 4 2 1 5 1 1 4 4 1 1 1 1 4 1 4 2 5 4 4 3 1 1 1 5 4 1 4 1 1 3 3 4 1 3
## [149] 1 4 4 3 1 3 1 1 1 1 3 3 1 3 1 1 5 4 4 1 4 1 1 1 1 1 3 3 4 4 1 3 1 4 1 4 4
## [186] 3 3 2 1 1 3 1 1 1 3 4 3 1 1 1 3 3 4 3 1 4 1 1 1 1 4 2 1 1 1 4 1 2 1 4 1 1
## [223] 4 1 5 2 2 4 4 1 3 1 4 4 1 1 3 1 2 1 5 4 1 5 1 1 2 1 3 3 3 4 3 4 1 1 1 5 1
## [260] 1 2 4 4 4 1 4 5 2 5 1 4 4 5 1 1 1 3 2 1 4 1 1 1 4 3 1 3 3 1 3 4 1 3 1 2 3
## [297] 4 4 3 1 1 4 3 1 1 4 4 2 1 1 4 1 4 3 2 4 2 4 4 1 1 1 1 1 3 1 1 3 2 1 3 1 3
## [334] 1 3 4 1 4 3 1 1 4 1 1 1 1 1 4 1 2 1 5 4 1 4 1 1 3 5 1 1 2 4 5 1 3 4 1 4 4
## [371] 4 1 1 1 2 4 4 1 4 4 4 1 2 2 2 4 1 2 3 1 1 1 1 1 1 1 1 3 1 3 1 3 4 2 4 4 4
## [408] 3 2 1 1 1 1 4 1 4 2 5 3 4 1
##
## Within cluster sum of squares by cluster:
## [1] 7488224454 2823135964 9143410363 3900150510 861057236
## (between_SS / total_SS = 70.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
4.시각화
barplot(t(df.kmeans$centers), beside=TRUE, col = 1:6)
legend("topleft", colnames(df[,3:8]), fill = 1:6, cex = 0.5)## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen cluster
## 1 2 3 12669 9656 7561 214 2674 1338 4
## 2 2 3 7057 9810 9568 1762 3293 1776 1
## 3 2 3 6353 8808 7684 2405 3516 7844 1
## 4 1 3 13265 1196 4221 6404 507 1788 4
## 5 2 3 22615 5410 7198 3915 1777 5185 2
## 6 2 3 9413 8259 5126 666 1795 1451 1
setwd("C:/Users/Administrator/Desktop/R Analysis/Business R Aanlysis source")
read.csv("AB_NYC_2019.csv") -> df
str(df)## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "" "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
df %>%
select(-reviews_per_month, -last_review, -latitude, -longitude) %>%
mutate(name = as.character(name),
id = as.character(id),
host_id= as.character(host_id),
hots_name = as.character(host_name),
price = as.numeric(price)) -> df_2
df_num <-df_2 %>%
select_if(is.numeric)#--------------------------------------------
# price 부분의 outlier 삭제
#---------------------------------------------
df_num %>%
filter(availability_365 == 0) -> temp
anti_join(df_num, temp) -> df_num_outlier
df_num_outlier %>%
filter(price >= 9000 ) -> temp_2
anti_join(df_num_outlier, temp_2) -> df_2
boxplot(df_2$price)#--------------------------------------------
# Scaling
#---------------------------------------------
df_num_scale <- as.data.frame(scale(df_2))
kmeans(df_num_scale,7) -> df_km
fviz_cluster(df_km,
data=df_num_scale)+
theme_minimal()#--------------------------------------------
# K 찾기 - manual
#---------------------------------------------
# Determine K
wss <- function(data, maxCluster = 20) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(df_num_scale) ##8개 df_km8$cluster -> df_2$cluster
barplot(t(df_km8$centers), beside=TRUE, col = 1:8)
legend("topleft", colnames(df_2[,1:4]), fill=1:8, cex=0.5)