저자 책 웹페이지: https://dataninja.me/ipds-kr/

일단은 필수패키지인 tidyverse를 로드하자. (로딩 메시지를 감추기 위해 suppressMessages() 명령을 사용.)

# install.packages("tidyverse")
suppressMessages(library(tidyverse))

1. (IMDB 자료 시각화)

캐글 웹사이트에서 다음 IMDB(Internet Movie Database) 영화 정보 데이터를 다운로드하도록 하자 (https://goo.gl/R08lpm 혹은 https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset, 무료 캐글 계정이 필요하다).

데이터에 대해서는 3장 연습문제 해답을 참조하자. http://rpubs.com/dataninja/ipds-kr-solutions-ch03

데이터 zip 파일을 다운로드한 후, R로 자료를 읽어들이자:

df2 <- read_csv("imdb-5000-movie-dataset.zip", guess_max = 1e6)

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   color = col_character(),
##   director_name = col_character(),
##   actor_2_name = col_character(),
##   genres = col_character(),
##   actor_1_name = col_character(),
##   movie_title = col_character(),
##   actor_3_name = col_character(),
##   plot_keywords = col_character(),
##   movie_imdb_link = col_character(),
##   language = col_character(),
##   country = col_character(),
##   content_rating = col_character(),
##   budget = col_double(),
##   imdb_score = col_double(),
##   aspect_ratio = col_double()
## )

## See spec(...) for full column specifications.

a. 이 데이터는 어떤 변수로 이루어져 있는가?

df2 %>% glimpse()

## Observations: 5,043
## Variables: 28
## $ color                     <chr> "Color", "Color", "Color", "Color", ...
## $ director_name             <chr> "James Cameron", "Gore Verbinski", "...
## $ num_critic_for_reviews    <int> 723, 302, 602, 813, NA, 462, 392, 32...
## $ duration                  <int> 178, 169, 148, 164, NA, 132, 156, 10...
## $ director_facebook_likes   <int> 0, 563, 0, 22000, 131, 475, 0, 15, 0...
## $ actor_3_facebook_likes    <int> 855, 1000, 161, 23000, NA, 530, 4000...
## $ actor_2_name              <chr> "Joel David Moore", "Orlando Bloom",...
## $ actor_1_facebook_likes    <int> 1000, 40000, 11000, 27000, 131, 640,...
## $ gross                     <int> 760505847, 309404152, 200074175, 448...
## $ genres                    <chr> "Action|Adventure|Fantasy|Sci-Fi", "...
## $ actor_1_name              <chr> "CCH Pounder", "Johnny Depp", "Chris...
## $ movie_title               <chr> "Avatar ", "Pirates of the Caribbean...
## $ num_voted_users           <int> 886204, 471220, 275868, 1144337, 8, ...
## $ cast_total_facebook_likes <int> 4834, 48350, 11700, 106759, 143, 187...
## $ actor_3_name              <chr> "Wes Studi", "Jack Davenport", "Step...
## $ facenumber_in_poster      <int> 0, 0, 1, 0, 0, 1, 0, 1, 4, 3, 0, 0, ...
## $ plot_keywords             <chr> "avatar|future|marine|native|paraple...
## $ movie_imdb_link           <chr> "http://www.imdb.com/title/tt0499549...
## $ num_user_for_reviews      <int> 3054, 1238, 994, 2701, NA, 738, 1902...
## $ language                  <chr> "English", "English", "English", "En...
## $ country                   <chr> "USA", "USA", "UK", "USA", NA, "USA"...
## $ content_rating            <chr> "PG-13", "PG-13", "PG-13", "PG-13", ...
## $ budget                    <dbl> 237000000, 300000000, 245000000, 250...
## $ title_year                <int> 2009, 2007, 2015, 2012, NA, 2012, 20...
## $ actor_2_facebook_likes    <int> 936, 5000, 393, 23000, 12, 632, 1100...
## $ imdb_score                <dbl> 7.9, 7.1, 6.8, 8.5, 7.1, 6.6, 6.2, 7...
## $ aspect_ratio              <dbl> 1.78, 2.35, 2.35, 2.35, NA, 2.35, 2....
## $ movie_facebook_likes      <int> 33000, 0, 85000, 164000, 0, 24000, 0...

b. 시각화를 통해 다음 질문에 답해보자

(~~분석 예는 https://goo.gl/pYPzvi 에서 찾을 수 있다~~ 아쉽게도 원링크 https://www.kaggle.com/adhok93/d/deepmatrix/imdb-5000-movie-dataset/eda-with-plotly 는 삭제되었습니다)

i. 연도별 리뷰받은 영화의 편수는?

df2 %>%
  group_by(title_year) %>%
  summarize(n_movies=n()) %>% 
  ggplot(aes(title_year, n_movies)) + geom_point() + geom_line()

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

ii. 연도별 리뷰평점의 변화는?

df2 %>%
  group_by(title_year) %>%
  summarize(avg_imdb_score = mean(imdb_score)) %>%
  ggplot(aes(title_year, avg_imdb_score)) + geom_point() + geom_line()

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

평균 점수는 점점 낮아지고 있음을 볼 수 있다.

(고급 분석: 이러한 평균 점수의 하락 추세의 원인은 무엇일까?)

iii. 영상물 등급(content_rating)에 따라서 리뷰평점의 분포에 차이가 있는가?

우선 등급의 분포부터 살펴보자:

df2 %>%
  ggplot(aes(content_rating)) + geom_bar()

이로부터 대부분의 영화들의 영상물 등급은 다음 넷 중 하나임을 알 수 있다: G, PG, PG-13, R. 아래 분석은 이 네 등급의 영화에 집중하도록 하자.

각 등급에 따른 리뷰평점 분포의 병렬상자그림을 그려보면:

df2 %>%
  filter(content_rating %in% c("G", "PG", "PG-13", "R")) %>%
  ggplot(aes(content_rating, imdb_score)) + geom_boxplot()

이로부터, 리뷰평점의 중간값은 G > R > PG > PG-13 의 순서임을 알 수 있다. 그리고 이상치에 가까운 최고의 평점을 받은 R 등급 영화들이 있음을 알 수 있다.

(고급: 이 최고 평점을 받은 R 등급 영화들은 무엇일까?)

유사한 시각화로, 각 등급별로 평점의 확률밀도함수를 겹쳐 그려볼 수도 있다:

df2 %>%
  filter(content_rating %in% c("G", "PG", "PG-13", "R")) %>%
  ggplot(aes(imdb_score, fill=content_rating, linetype=content_rating)) + 
  geom_density(alpha=.3)

(필자는 색맹이므로, fill= 옵션으로는 각 집단이 충분히 구분이 되지 않아서 linetype= 옵션도 사용하였다) 이 시각화로부터 추가적으로 알 수 있는 것은 G 등급 영화 중 평점이 높은 영화의 비중이 꾀 높다는 것이다. (아마도 디즈니 영화들일까?)

조금 더 전통적인 통계학적 가설검정을 적용하자면 분산분석 (ANOVA; Analysis of Variance) 을 해 보면 된다.

summary(lm(imdb_score ~ content_rating, 
           data=df2 %>% 
             filter(content_rating %in% c("G", "PG", "PG-13", "R"))))

## 
## Call:
## lm(formula = imdb_score ~ content_rating, data = df2 %>% filter(content_rating %in% 
##     c("G", "PG", "PG-13", "R")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9295 -0.6271  0.0729  0.7425  2.7729 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.529464   0.103061  63.356   <2e-16 ***
## content_ratingPG    -0.235028   0.110989  -2.118   0.0343 *  
## content_ratingPG-13 -0.271969   0.106938  -2.543   0.0110 *  
## content_ratingR     -0.002363   0.105750  -0.022   0.9822    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.091 on 4388 degrees of freedom
## Multiple R-squared:  0.0139, Adjusted R-squared:  0.01322 
## F-statistic: 20.62 on 3 and 4388 DF,  p-value: 2.92e-13

자료의 개수가 워낙 많아서이기도 하지만 등급 집단간에 평점 평균이 통계적으로 유의한 차이가 있음을 알 수 있다.

iv. 페이스북 좋아요 개수와 리뷰평점의 사이의 관계는?

일단 페북 좋아요 개수(move_facebook_likes) 변수의 분포를 살펴보자. 꼬리가 아주 긴 분포이므로, 일단 제곱근 변환을 해 주었다. (독자들은 log10 변환도 해 보길 권한다.)

df2 %>%
  ggplot(aes(movie_facebook_likes)) +
  geom_histogram() +
  scale_x_sqrt()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

이 시각화로부터 분포의 이상한 점이 눈에 띈다. 변환 후 분포 중간에 이상한 갭이 있다는 것이다.

(아직 필자는 그 이유를 찾지 못했으니, 알아낸 분은 공유 바랍니다)

이에 반해 평점의 분포는 상당히 정상적이다:

df2 %>%
  ggplot(aes(imdb_score)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

이제, 제곱근 변환된 좋아요 개수와 스코어 간의 산점도를 그려보자.

df2 %>%
  ggplot(aes(movie_facebook_likes, imdb_score)) + 
  geom_point() +
  scale_x_sqrt() +
  geom_smooth()

## `geom_smooth()` using method = 'gam'

평활 곡선으로부터 양의 상관관계가 있음을 알 수 있다. 어느정도는 상식적이게도 “페북 좋아요 개수가 높을수록 리뷰 평점이 높다”.

하지만 이 분석이 정확한 분석일까? 페북을 사람들이 사용한것은 비교적 최근의 일이다. 따라서 과거의 영화는 좋은 영화이더라도 페북의 좋아요 개수가 적을 수도 있다.

이를 확인하기 위해 년간 페북 좋아요 개수의 분포를 살펴보자:

df2 %>%
  ggplot(aes(as.factor(title_year), movie_facebook_likes)) +
  geom_boxplot() +
  scale_y_sqrt()

예상대로, 2010년 이전과 이후의 좋아요 개수의 분포는 무척 다르다.

따라서, 앞서와 같은 산점도를 그리되 2010년 이후, 그리고 미국 영화로 제한하여 시각화 해 보자:

df2 %>%
  filter(title_year > 2010 & country == "USA") %>%
  ggplot(aes(movie_facebook_likes, imdb_score)) + 
  geom_point() +
  scale_x_sqrt() +
  geom_smooth()

## `geom_smooth()` using method = 'loess'

마찬가지의 자료에서, 좋아요 개수가 100개가 넘는 데이터에 관해 두 변수간의 상관관계는 높은 편이다:

df3 <- df2 %>%
  filter(title_year > 2010 & country == "USA") %>%
  filter(movie_facebook_likes > 100)
cor(sqrt(df3$movie_facebook_likes), df3$imdb_score)

## [1] 0.5038561

선형회귀분석을 적용하면 모수추정과 가설검정 결과도 얻을 수 있다:

summary(lm(imdb_score ~ sqrt(movie_facebook_likes), data=df3))

## 
## Call:
## lm(formula = imdb_score ~ sqrt(movie_facebook_likes), data = df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3556 -0.4786  0.1203  0.6417  2.6799 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.4265227  0.0728755   74.46   <2e-16 ***
## sqrt(movie_facebook_likes) 0.0061409  0.0004082   15.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9776 on 665 degrees of freedom
## Multiple R-squared:  0.2539, Adjusted R-squared:  0.2527 
## F-statistic: 226.3 on 1 and 665 DF,  p-value: < 2.2e-16

c. 이 데이터의 다른 흥미있는 시각화는 어떤 것이 있을까?

(생략)

2 (포켓몬 데이터)

캐글 웹사이트에서 다음 포켓몬 데이터를 다운로드하자 (https://goo.gl/sMPKtX, 혹은 https://www.kaggle.com/abcsds/pokemon 무료 캐글 계정이 필요하다). 이 데이터를 시각화하라. https://goo.gl/3fxt2x 혹은 https://www.kaggle.com/ndrewgele/visualizing-pok-mon-stats-with-seaborn을 참고하라.

웹페이지에서 pokemon.zip 자료를 다운받은 후 다음처럼 R로 읽어들인다:

df_pkm <- read_csv("pokemon.zip")

## Parsed with column specification:
## cols(
##   `#` = col_integer(),
##   Name = col_character(),
##   `Type 1` = col_character(),
##   `Type 2` = col_character(),
##   Total = col_integer(),
##   HP = col_integer(),
##   Attack = col_integer(),
##   Defense = col_integer(),
##   `Sp. Atk` = col_integer(),
##   `Sp. Def` = col_integer(),
##   Speed = col_integer(),
##   Generation = col_integer(),
##   Legendary = col_character()
## )

데이터의 대강 모양은 다음과 같다:

df_pkm %>% glimpse()

## Observations: 800
## Variables: 13
## $ `#`        <int> 1, 2, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10, 11, 12, ...
## $ Name       <chr> "Bulbasaur", "Ivysaur", "Venusaur", "VenusaurMega V...
## $ `Type 1`   <chr> "Grass", "Grass", "Grass", "Grass", "Fire", "Fire",...
## $ `Type 2`   <chr> "Poison", "Poison", "Poison", "Poison", NA, NA, "Fl...
## $ Total      <int> 318, 405, 525, 625, 309, 405, 534, 634, 634, 314, 4...
## $ HP         <int> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44, 59, 79, 79,...
## $ Attack     <int> 49, 62, 82, 100, 52, 64, 84, 130, 104, 48, 63, 83, ...
## $ Defense    <int> 49, 63, 83, 123, 43, 58, 78, 111, 78, 65, 80, 100, ...
## $ `Sp. Atk`  <int> 65, 80, 100, 122, 60, 80, 109, 130, 159, 50, 65, 85...
## $ `Sp. Def`  <int> 65, 80, 100, 120, 50, 65, 85, 85, 115, 64, 80, 105,...
## $ Speed      <int> 45, 60, 80, 80, 65, 80, 100, 100, 100, 43, 58, 78, ...
## $ Generation <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Legendary  <chr> "False", "False", "False", "False", "False", "False...

다양한 시각화가 가능하겠지만 위의 예제 페이지에 나온 시각화를 해 보자면:

df_pkm %>% ggplot(aes(HP)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df_pkm %>% ggplot(aes(Attack)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df_pkm %>% ggplot(aes(HP, Attack)) + geom_point(alpha=.3)

이 외에 다양한 시각화가 가능하겠지만, HP 와 Attack 간의 관계가 각 Type 에 따라 어떻게 변하는지 알고자 한다면 다음과 같은 facet_wrap() 함수가 유용하다:

df_pkm %>% 
  ggplot(aes(HP, Attack)) + 
  geom_point(alpha=.3) +
  # geom_smooth() + 
  facet_wrap(~`Type 1`)

<따라 하며 배우는 데이터 과학> 4장 연습문제 해답

권재명

9/27/2017