엑셀파일 불러들여 데이터 살펴보고 과제 수행하기

지난 시간에 엑셀파일을 불러들여오는 명령어를 배웠는데, 이번에는 주어진 엑셀파일(mlu.xls)를 그 명령어를 이용해서 불러들인 다음 오늘 배운 내용들을 수행해봅니다. 데이터는 엑셀파일의 두번째 sheet에 있으므로 명령어를 잘 활용하여 불러들이기 바랍니다.

Q1. 데이터를 불러들여 mlu_data로 지정한 다음 카피본을 만든다.

library(readxl)
mlu <- read_excel("mlu.xlsx")
mlu_data <- as.data.frame(mlu)
summary(mlu_data)
##      File               age            utterances_mlu    words_mlu   
##  Length:35          Length:35          Min.   :323.0   Min.   : 813  
##  Class :character   Class :character   1st Qu.:561.0   1st Qu.:1368  
##  Mode  :character   Mode  :character   Median :621.0   Median :1716  
##                                        Mean   :631.8   Mean   :1710  
##                                        3rd Qu.:716.0   3rd Qu.:2060  
##                                        Max.   :890.0   Max.   :2766  
##                                                                      
##   DurationTime                  DurationSec       Types_freq       Token_freq  
##  Min.   :1899-12-31 00:08:47   Min.   : 527.0   Min.   : 378.0   Min.   : 832  
##  1st Qu.:1899-12-31 00:15:17   1st Qu.: 940.5   1st Qu.: 567.5   1st Qu.:1446  
##  Median :1899-12-31 00:17:39   Median :1060.5   Median : 694.0   Median :1798  
##  Mean   :1899-12-31 00:17:57   Mean   :1088.1   Mean   : 754.8   Mean   :1778  
##  3rd Qu.:1899-12-31 00:20:45   3rd Qu.:1245.5   3rd Qu.: 775.5   3rd Qu.:2134  
##  Max.   :1899-12-31 00:29:22   Max.   :1762.0   Max.   :4014.0   Max.   :2827  
##                                NA's   :1

Q2. age칼럼에는 몇개의 요인이 있는가?

class(mlu_data$age)
## [1] "character"
length(mlu_data$age)
## [1] 35
table(mlu_data$age)
## 
## A0 A1 A2 
## 12 11 12

Q3. utterances_mluutteranceswords_mluwords로 이름을 바꾸라.

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mlu_data <- dplyr::rename(mlu_data, utterances= utterances_mlu)
mlu_data <- dplyr::rename(mlu_data, words = words_mlu)
summary(mlu_data)
##      File               age              utterances        words     
##  Length:35          Length:35          Min.   :323.0   Min.   : 813  
##  Class :character   Class :character   1st Qu.:561.0   1st Qu.:1368  
##  Mode  :character   Mode  :character   Median :621.0   Median :1716  
##                                        Mean   :631.8   Mean   :1710  
##                                        3rd Qu.:716.0   3rd Qu.:2060  
##                                        Max.   :890.0   Max.   :2766  
##                                                                      
##   DurationTime                  DurationSec       Types_freq       Token_freq  
##  Min.   :1899-12-31 00:08:47   Min.   : 527.0   Min.   : 378.0   Min.   : 832  
##  1st Qu.:1899-12-31 00:15:17   1st Qu.: 940.5   1st Qu.: 567.5   1st Qu.:1446  
##  Median :1899-12-31 00:17:39   Median :1060.5   Median : 694.0   Median :1798  
##  Mean   :1899-12-31 00:17:57   Mean   :1088.1   Mean   : 754.8   Mean   :1778  
##  3rd Qu.:1899-12-31 00:20:45   3rd Qu.:1245.5   3rd Qu.: 775.5   3rd Qu.:2134  
##  Max.   :1899-12-31 00:29:22   Max.   :1762.0   Max.   :4014.0   Max.   :2827  
##                                NA's   :1

Q4. 각 utterance에 평균 몇개의 단어가 들어있는지를 계산해서 mlu라는 파생변수를 생성하라

library("dplyr")
mlu_data$mlu <- (mlu_data$words/mlu_data$utterances)

Q5. summary 명령을 이용하여 mlu의 평균 및 quartile값을 파악하라.

library("dplyr")
summary(mlu_data$mlu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.730   2.447   2.745   2.696   2.916   3.476

Q6. mlu값을 4개의 등급으로 나누어 가장 문장 길이가 긴 그룹을 A, 그다음 순서대로 B, C, D로 구분하여 grade칼럼을 생성하라.

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.730 2.447 2.745 2.696 2.916 3.476

library("dplyr")
mlu_data$grade <- ifelse(mlu_data$mlu >= 2.916, "A",
                         ifelse(mlu_data$mlu >= 2.745, "B",
                                ifelse(mlu_data$mlu >= 2.447, "C","D")))
table(mlu_data$grade)
## 
## A B C D 
## 9 9 8 9

Q7. agemlu사이의 빈도분포를 table명령어를 이용하여 구하라.

summary(table(mlu_data$age, mlu_data$mlu))
## Number of cases in table: 35 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 70, df = 68, p-value = 0.4102
##  Chi-squared approximation may be incorrect

Q8. qplot을 이용하여 각 나이 그룹을 X축으로 하여 mlu값의 분포를 그래프로 그려라.

library("ggplot2")
table(mlu_data$age, mlu_data$mlu)
##     
##      1.72978723404255 2.05936920222635 2.13109243697479 2.20781527531083
##   A0                1                1                1                1
##   A1                0                0                0                0
##   A2                0                0                0                0
##     
##      2.20955882352941 2.2791519434629 2.35725938009788 2.40298507462687
##   A0                0               1                0                0
##   A1                1               0                1                1
##   A2                0               0                0                0
##     
##      2.42857142857143 2.46551724137931 2.54157303370787 2.61985815602837
##   A0                0                0                0                1
##   A1                1                0                1                0
##   A2                0                1                0                0
##     
##      2.63072776280323 2.63979848866499 2.66183574879227 2.68727272727273
##   A0                1                0                0                0
##   A1                0                1                1                0
##   A2                0                0                0                1
##     
##      2.69240196078431 2.74540682414698 2.75541795665635 2.79177057356608
##   A0                0                0                0                1
##   A1                1                1                0                0
##   A2                0                0                1                0
##     
##      2.80441176470588 2.81474820143885 2.81914893617021 2.8353982300885
##   A0                0                0                1               1
##   A1                1                0                0               0
##   A2                0                1                0               0
##     
##      2.84929577464789 2.90047393364929 2.93213296398892 2.97038327526132
##   A0                0                1                0                1
##   A1                0                0                0                0
##   A2                1                0                1                0
##     
##      2.98074074074074 3.02551020408163 3.02862254025045 3.18450704225352
##   A0                0                0                0                0
##   A1                0                0                1                0
##   A2                1                1                0                1
##     
##      3.34057971014493 3.35387323943662 3.47609147609148
##   A0                0                0                0
##   A1                0                0                0
##   A2                1                1                1
qplot(mlu_data$age, mlu_data$mlu)