R을 이용한 데이터 처리 & 분석 실무: 가족 정보를 사용한 ctree() 모델링

slide 1 of 18, currently active
slide 2 of 18
slide 3 of 18
slide 4 of 18
slide 5 of 18
slide 6 of 18
slide 7 of 18
slide 8 of 18
slide 9 of 18
slide 10 of 18
slide 11 of 18
slide 12 of 18
slide 13 of 18
slide 14 of 18
slide 15 of 18
slide 16 of 18
slide 17 of 18
slide 18 of 18

가족 정보를 사용한 ctree() 모델링

이제 all 데이터를 사용해 모델을 만들고 성능을 평가해보자. 지금까지 추가한 컬럼 중 type, avg_prob, maybe_parent, maybe_child, avg_parent_prob, avg_child_prob를 사용할 것이다.

첫 번째 폴드에 지금까지 설명한 내용을 적용한 all 데이터 프레임의 모습은 다음과 같다.

> str(all)
'data.frame': 1178 obs. of 19 variables:
 $ pclass         : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ survived       : Factor w/ 2 levels "dead","survived": 2 2 2 2 1 2 1 1 1 1 ...
 $ name           : chr "Cherry, Miss. Gladys" "Maioni, Miss. Roberta" "Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)" "Taussig, Miss. Ruth" ...
 $ sex            : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 2 2 2 2 ...
 $ age            : num 30 16 33 18 52 39 NA 47 30 42 ...
 $ sibsp          : int 0 0 0 0 1 1 0 0 0 0 ...
 $ parch          : int 0 0 0 2 1 1 0 0 0 0 ...
 $ ticket         : chr "110152" "110152" "110152" "110413" ...
 $ fare           : num 86.5 86.5 86.5 79.7 79.7 ...
 $ cabin          : chr "B77" "B79" "B77" "E68" ...
 $ embarked       : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 3 ...
 $ type           : chr "T" "T" "T" "T" ...
 $ prob           : num 0.0591 0.0591 0.0591 0.0591 0.6364 ...
 $ family_id      : Factor w/ 861 levels "TICKET_1","TICKET_2",..: 1 1 1 2 2 2 3 3 4 5 ...
 $ avg_prob       : num 0.0591 0.0591 0.0591 0.2515 0.2515 ...
 $ maybe_parent   : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ maybe_child    : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ avg_parent_prob: num 0.0591 0.0591 0.0591 0.6364 0.6364 ...
 $ avg_child_prob : num 0.0591 0.0591 0.0591 0.0591 0.0591 ...

all에 있는 정보 중 훈련 데이터를 사용해 ctree( )를 수행하고 이를 검증 데이터에 적용해보자.

f$train <- subset(all, type == "T")
f$validation <- subset(all, type == "V")

(m <- ctree(survived ~ pclass + sex + age + sibsp + parch + fare + embarked + maybe_parent + maybe_child + age + sex + avg_prob + avg_parent_prob + avg_child_prob,
            data=f$train))

print(m)
predicted <- predict(m, newdata=f$validation)

그 결과 첫 번째 폴드로부터 만들어진 ctree( )는 다음과 같다.

> print(m)

    Conditional inference tree with 7 terminal nodes

Response: survived
Inputs: pclass, sex, age, sibsp, parch, fare, embarked, maybe_parent, maybe_child, avg_prob, avg_parent_prob, avg_child_prob
Number of observations: 1060

1) avg_prob <= 0.5751314; criterion = 1, statistic = 385.146
  2) sex == {male}; criterion = 1, statistic = 65.12
    3) age <= 11; criterion = 1, statistic = 20.899
      4)* weights = 21
    3) age > 11
      5)* weights = 80
  2) sex == {female}
    6) pclass == {3}; criterion = 1, statistic = 63.64
      7) avg_parent_prob <= 0.4238411; criterion = 0.965, statistic = 8.81
        8)* weights = 116
      7) avg_parent_prob > 0.4238411
        9)* weights = 9
    6) pclass == {1, 2}
      10)* weights = 203
1) avg_prob > 0.5751314
  11) avg_child_prob <= 0.6507765; criterion = 1, statistic = 47.28
    12)* weights = 126
  11) avg_child_prob > 0.6507765
    13)* weights = 505

확인 결과 avg_prob, avg_child_prob 등이 유용하게 사용되고 있다.

신간 소식 구독하기

뉴스레터에 가입하시고 이메일로 신간 소식을 받아 보세요.