가족 정보를 사용한 ctree() 모델링
이제 all 데이터를 사용해 모델을 만들고 성능을 평가해보자. 지금까지 추가한 컬럼 중 type, avg_prob, maybe_parent, maybe_child, avg_parent_prob, avg_child_prob를 사용할 것이다.
첫 번째 폴드에 지금까지 설명한 내용을 적용한 all 데이터 프레임의 모습은 다음과 같다.
> str(all)
'data.frame': 1178 obs. of 19 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "dead","survived": 2 2 2 2 1 2 1 1 1 1 ...
$ name : chr "Cherry, Miss. Gladys" "Maioni, Miss. Roberta" "Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)" "Taussig, Miss. Ruth" ...
$ sex : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 2 2 2 2 ...
$ age : num 30 16 33 18 52 39 NA 47 30 42 ...
$ sibsp : int 0 0 0 0 1 1 0 0 0 0 ...
$ parch : int 0 0 0 2 1 1 0 0 0 0 ...
$ ticket : chr "110152" "110152" "110152" "110413" ...
$ fare : num 86.5 86.5 86.5 79.7 79.7 ...
$ cabin : chr "B77" "B79" "B77" "E68" ...
$ embarked : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 3 ...
$ type : chr "T" "T" "T" "T" ...
$ prob : num 0.0591 0.0591 0.0591 0.0591 0.6364 ...
$ family_id : Factor w/ 861 levels "TICKET_1","TICKET_2",..: 1 1 1 2 2 2 3 3 4 5 ...
$ avg_prob : num 0.0591 0.0591 0.0591 0.2515 0.2515 ...
$ maybe_parent : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ maybe_child : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ avg_parent_prob: num 0.0591 0.0591 0.0591 0.6364 0.6364 ...
$ avg_child_prob : num 0.0591 0.0591 0.0591 0.0591 0.0591 ...
all에 있는 정보 중 훈련 데이터를 사용해 ctree( )를 수행하고 이를 검증 데이터에 적용해보자.
f$train <- subset(all, type == "T") f$validation <- subset(all, type == "V") (m <- ctree(survived ~ pclass + sex + age + sibsp + parch + fare + embarked + maybe_parent + maybe_child + age + sex + avg_prob + avg_parent_prob + avg_child_prob, data=f$train)) print(m) predicted <- predict(m, newdata=f$validation)
그 결과 첫 번째 폴드로부터 만들어진 ctree( )는 다음과 같다.
> print(m)
Conditional inference tree with 7 terminal nodes
Response: survived
Inputs: pclass, sex, age, sibsp, parch, fare, embarked, maybe_parent, maybe_child, avg_prob, avg_parent_prob, avg_child_prob
Number of observations: 1060
1) avg_prob <= 0.5751314; criterion = 1, statistic = 385.146
2) sex == {male}; criterion = 1, statistic = 65.12
3) age <= 11; criterion = 1, statistic = 20.899
4)* weights = 21
3) age > 11
5)* weights = 80
2) sex == {female}
6) pclass == {3}; criterion = 1, statistic = 63.64
7) avg_parent_prob <= 0.4238411; criterion = 0.965, statistic = 8.81
8)* weights = 116
7) avg_parent_prob > 0.4238411
9)* weights = 9
6) pclass == {1, 2}
10)* weights = 203
1) avg_prob > 0.5751314
11) avg_child_prob <= 0.6507765; criterion = 1, statistic = 47.28
12)* weights = 126
11) avg_child_prob > 0.6507765
13)* weights = 505
확인 결과 avg_prob, avg_child_prob 등이 유용하게 사용되고 있다.