R을 이용한 데이터 처리 & 분석 실무: 문서의 행렬 표현

다음은 crude를 단어-문서의 행렬로 표현한 예다.

> (x <- TermDocumentMatrix(crude))
A term-document matrix (1266 terms, 20 documents)

Non-/sparse entries: 2255/23065
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency (tf)

위 결과를 보면 단어-문서 행렬의 차원이 1266×20이고 행렬의 값은 TF, 즉 단어의 출현 빈도임을 알 수 있다.

행렬의 내부는 inspect( )를 사용해서 볼 수 있다.

> inspect(x[1:10, 1:10])
A term-document matrix (10 terms, 10 documents)

Non-/sparse entries: 4/96
Sparsity           : 96%
Maximal term length: 7
Weighting          : term frequency (tf)

        Docs
Terms    127 144 191 194 211 236 237 242 246 248
...        0   0   0   0   0   0   0   0   0   1
100,000    0   0   0   0   0   0   0   0   0   0
10.8       0   0   0   0   0   0   0   0   0   0
1.1        0   0   0   0   0   0   0   0   0   0
1.11       0   0   0   0   0   0   0   0   0   0
1.15       0   0   0   0   0   0   0   0   0   0
1.2        0   0   0   0   0   1   0   0   0   0
12.        0   0   0   1   0   0   0   0   0   0
12.217     0   0   0   0   0   0   0   0   1   0
12.32      0   0   0   0   0   0   0   0   0   0

만약 다른 가중치 기법을 사용하고 싶다면 TermDocumentMatrix( )의 control 인자에 weighting을 지정한다. 다음은 TF-IDF를 사용한 예다.

> x <- TermDocumentMatrix(crude, control=list(weighting=weightTfIdf))
> inspect(x[1:10, 1:5])
A term-document matrix (10 terms, 5 documents)

Non-/sparse entries: 1/49
Sparsity           : 98%
Maximal term length: 7
Weighting          :
    term frequency - inverse document frequency (normalized) (tf-idf)

        Docs
Terms    127 144 191 194 211
...      0 0 0 0.000000 0
100,000  0 0 0 0.000000 0
10.8     0 0 0 0.000000 0
1.1      0 0 0 0.000000 0
1.11     0 0 0 0.000000 0
1.15     0 0 0 0.000000 0
1.2      0 0 0 0.000000 0
12.      0 0 0 0.074516 0
12.217   0 0 0 0.000000 0
12.32    0 0 0 0.000000 0

help(TermDocumentMatrix)에 좀 더 다양한 사례가 예로 설명되어 있으므로 참고하기 바란다.

추천 도서와 신규 콘텐츠를 먼저 받아보세요