R을 이용한 데이터 처리 & 분석 실무: 빈번한 단어

slide 1 of 18, currently active
slide 2 of 18
slide 3 of 18
slide 4 of 18
slide 5 of 18
slide 6 of 18
slide 7 of 18
slide 8 of 18
slide 9 of 18
slide 10 of 18
slide 11 of 18
slide 12 of 18
slide 13 of 18
slide 14 of 18
slide 15 of 18
slide 16 of 18
slide 17 of 18
slide 18 of 18

빈번한 단어

findFreqTerms( )는 단어-문서 행렬로부터 자주 출현하는 단어를 찾아준다.

▼ 표 10-16 빈번한 단어 검색

tm::findFreqTerms : 단어-문서, 문서-단어 행렬로부터 빈번히 출현하는 단어를 찾는다.

tm::findFreqTerms(
  x,            # 단어-문서 또는 문서-단어 행렬
  lowfreq=0,    # 최소 출현 횟수
  highfreq=Inf, # 최대 출현 횟수. 기본값은 무한대
)

반환 값은 lowfreq 이상, highfreq 이하 출현하는 빈번한 단어들이다.

다음은 전체 20개 문서로 구성된 crude 코퍼스에서 10회 이상 출현한 단어를 찾은 예다.

> findFreqTerms(TermDocumentMatrix(crude), lowfreq=10)
 [1] "about"    "and"     "are"      "bpd"        "but"
 [6] "crude"    "dlrs"    "for"      "from"       "government"
[11] "has"      "its"     "kuwait"   "last"       "market"
[16] "mln"      "new"     "not"      "official"   "oil"
[21] "one"      "opec"    "pct"      "price"      "prices"
[26] "reuter"   "said"    "said."    "saudi"      "sheikh"
[31] "that"     "the"     "they"     "u.s."       "was"
[36] "were"     "will"    "with"     "would"

참고로 행렬에서 전체 단어와 문서의 목록은 rownames( ), colnames( )로 볼 수 있다.

> x <- TermDocumentMatrix(crude)
> head(rownames(x))
[1] "..."    "100,000" "10.8"    "1.1"    "1.11"   "1.15"
> head(colnames(x))
[1] "127" "144" "191" "194" "211" "236"

신간 소식 구독하기

뉴스레터에 가입하시고 이메일로 신간 소식을 받아 보세요.