R을 이용한 데이터 처리 & 분석 실무: 파일로부터 코퍼스 생성

slide 1 of 18, currently active
slide 2 of 18
slide 3 of 18
slide 4 of 18
slide 5 of 18
slide 6 of 18
slide 7 of 18
slide 8 of 18
slide 9 of 18
slide 10 of 18
slide 11 of 18
slide 12 of 18
slide 13 of 18
slide 14 of 18
slide 15 of 18
slide 16 of 18
slide 17 of 18
slide 18 of 18

제목, 본문, 분류 레이블이 저장된 다음과 같은 docs.csv 파일이 있다고 가정하자. 분류는 notspam과 spam으로 각 문서가 스팸인지 여부가 저장되어 있다.

Title,Body,Label
"Hello World","This is the body of hello world document","notspam"
"Linear algebra","This book is about linear algebra","nonspam"
"Online Casino","$50,000 is waiting for you","spam"
"Poker","Poker game players","spam"

이 파일을 read.csv( )를 사용해 데이터 프레임으로 읽어들일 수 있다.

> (docs <- read.csv("docs.csv", stringsAsFactors=FALSE))

           Title                                     Body Label
1    Hello World This is the body of hello world document notspam
2 Linear algebra        This book is about linear algebra nonspam
3  Online Casino               $50,000 is waiting for you    spam
4          Poker                       Poker game players    spam
> str(docs)
'data.frame': 4 obs. of 3 variables:
 $ Title: chr "Hello World" "Linear algebra" "Online Casino" "Poker"
 $ Body : chr "This is the body of hello world document" "This book is about linear algebra" "$50,000 is waiting for you" "Poker game players"
 $ Label: chr "notspam" "nonspam" "spam" "spam"

read.csv( )에서 stringsAsFactors=FALSE를 지정했음을 확인하기 바란다. 이 파라미터를 지정하지 않으면 문자열이 모두 팩터형 변수로 저장되어버린다.

읽어들인 데이터 프레임은 Corpus( )에 DataframeSource( )를 지정해 Corpus로 변환할 수 있다.

> corpus <- Corpus(DataframeSource(docs[,1:2]))  # Label은 제외
> inspect(corpus)
A corpus with 4 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator
Available variables in the data frame are:
  MetaID

$`1`
Hello World
This is the body of hello world document

$`2`
Linear algebra
This book is about linear algebra

$`3`
Online Casino
$50,000 is waiting for you

$`4`
Poker
Poker game players

이 예에서는 아직 문서의 분류Label는 저장하지 않았다. 문서의 분류를 저장하는 방법은 다음 절에서 설명한다.

신간 소식 구독하기

뉴스레터에 가입하시고 이메일로 신간 소식을 받아 보세요.