모두의 데이터 과학 with 파이썬: 2 정규화

지금까지 다룬 모든 것을 이어 붙여 index.html 파일에서 (불용어를 제외하고) 가장 많이 등장한 단어 원형을 찾아보자(‘UNIT 13. HTML 파일 처리하기’에서 다룬 BeautifulSoup을 사용하면 된다).

from bs4 import BeautifulSoup

from collections import Counter

from nltk.corpus import stopwords

from nltk import LancasterStemmer

# 형태소 분류기를 생성한다.

ls = nltk.LancasterStemmer()

# 파일을 읽고 soup을 만든다.

with open("index.html") as infile:

soup = BeautifulSoup(infile)

# 텍스트를 추출하고 토큰화한다.

words = nltk.word_tokenize(soup.text)

# 단어를 소문자로 변환한다.

words = [w.lower() for w in words]

# 불용어를 제거하고 단어의 형태소를 추출한다.

words = [ls.stem(w) for w in text if w not in stopwords.words("english")

and w.isalnum()]

# 가장 빈번하게 등장한 10개의 단어를 추출한다.

freqs = Counter(words)

print(freqs.most_common(10))

>>>

[('h', 1), ('e', 1), ('l', 1), ('p', 1), ('d', 1)]

이러한 코드는 주제 추출(topic extraction)의 첫 번째 단계라고도 볼 수 있다.

신간 소식 구독하기

뉴스레터에 가입하시고 이메일로 신간 소식을 받아 보세요.