스파크를 다루는 기술: 5.1.7 데이터 조인

예제로 넘어가자. 먼저 책의 깃허브 저장소에 있는 italianVotes.csv 파일(저장소를 홈 디렉터리에 복제했다면 first-edition 폴더 아래에 있다)을 로드하고, 파일의 데이터를 DataFrame으로 변환하자.

val itVotesRaw = sc.textFile("first-edition/ch05/italianVotes.csv").
  map(x => x.split("~"))
val itVotesRows = itVotesRaw.map(row => Row(row(0).toLong, row(1).toLong,
  row(2).toInt, Timestamp.valueOf(row(3))))
val votesSchema = StructType(Seq(
  StructField("id", LongType, false),
  StructField("postId", LongType, false),
  StructField("voteTypeId", IntegerType, false),
  StructField("creationDate", TimestampType, false)) )
val votesDf = spark.createDataFrame(itVotesRows, votesSchema)

이제 다음과 같이 postId 칼럼을 기준으로 두 DataFrame을 조인할 수 있다.

val postsVotes = postsDf.join(votesDf, postsDf("id") === 'postId)

이 코드는 내부(inner) 조인을 실행한다. 또는 다음과 같이 세 번째 인수를 추가해 외부(outer) 조인을 실행할 수도 있다.

val postsVotesOuter = postsDf.join(votesDf,
  postsDf("id") === 'postId, "outer")

신간 소식 구독하기

뉴스레터에 가입하시고 이메일로 신간 소식을 받아 보세요.