Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto.
import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._
val hllMerge = new HyperLogLogMerge
sqlContext.udf.register("hll_merge", hllMerge)
sqlContext.udf.register("hll_create", hllCreate _)
sqlContext.udf.register("hll_cardinality", hllCardinality _)
val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
val count = frame
.select(expr("hll_create(id, 12) as hll"))
.groupBy()
.agg(expr("hll_cardinality(hll_merge(hll)) as count"))
.show()
yields:
+-----+
|count|
+-----+
| 3|
+-----+
-
Configure your credentials for the Spark Packages repository in
~/.ivy2/.sbtcredentials
, e.g:realm=Sonatype Nexus Repository Manager host=oss.sonatype.org user=foo password=bar
-
Publish a new release with
sbt publishSigned