Scala: How to get the upper N elements of an Iterable with Grouping (or Binning)

advertisements

I have used the solution mentioned here to get the top n elements of a Scala Iterable, efficiently.

End example:

scala> val li = List (4, 3, 6, 7, 1, 2, 9, 5)
li: List[Int] = List(4, 3, 6, 7, 1, 2, 9, 5)

scala> top (2, li)
res0: List[Int] = List(2, 1)

Now, suppose I want to get the top n elements with a lower resolution. The range of integers may somehow be divided/binned/grouped to sub-ranges such as modulo 2: {0-1, 2-3, 4-5, ...}, and in each sub-range I do not differentiate between integers, e.g. 0 and 1 are all the same to me. Therefore, the top element in the above example would still be 1, but the next element would either be 2 or 3. More clearly these results are equivalent:

scala> top (2, li)
res0: List[Int] = List(2, 1)

scala> top (2, li)
res0: List[Int] = List(3, 1)

  1. How do I change this nice function to fit these needs?
  2. Is my intuition correct and this sort should be faster? Since the sort is on the bins/groups, then taking all or some of the elements of the bins with no specific order until we get to n elements.

Comments:

  1. The binning/grouping is something simple and fixed like modulo k, doesn't have to be generic like allowing different lengths of sub-ranges
  2. Inside each bin, assuming we need only some of the elements, we can just take first elements, or even some random elements, doesn't have to be some specific system.

Per the comment, you're just changing the comparison.

In this version, 4 and 3 compare equal and 4 is taken first.

object Firstly extends App {

  def firstly(taking: Int, vs: List[Int]) = {
    import collection.mutable.{ SortedSet => S }

    def bucketed(i: Int) = (i + 1) / 2

    vs.foldLeft(S.empty[Int]) { (s, i) =>
      if (s.size < taking) s += i
      else if (bucketed(i) >= bucketed(s.last)) s
      else {
        s += i
        s -= s.last
      }
    }
  }

  assert(firstly(taking = 2, List(4, 6, 7, 1, 9, 3, 5)) == Set(4, 1))
}

Edit: example of sorting buckets instead of keeping sorted "top N":

scala> List(4, 6, 7, 1, 9, 3, 5).groupBy(bucketed).toList.sortBy {
     | case (i, vs) => i }.flatMap {
     | case (i, vs) => vs }.take(5)
res10: List[Int] = List(1, 4, 3, 6, 5)

scala> List(4, 6, 7, 1, 9, 3, 5).groupBy(bucketed).toList.sortBy {
     | case (i, vs) => i }.map {
     | case (i, vs) => vs.head }.take(5)
res11: List[Int] = List(1, 4, 6, 7, 9)

Not sure which result you prefer, of the last two.

As to whether sorting buckets is better, it depends how many buckets.