unable to select top 10 records per group in sparksql

Posted on

unable to select top 10 records per group in sparksql – This article will take you through the common SQL errors that you might encounter while working with sql, apache-spark-sql,  . The wrong arrangement of keywords will certainly cause an error, but wrongly arranged commands may also be an issue. SQL keyword errors occur when one of the words that the SQL query language reserves for its commands and clauses is misspelled. If the user wants to resolve all these reported errors, without finding the original one, what started as a simple typo, becomes a much bigger problem.

SQL Problem :

Hi I am new to spark sql. I have a data frame like this.

 |tag id|timestamp|listner| orgid |org2id|RSSI
 |  4|1496745912| 362|   4|   3|                    0.60|
 |  4|1496745924|1901|   4|   3|                    0.60|
 |  4|1496746030|1901|   4|   3|                    0.60|
 |  4|1496746110| 718|   4|   3|                    0.30|
 |  2|1496746128| 718|   4|   3|                    0.60|
 |  2|1496746188|1901|   4|   3|                    0.10|

I want to select for each listner top 10 timestamp values in spark sql.

I tried the following query.It throws errors.

  val avg = sqlContext.sql("select top 10 * from avg_table") // throws error.

  val avg = sqlContext.sql("select rssi,timestamp,tagid from avg_table order by desc limit 10")  // it prints only 10 records.

I want to select for each listner I need to take top 10 timestamp values. Any help will be appreciated.

Solution :

Doesn’t this work?

select rssi, timestamp, tagid
from avg_table
order by timestamp desc
limit 10;


Oh, I get it. You want row_number():

select rssi, timestamp, tagid
from (select a.*,
             row_number() over (partition by listner order by timestamp desc) as seqnum
      from avg_table
     ) a
where seqnum <= 10
order by a.timestamp desc;

Here we can used dense_rank also

select *
from (select *,
             dense_rank() over (partition by listner order by timestamp) as rank
      from avg_table
where rank <= 10;

Difference Between dense_rank() and row_number() is dense_rank() provide the same rank/number to matching column[on partitioned is done] values in multiple row where as row_number() provide the unique row number/rank to matching column values in multiple row


Use “limit” in your query. (limit 10 in your case)

EXAMPLE: sqlContext.sql("SELECT text FROM yourTable LIMIT 10")

Or you can select all from your table and save result to DataFrame or DataSet
(or to RDD, but then you need to call rdd.toDS() or to DF() method)
Then you can just call show(10) method.

Finding SQL syntax errors can be complicated, but there are some tips on how to make it a bit easier. Using the aforementioned Error List helps in a great way. It allows the user to check for errors while still writing the project, and avoid later searching through thousands lines of code.

Leave a Reply

Your email address will not be published. Required fields are marked *