Skip to content

Frequency weighted sampling in SQL

Let's say we have a table with some large number of rows and we want a randm sample of them. The easiest approach to this is to use something like NEWID() as follows:

SELECT TOP 100 *
FROM t1
ORDER BY NEWID()

This works greats if all rows are considered equal. Let's say, however, that we have a column that gives each row a different weight or importance. The earlier approach is indiscriminate, will ignore those weights, and result ina much more tail-skewed sample.

Instead, we can use something like this:

SELECT Query, QueryCount
FROM 
(
  SELECT TOP 2000 *
  FROM t1
  ORDER BY RAND(CAST(NEWID() AS VARBINARY)) * 1/QueryCount
) x 
ORDER BY QueryCount DESC

We change the RAND() seed for each row otherwise it'll give the same value for every row. The inverse frequency will tend to put higher weighted rows closer to the top, subject to the randomization.