:-$

Ryan's work blog

My Links

News

The WeatherPixie
Subscribe with Bloglines
About this blog

Tools I use:

Post Categories

Article Categories

Archives

Image Galleries

Blog Stats

Personal

Projects

Random Blogs

Random other

Reference

Web comics

Work

Getting a random sample with SQL Server 2000 revisted

I received a helpful comment on my last entry, and sat down to try out this seemingly ideal solution:

--limit results to @limit rows
SET ROWCOUNT @limit
DELETE FROM #temp WHERE Id NOT IN(
--due to rowcount, this should be my random sample of @limit Ids
SELECT TOP 100 PERCENT Id FROM Items
ORDER BY NEWID()
)
--clear the rowcount, so results aren't limited
SET ROWCOUNT 0
It didn't work, and I got really confused when I tried to figure out why, so I started making some simple tests.

Setup and basic test

I started by creating a table variable, and populating it with some data:

DECLARE @limiter TABLE (id int not null)
--get our set into storage
INSERT INTO @limiter (id) (
SELECT TOP 2000 Id
FROM Items
)
Next step, lets see if TOP 100 PERCENT works with ROWCOUNT the way I'd like it to:
SET ROWCOUNT 10
SELECT TOP 100 PERCENT id FROM @limiter
ORDER BY NEWID()
Running this gives me 10 rows, and running it again give me 10 different rows, so my randomish requirement is met.

Subquery test

Next test is my goal, the randomish sample. I make the 10 row random sample in a subquery, and then delete everything not in that sample. Yes, this is kinda backwards, but the actual code is more complicated than my simplification here, so its easier to generate all the possibilities then remove all that don't fall into our sample.

--limit SELECT results
SET ROWCOUNT 10
DELETE FROM @limiter WHERE Id NOT IN (
SELECT TOP 100 PERCENT id FROM @limiter
ORDER BY NEWID()
)
--clear the rowcount, so results aren't limited
SET ROWCOUNT 0
SELECT * FROM @limiter
I expect this to return 10 rows. It returns 2000, the full contents of @limiter. That implies the TOP 100 PERCENT in the subquery returned the full 2000 rows, so nothing was deleted. Checking the execution plan verifies this. SET ROWCOUNT doesn't work on subqueries? Or maybe the TOP 100 PERCENT is confusing SQL Server, so I'll try using a TOP N:
SET ROWCOUNT 10
DELETE FROM @limiter WHERE Id NOT IN (
SELECT TOP 20000 id FROM @limiter
ORDER BY NEWID()
)
--clear the rowcount, so results aren't limited
SET ROWCOUNT 0
SELECT COUNT(id) FROM @limiter
Nope, still get 2000 rows in @limiter after the delete. For grins, I start playing with the TOP statement, and get some confusing results:
TOP NRows left in @limiter
TOP 200002000
TOP 20002000
TOP 19991999
TOP 19981997
Huh? The top 1998 should have returned 1998 rows; the 2 rows not in that sample should have been deleted. Lemme run that again to make sure I didn't mess up.
TOP 19982000
What the hell? Is this some weird query caching thing? Let me run it a few more times...
TOP 19981998
TOP 19981999
Seriously, WTF. The execution plan just creates more questions: It's somehow processing 4M rows from a table variable with 2K rows!? Then the TOP 1998 gets performed in the Filter operation, which reduces it to 1999 rows? Ok, so 2K * 2K = 4M, so the WHERE Id NOT IN () is effectively performing a cross join, which makes sense, but why the 1999 rows?
TOP 101990
Ok, so something is seriously inconsistent with ROWCOUNT and TOP in subqueries. When I take the SET ROWCOUNT out, it still is inconsistent. When I take out the ORDER BY NEWID(), it is still inconsistent.

TOP in a subquery is broken?

Consider this repro:

DECLARE @t TABLE (id int not null)
INSERT INTO @t (id) (
SELECT TOP 20 Id FROM [someTable]
)
SELECT COUNT(Id) FROM @t
WHERE Id IN (
SELECT TOP 10 Id FROM @t
ORDER BY NEWID()
)
What do you think the count should be? Hint: there's more than one answer. Run that query repeatedly, and bask in inconsistency. But... try this one:
SELECT COUNT(Id) FROM [someTable]
WHERE Id IN (
SELECT TOP 10 Id FROM [someTable]
ORDER BY NEWID()
)
And bam, we get 10 every time. So its not TOP and ORDER BY in subqueries, maybe its something to do with table variables? So, with that in mind, lets go back to the original test, but using a temp table instead of a table variable.

Out with table variables

So, back to the beginning:

SELECT TOP 2000 Id INTO #limiter
FROM Items
--limit SELECT results
SET ROWCOUNT 10
DELETE FROM #limiter WHERE Id NOT IN (
SELECT TOP 100 PERCENT id FROM #limiter
ORDER BY NEWID()
)
--clear the rowcount, so results aren't limited
SET ROWCOUNT 0
SELECT COUNT(id) FROM #limiter
Result: 1990 rows. It should have 10. So, table variables aren't the problem.

Fine, you win, SQL Server

Fine. If you want it that badly MSSQL, you can have it. I'll keep my slimy solution from the last entry (for now), as you obviously don't want to let me do anything in a straighforward manner. I'll try to forget the inconsistencies I found today, but it'll take time to heal. Be patient with me. And don't get all jealous if you see me installing postgresql to see if it can do this better than you. You know I'm locked in to you.

posted on Friday, February 24, 2006 11:38 AM