:-$

Ryan's work blog

My Links

News

The WeatherPixie
Subscribe with Bloglines
About this blog

Tools I use:

Post Categories

Article Categories

Archives

Image Galleries

Blog Stats

Personal

Projects

Random Blogs

Random other

Reference

Web comics

Work

Getting a randomish sample with SQL Server 2000

I was recently presented with a seemingly simple task, write a query to return a random* sample of rows from a table, with the number of rows to pull determined at runtime. The surrounding problem is a bit more complex, but this was the specific task I was trying to accomplish, and SQL Server 2000 didn't make it very easy. The rest of my task involved deleting rows where the id wasn't in my random sample.

First try, my almost-dream syntax

My first attempt was a bit naive:

SELECT TOP @limit Id 
FROM Items 
ORDER BY NEWID() 
@limit is a variable which has how many rows I should be returning. I order by NEWID(), which makes a GUID for every row, as this is the best way I know of to get it ordered randomly. I don't think its a terribly good way, but its the best I know. If you have a better way, please comment or email.

That snippet will of course fail, giving this error message:

Incorrect syntax near '@limit'.
Ok, so I can't dynamically set the TOP. Fine.

Use ROWCOUNT

By using SET ROWCOUNT @limit, I could instruct SQL Server to only return @limit rows when performing a SELECT. So, this was the query to delete rows not in my random sample:

--limit results to @limit rows 
SET ROWCOUNT @limit 
 
DELETE FROM #temp WHERE Id NOT IN
    --due to rowcount, this should be my random sample of @limit Ids 
    SELECT Id FROM Items 
    ORDER BY NEWID() 
    ) 
--clear the rowcount, so results aren't limited 
SET ROWCOUNT 0 
That snippet also fails, giving this error:
The ORDER BY clause is invalid in views, inline functions, derived tables, and subqueries, unless TOP is also specified.
Ok, so I can't use ORDER BY in the subquery. Well damn, that would have been really clean. I could add a TOP 2147483647 in there, but that seemed like a bad idea.

Use a table variable

Ok, next thought is to use a table variable, insert my rows in a random order, numbering each row, and then pull the rows with a number below my limit:

--make the table variable, the num column will count upwards as we insert, numbering the rows 
DECLARE @temp TABLE (id int NOT NULL, num int NOT NULL IDENTITY(1,1)) 
 
--insert into the table in random order 
INSERT INTO @temp (id) 
(SELECT Id  
FROM Items 
ORDER BY NEWID()) 
 
--get out our sample 
SELECT Id FROM @temp WHERE num <= @limit 
That snippet also fails, giving this error:
Incorrect syntax near the keyword 'ORDER'.
Oh right, I can't use ORDER BY in the subquery. Fine.

Use a temporary table and INSERT INTO

I was trying to avoid temporary tables, since they have some performance issues, but figured maybe there wasn't any other way. So, I broke down and used a temp table:

--select into the temp table, using IDENTITY to get my row counter 
SELECT Id, IDENTITY(int, 1,1) [num] 
INTO #temp  
FROM Items 
ORDER BY NEWID() 
 
--get out our sample 
SELECT Id FROM #temp WHERE num < @limit 
 
--clean up 
DROP TABLE #temp 
That snippet also fails, giving this error:
Cannot add identity column, using the SELECT INTO statement, to table '#temp', which already has column 'Id' that inherits the identity property.
Ok... So the 'identity' property gets inherited into the temp table, and so I can't specify another identity column. I did some searching in the docs, and there didn't seem to be a way to get an autoincrement field without it being an identity field. Ok, fine, I know a retarded way to get around that.

Use a temporary table, INSERT INTO, and a table variable to mask the identity column

I actually wrote a variant of this:

--Make a table to copy my ids into 
DECLARE @retarded TABLE (id int
 
--copy all my ids in, but leave the 'identity' property behind 
INSERT INTO @retarded (id) 
    (SELECT Id FROM Items) 
 
--select into the temp table, using IDENTITY to get my row counter 
SELECT Id, IDENTITY(int, 1,1) [num] 
INTO #temp  
FROM @retarded 
ORDER BY NEWID() 
 
--get out our sample 
SELECT Id FROM #temp WHERE num < @limit 
 
--clean up 
DROP TABLE #temp 
This worked, but made me sick to my stomach. There had to be a better way.

Use a temporary table, INSERT INTO, and a dummy expression

So, I dug through some more documentation, finding this gem:

When an existing identity column is selected into a new table, the new column inherits the IDENTITY property, unless one of the following conditions is true:
  • The SELECT statement contains a join, GROUP BY clause, or aggregate function.
  • Multiple SELECT statements are joined by using UNION.
  • The identity column is listed more than one time in the select list.
  • The identity column is part of an expression.
If any one of these conditions is true, the column is created NOT NULL instead of inheriting the IDENTITY property. All rules and restrictions for the identity columns apply to the new table.
Ok, so I decided the least despicable thing to do would be to make a dummy expression to drop the identity property:
--select into the temp table, using IDENTITY to get my row counter 
SELECT Id*1 [Id], IDENTITY(int, 1,1) [num] 
INTO #temp  
FROM Items 
ORDER BY NEWID() 
 
--get out our sample 
SELECT Id FROM #temp WHERE num < @limit 
 
--clean up 
DROP TABLE #temp 
I did some profiling, and doing id*1 was the same as doing id+0, so I left it as multiplication.

All in all, very, very frustrating, and I'm not sure if the TOP 2147483647 option above is worse than the hack at the end. Votes are welcome.

I hate how much I jump through hoops and reinvent to do what should be very simple tasks. I mean, c'mon, we've been programming for over 50 years now, and I'm pretty sure pulling a random sample from a dataset is not a new task. We should be better than this.

posted on Thursday, February 09, 2006 7:10 PM