I was recently presented with a seemingly simple task, write a query to return a random* sample of rows
from a table, with the number of rows to pull determined at runtime.
The surrounding problem is a bit more complex, but this was the specific task I was trying to accomplish,
and SQL Server 2000 didn't make it very easy. The rest of my task involved deleting rows where the id wasn't in my random sample.
First try, my almost-dream syntax
My first attempt was a bit naive:
|
| SELECT TOP @limit Id |
| FROM Items |
| ORDER BY NEWID() |
@limit is a variable which has how many rows I should be returning.
I order by NEWID(), which makes a GUID for every row, as this is the best way I know of to get it ordered randomly. I don't
think its a terribly good way, but its the best I know. If you have a better way, please comment or email.
That snippet will of course fail, giving this error message:
Incorrect syntax near '@limit'.
Ok, so I can't dynamically set the TOP. Fine.
Use ROWCOUNT
By using SET ROWCOUNT @limit, I could instruct SQL Server to only return @limit rows when performing a SELECT. So, this was the query to
delete rows not in my random sample:
|
| |
| SET ROWCOUNT @limit |
| |
| DELETE FROM #temp WHERE Id NOT IN( |
| |
| SELECT Id FROM Items |
| ORDER BY NEWID() |
| ) |
| |
| SET ROWCOUNT 0 |
That snippet also fails, giving this error:
The ORDER BY clause is invalid in views, inline functions, derived tables, and subqueries, unless TOP is also specified.
Ok, so I can't use ORDER BY in the subquery. Well damn, that would have been really clean. I could add a
TOP 2147483647 in there, but that seemed like a bad idea.
Use a table variable
Ok, next thought is to use a table variable, insert my rows in a random order, numbering each row, and then pull the rows with a number
below my limit:
|
| |
| DECLARE @temp TABLE (id int NOT NULL, num int NOT NULL IDENTITY(1,1)) |
| |
| |
| INSERT INTO @temp (id) |
| (SELECT Id |
| FROM Items |
| ORDER BY NEWID()) |
| |
| |
| SELECT Id FROM @temp WHERE num <= @limit |
That snippet also fails, giving this error:
Incorrect syntax near the keyword 'ORDER'.
Oh right, I can't use ORDER BY in the subquery. Fine.
Use a temporary table and INSERT INTO
I was trying to avoid temporary tables, since they have some performance issues, but figured maybe there wasn't any other way. So,
I broke down and used a temp table:
|
| |
| SELECT Id, IDENTITY(int, 1,1) [num] |
| INTO #temp |
| FROM Items |
| ORDER BY NEWID() |
| |
| |
| SELECT Id FROM #temp WHERE num < @limit |
| |
| |
| DROP TABLE #temp |
That snippet also fails, giving this error:
Cannot add identity column, using the SELECT INTO statement, to table '#temp', which already has column 'Id' that inherits the identity property.
Ok... So the 'identity' property gets inherited into the temp table, and so I can't specify another identity column.
I did some searching in the docs, and there didn't seem to be a way to get an autoincrement field without it
being an identity field. Ok, fine, I know a retarded way to get around that.
Use a temporary table, INSERT INTO, and a table variable to mask the identity column
I actually wrote a variant of this:
|
| |
| DECLARE @retarded TABLE (id int) |
| |
| |
| INSERT INTO @retarded (id) |
| (SELECT Id FROM Items) |
| |
| |
| SELECT Id, IDENTITY(int, 1,1) [num] |
| INTO #temp |
| FROM @retarded |
| ORDER BY NEWID() |
| |
| |
| SELECT Id FROM #temp WHERE num < @limit |
| |
| |
| DROP TABLE #temp |
This worked, but made me sick to my stomach. There
had to be a better way.
Use a temporary table, INSERT INTO, and a dummy expression
So, I dug through some more documentation, finding this gem:
When an existing identity column is selected into a new table, the new column inherits the IDENTITY property, unless one of the following conditions is true:
- The SELECT statement contains a join, GROUP BY clause, or aggregate function.
- Multiple SELECT statements are joined by using UNION.
- The identity column is listed more than one time in the select list.
- The identity column is part of an expression.
If any one of these conditions is true, the column is created NOT NULL instead of inheriting the IDENTITY property. All rules and restrictions for the identity columns apply to the new table.
Ok, so I decided the least despicable thing to do would be to make a dummy expression to drop the identity property:
|
| |
| SELECT Id*1 [Id], IDENTITY(int, 1,1) [num] |
| INTO #temp |
| FROM Items |
| ORDER BY NEWID() |
| |
| |
| SELECT Id FROM #temp WHERE num < @limit |
| |
| |
| DROP TABLE #temp |
I did some profiling, and doing
id*1 was the same as doing
id+0, so I left it as multiplication.
All in all, very, very frustrating, and I'm not sure if the TOP 2147483647 option above is worse than the hack at the end. Votes are welcome.
I hate how much I jump through hoops and reinvent to do what should be very simple tasks.
I mean, c'mon, we've been programming for over 50 years now, and I'm pretty sure pulling a random sample from a dataset is
not a new task. We should be better than this.