I received a helpful comment on my last entry, and sat down to try out this seemingly ideal
solution:
|
| |
| SET ROWCOUNT @limit |
| |
| DELETE FROM #temp WHERE Id NOT IN( |
| |
| SELECT TOP 100 PERCENT Id FROM Items |
| ORDER BY NEWID() |
| ) |
| |
| SET ROWCOUNT 0 |
It didn't work, and I got really confused when I tried to figure out why, so I started making some
simple tests.
Setup and basic test
I started by creating a table variable, and populating it with some data:
|
| DECLARE @limiter TABLE (id int not null) |
| |
| INSERT INTO @limiter (id) ( |
| SELECT TOP 2000 Id |
| FROM Items |
| ) |
Next step, lets see if TOP 100 PERCENT works with ROWCOUNT the way I'd like it to:
|
| SET ROWCOUNT 10 |
| |
| SELECT TOP 100 PERCENT id FROM @limiter |
| ORDER BY NEWID() |
Running this gives me 10 rows, and running it again give me 10 different rows, so my randomish requirement is met.
Subquery test
Next test is my goal, the randomish sample. I make the 10 row random sample in a subquery, and then delete everything not in that sample. Yes, this is kinda backwards, but the actual code is more complicated than my simplification here, so its easier to generate all the possibilities then remove all that don't fall into our sample.
|
| |
| SET ROWCOUNT 10 |
| |
| DELETE FROM @limiter WHERE Id NOT IN ( |
| SELECT TOP 100 PERCENT id FROM @limiter |
| ORDER BY NEWID() |
| ) |
| |
| |
| SET ROWCOUNT 0 |
| SELECT * FROM @limiter |
I expect this to return 10 rows. It returns 2000, the full contents of @limiter. That implies the TOP 100 PERCENT in the subquery returned the full 2000 rows, so nothing was deleted. Checking the execution plan verifies this. SET ROWCOUNT doesn't work on subqueries? Or maybe the TOP 100 PERCENT is confusing SQL Server, so I'll try using a TOP N:
|
| SET ROWCOUNT 10 |
| |
| DELETE FROM @limiter WHERE Id NOT IN ( |
| SELECT TOP 20000 id FROM @limiter |
| ORDER BY NEWID() |
| ) |
| |
| |
| SET ROWCOUNT 0 |
| SELECT COUNT(id) FROM @limiter |
Nope, still get 2000 rows in @limiter after the delete. For grins, I start playing with the TOP statement, and get some confusing results:
| TOP N | Rows left in @limiter |
| TOP 20000 | 2000 |
| TOP 2000 | 2000 |
| TOP 1999 | 1999 |
| TOP 1998 | 1997 |
| Huh? The top 1998 should have returned 1998 rows; the 2 rows not in that sample should have been deleted. Lemme run that again to make sure I didn't mess up. |
| TOP 1998 | 2000 |
| What the hell? Is this some weird query caching thing? Let me run it a few more times... |
| TOP 1998 | 1998 |
| TOP 1998 | 1999 |
Seriously, WTF. The execution plan just creates more questions:
It's somehow processing 4M rows from a table variable with 2K rows!? Then the TOP 1998 gets performed in the Filter operation, which reduces it to 1999 rows? Ok, so 2K * 2K = 4M, so the WHERE Id NOT IN () is effectively performing a cross join, which makes sense, but why the 1999 rows?
|
| TOP 10 | 1990 |
Ok, so something is seriously inconsistent with ROWCOUNT and TOP in subqueries. When I take the SET ROWCOUNT out, it still is inconsistent. When I take out the ORDER BY NEWID(), it is still inconsistent.
TOP in a subquery is broken?
Consider this repro:
|
| DECLARE @t TABLE (id int not null) |
| |
| INSERT INTO @t (id) ( |
| SELECT TOP 20 Id FROM [someTable] |
| ) |
| |
| SELECT COUNT(Id) FROM @t |
| WHERE Id IN ( |
| SELECT TOP 10 Id FROM @t |
| ORDER BY NEWID() |
| ) |
What do you think the count should be? Hint: there's more than one answer. Run that query repeatedly, and bask in inconsistency. But... try this one:
|
| SELECT COUNT(Id) FROM [someTable] |
| WHERE Id IN ( |
| SELECT TOP 10 Id FROM [someTable] |
| ORDER BY NEWID() |
| ) |
And bam, we get 10 every time. So its not TOP and ORDER BY in subqueries, maybe its something to do with table variables? So, with that in mind, lets go back to the original test, but using a temp table instead of a table variable.
Out with table variables
So, back to the beginning:
|
| SELECT TOP 2000 Id INTO #limiter |
| FROM Items |
| |
| |
| SET ROWCOUNT 10 |
| |
| DELETE FROM #limiter WHERE Id NOT IN ( |
| SELECT TOP 100 PERCENT id FROM #limiter |
| ORDER BY NEWID() |
| ) |
| |
| |
| SET ROWCOUNT 0 |
| SELECT COUNT(id) FROM #limiter |
Result: 1990 rows. It should have 10. So, table variables aren't the problem.
Fine, you win, SQL Server
Fine. If you want it that badly MSSQL, you can have it. I'll keep my slimy solution from the last entry (for now), as you obviously don't want to let me do anything in a straighforward manner. I'll try to forget the inconsistencies I found today, but it'll take time to heal. Be patient with me. And don't get all jealous if you see me installing postgresql to see if it can do this better than you. You know I'm locked in to you.