Kevin Mintmier, Slalom Consulting

OPENQUERY(BRAIN, 'SELECT * FROM EVERYTHING');

Archive for the ‘SQL Server’ Category

Linked Server Issue with SSAS Provider

leave a comment »

I was just working on a new post about using linked servers to interact with Analysis Services, and I ran into some odd behavior. I tried to test a new linked server to a local SSAS instance, and I got this error message:

The 32-bit OLE DB provider “MSOLAP” cannot be loaded in-process on a 64-bit SQL Server.

I am running the 64-bit SQL Server 2012 on my laptop, so the latter part of the message made sense. Fortunately, I saw a similar issue recently and figured SQL was just trying to use the 32-bit SSAS driver.

Turns out both drivers were registered, but in the wrong order. SQL Server will use the most recently registered driver, and apparently the 32-bit driver had been registered after the 64-bit version.

This was a simple case of unregistering and re-registering the SSAS drivers.

1. Unregister both drivers.

regsvr32 /u “C:\Program Files (x86)\Microsoft Analysis Services\AS OLEDB\110\msolap110.dll”
regsvr32 /u “C:\Program Files\Microsoft Analysis Services\AS OLEDB\110\msolap110.dll”

2. Re-register both drivers, in the right order this time.

regsvr32 “C:\Program Files (x86)\Microsoft Analysis Services\AS OLEDB\110\msolap110.dll”
regsvr32 “C:\Program Files\Microsoft Analysis Services\AS OLEDB\110\msolap110.dll”

3. Restart SQL Server.

After this, the linked server to SSAS worked correctly.

Advertisements

Written by Kevin Mintmier

March 11, 2013 at 12:00 pm

Posted in SQL Server

Tagged with , , ,

Self-Referencing Linked Servers

leave a comment »

Call me lazy, but I don’t always enjoy having to explicitly create table variables (or temp tables) to hold the output of a SELECT statement. This makes me a big fan of SELECT INTO. Unfortunately, we can’t SELECT INTO from a stored procedure.

Or can we? Not directly, of course, but it turns out that we can do it by passing the stored procedure call through a linked server using OPENQUERY.

A linked server encapsulates a connection string and uses a data provider (driver) to give us access to other servers. OPENQUERY is one of several commands (others being OPENROWSET and EXEC) that can leverage linked servers to interact with other servers. For this post, we’ll be treating the local server as a remote server, and we’ll use OPENQUERY with a linked server to issue a stored procedure call to the local server.

Since the system stored procedure sys.sp_who returns a table, we’ll use it in our example.

Here’s how we create a self-referencing linked server. We’ll call it SELF. Note in the data source definition that we use the global variable @@SERVERNAME – so we don’t even have to know our server or instance name. Now that’s satisfyingly lazy.

EXEC master.dbo.sp_addlinkedserver

  @server = N’SELF’

  , @srvproduct = N”

  , @provider = N’SQLNCLI’

  , @datasrc = @@SERVERNAME;

 

Note: @@SERVERNAME is incredibly convenient here, but it doesn’t automatically reflect changes to the network name of a server the way SERVERPROPERTY(‘SERVERNAME’) does. If your server name changes, @@SERVERNAME can be misleading until you update sys.servers, but that’s a topic for another post.

For good measure, this is how we would drop our linked server, should the need arise:

EXEC master.dbo.sp_dropserver

  @server = N’SELF’;

Now, here’s the good stuff.

SELECT q.*

INTO #Temp

FROM OPENQUERY(

  SELF

  , ‘EXEC sys.sp_who’

) q;

 

SELECT * FROM #Temp;

 

— TODO: do amazing things with #Temp

 

DROP TABLE #Temp;

 

Again, OPENQUERY allows us to execute a T-SQL query against a linked server (SELF, in this case), and since the query is executed in a different context (that of the callee vs. that of the caller), it doesn’t have the limitation mentioned in the first paragraph – so we can insert the results directly into a temp table.

Incidentally, if you plan to use OPENROWSET or OPENDATASOURCE to perform distributed queries (another great topic), you’ll need to enable Ad Hoc Distributed Queries. I mention this because I bumped into a related issue while using a linked server to access an OLAP data source (spoiler alert; that’s in a future post).

EXEC sp_configure ‘Show Advanced Options’, 1

GO

RECONFIGURE

GO

EXEC sp_configure ‘Ad Hoc Distributed Queries’, 1

GO

RECONFIGURE

GO

So…problem solved, right? Well, sort of. OPENQUERY works, but I picked sp_who on purpose because it doesn’t create any temp tables during its own execution. Why is that important?

Because, of course, there’s a caveat to this little gem. Using a temp table to capture the results of a stored procedure that itself uses temp tables will case an error. If we had tried to run sp_who2 instead of sp_who, we would have gotten this lovely message:

Msg 11526, Level 16, State 1, Procedure sp_describe_first_result_set, Line 1

The metadata could not be determined because statement ‘delete #tb1_sysprocesses

         where   lower(status)  = ‘sleeping’

         and     upper(cmd)’ in procedure ‘sp_who2’ uses a temp table.

So, if we wanted to capture the results of sp_who2, we’d have to explicitly create our temp table and use INSERT INTO…EXEC instead. You can try that yourself…I did, and it worked.

One final thought: I also wondered if a multi-statement table-valued functions represented a comparable work-around. Let’s give it a shot.

CREATE FUNCTION tvf_who2 ()

RETURNS @Who2 TABLE (

  SPID INT

  , [Status] VARCHAR(MAX)

  , [LOGIN] VARCHAR(MAX)

  , HostName VARCHAR(MAX)

  , BlkBy VARCHAR(MAX)

  , DBName VARCHAR(MAX)

  , Command VARCHAR(MAX)

  , CPUTime INT

  , DiskIO INT

  , LastBatch VARCHAR(MAX)

  , ProgramName VARCHAR(MAX)

  , SPID_1 INT

  , REQUESTID INT

)

BEGIN

  INSERT INTO @Who2

  EXEC sys.sp_who2;

 

  RETURN;

END

 

GO

 

SELECT *

FROM tvf_who2();

INSERT…EXEC is completely valid, so in theory, we achieve the simplicity of the subsequent SELECT statement, right? Survey says…

Msg 443, Level 16, State 14, Procedure tvf_who2, Line 18

Invalid use of a side-effecting operator ‘INSERT EXEC’ within a function.

Looks like the linked-server-OPENQUERY technique takes the prize here. I use it when the situation warrants, and I hope you find it useful.

Written by Kevin Mintmier

March 10, 2013 at 12:00 pm

Posted in SQL Server

Tagged with ,

Conjuring Data with MDX

leave a comment »

Yesterday one of my team-mates asked me a great question: is there an equivalent to the following in MDX?

SELECT ‘Value 1’
UNION
SELECT ‘Value 2’

An MDX purist might answer “no”, since in T-SQL, you can execute an “independent” query (such as the one above) without the context of a specific database (i.e. without a FROM clause). MDX queries require an SSAS database (specifically a cube) against which to execute.

An MDX adventurist (you heard it here first) will say “yes”, since almost anything is possible with enough blood, sweat and carbs. Without further ado, here’s the response I provided.

If you don’t mind the two “values” being on the columns, it’s easy:

WITH

MEMBER [Measures].[First Value] AS “Value 1”
MEMBER [Measures].[Second Value] AS “Value 2”

SELECT
{
    [Measures].[First Value]
    , [Measures].[Second Value]
} ON Columns
FROM [My Cube];

The results should look something like this:

First Value

Second Value

Value 1

Value 2

 

In MDX, you must select to the Columns if you want to select to the Rows. (Despite reading the textbook explanations, I’ve never really understood why…but it seems to be true even as I test these scripts.) If you want to get the values on the rows (more like the T-SQL output), you need to create a surrogate aggregate member to fill the “Columns” clause of the SELECT statement. It can simply aggregate the current measure. The name of the surrogate becomes the equivalent of a column name.

You can’t select from the same hierarchy in both the Columns and the Rows, so I just added [My Value] to the first hierarchy that came to mind.

WITH

MEMBER [My Dimension].[My Hierarchy].[My Value] AS Aggregate([Measures].CurrentMember)

MEMBER [Measures].[First Value] AS “Value 1”
MEMBER [Measures].[Second Value] AS “Value 2”

SELECT
{
    [My Dimension].[My Value]
} ON Columns
, {
    [Measures].[First Value]
    , [Measures].[Second Value]
} ON Rows
FROM [My Cube];

The results should now resemble this:

My Value

First Value

Value 1

Second Value

Value 2

 

Better, right? Unless you’ve got a trick up your sleeve that I haven’t seen this is as good as it’s going to get for this example. You might be tempted to “simplify” the query (my attempt follows), but I don’t think it can be done. Please let me know if you determine otherwise.

WITH

MEMBER [Measures].[First Value] AS “Value 1”
MEMBER [Measures].[Second Value] AS “Value 2”

SELECT
{
  NULL
} ON Columns
, {
  [Measures].[First Value]
  , [Measures].[Second Value]
} ON Rows
FROM [My Cube];

Seems legit, doesn’t it? Not even close.

First Value

Second Value

That’s odd… and useless. So the second option appears to be the best solution.

Look for more MDX tidbits here in the future.

Written by Kevin Mintmier

March 9, 2013 at 6:02 pm

Posted in SQL Server

Tagged with , , , ,

Scripting SQL Server Service Control

leave a comment »

Since I use my laptop for development, I regularly need to do two mutually-exclusive things: (1) run SQL Server, and (2) maintain battery-life. As a result, I find myself starting and stopping the SQL Server services on a regular basis.

Rather than use the SQL Server Configuration Manager (or the Services control panel), I use two scripts. Both leverage the NET command. If you’ve never used it before, I encourage you to do a quick web search. There are already lots of good explanations and tutorials out there. For this purpose, know that we can use NET START and NET STOP to control Windows services – either by name or by identifier. I prefer names because they’re more human-readable.

One caveat to these scripts – on my Slalom laptop, I have to right-click on them and select “Run as Administrator”. In my virtualized environments where I’m the machine admin, that step isn’t necessary.

Here’s the startup script.

Start SQL.cmd

@ECHO OFF
NET START “SQL Full-text Filter Daemon Launcher (MSSQLSERVER)”
NET START “SQL Server Browser”
NET START “SQL Server (MSSQLSERVER)”
NET START “SQL Server Agent (MSSQLSERVER)”
NET START “SQL Server Reporting Services (MSSQLSERVER)”
NET START “SQL Server Analysis Services (MSSQLSERVER)”
NET START “SQL Server Analysis Services (TABULAR)”
NET START “SQL Server Integration Services 11.0”

Simple, right? The shutdown script is very similar. The only two differences are the NET command (STOP instead of START) and the service order. This ensures that dependent services are shut down first – otherwise some services might remain running after the script terminates.

Stop SQL.cmd

@ECHO OFF
NET STOP “SQL Server Browser”
NET STOP “SQL Server Agent (MSSQLSERVER)”
NET STOP “SQL Server (MSSQLSERVER)”
NET STOP “SQL Server Reporting Services (MSSQLSERVER)”
NET STOP “SQL Server Analysis Services (MSSQLSERVER)”
NET STOP “SQL Server Analysis Services (TABULAR)”
NET STOP “SQL Server Integration Services 11.0”
NET STOP “SQL Full-text Filter Daemon Launcher (MSSQLSERVER)”

If you’re wondering where I got my list of services, just take a look at your SQL Server Configuration Manager. Make sure you update the script to use your service names in the appropriate order, and you’re all set.

image

Hope these scripts save you some time.

Written by Kevin Mintmier

March 9, 2013 at 12:00 pm

Graph Traversal

leave a comment »

This is the last post in a 4-part series. All of the scripts in this series are designed to run in SQL Server (presumably 2005 forward) without the need for any non-system tables.

If you’re just tuning in, we’ve been talking about recursive CTEs. In my last post, we used them to create a world of interconnected areas and regions. If we wanted to immerse ourselves in that virtual space, we would need the ability to select a destination and travel there. This brings us to the topic of path-finding – how do we get from Point A to Point B?

Path-finding algorithms are a form of graph traversal. A graph is a system of interconnected nodes. Certain smart people (in this case, mathematicians) insist on carrying the definition much further, so let’s see what they have to say on the subject.

  • Undirected graphs are sets of connected nodes where the links (called “edges”) are all two-way, meaning that you can travel back and forth between connected nodes.
  • Directed graphs are sets of connected nodes where the edges are all only one-way.
  • Mixed graphs have some directed edges and some undirected edges.
  • Cyclic graphs have edges that enable a path to lead back to a previous node.
  • Acyclic graphs have directed edges that do not allow a path to lead back to a previous node. A hierarchy is a great example of a directed, acyclic graph.
  • In your mind, add about 64 more bullets here, and silently thank me for not continuing this list. Graphs are a rich topic, worthy of a 200-level course.

Wikipedia will gladly compromise your day with a ludicrous number of additional graph types, properties, and examples. Based on the most relevant definitions, though, the world we built in the previous post is essentially a cyclic mixed graph implemented as a directed graph.

I’ve seen some pretty slick path-finding algorithms in T-SQL, but none of them were a perfect fit for this scenario. The closest conceptual technique is based on Dijkstra’s algorithm. Edsger Dijkstra published the algorithm in 1956 and it’s still widely used in computer science today. That’s quite a track record. Seriously, I’ll be pleased if this post shows up in your search results a month from now.

I’ll leave it up to you to research Dijkstra’s algorithm, and I trust that a web search for DIJKSTRA TSQL will lead you to the same queries that inspired my solution. Credit where it’s due, I wouldn’t be posting this without the authors of those algorithms. In the end, though, I put in the time to understand what their code was actually doing – and I adapted those scripts for this purpose.

One of my major adaptations pertains to the fact that Dijkstra’s algorithm allowed for edges to be weighted. This is similar to determining the number of miles between cities and charting the shortest driving route. When Dijkstra used the expression “shortest path”, he was referring to the “lightest” path – that is, the one with the lowest cumulative “weight”. For our exercise, the “shortest path” is a bit simpler – it’s the one with the least number of connections from the start area to the end area.

Before we jump into the code, let’s talk about how we’re going to walk through the world. Let’s say we’re looking for the path from Area 1 to Area 10. We need to start in Area 1, find all of its destinations, and their destinations (etc.) until we reach Area 10.

Through each iteration of our search, we need to do two things:

  1. Keep track of where we came from. We’ll call this our “predecessor area”.
  2. Never return to an area we already traveled through.
    • Said differently – as an example, since we stared in Area 1, we should never travel back to Area 1. This would create a circular loop at runtime, and we might never reach our destination. I don’t know about you, but I like to get where I’m going…so this sort of sounds bad.

Once we reach Area 10, we can walk backwards through the list of “predecessor areas” until we reach Area 1. This gives us the shortest path from our start area to our destination area. Of course, it will be in reverse-order, but we’re smart (like mathematicians!), so we can deal with that. Once we reach our destination, our code needs to return the series of connections that leads from our start area to our end area.

A special case exists where we walk through all of the “reachable” areas (via our connections), and we never reach the destination area. This is entirely possible since we only checked for orphan areas in the last post – we didn’t do an exhaustive check to ensure that every area was reachable from every area. (That’s largely because we hadn’t covered path-finding yet, so feel free to improve the code after we finish discussing this bit of awesome-sauce.) If we can’t reach the destination area, our code needs to fail gracefully and return an empty series of connections.

That’s enough philosophy. Let’s write some code!

DECLARE

  @StartAreaId INT = 1

  , @EndAreaId INT = 10;

 

Yeah. You knew that was coming. Don’t deny it. If we decided to push all this code into a multi-statement table-valued function later (hint, hint), these variables would be the parameters.

DECLARE @Path TABLE (

  Step INT

  , AreaId INT

  , [Path] VARCHAR(8000)

);

 

 

This is the table we’re eventually going to populate. The Step column will be a list (0..N) of the steps we take to reach our @EndAreaId. The AreaId will indicate the Areas we travel through to reach our destination. The Path column will be a textual representation of the steps from each area to our destination area.

DECLARE

  @Step INT = 1

  , @Found BIT = 0

  , @Done BIT = 0

  , @StepAreaCount INT;

These are the only variables we need for our algorithm. We’ll track our current step (0 at our start area, 1 after our first move, etc.), whether the destination has been found, whether we’re done (out of areas to traverse), and the number of areas we reached in the current step (when this is zero, we’re done).

DECLARE @Graph TABLE (

  Step INT

  , PredecessorId INT

  , AreaId INT PRIMARY KEY CLUSTERED WITH (IGNORE_DUP_KEY = ON)

);

 

A graph is always explicitly defined from a starting point. Our world is defined generically in the database; we will define its graph specifically from our @StartAreaId.

An important point here is that I want you to think of @Graph as a hierarchy. We’re starting from a node, and every subsequent node is a child of that node, by at least one level. In order to eliminate circular loops, we need to ensure that we never return to a node that’s already in the tree (graph).

Keep the highlighted part of that table expression in your head. We’ll come back to it momentarily.

INSERT INTO @Graph (Step, AreaId)

VALUES (0, @StartAreaId);

 

We’ve recorded our start area as step 0.

Now, let’s walk through the first of two major steps in this process. We’ll work through our available (connected) areas, keeping track of where we came from.

WHILE @Done = 0 AND @Found = 0

BEGIN

  INSERT INTO @Graph (

  Step

    , PredecessorId

    , AreaId

  )

  SELECT

    @Step

    , MIN(c.StartAreaId) AS PredecessorId

    , c.EndAreaId AS AreaId

  FROM

    dbo.Connection c

    JOIN @Graph g

      ON c.StartAreaId = g.AreaId

      AND g.Step = @Step 1

  GROUP BY

    c.EndAreaId;

 

  SET @StepAreaCount = @@ROWCOUNT;

 

    SET @Done = CASE WHEN @StepAreaCount > 0 THEN 0 ELSE 1 END;

    SET @Found =

      (SELECT COUNT(1) FROM @Graph WHERE AreaId = @EndAreaId);

 

    SET @Step = @Step + 1;

END;

Yeah, that’s a while loop. If you can figure out a way to do this part with a CTE, I’d love to learn how you did it.

I’ve highlighted two things above. They’re related. If more than one of the areas we’ve reached in a given step have connections to the same destination, we should only include that destination once in our result set. Together, the MIN and the GROUP BY accomplish this nicely.

The while-loop continues so long as we’re “not done” and “not found”. @Done implies that we’re out of areas, meaning that we’ve traversed the entire graph and there are no areas we haven’t visited. @Found indicates that we’ve reached our destination.

The first part of this loop updates our @Graph based on the available destinations (connections). Let’s return our attention to a bit of code, highlighted earlier, in the definition of the @Graph table. We told SQL Server to INGORE_DUP_KEYS. Since the AreaId is a KEY, each AreaId can appear in the @Graph table only once. Under normal circumstances, if we were to insert the same “AreaId” into the @Graph table a second time, it would throw an error:

Violation of PRIMARY KEY constraint […]’. Cannot insert duplicate key in object @Graph’. The duplicate key value is ([…]).

 

I should preface the following statement by saying that I would never do this in a production environment, outside the context of temporary tables and unusual use cases (like our current endeavor). That said, ignoring duplicate keys here is downright awesome. Simply put, it prevents us from re-visiting previously-visited graph nodes (areas).

Got that? If we allow ourselves to try and insert duplicate keys (for areas we’ve already visited), but we categorically deny the right to do so – without throwing an error – we avoid the need to check for cases where we’ve already visited a given area. Those types of checks (“WHERE NOT IN…”) are expensive. This keeps our query fast and light. Nice.

Moving on…

IF @Found = 0 RETURN;

 

Abort, Scotty, if we couldn’t reach our destination. Sounds reasonable.

DECLARE @Delimiter VARCHAR(4) = ‘ > ‘;

 

Jumping ahead a bit, we want to build textual path descriptions. This delimiter determines how we subdivide those lists (e.g. “1 > 10”).

Now, Spartans, prepare for glory.

WITH

 

[Path] (

  Step

  , PredecessorId

  , AreaId

  , [Path]

)

AS (

   SELECT

     g.Step

    , g.PredecessorId

    , g.AreaId

    , CONVERT(VARCHAR(8000), g.AreaId)

  FROM @Graph g

  WHERE g.AreaId = @EndAreaId

 

  UNION ALL

 

  SELECT

    g.Step

    , g.PredecessorId

    , g.AreaId

    , CONVERT(VARCHAR(8000), CONVERT(VARCHAR(8000), g.AreaId)

        + @Delimiter + p.[Path])

  FROM

    @Graph g

    JOIN [Path] p

      ON p.PredecessorId = g.AreaId

)

 

INSERT INTO @Path (

  Step

  , AreaId

  , [Path]

)

SELECT TOP 99.99 PERCENT

  p.Step

  , p.AreaId

  , p.[Path]

FROM [Path] p

ORDER BY p.Step;

Another CTE? Of course! Makes my brain tingle. Plus, this achieves the second major step in our process. We walk backward through the areas we identified in the first major step, starting from @EndAreaId (have a look at the anchor), from each area to its predecessor. We’re guaranteed we’ll reach the start sector (thankfully), and we build a delimited list of areas (our “path”) as we go.

I’ve highlighted one weird line. I don’t understand why SQL Server requires this syntax, but it enables us to correctly observe the ORDER BY clause while we INSERT from a CTE. Typically we wouldn’t care about ORDER during an INSERT, but we needed to reverse the order of our path as we were backtracking.

Only one action remains – it’s time to enjoy the fruit of our labor.

SELECT * FROM @Path;

 

We’ve got an ordered list of navigational steps to follow, and we can see (in delimited text) the path from each area in the route to our destination.

For example:

1 > 709 > 943 > 10

 

That looks like a route I could follow. Especially if this were a game world, and my character were attempting to travel from Area 1 to Area 10.

What’s the Point?

Good question. Let’s imagine that one of two things is true:

  1. You’re writing a game. Sweet. Now you can do some degree of path-finding in your database. Glad I could help.
  2. You’re analyzing social networks. (Bet you didn’t see that coming.) Ever wonder how LinkedIn knows who you’re indirectly connected to, and how many “degrees” away they are? Now you know, at least conceptually.

Bottom line, beyond just facilitating navigation in a game environment, this concept enables us to begin analyzing social, business and physical networks. Your implementation may vary based on intent, but the idea has thrived since Dijkstra and will continue long after we hang up our keyboards.

I’ll let you ruminate on that last point, as actual social network analysis is beyond the scope of this series. Hopefully, though, your brain is in high gear, its creative centers humming like a string beneath the fingers of the wind. (Yeah…that killed the mood.)

With that, we’ve reached the end of the series. Together, we’ve slain the CTE dragon and rescued the graph-search princess (or prince, if that’s a better metaphor). I hope you’ve enjoyed reading this as much as I’ve enjoyed writing it.

Written by Kevin Mintmier

March 8, 2013 at 3:03 pm

World-Building

leave a comment »

This is the third post in a 4-part series. All of the scripts in this series are designed to run in SQL Server (presumably 2005 forward) without the need for any non-system tables.

I hope you’re enjoying this series on recursive CTEs. Up until now we’ve been looking at philosophical, textbook, and conceptual examples. Seems like it’s time to have some fun. Now we’re going to build a virtual space.

What’s in a World?

Let’s be ambitious and set out to simulate a world. A really, really big world. To make such an environment manageable, it needs to be broken down into small areas. Areas will be grouped in larger regions, and areas will need to be connected. That will be our focus in this post.

To create order and keep this reasonably simple, I’m going to suggest that we create 26 regions named for the letters of the alphabet (A to Z) and a variable number of areas (initially 1,000) that are randomly assigned to regions, and randomly connected.

Up until now in this series, two things have been true:

  1. You haven’t needed any particular data source to follow along; all of the queries have either referenced SQL Server system tables or generated their own data. This will remain true.
  2. You haven’t needed permission to CREATE or ALTER anything. This is about to go out the window. Get ready to start building things. You’ll need a SQL Server sandbox if you want to follow along. I’m using SQL Server 2012, but I expect 2008 R2 would suffice.

With that said, let’s create a table to represent our regions.

CREATE TABLE dbo.Region (

  RegionId INT NOT NULL

  , Name VARCHAR(8) NOT NULL

  , CONSTRAINT PK_Region PRIMARY KEY CLUSTERED (RegionId ASC)

);

 

Simple – just an identifier and a name. Now, let’s do the same thing for areas, and be sure we include an identifier for each one’s region.

CREATE TABLE dbo.Area (

  AreaId INT NOT NULL

  , RegionId INT NOT NULL

  , CONSTRAINT PK_Area PRIMARY KEY CLUSTERED (AreaId ASC)

);

 

Good. One more table is required for this stage of our world-building – we need a way to establish connections between areas.

CREATE TABLE dbo.Connection (

  StartAreaId INT NOT NULL

  , EndAreaId INT NOT NULL

  , CONSTRAINT PK_Connection PRIMARY KEY CLUSTERED (

    StartAreaId ASC

    , EndAreaId ASC

  )

);

 

These are the only tables we need in order to establish the basic geography of our world and (in the next post) enable navigation therein. Next, let’s look at how to populate those tables.

If you’re familiar with a traditional, high-level programming language like C# or Java, I’m sure you could generate data outside of SQL Server and simply insert it. This exercise will demonstrate that it can be done as effectively (and likely faster) without external influence.

Regions

This is the easiest table to build, since regions have no dependencies. Here’s the code:

DECLARE @NumRegions INT = 26;

 

TRUNCATE TABLE dbo.Region;

 

WITH RegionList AS (

  SELECT 1 AS RegionId

 

  UNION ALL

 

  SELECT RegionId + 1

  FROM RegionList

  WHERE RegionId < @NumRegions

)

 

INSERT INTO dbo.Region (RegionId, Name)

SELECT

  r.RegionId

  , ‘Region ‘ + CHAR(64 + r.RegionId)

FROM RegionList r

OPTION (MAXRECURSION 0);

 

That should look familiar from previous posts in this series. It’s a simple recursive CTE. The only new point of interest is the way we’re naming the regions (highlighted). Since the letter A has an ASCII value of 65, we can use our list of region identifiers (1 to 26) to walk through the alphabet.

Areas

Populating the areas is very similar.

DECLARE @NumAreas INT = 1000;

 

TRUNCATE TABLE dbo.Area;

 

WITH AreaList AS (

  SELECT 1 AS AreaId

 

  UNION ALL

 

  SELECT AreaId + 1

  FROM AreaList

  WHERE AreaId < @NumAreas

)

 

INSERT INTO dbo.Area (AreaId, RegionId)

SELECT

  a.AreaId

  , ROUND(RAND(CONVERT(VARBINARY, NEWID())) * (@NumRegions 1), 0) + 1

FROM AreaList a

OPTION (MAXRECURSION 0);

Again, this should look very familiar. This time, to randomly assign areas to regions, we’re using NEWID to reseed RAND on the fly (highlighted), which was also covered in a previous post.

Along the way, we should inspect the data. Let’s look at the areas with their respective Regions:

SELECT

  a.AreaId AS Area

  , r.Name AS Region

FROM

  dbo.Area a

  JOIN dbo.Region r

    ON a.RegionId = r.RegionId;

 

And let’s look at the regions and see how many areas they contain:

SELECT

  r.Name AS Region

  , COUNT(1) AS Areas

FROM

  dbo.Region r

  JOIN dbo.Area a

    ON r.RegionId = a.RegionId

GROUP BY

  r.Name;

 

Looks good on my end; I expect you’re seeing something similar, but thanks to the randomization it’s unlikely our datasets have much in common.

Connections

Now comes the tricky part…we need to connect the areas. Without connections we would be forever stuck in a single segment of our simulated world. Let’s establish some business rules to govern the creation of our connections.

  1. Connections are explicitly one-way.
    • A connection from Area 1 to Area 2 does not imply that you can travel back to Area 1 from Area 2. A separate and distinct connection from Area 2 to Area 1 must exist to support a direct return trip.
  2. Initially, each area can be the starting point for up to 5 connections.
    • This number should be configurable.
  3. Areas cannot connect to themselves.
    • This would just be a pointless waste of pointlessness.
  4. Only distinct connections can exist.
    • There cannot be multiple connections from Area 1 to Area 2.
  5. There can be no “orphan” areas.
    • Each area must have at least one connection leading to it.
  6. A minimum of 10% of the paths should be two way.
    • This means that at least 10% of the connections should have a matching, inverse connection (such as 2 > 1 to compliment 1 > 2).
    • This number should be configurable.

That feels like a lot of rules for our exercise, but (spoiler alert) they will make it much easier to do the exciting things we’ll cover in the next post.

Step one of connection-building is very similar to the Inception-style lists-within-lists example in the previous post. If you followed that logic without a hitch, you’ll be fine here. Just like we generated a random number of die rolls for an easily-configurable number of people, we’re going to generate a random number of destinations for our easily-configurable number of areas.

DECLARE @MaxInitialConnectionsPerArea INT = 5;

 

DECLARE @Base TABLE (

  AreaId INT PRIMARY KEY CLUSTERED

  , NumConnections INT

);

 

TRUNCATE TABLE dbo.Connection;

 

INSERT INTO @Base (AreaId, NumConnections)

SELECT

  a.AreaId

  , ROUND(

      RAND(CONVERT(VARBINARY, NEWID()))

        * (@MaxInitialConnectionsPerArea 1) + 1

  , 0) AS NumPaths

FROM dbo.Area a;

 

WITH

 

ConnectionList (

  AreaId

  , NumConnections

  , Iterator

)

AS (

  SELECT

    b.AreaId

    , b.NumConnections

    , 1

  FROM

    @Base b

 

  UNION ALL

 

  SELECT

    b.AreaId

    , b.NumConnections

    , c.Iterator + 1

  FROM

    @Base b

    JOIN ConnectionList c

      ON b.AreaId = c.AreaId

      AND c.Iterator + 1 <= b.NumConnections

)

 

, ConnectionSource AS (

  SELECT

    cl.AreaId AS StartAreaId

    , ROUND(

        RAND(CONVERT(VARBINARY, NEWID())) * (@NumAreas 1)

        , 0) + 1 AS EndAreaId

  FROM ConnectionList cl

)

 

INSERT INTO dbo.Connection

SELECT DISTINCT

  cs.StartAreaId

  , cs.EndAreaId

FROM ConnectionSource cs

WHERE cs.StartAreaId <> cs.EndAreaId

ORDER BY

  cs.StartAreaId

  , cs.EndAreaId;

That was a bit long, but necessarily so. First we created a @Base dataset that determined the number of connections to create for each area. Then we followed our pattern of iterating through the set of sets, after which we randomly assigned destination areas to each connection and stuffed them all into the Connection table.

Towards the end, I’ve highlighted two lines. These relate to business rule number 4 (only distinct connections can exist) and 3 (areas cannot connect to themselves). At this point, we’ve actually satisfied all the business rules except for 5 and 6. Nifty. But brace for impact – it’s about to get real up in here.

Orphan Areas

To find and eliminate orphan areas, we need to start with a complete list of areas and LEFT JOIN to a distinct list of destinations found in the Connection table. Any area that isn’t a destination is, by definition, an orphan – so after the LEFT JOIN, we check for NULL start area counts. Once we have a list of orphans, we can insert a single connection to each one – all in one convenient statement. Let’s see what it looks like:

INSERT INTO dbo.Connection (StartAreaId, EndAreaId)

SELECT

  ROUND(RAND(CONVERT(VARBINARY, NEWID())) * (@NumAreas 1) + 1, 0)

  , s.EndAreaId

FROM (

    SELECT StartAreaId AS EndAreaId

    FROM dbo.Connection

    GROUP BY StartAreaId

  ) s

  LEFT JOIN (

    SELECT EndAreaId, COUNT(1) AS NumStartAreas

    FROM dbo.Connection

    GROUP BY EndAreaId

  ) e

  ON s.EndAreaId = e.EndAreaId

WHERE

  e.NumStartAreas IS NULL

ORDER BY

  s.EndAreaId;

 

It’s possible, though unlikely, that we’ll randomly generate a StartAreaId that’s the same as the orphan AreaId we’re trying to adopt. While this would be a violation of business rule 3, I’ve decided it’s reasonable to move on. You’re welcome to address this yourself if it keeps you awake tonight.

That takes care of business rule 5. If your eyes aren’t bleeding yet, let’s move on to business rule 6.

Two-Way Connections

To identify two-way connections, we can leverage two of our most woefully underutilized allies. INTERSECT and EXCEPT are syntactic wallflowers, typically relegated to the sidelines, forced to watch while UNION shamelessly tears up the dance floor. You don’t often see these unsung heroes of T-SQL theory, but today you get to use them both in one place. Weep for joy, friends, and behold.

DECLARE

  @TwoWayConnectionRatio INT = 10

  , @MinTwoWayConnections INT

  , @ExistingTwoWayConnections INT

  , @NewTwoWayConnections INT;

 

SELECT @ExistingTwoWayConnections = COUNT(1) FROM (

  SELECT

    StartAreaId

    , EndAreaId

  FROM dbo.Connection

 

  INTERSECT

 

  SELECT

    EndAreaId AS StartAreaId

    , StartAreaId AS EndAreaId

  FROM dbo.Connection

) q;

Let’s break there for a second. So far we’ve found the intersection between start-to-end connections and end-to-start connections. Said differently, if there are connections from Area 1 to Area 2 and from Area 2 to Area 1, this combination shows up in our INTERSECT query an counts as a two-way connection.

Moving on…

SELECT @MinTwoWayConnections = COUNT(1) * @TwoWayConnectionRatio / 100

FROM dbo.Connection;

 

SET @NewTwoWayConnections = @MinTwoWayConnections @ExistingTwoWayConnections;

 

SET @NewTwoWayConnections =

  CASE

    WHEN @NewTwoWayConnections < 0

    THEN 0

    ELSE @NewTwoWayConnections

  END;

Now we know how many more (“new”) two-way connections are needed. It’s possible that we already have enough, so @NewTwoWayConnections could be negative. Since we’re going to use it in a TOP expression, we ensure that it’s non-negative.

One statement to go. Let’s use EXCEPT to find a list of one-way connections – basically the inverse of the INTERSECT we performed above. We’ll use ORDER BY NEWID to randomize the order of the one-way connections and SELECT TOP to limit the number of connections we’re about to create. Once we reverse their directionality (swap the start and end areas), we can INSERT the results into the Connection table.

INSERT INTO dbo.Connection (StartAreaId, EndAreaId)

SELECT TOP (@NewTwoWayConnections)

  EndAreaId

  , StartAreaId

FROM (

  SELECT

    StartAreaId

    , EndAreaId

  FROM dbo.Connection

 

  EXCEPT

 

  SELECT

    EndAreaId AS StartAreaId

    , StartAreaId AS EndAreaId

  FROM dbo.Connection

) q

ORDER BY

  NEWID();

With that, business rule 6 is satisfied. We’re essentially done. Let’s take some time to admire our handiwork.

How many connections did we create?

SELECT COUNT(1) AS Connections

FROM dbo.Connection;

 

I’m seeing close to 6000, which means that with the addition of our two-way connections, we’re averaging almost 6 connections for each of the 1000 areas we created.

How many destinations are there from each area?

SELECT

  c.StartAreaId

  , COUNT(1) AS Destinations

FROM dbo.Connection c

GROUP BY c.StartAreaId;

Looks good. Let’s find the true average number of connections from each area.

SELECT

  AVG(d.Destinations * 1.0) AS AverageDestinations

FROM (

  SELECT

    c.StartAreaId

    , COUNT(1) AS Destinations

  FROM dbo.Connection c

  GROUP BY c.StartAreaId

) d;

 

I’m multiplying by 1.0 to implicitly cast the destination count from INT to DECIMAL.

I show an average of 5.98 destinations, which is consistent with our estimate above.

Finally, let’s see what areas we can get to from Area 1.

SELECT c.EndAreaId AS DestinationAreaId

FROM dbo.Connection c

WHERE c.StartAreaId = 1;

 

I imagine those are the signposts on a massive freeway – exit here for Area 27 in Region E, exit in 1 mile for Area 408 in Region Z. Considering the size of our “map”, we could drive these roads forever.

Onward

We’ve successfully built the geography of a simple world, and I encourage you to play with the following variables:

  • @NumAreas – try making a really big world.
  • @MaxInitialConnectionsPerArea – make it easier or harder to move between areas.
  • @TwoWayConnectionRatio – determine the impact of high vs. low ratios.

The effects of those variables on our simulated world will be clearer next time when we talk about how to navigate our virtual spider web. We’ve essentially built ourselves a system of highways; now we just need to chart a course from point A to point B.

Before reading my next post, give it a try. If you’re in Area 1 and you need to get to Area 10, what’s the shortest path? Which areas would you pass through? How do we devise a path-finding algorithm that works reliably and efficiently? Finding a solution myself was a fun challenge, and once you read the next post, I’d be interested to know if you found a different way.

Written by Kevin Mintmier

March 7, 2013 at 8:47 am

Generating Data

leave a comment »

This is the second post in a 4-part series. All of the scripts in this series are designed to run in SQL Server (presumably 2005 forward) without the need for any non-system tables.

In my last post I discussed recursive CTEs and gave a few examples involving familiar calculations like factorials and Fibonacci numbers. Let’s take that a step further and build a platform for generating data.

Say you’re working on a demo, and you need some data that represents a group of people. These people were all asked to roll a die at least twice and no more than ten times. That seems like a pretty common scenario, right? Happens all the time? Work with me here. This is a fun exercise (read: good candidate) for a recursive CTE.

Let’s start by generating a list of ten people.

DECLARE

  @NumPeople INT = 10;

 

WITH People AS (

  SELECT 1 AS PersonId

 

  UNION ALL

 

  SELECT PersonId + 1

  FROM People

  WHERE PersonId < @NumPeople

)

 

SELECT PersonId

FROM People

OPTION (MAXRECURSION 0);

 

 

This code should look remarkably familiar (assuming you read the last post). Let’s move on and simulate that these folks all rolled a die between two and ten times.

 

DECLARE

  @NumPeople INT = 10

  , @MinRolls INT = 2

  , @MaxRolls INT = 10;

 

DECLARE @RollCount TABLE (

  PersonId INT

  , NumRolls INT

);

 

WITH People AS (

  SELECT 1 AS PersonId

 

  UNION ALL

 

  SELECT PersonId + 1

  FROM People

  WHERE PersonId < @NumPeople

)

 

INSERT INTO @RollCount

SELECT

  PersonId

  , ROUND(RAND() * (@MaxRolls @MinRolls), 0) + @MinRolls AS NumRolls

FROM People

OPTION (MAXRECURSION 0);

 

SELECT *

FROM @RollCount;

Pop quiz time. What’s wrong with this query? (Hint: It isn’t visible in the script. It will be hard to identify if you aren’t following along in Management Studio.)

Answer: Apparently, the entire group of people rolled the die the exact same number of times. Isn’t RAND supposed to generate random numbers? Apparently it’s failing. Any thoughts on why?

This stumped me despite a series of web searches, and I finally stumbled upon the reason. RAND isn’t reseeded between calls in this context. This is a fairly rich topic on its own, but suffice it to say that after trying several solutions, I found this one to be the cleanest. Replace the definition of NumRows as follows (formatted for readability):

  , ROUND(

      RAND(CONVERT(VARBINARY, NEWID())) * (@MaxRolls @MinRolls)

     , 0) + @MinRolls AS NumRolls

 

Unlike RAND, NEWID returns a unique value in each of the recursive iterations, so we simply convert the UNIQUEIDENTIFIER to a compatible numeric type and use it to reseed RAND. Problem solved.

So now we have a list of ten people and the number of times they each rolled the die. Let’s expand our result set so each person has the appropriate number of rows (one per die roll).

The same technique we used to generate the initial list (ten people) can be used to generate “sub-lists” – lists within lists. Inception-style.

If you’re following along in Management Studio (and I sincerely hope you are), this code should follow (or replace) the SELECT statement at the end of your current code block.

WITH

 

Roll (

  PersonId

  , NumRolls

  , RollNumber

)

AS (

  SELECT

    rc.PersonId

    , rc.NumRolls

    , 1

  FROM

    @RollCount rc

 

  UNION ALL

 

  SELECT

    rc.PersonId

    , rc.NumRolls

    , r.RollNumber + 1

  FROM

    @RollCount rc

    JOIN Roll r

      ON rc.PersonId = r.PersonId

      AND r.RollNumber + 1 <= rc.NumRolls

)

 

SELECT

  r.PersonId

  , r.NumRolls

  , r.RollNumber

FROM Roll r

ORDER BY

  r.PersonId

  , r.NumRolls

  , r.RollNumber;

Two points of interest here:

  1. Now we’re selecting from a source table and joining recursively to the CTE. Level up.
  2. If you exclude the ORDER BY clause at the end, you’ll see the natural, step-by-step behavior of the CTE in all its glory. Simply put, you’ll see the anchor list followed by each iteration of its “child” lists.

Finally, it’s time to roll the dice. We need to generate the distinct, random value of each roll. Let’s start by adding a variable to represent our standard, six-sided die.

DECLARE

  @NumPeople INT = 10

  , @MinRolls INT = 2

  , @MaxRolls INT = 10

  , @NumSides INT = 6;

And we’ll add a calculated column to our last select statement:

SELECT

  r.PersonId

  , r.NumRolls

  , r.RollNumber

  , ROUND(

      RAND(CONVERT(VARBINARY, NEWID())) * (@NumSides 1)

      , 0) + 1 AS Roll

FROM Roll r

ORDER BY

  r.PersonId

  , r.NumRolls

  , r.Iterator;

Problem solved; achievement unlocked. We can run this repeatedly and see a completely different set of data each time, and it always conforms to the business rules we established at the beginning.

In the next post I’ll use this technique to generate a similar data set that represents a nodal graph and its relationships. If you’re interested in things like path-finding and shortest-distance algorithms, stay tuned…we’re still only halfway to our destination.

Written by Kevin Mintmier

March 6, 2013 at 8:44 am