Go to content Go to navigation and search

Home

Current SQL Server Blog Articles

    SQL Server Spatial: STFlipVectors
    SQL Server Spatial: Extract LineStrings in GeometryCollection to create LineString
    SQL Server Spatial: Converting a LineString to COGO XML
    SQL Server Spatial: Creating (Multi)LineStrings geometries from COGO XML instructions
    New Version of Package of TSQL Spatial Functions
    Vectorize/Segmentize SQL Server 2012
    Coordinate Editing Functions for SQL Server Spatial
    Function to Scale a geometry object for SQL Server Spatial
    TSQL String Tokenizer Function for SQL Server
    geography/geometry to MBR helper functions
    generate_series for SQL Server 2008
    Extract Polygons from result of STIntersection in SQL Server Spatial
    Function to round ordinates of a SQL Server Spatial geometry object
    Extract elements of SQL Server Spatial geometry object
    Counting number of polygon rings SQL Server Spatial
    Filtering Polygon Rings in SQL Server Spatial
    Function to Move a geometry object in SQL Server Spatial
    Alternate Centroid Functions for SQL Server Spatial
    Function to Rotate geometry objects in SQL Server Spatial
    A GetVertices wrapper for DumpPoints in SQL Server 2008 Spatial
    Creating a Morton number Space Key generator for SQL Server
    Gridding a geometry or geography object (SQL Server Denali)
    On hinting spatial indexes
    Random Search Procedure (SQL Server 2008 Spatial)
    COGO: Converting (Google Earth) Formatted Longitude/Latitude points to decimal degrees (SQL Server)
    COGO: Convert Degrees, Minutes and Seconds values to Decimal Degrees
    COGO: DD2DMS Formatting a latitude/longitude decimal degree value
    COGO: Create point from bearing and distance
    COGO: Compute number of vertices required to stroke circle or circular arc
    COGO: Calculating the bearing between two points (SQL Server 2008 Spatial)
    COGO: Compute arc length subtended by angle centre of circle
    COGO: Compute chord length of segment of a circle (arc)
    COGO: Computing Arc To Chord (Arc2Chord) Separation
    COGO: Creating a stroked polygon from a circle's centre x,y and radius
    COGO: Finding centre and radius of a curve defined by three points: FindCircle function
    Identifying Tight Radius Curves sections within LineString geometry data
    Generating random point data for SQL Server 2008 Spatial
    New Presentation on Active (Searchable) Spatial Metadata for SQL Server 2008 Spatial and FreeText
    Checking is a column in a table or a view is of type geometry or geography
    Extracting geometry type string values from geometry/geography objects in SQL Server 2008 Spatial
    Vectorising geometry objects in SQL Server 2008
    MBR to geography/geometry helper functions
    Extracting geometry types from geometry/geography objects in SQL Server 2008
    Getting the number of coordinate dimensions of a geometry/geography object in SQL Server 2008
    A Dump Points Function for SQL Server 2008 Spatial
    toGeography and toGeography Conversion Functions for SQL Server 2008 Spatial
    Write text file with spatial data from SQL Server 2008
    Loading Shapefiles into Geography type column in SQL Server 2008
    Tip #6: Correcting invalid geometries
    Tip #5: Where or where has my little column gone
    Tip #4: What Coordinate System is that?
    Tip #3: What object is that?
    Tip #2: Spatial Indexing and Primary Keys
    Tip #1: SQL Server 2008 "Katmai" - Setting SRIDs

Search

    Browse

RSS / Atom

Email me

textpattern

Creative Commons License
All Blog Articles, Data Models and Free Source Code by Simon Greener, The SpatialDB Advisor is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

TSQL String Tokenizer Function for SQL Server

Thursday April 12 2012 at 03:59

Keywordsstring tokenizer separator delimiter
Summary

String tokenizer for Sql Server 2008 and Denali.

Update: Tokenizer has been updated for Denali analytic features and to expose the separators in the output.

We often have need of string tokenizers in applications. I had need of one in some TSQL development I am currently doing for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).

The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article

Still, with a little perseverance I came up with a working implementation.

Here it is.

  1. USE [GISDB]  -- You need to change this if you use this function.
  2. GO
  3. /*********************************************************************************
  4. ** @function    : Tokenizer
  5. ** @precis      : Splits any string into its tokens.
  6. ** @description : Supplied a string and a list of separators this function
  7. **                returns resultant tokens as a table collection.
  8. ** @example     : SELECT t.token
  9. **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
  10. ** @param       : p_string. The string to be Tokenized.
  11. ** @param       : p_separators. The characters that are used to split the string.
  12. ** @depend      : dbo.generate_series()
  13. ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  14. ** @history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
  15. ** @history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
  16. **/
  17. DROP FUNCTION Tokenizer;
  18. ==<nbsp/>==
  19. CREATE FUNCTION Tokenizer(@p_string     VARCHAR(MAX),
  20.                           @p_separators VARCHAR(254))
  21.   RETURNS @varchar_table TABLE
  22.  (
  23.    token VARCHAR(MAX)
  24.   )
  25. AS
  26. BEGIN
  27.   BEGIN
  28.       WITH myCte AS (
  29.       SELECT c.beg,
  30.              c.fullstring,
  31.              ROW_NUMBER() OVER(ORDER BY c.beg ASC) RowVersion
  32.         FROM (SELECT b.beg, b.fullstring
  33.                 FROM (SELECT a.beg, @p_string AS fullstring
  34.                         FROM (SELECT c.IntValue AS beg
  35.                                 FROM dbo.generate_series(1,DATALENGTH(@p_string),1) c
  36.                               ) a
  37.                      ) b,
  38.                      (SELECT SUBSTRING(@p_separators,d.IntValue,1) AS delim
  39.                         FROM dbo.generate_series(1,DATALENGTH(@p_separators),1) d
  40.                      ) c
  41.                WHERE CHARINDEX(c.delim,SUBSTRING(b.fullstring,b.beg,1)) > 0
  42.                UNION ALL SELECT 0 AS beg, @p_string AS fullstring
  43.                UNION ALL SELECT DATALENGTH(@p_string)+1 AS beg, @p_string AS fullstring
  44.              ) c
  45.       )
  46.       INSERT INTO @varchar_table
  47.       SELECT SUBSTRING(d.fullstring, (d.beg + 1), (d.end_p - d.beg - 1) ) token
  48.         FROM (SELECT BASE.beg,
  49.                      LEAD.beg end_p,
  50.                      BASE.fullstring
  51.                 FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.RowVersion = LEAD.RowVersion-1
  52.              ) d
  53.        WHERE d.end_p IS NOT NULL
  54.          AND d.end_p > d.beg + 1;
  55.       RETURN;
  56.   END;
  57. END
  58. GO

Here are my, simple, tests.

  1. SELECT DISTINCT t.token
  2.   FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t;

Result.

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
  1. SELECT t.token
  2.   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Result.

token
The
rain
in
spain
stays
mainly
on
the
plain

Now, if you want to collect them back into a single string, here’s an example of what you can do.

  1. SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
  2.                  FROM ( SELECT DISTINCT t.token AS gtype
  3.                           FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t
  4.                        ) a
  5.                 ORDER BY ':' + a.gtype
  6.                 FOR XML PATH(''), TYPE, ROOT).VALUE('root[1]','nvarchar(max)'),1,1,'')
  7.         ) AS GeometryTypes;

Result.

GeometryTypes
LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon

Upgraded Version for Denali

  1. ** @history     : Simon Greener - Apr 2012  - Extended TO include returning OF tokens
  2. -- Connect to database holding generate_series
  3. --
  4. USE [GISDB] -- Change to your database
  5. GO
  6. -- Drop function if exists
  7. --
  8. IF EXISTS (SELECT *
  9.              FROM dbo.sysobjects
  10.             WHERE id = object_id (N'[dbo].[Tokenizer]')
  11.               AND OBJECTPROPERTY(id, N'IsTableFunction') = 1)
  12. DROP FUNCTION [dbo].[Tokenizer]
  13. GO
  14. --
  15. -- Now let's create it
  16. --
  17. /*********************************************************************************
  18. ** @function    : Tokenizer
  19. ** @precis      : Splits any string into its tokens.
  20. ** @description : Supplied a string and a list of separators this function
  21. **                returns resultant tokens as a table collection.
  22. ** @example     : SELECT t.token
  23. **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
  24. ** @param       : p_string. The string to be Tokenized.
  25. ** @param       : p_separators. The characters that are used to split the string.
  26. ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  27. ** @history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
  28. ** @history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
  29. ** @history     : Simon Greener - Aug 2012 - Converted to SQL Server 2012 and return separators
  30. **/
  31. CREATE FUNCTION [dbo].[Tokenizer](@p_string     VARCHAR(MAX),
  32.                                   @p_separators VARCHAR(254))
  33.   RETURNS @varchar_table TABLE
  34.  (
  35.    id        INT,
  36.    token     VARCHAR(MAX),
  37.    separator VARCHAR(MAX)
  38.   )
  39. AS
  40. BEGIN
  41.   BEGIN
  42.     WITH MyCTE AS (
  43.       SELECT c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) AS rid
  44.         FROM (SELECT b.beg, c.sep
  45.                 FROM (SELECT a.beg
  46.                         FROM (SELECT c.IntValue AS beg
  47.                                 FROM dbo.generate_series(1,DATALENGTH(@p_string),1) c
  48.                               ) a
  49.                       ) b,
  50.                       (SELECT SUBSTRING(@p_separators,d.IntValue,1) AS sep
  51.                           FROM dbo.generate_series(1,DATALENGTH(@p_separators),1) d
  52.                         ) c
  53.                 WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
  54.               UNION ALL SELECT 0 AS beg, CAST(NULL AS VARCHAR) AS sep
  55.              ) c
  56.     )
  57.     INSERT INTO @varchar_table
  58.     SELECT ROW_NUMBER() OVER (ORDER BY a.rid ASC) AS Id,
  59.            CASE WHEN DataLength(a.token) = 0 THEN NULL ELSE a.token END AS token,
  60.            a.sep
  61.       FROM (SELECT d.rid,
  62.                    SUBSTRING(@p_string, (d.beg + 1), (Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1) ) AS token,
  63.                    Lead(d.sep,1) OVER (ORDER BY d.rid ASC) AS sep
  64.               FROM MyCTE d
  65.            ) AS a
  66.      WHERE DataLength(a.token) <> 0 OR DataLength(a.sep) <> 0;
  67.     RETURN;
  68.   END;
  69. END
  70. GO

Testing

  1. SELECT DISTINCT t.token
  2.  FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t;

Results

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

The classic “Rain in Spain…”.

  1. SELECT t.*
  2.   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Results

id token separator
1 The {SPACE}
2 rain {SPACE}
3 in {SPACE}
4 spain ,
5 NULL {SPACE}
6 stays {SPACE}
7 mainly {SPACE}
8 on {SPACE}
9 the {SPACE}
10 plain .
11 NULL !

Now, let’s process a POLYGON WKT.

  1. SELECT t.id, t.token, t.separator
  2.   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400  900, 2200 900, 2300 1000))',' ,()') AS t;

Results

id token separator
1 POLYGON (
2 NULL (
3 2300 {SPACE}
4 400 ,
5 NULL {SPACE}
6 2300 {SPACE}
7 700 ,
8 NULL {SPACE}
9 2800 {SPACE}
10 1100 ,
11 NULL {SPACE}
12 2300 {SPACE}
13 1100 ,
14 NULL {SPACE}
15 1800 {SPACE}
16 1100 ,
17 NULL {SPACE}
18 2300 {SPACE}
19 400 )
20 NULL ,
21 NULL {SPACE}
22 NULL (
23 2300 {SPACE}
24 1000 ,
25 NULL {SPACE}
26 2400 {SPACE}
27 NULL {SPACE}
28 900 ,
29 NULL {SPACE}
30 2200 {SPACE}
31 900 ,
32 NULL {SPACE}
33 2300 {SPACE}
34 1000 )
35 NULL )

This time don’t include the space as a separator.

  1. SELECT t.id, t.token, t.separator
  2.   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400  900, 2200 900, 2300 1000))',',()') AS t;

Results

id token separator
1 POLYGON (
2 NULL (
3 2300 400 ,
4 2300 700 ,
5 2800 1100 ,
6 2300 1100 ,
7 1800 1100 ,
8 2300 400 )
9 NULL ,
10 {SPACE} (
11 2300 1000 ,
12 2400 900 ,
13 2200 900 ,
14 2300 1000 )
15 NULL )

I hope that someone out there finds this useful.

Creative Commons License

post this at del.icio.uspost this at Diggpost this at Technoratipost this at Redditpost this at Farkpost this at Yahoo! my webpost this at Windows Livepost this at Google Bookmarkspost this to Twitter

Comment [8]

Simon,

Thanks for the article. I think I’ve had several occasions when I thought would be nice to have a string tokenizer, but been lazy too to look for a script for one for SQL Server or roll my own.

For a group concatenation function for strings I have one of those .NET custom aggregate functions, mostly because I can never remember the XML path syntax and its a bit easier to write.

The disadvantage of that is .NET by default is disabled in SQL Server so have to enable it in surface area etc. and getting an admin for a customer to do this is oh so frustrating – I often wait for it to be escalated to some administrator who has a clue which sometimes takes a week for a 1 minute configuration change.

It would have been really nice if SQL Server 2008 allowed defining aggregates in T-SQL similar to what PostgreSQL allows with sql/plpgsql.

By the way I think your code would work fine in SQL Server 2005 too. Will have to give it a try.

Regina · 29 August 2009, 14:23 · #

Simon

Just a small note that all your @ symbols have been removed from your variables in the function.

— James · 30 August 2009, 23:17 · #

James,

Thanks for letting me know: the article should, now, be fixed.

regards
Simon

Simon · 30 August 2009, 23:56 · #

Nice. However, this would have been even better if it returned some kind of order for the tokens. Let’s say I want to reliably identify the second token in the string… how do I do that? As far as I know, an SQL table is unordered by definition.

— Darren · 21 January 2011, 17:33 · #

Yes, relational theory says that a relation (table), is not ordered. To order you have to include an ORDER BY clause in the SQL.

So, if you pulled the SQL out the function, removed the INSERT, and instead of:

Select SUBSTRING, (d.end_p – d.beg – 1) ) token

you put

Select d.beg, SUBSTRING, (d.end_p – d.beg – 1) ) token

you will see the ordering is preserved.

If not, just add:

Order by d.beg

at the end. That is:

Select SUBSTRING, (d.end_p – d.beg – 1) ) token
From (Select BASE.beg,
LEAD.beg end_p,
BASE.fullstring
From MyCTE BASE
LEFT JOIN MyCTE LEAD
ON BASE.RowVersion = LEAD.RowVersion-1
) d
Where d.end_p Is Not Null
And d.end_p > d.beg + 1
order by d.beg;

Simon

Simon Greener · 22 January 2011, 02:37 · #

Great tool Simon!
Could you kindly assist on how would one implement the updated function without the Lead function. i.e for SQL Server 2008?

— Leon Xavier · 27 May 2012, 20:46 · #

Leon,
The first Tokenizer in the article works for 2008. The last one for 2012. If you want the updated capability (ie the return of the tokens and the separators) please have a go yourself and if you can’t get it to work then contact me directly (simon at spatialdbadvisor dot com) with what you have done and I will have a look.
Simon

— Simon Greener · 27 May 2012, 23:57 · #

This works for me with SQL Server 2008 and all “features” (sorry but I did not figure out how to format this…)

<blockquote>USE [GISDB] — You need to change this if you use this function.
GO
/*********************************************************************************

** function : Tokenizer ** precis : Splits any string into its tokens. ** description : Supplied a string and a list of separators this function ** returns resultant tokens as a table collection. ** example : SELECT t.token ** FROM dbo.tokenizer(‘The rain in spain, stays mainly on the plain.!’,’ ,.!’) t; ** param : p_string. The string to be Tokenized. ** param : p_separators. The characters that are used to split the string. ** depend : dbo.generate_series() ** history : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html ** history : Simon Greener - Jul 2006 - Original coding (extended SQL) ** history : Simon Greener – Aug 2008 – Converted to SQL Server 2008 ** @history : Simon Greener – Aug 2012 – Converted to SQL Server 2012 and return separators **/ — Drop function if exists — IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = object_id (N’[dbo].[Tokenizer]’) AND OBJECTPROPERTY = 1) DROP FUNCTION [dbo].[Tokenizer] GO CREATE FUNCTION dbo.Tokenizer(@p_string VARCHAR, @p_separators VARCHAR) RETURNS @varchar_table TABLE ( id INT, token VARCHAR, separator VARCHAR ) AS BEGIN BEGIN WITH myCte AS ( SELECT c.beg, c.sep, ROW_NUMBER() OVER AS rid FROM (SELECT b.beg, c.sep FROM (SELECT a.beg FROM (SELECT c.IntValue AS beg FROM dbo.generate_series(1,DATALENGTH,1) c ) a ) b, (SELECT SUBSTRING AS sep FROM dbo.generate_series(1,DATALENGTH,1) d ) c WHERE CHARINDEX) > 0 UNION ALL SELECT 0 AS beg, CAST AS sep ) c ) INSERT INTO @varchar_table SELECT ROW_NUMBER() OVER (ORDER BY d.rid ASC) AS Id, CASE WHEN DataLength(SUBSTRING, (d.end_p – d.beg – 1) )) = 0 THEN NULL ELSE SUBSTRING, (d.end_p – d.beg – 1) ) END as token, d.sepstr as separator FROM (SELECT BASE.rid, BASE.beg, LEAD.beg end_p, @p_string as fullstring, LEAD.sep as sepstr FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.rid = LEAD.rid-1 ) d WHERE d.end_p IS NOT NULL AND d.end_p > d.beg; RETURN; END; END GO </blockquote>

— Dieter Hofrichter · 2 August 2012, 10:10 · #