Go to content Go to navigation and search

Home

Current SQL Server Blog Articles


Search

Browse

RSS / Atom

Email me

textpattern

Creative Commons License
All Blog Articles, Data Models and Free Source Code by Simon Greener, The SpatialDB Advisor is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

TSQL String Tokenizer Function for SQL Server

Thursday April 12 2012 at 03:59

Keywordsstring tokenizer separator delimiter
Summary

String tokenizer for Sql Server 2008 and Denali.

Update: Tokenizer has been updated for Denali analytic features and to expose the separators in the output.

We often have need of string tokenizers in applications. I had need of one in some TSQL development I am currently doing for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).

The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article

Still, with a little perseverance I came up with a working implementation.

Here it is.

  1. USE [GISDB]  -- You need to change this if you use this function.
  2. GO
  3. /*********************************************************************************
  4. ** @function    : Tokenizer
  5. ** @precis      : Splits any string into its tokens.
  6. ** @description : Supplied a string and a list of separators this function
  7. **                returns resultant tokens as a table collection.
  8. ** @example     : SELECT t.token
  9. **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
  10. ** @param       : p_string. The string to be Tokenized.
  11. ** @param       : p_separators. The characters that are used to split the string.
  12. ** @depend      : dbo.generate_series()
  13. ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  14. ** @history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
  15. ** @history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
  16. **/
  17. DROP FUNCTION Tokenizer;
  18. ==<nbsp/>==
  19. CREATE FUNCTION Tokenizer(@p_string     VARCHAR(MAX),
  20.                           @p_separators VARCHAR(254))
  21.   RETURNS @varchar_table TABLE
  22.  (
  23.    token VARCHAR(MAX)
  24.   )
  25. AS
  26. BEGIN
  27.   BEGIN
  28.       WITH myCte AS (
  29.       SELECT c.beg,
  30.              c.fullstring,
  31.              ROW_NUMBER() OVER(ORDER BY c.beg ASC) RowVersion
  32.         FROM (SELECT b.beg, b.fullstring
  33.                 FROM (SELECT a.beg, @p_string AS fullstring
  34.                         FROM (SELECT c.IntValue AS beg
  35.                                 FROM dbo.generate_series(1,DATALENGTH(@p_string),1) c
  36.                               ) a
  37.                      ) b,
  38.                      (SELECT SUBSTRING(@p_separators,d.IntValue,1) AS delim
  39.                         FROM dbo.generate_series(1,DATALENGTH(@p_separators),1) d
  40.                      ) c
  41.                WHERE CHARINDEX(c.delim,SUBSTRING(b.fullstring,b.beg,1)) > 0
  42.                UNION ALL SELECT 0 AS beg, @p_string AS fullstring
  43.                UNION ALL SELECT DATALENGTH(@p_string)+1 AS beg, @p_string AS fullstring
  44.              ) c
  45.       )
  46.       INSERT INTO @varchar_table
  47.       SELECT SUBSTRING(d.fullstring, (d.beg + 1), (d.end_p - d.beg - 1) ) token
  48.         FROM (SELECT BASE.beg,
  49.                      LEAD.beg end_p,
  50.                      BASE.fullstring
  51.                 FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.RowVersion = LEAD.RowVersion-1
  52.              ) d
  53.        WHERE d.end_p IS NOT NULL
  54.          AND d.end_p > d.beg + 1;
  55.       RETURN;
  56.   END;
  57. END
  58. GO

Here are my, simple, tests.

  1. SELECT DISTINCT t.token
  2.   FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t;

Result.

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
  1. SELECT t.token
  2.   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Result.

token
The
rain
in
spain
stays
mainly
on
the
plain

Now, if you want to collect them back into a single string, here’s an example of what you can do.

  1. SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
  2.                  FROM ( SELECT DISTINCT t.token AS gtype
  3.                           FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t
  4.                        ) a
  5.                 ORDER BY ':' + a.gtype
  6.                 FOR XML PATH(''), TYPE, ROOT).VALUE('root[1]','nvarchar(max)'),1,1,'')
  7.         ) AS GeometryTypes;

Result.

GeometryTypes
LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon

Upgraded Version for Denali

  1. ** @history     : Simon Greener - Apr 2012  - Extended TO include returning OF tokens
  2. -- Connect to database holding generate_series
  3. --
  4. USE [GISDB] -- Change to your database
  5. GO
  6. -- Drop function if exists
  7. --
  8. IF EXISTS (SELECT *
  9.              FROM dbo.sysobjects
  10.             WHERE id = object_id (N'[dbo].[Tokenizer]')
  11.               AND OBJECTPROPERTY(id, N'IsTableFunction') = 1)
  12. DROP FUNCTION [dbo].[Tokenizer]
  13. GO
  14. --
  15. -- Now let's create it
  16. --
  17. /*********************************************************************************
  18. ** @function    : Tokenizer
  19. ** @precis      : Splits any string into its tokens.
  20. ** @description : Supplied a string and a list of separators this function
  21. **                returns resultant tokens as a table collection.
  22. ** @example     : SELECT t.token
  23. **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
  24. ** @param       : p_string. The string to be Tokenized.
  25. ** @param       : p_separators. The characters that are used to split the string.
  26. ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  27. ** @history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
  28. ** @history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
  29. ** @history     : Simon Greener - Aug 2012 - Converted to SQL Server 2012 and return separators
  30. **/
  31. CREATE FUNCTION [dbo].[Tokenizer](@p_string     VARCHAR(MAX),
  32.                                   @p_separators VARCHAR(254))
  33.   RETURNS @varchar_table TABLE
  34.  (
  35.    id        INT,
  36.    token     VARCHAR(MAX),
  37.    separator VARCHAR(MAX)
  38.   )
  39. AS
  40. BEGIN
  41.   BEGIN
  42.     WITH MyCTE AS (
  43.       SELECT c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) AS rid
  44.         FROM (SELECT b.beg, c.sep
  45.                 FROM (SELECT a.beg
  46.                         FROM (SELECT c.IntValue AS beg
  47.                                 FROM dbo.generate_series(1,DATALENGTH(@p_string),1) c
  48.                               ) a
  49.                       ) b,
  50.                       (SELECT SUBSTRING(@p_separators,d.IntValue,1) AS sep
  51.                           FROM dbo.generate_series(1,DATALENGTH(@p_separators),1) d
  52.                         ) c
  53.                 WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
  54.               UNION ALL SELECT 0 AS beg, CAST(NULL AS VARCHAR) AS sep
  55.              ) c
  56.     )
  57.     INSERT INTO @varchar_table
  58.     SELECT ROW_NUMBER() OVER (ORDER BY a.rid ASC) AS Id,
  59.            CASE WHEN DataLength(a.token) = 0 THEN NULL ELSE a.token END AS token,
  60.            a.sep
  61.       FROM (SELECT d.rid,
  62.                    SUBSTRING(@p_string, (d.beg + 1), (Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1) ) AS token,
  63.                    Lead(d.sep,1) OVER (ORDER BY d.rid ASC) AS sep
  64.               FROM MyCTE d
  65.            ) AS a
  66.      WHERE DataLength(a.token) <> 0 OR DataLength(a.sep) <> 0;
  67.     RETURN;
  68.   END;
  69. END
  70. GO

Testing

  1. SELECT DISTINCT t.token
  2.  FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t;

Results

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

The classic “Rain in Spain…”.

  1. SELECT t.*
  2.   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Results

id token separator
1 The {SPACE}
2 rain {SPACE}
3 in {SPACE}
4 spain ,
5 NULL {SPACE}
6 stays {SPACE}
7 mainly {SPACE}
8 on {SPACE}
9 the {SPACE}
10 plain .
11 NULL !

Now, let’s process a POLYGON WKT.

  1. SELECT t.id, t.token, t.separator
  2.   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400  900, 2200 900, 2300 1000))',' ,()') AS t;

Results

id token separator
1 POLYGON (
2 NULL (
3 2300 {SPACE}
4 400 ,
5 NULL {SPACE}
6 2300 {SPACE}
7 700 ,
8 NULL {SPACE}
9 2800 {SPACE}
10 1100 ,
11 NULL {SPACE}
12 2300 {SPACE}
13 1100 ,
14 NULL {SPACE}
15 1800 {SPACE}
16 1100 ,
17 NULL {SPACE}
18 2300 {SPACE}
19 400 )
20 NULL ,
21 NULL {SPACE}
22 NULL (
23 2300 {SPACE}
24 1000 ,
25 NULL {SPACE}
26 2400 {SPACE}
27 NULL {SPACE}
28 900 ,
29 NULL {SPACE}
30 2200 {SPACE}
31 900 ,
32 NULL {SPACE}
33 2300 {SPACE}
34 1000 )
35 NULL )

This time don’t include the space as a separator.

  1. SELECT t.id, t.token, t.separator
  2.   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400  900, 2200 900, 2300 1000))',',()') AS t;

Results

id token separator
1 POLYGON (
2 NULL (
3 2300 400 ,
4 2300 700 ,
5 2800 1100 ,
6 2300 1100 ,
7 1800 1100 ,
8 2300 400 )
9 NULL ,
10 {SPACE} (
11 2300 1000 ,
12 2400 900 ,
13 2200 900 ,
14 2300 1000 )
15 NULL )

I hope that someone out there finds this useful.

Creative Commons License

post this at del.icio.uspost this at Diggpost this at Technoratipost this at Redditpost this at Farkpost this at Yahoo! my webpost this at Windows Livepost this at Google Bookmarkspost this to Twitter

Comment [8]

Simon,

Thanks for the article. I think I’ve had several occasions when I thought would be nice to have a string tokenizer, but been lazy too to look for a script for one for SQL Server or roll my own.

For a group concatenation function for strings I have one of those .NET custom aggregate functions, mostly because I can never remember the XML path syntax and its a bit easier to write.

The disadvantage of that is .NET by default is disabled in SQL Server so have to enable it in surface area etc. and getting an admin for a customer to do this is oh so frustrating – I often wait for it to be escalated to some administrator who has a clue which sometimes takes a week for a 1 minute configuration change.

It would have been really nice if SQL Server 2008 allowed defining aggregates in T-SQL similar to what PostgreSQL allows with sql/plpgsql.

By the way I think your code would work fine in SQL Server 2005 too. Will have to give it a try.

Regina · 29 August 2009, 14:23 · #

Simon

Just a small note that all your @ symbols have been removed from your variables in the function.

— James · 30 August 2009, 23:17 · #

James,

Thanks for letting me know: the article should, now, be fixed.

regards
Simon

Simon · 30 August 2009, 23:56 · #

Nice. However, this would have been even better if it returned some kind of order for the tokens. Let’s say I want to reliably identify the second token in the string… how do I do that? As far as I know, an SQL table is unordered by definition.

— Darren · 21 January 2011, 17:33 · #

Yes, relational theory says that a relation (table), is not ordered. To order you have to include an ORDER BY clause in the SQL.

So, if you pulled the SQL out the function, removed the INSERT, and instead of:

Select SUBSTRING, (d.end_p – d.beg – 1) ) token

you put

Select d.beg, SUBSTRING, (d.end_p – d.beg – 1) ) token

you will see the ordering is preserved.

If not, just add:

Order by d.beg

at the end. That is:

Select SUBSTRING, (d.end_p – d.beg – 1) ) token
From (Select BASE.beg,
LEAD.beg end_p,
BASE.fullstring
From MyCTE BASE
LEFT JOIN MyCTE LEAD
ON BASE.RowVersion = LEAD.RowVersion-1
) d
Where d.end_p Is Not Null
And d.end_p > d.beg + 1
order by d.beg;

Simon

Simon Greener · 22 January 2011, 02:37 · #

Great tool Simon!
Could you kindly assist on how would one implement the updated function without the Lead function. i.e for SQL Server 2008?

— Leon Xavier · 27 May 2012, 20:46 · #

Leon,
The first Tokenizer in the article works for 2008. The last one for 2012. If you want the updated capability (ie the return of the tokens and the separators) please have a go yourself and if you can’t get it to work then contact me directly (simon at spatialdbadvisor dot com) with what you have done and I will have a look.
Simon

— Simon Greener · 27 May 2012, 23:57 · #

This works for me with SQL Server 2008 and all “features” (sorry but I did not figure out how to format this…)

<blockquote>USE [GISDB] — You need to change this if you use this function.
GO
/*********************************************************************************

** function : Tokenizer ** precis : Splits any string into its tokens. ** description : Supplied a string and a list of separators this function ** returns resultant tokens as a table collection. ** example : SELECT t.token ** FROM dbo.tokenizer(‘The rain in spain, stays mainly on the plain.!’,’ ,.!’) t; ** param : p_string. The string to be Tokenized. ** param : p_separators. The characters that are used to split the string. ** depend : dbo.generate_series() ** history : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html ** history : Simon Greener - Jul 2006 - Original coding (extended SQL) ** history : Simon Greener – Aug 2008 – Converted to SQL Server 2008 ** @history : Simon Greener – Aug 2012 – Converted to SQL Server 2012 and return separators **/ — Drop function if exists — IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = object_id (N’[dbo].[Tokenizer]’) AND OBJECTPROPERTY = 1) DROP FUNCTION [dbo].[Tokenizer] GO CREATE FUNCTION dbo.Tokenizer(@p_string VARCHAR, @p_separators VARCHAR) RETURNS @varchar_table TABLE ( id INT, token VARCHAR, separator VARCHAR ) AS BEGIN BEGIN WITH myCte AS ( SELECT c.beg, c.sep, ROW_NUMBER() OVER AS rid FROM (SELECT b.beg, c.sep FROM (SELECT a.beg FROM (SELECT c.IntValue AS beg FROM dbo.generate_series(1,DATALENGTH,1) c ) a ) b, (SELECT SUBSTRING AS sep FROM dbo.generate_series(1,DATALENGTH,1) d ) c WHERE CHARINDEX) > 0 UNION ALL SELECT 0 AS beg, CAST AS sep ) c ) INSERT INTO @varchar_table SELECT ROW_NUMBER() OVER (ORDER BY d.rid ASC) AS Id, CASE WHEN DataLength(SUBSTRING, (d.end_p – d.beg – 1) )) = 0 THEN NULL ELSE SUBSTRING, (d.end_p – d.beg – 1) ) END as token, d.sepstr as separator FROM (SELECT BASE.rid, BASE.beg, LEAD.beg end_p, @p_string as fullstring, LEAD.sep as sepstr FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.rid = LEAD.rid-1 ) d WHERE d.end_p IS NOT NULL AND d.end_p > d.beg; RETURN; END; END GO </blockquote>

— Dieter Hofrichter · 2 August 2012, 10:10 · #