PDA

View Full Version : MSSQL 2019 and Unicode



Ian Smith
11-Apr-2023, 11:27 AM
In the Learning Center "Migrating to DataFlex 2021" course the statement is made that UFT8 is supported with MS SQL 2019 but that NChar & NVarChar are still recommended because of the “field size interpretation“, why?

Please can anyone explain / expand on this statement?

Thanks

Sture
11-Apr-2023, 12:06 PM
Perhaps because UTF8 characters may be more than 1 byte? Only NChar and NVarChar takes this into account. I''m guessing

Samuel Pizarro
11-Apr-2023, 12:14 PM
UTF8 is not the same as Unicode.

UTF8 is only one of the possible encoding to represent a Unicode character (the image you see).

NCHAR and NVARCAR are not UTF8, they are UTF16. which is a 2-bytes for each char.

NCHAR(20) means you can hold up to 20 chars/symbols in your column, but behinds the scenes it consumes 40 bytes of space. each char will use 2 bytes.

In a UTF8, VARCHAR(20) means 20 bytes only.. not chars.
if you have any char that would is beyond the 127 ascII position, that char will consume 2 bytes..

So depending on your string content, it won't fit anymore in a varchar(20) column, if you are converting that from a non-unicode source.. which is our case


hope that makes sense ...

Peter van Mil
11-Apr-2023, 03:10 PM
When using NChar & NVarChar (UTF-16) each character takes 2 bytes. If you have a Customer database, the customer name will take a field length of 70 in stead of 35.

With UTF-8 chacacters can take 2 to 4 bytes. But in the UK and Western Europe most characters will take just one byte. Only customer names from Eastern Europe will contain characters, that will take 2 or more bytes.

Thefor I use the collating sequence Latin1_General_100_CI_AS_SC_UTF8 for my databases and Char or Varchar as datatype.. This will work unless you need to store real Chinese or Japanese customer names.

Harm Wibier
12-Apr-2023, 03:35 AM
We had multiple reasons to recommend the N.. fields over the UTF-8 support. One was that the UTF-8 support on MS SQL was relatively new meaning that it wasn’t very ‘proven’ and not available on older versions of MS SQL.

But the main reason is that the way field lengths are interpreted on N(Var)Char fields matches how the user interface on both Windows and Web does it in DataFlex. A field length of 20 actually means 20 UTF-16 code units in DataFlex which matches what MS SQL does for N(Var)Char fields. For UTF-8 fields I believe (distant memory talking here) MS SQL counts UTF-8 code units so you can actually store less than DataFlex allows you to enter.

Of course, UTF-16 is less efficient in storage, but can be faster in usage because it requires less conversions. An advantage of the N(Var)Char fields is that you don’t have to convert the entire database at once. You could decide to first only do it for Unicode critical fields like customer name but leave your product information in char fields.

Martin Moleman
12-Apr-2023, 04:46 AM
Not sure what you mean with Eastern Europe characters, but any character not in the Ascii range ( < 127) will take 2 or more bytes in utf-8.
That includes accented characters that are common in for example German, French or Spanish. For example 'Müller' will take 7 bytes.

Peter van Mil
12-Apr-2023, 07:05 AM
Hi Martin,

I was under the impression that it applied to characters > 254. That doesn't make a big difference in our country, but for international applications it does.

Ian Smith
12-Apr-2023, 10:52 AM
Now I understand, if using an MS SQL UTF8 collation you would need to / should SizeOfString all fields (probably in Creating and Updating) to confirm the data will fit.

Back to n(var)char - thanks everyone.