PDA

View Full Version : NEXTGEN First Take (Alpha-2).



Clive Richmond
13-Jul-2020, 01:39 AM
So far so good. Apart from some cryptographer code that required refactoring, our application source compiles and runs. Undoubtedly, we still have work to do before it’s UNICODE ready and with that in mind we have decided to leave 64bit for another release.

Reading the available literature, am I right in thinking these are the main areas of source code that need reviewing?


Length function. Are we counting characters or the size of the string? May need to switch to SizeOfString.
Sequential I/O. Where we are using direct_input/output do we need to maintain the existing file format or add a BOM to identify the file as UNICODE?
External APIs. Identify which ‘A’ calls have an equivalent ‘W’ call and update accordingly.
3rd Party Controls. Review and update accordingly.
MSSQL. Convert char and varchar datatypes to nchar and nvarchar data types respectively.

Our customer base is still a spread of embedded and SQL at this time but the pendulum is slowly swinging towards SQL. Note our applications don’t support mixed database environments. It’s either one database type or the other and this includes tables such as codemast etc. With that in mind I noted the following in the What’s New.


Database
The embedded database does not support Unicode and data written to it is converted to OEM by the runtime. It is backwards compatible and the database can be shared with older revisions of DataFlex. The sorting of the indexes is done according to the Df_collate.cfg in bin or bin64. Note that string comparisons in the language are now performed using the new Unicode comparisons and can be different than the embedded database collation.

All that is ok with exception of the string comparisons. In particular where a control block e.g. while / loop is used to step through a table and the primary index segment is a string. If the database collating returns records in a certain order is there not the potential for the loop to terminate prematurely since the string comparison evaluation will be different?

If that is the case will DAW provide a DF_Collate.cfg that matches the string comparison used by the runtime?

Under the Collating section there is mention of a new attribute called DF_LOCALE_CODE. From what I gather this doesn’t apply to the embedded database. I was thinking we could use something similar to set a custom df_collate.cfg. NB: I appreciate the database would require reindexing.

There isn’t much information on this new attribute. Could someone please explain how it might be used and when?

Harm Wibier
13-Jul-2020, 02:57 AM
Hi Clive,

Your list looks all right. I'd probably add to search the source code for any ToOEM / ToANSI conversions you might have. The warning system will tell you about those as well (and a lot of other potential issues).

As for embedded database string comparisons I would like to note that this is how it has always been when using SQL. It is very likely that 80% of the systems out there works on a SQL database that collates different than the DF_Collate.cfg of DataFlex does. So we do not expect big issues in that area.

We can't supply a standard DF_Collate that matches the new collation because the new collation is dependent on machine settings (unless you change it using DF_LOCALE_CODE). It might be possible to create a program that generates one based on the new collation. We are working on more documentation on this. But here is a forum thread with some details on DF_LOCALE_CODE: https://support.dataaccess.com/Forums/showthread.php?65401-Congratulations-about-ICU-addoption-request-to-not-stop-there!&p=351649#post351649

Regards,

DaveR
29-Jul-2020, 02:51 AM
Clive's point 2 was a question. What's the answer?

Focus
29-Jul-2020, 03:11 AM
BOM markers are always optional DF20 just assumes Unicode in the same way <20 assumed OEM

Peter van Mil
29-Jul-2020, 03:49 AM
I also don't know the answer, but a lot of external files will be in ANSI (or OEM) format. A BOM character is only useful when a file is supposed to be in UTF format (and when external parties are expected that).

Focus
29-Jul-2020, 04:08 AM
Up to 19.1 it will always read/write it as OEM, If you are reading something else you have to code for it. 20 is no different other than the starting point is Unicode and not OEM

Peter van Mil
29-Jul-2020, 04:18 AM
That's right, but my comment pointed to the external files. If you are using BOM characters for external files and an external party is expecting an ANSI file then things go wrong.

Focus
29-Jul-2020, 04:27 AM
But you have to code it yourself and you would not put one in an ANSI file because the BOM is to tell the receiver what UTF encoding has been used in this UNICODE file

DaveR
29-Jul-2020, 09:02 AM
Hmm. Most csv files we get are assumed to be UTF8. I suppose we'll have to annotate all programs with a warning that this might change. Lots of furriners send csv but they assume we are US-ish and so far so good.

Focus
29-Jul-2020, 09:13 AM
If they ARE UTF8 then there is no change other than DF20 will properly handle past the first 127 characters. If they are NOT UTF-8 then yes one side or the other will have to do something. Although given you must be presenting UTF-8 files to something expecting OEM now you can't have any chars over 127 in the files ?

DaveR
29-Jul-2020, 11:48 AM
If they ARE UTF8 then there is no change other than DF20 will properly handle past the first 127 characters. If they are NOT UTF-8 then yes one side or the other will have to do something. Although given you must be presenting UTF-8 files to something expecting OEM now you can't have any chars over 127 in the files ?

We'll cross that bridge...

Mostly we encounter accented European characters, but I suppose Vendors are eventually going to have the expectation that they can throw other languages. It's taken 30 years to get some of them to send excel file, can't be hasty :(

wila
29-Jul-2020, 01:04 PM
Peter,


That's right, but my comment pointed to the external files. If you are using BOM characters for external files and an external party is expecting an ANSI file then things go wrong.

Any time you need ANSI string encoding and communicate with a 3rd party in ANSI or OEM, you will have a problem in DataFlex 20 as you do not even have an ansi string type. Basically the answer is "you can not".
Be it just writing to text files or anything else.

I'm sure that there will be ways around that by for example using Utf8ToOemBuffer or Utf8ToStr, but you'll have to store the result in a utf8 string type or straight out memory location, so that certainly is going to be "interesting times".
See also this thread from earlier for more details on the topic:
https://support.dataaccess.com/Forums/showthread.php?65181-Windows-API-wrappers-and-the-state-of-brokenness

--
Wil

Clive Richmond
13-Aug-2020, 03:47 AM
Hi Harm,

Thanks for replies.


As for embedded database string comparisons I would like to note that this is how it has always been when using SQL. It is very likely that 80% of the systems out there works on a SQL database that collates different than the DF_Collate.cfg of DataFlex does. So we do not expect big issues in that area.

We can't supply a standard DF_Collate that matches the new collation because the new collation is dependent on machine settings (unless you change it using DF_LOCALE_CODE). It might be possible to create a program that generates one based on the new collation. We are working on more documentation on this. But here is a forum thread with some details on DF_LOCALE_CODE: https://support.dataaccess.com/Forums/showthread.php?65401-Congratulations-about-ICU-addoption-request-to-not-stop-there!&p=351649#post351649


We still have an issue with the above. Just because "this is how it has always been when using SQL" does not mean its right and certainly wasn't something we were aware of until it bit us with some nasty consequences. :confused::(

Take the example below. The SQL database collation is 'out of the box' which for our region is Latin1_General_CI_AS. I assume that the DataFlex runtime, prior DF20, took its string comparison collation from df_collate.cfg and hence why you see the result below. There are a couple of ways to fix this (sql) but the option we use is to change the df_collate.cfg file to best match the sql collation. Doing so gives us the results as if it were embedded.

13898

In DF20 with "string comparisons in the language are now performed using the new Unicode comparisons". I assume that this will eliminate how we have used the df_collate.cfg file in the past. Therefore, how in DF20 do we match the database collating sequence? Is this where DF_LOCAL_CODE comes in?

Harm Wibier
13-Aug-2020, 08:28 AM
Sure, it isn't perfect, as we do not live in a perfect world. All I can do is explain how it works and if you have suggestions on how to make it better, feel free to let us know!

Strings comparisons in the language now compare unicode strings, which can hold a virtually unlimited number of characters. The old string comparison logic relying on the order specified in DF_Collate.cfg obviously did not hold up and we have replaced that with string comparison logic provided by the ICU library. Customization can be done by setting the DF_LOCALE_CODE which is actually passed on into the ICU library. See this post (https://support.dataaccess.com/Forums/showthread.php?65401-Congratulations-about-ICU-addoption-request-to-not-stop-there!&p=351649#post351649) for some information on how to use it.

It would be great if we could somehow create a list DF_LOCALE_CODE strings for the various SQL collations that are available. I am not sure if we can match them all perfectly.

So in DataFlex 2020 the usage of DF_Collate.cfg is limited to being used only for the generation and usage of embedded database indexes. While we could have hooked this up with the new string comparison logic, but we did not as this would create an incompatibility between older DataFlex revs and 2020. We have (prototyped) an example of how to generate a DF_Collate.cfg that matches the ICU collation used by DataFlex. If you are interested in that I am happy to send it to you, but it is pretty rough as we haven't decided what to do with it yet.

Harm Wibier
13-Aug-2020, 08:36 AM
Any time you need ANSI string encoding and communicate with a 3rd party in ANSI or OEM, you will have a problem in DataFlex 20 as you do not even have an ansi string type. Basically the answer is "you can not".
Be it just writing to text files or anything else.
Note that between the technology previews and the Alphas we have improved the handling of strings that do not contain valid UTF-8 data. So you can temporary convert a string to a different encoding (without having to put it into a UChar array or work with pointers) without crashes and funky debugger behaviors (although it will not display the strings properly). So you can simply do a WriteLn (Utf8ToOem(sMyString)) to write out an OEM file. I would always recommend to do these conversions as close to the the external API or file IO as possible. Also, to work with binary data we still recommend to use UChar arrays as that still has multiple advantages over using strings.

wila
13-Aug-2020, 10:45 AM
Hi Harm,


Note that between the technology previews and the Alphas we have improved the handling of strings that do not contain valid UTF-8 data. So you can temporary convert a string to a different encoding (without having to put it into a UChar array or work with pointers) without crashes and funky debugger behaviors (although it will not display the strings properly). So you can simply do a WriteLn (Utf8ToOem(sMyString)) to write out an OEM file. I would always recommend to do these conversions as close to the the external API or file IO as possible. Also, to work with binary data we still recommend to use UChar arrays as that still has multiple advantages over using strings.

That's good to know and indeed an improvement over how it was.
Can't help to think of it as a partial solution though as you can't inspect the OEM or ANSI data in the debugger at the moment.
But at least there's a viable workaround, which I guess is all we can hope for at this stage.

Thanks!
--
Wil

Harm Wibier
13-Aug-2020, 12:14 PM
Can't help to think of it as a partial solution though as you can't inspect the OEM or ANSI data in the debugger at the moment.
Look at it as a temporary solution until everything converted to unicode..

Marco
13-Aug-2020, 05:39 PM
Just a thought. Have a ‘generate collate from database’.
It creates a table inserts a series of values, queries then back with order by, records this and writes out the matching collate to be kept with your program. Oh and drops the table again.

Would something like that work? At least for the 255 odd ‘common’ characters.

Clive Richmond
16-Aug-2020, 10:33 PM
Hi Harm,


Sure, it isn't perfect, as we do not live in a perfect world. All I can do is explain how it works and if you have suggestions on how to make it better, feel free to let us know!

We’re always happy to contribute Harm. I think the frustration is we couldn’t see any light at the end of the tunnel on this one. However, perhaps there is something that can be done to help.


Strings comparisons in the language now compare unicode strings, which can hold a virtually unlimited number of characters. The old string comparison logic relying on the order specified in DF_Collate.cfg obviously did not hold up and we have replaced that with string comparison logic provided by the ICU library. Customization can be done by setting the DF_LOCALE_CODE which is actually passed on into the ICU library. See this post (https://support.dataaccess.com/Forums/showthread.php?65401-Congratulations-about-ICU-addoption-request-to-not-stop-there!&p=351649#post351649) for some information on how to use it.

I took your demo sample Harm and experimented with it by adding the data where the sql server collation and string comparison differs. After cycling through the various options, I hit upon one that worked. I took this and added the code to change the locale prior to finding and this has worked.

13902

13903

It would be great if we could somehow create a list DF_LOCALE_CODE strings for the various SQL collations that are available. I am not sure if we can match them all perfectly.

I like Marco’s suggestion (https://support.dataaccess.com/Forums/showthread.php?66145-NEXTGEN-First-Take-(Alpha-2)&p=357640#post357640). If we had a utility that took a sql server collation and then generated the appropriate local settings for string comparison that would go a long way to taking out the guesswork.


So in DataFlex 2020 the usage of DF_Collate.cfg is limited to being used only for the generation and usage of embedded database indexes. While we could have hooked this up with the new string comparison logic, but we did not as this would create an incompatibility between older DataFlex revs and 2020. We have (prototyped) an example of how to generate a DF_Collate.cfg that matches the ICU collation used by DataFlex. If you are interested in that I am happy to send it to you, but it is pretty rough as we haven't decided what to do with it yet.

Totally agree. So far, in DF20 we haven’t noticed any change with the embedded database collation and string comparison. My understanding is as long as we don’t try to mix databases i.e. sql server with unicode characters (nvarchar) and embedded it should be fine. However, this is not something we support in our application. There is no half-n-half, sites are entirely embedded or sql server.

Assuming my understanding is correct then the utility you mention should only be needed when mixing databases? Harm I am certainly interested to take a look even if it is just to get a better understanding of ICU collations. Thanks.

DaveR
18-Aug-2020, 04:24 PM
Look at it as a temporary solution until everything converted to unicode..

the U.S. will have signs in Kilometres before that happens on this side of the planet :cool:

Samuel Pizarro
18-Aug-2020, 07:25 PM
I hope not!

Focus
24-Aug-2020, 07:41 AM
Hi Clive

I've just been looking at the general area of collation myself

Just to ask the daft question...The first two lines in both your screen shots seem to be swapped and the last line in both cases is the same. Are you not trying to achieve the same order for all three lines ?

On a separate note I created a SQL table with chars 1-255 in a char column and then tried different order by with different collations and they are different

For reference this was the sequence I got for Latin1_General_CI_AS, so if it was important one could create a DF_COLLATE.CFG with the characters in this order



32
1
2
3
4
5
6
7
8
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
127
129
141
143
144
157
39
45
173
150
151
160
9
10
11
12
13
33
34
35
36
37
38
40
41
42
44
46
47
58
59
63
64
91
92
93
94
136
95
96
123
124
125
126
161
166
168
175
180
184
191
152
145
146
130
147
148
132
139
155
43
60
61
62
177
171
187
215
247
162
163
164
165
167
169
172
174
176
181
182
183
134
135
149
133
137
128
48
188
189
190
185
49
50
178
179
51
52
53
54
55
56
57
65
97
170
225
193
224
192
194
226
228
196
195
227
229
197
198
230
98
66
67
99
231
199
100
68
208
240
69
101
233
201
200
232
202
234
235
203
70
102
131
103
71
72
104
105
73
237
205
204
236
238
206
207
239
74
106
107
75
76
108
109
77
78
110
241
209
79
111
186
243
211
210
242
244
212
214
246
245
213
216
248
156
140
112
80
81
113
114
82
83
115
138
154
223
84
116
254
222
153
85
117
250
218
217
249
219
251
252
220
86
118
119
87
88
120
121
89
253
221
159
255
122
90
142
158

Clive Richmond
24-Aug-2020, 11:00 AM
Hi Andrew,


Just to ask the daft question...The first two lines in both your screen shots seem to be swapped and the last line in both cases is the same. Are you not trying to achieve the same order for all three lines ?

No, this question is not about trying to achieve the same order. The ordering shown is correct for the collation used by their respective databases i.e. embedded and SQL.

The issue is the collation used by the runtime to compare strings. If the database returns records in a certain order for a primary key that is a string e.g. nvarchar then the string collation used by the runtime should match when comparisons are made. You’ll notice in my second post (https://support.dataaccess.com/Forums/showthread.php?66145-NEXTGEN-First-Take-(Alpha-2)&p=357621#post357621) the screen shows SQL failing to select the records, which it should have done, based on the first and last records in the table.


For reference this was the sequence I got for Latin1_General_CI_AS, so if it was important one could create a DF_COLLATE.CFG with the characters in this order

Changing df_collate.cfg will not solve this problem in DataFlex 2020. I believe the solution is to set the correct DF_LOCALE_CODE as suggested by Harm. There is an example of this in the code of the second screen of my previous reply (https://support.dataaccess.com/Forums/showthread.php?66145-NEXTGEN-First-Take-(Alpha-2)&p=357695#post357695). What DAW might be able to do is provide a utility that helps developers generate the locale code for the sql server database's collation being used.

HTH