PDA

View Full Version : readln bug



Mike Cooper
18-Apr-2020, 06:46 AM
Another issue with characters higher than 7 bit (ie ascii 128 and over)

Create a view
Add a form called oFileNew
Add the following button



Object oButton2 is a Button
Set Location to 153 371
Set Label to 'Read In Test'

// fires when the button is clicked
Procedure OnClick
String sFile sCode sStatus sLine sSound
Boolean bClear
Integer iConfirm iChannel iCount iArraySize i iVal
Get Value of oFileNew to sFile
String sData

Get Value of oFileNew to sFile
If (sFile='') Begin
Send Stop_Box 'Invalid File'
Procedure_Return
End

Move (Seq_New_Channel()) to iChannel

Direct_Input channel iChannel sFile
If (SeqEof) Begin
Close_Input channel iChannel
Send Seq_Release_Channel iChannel
Send Stop_Box 'Invalid File'
Procedure_Return
End
Repeat
Readln channel iChannel sLine
Move (Trim(sLine)) to sLine <---Put a Stop Break here
If (sLine<>'') Begin
Increment iCount
End
Until [SeqEof]
Close_Input channel iChannel
Send Seq_Release_Channel iChannel
Send Info_Box 'Complete'




End_Procedure

End_Object



Now create a txt file with the following lines:



accordons
accords
accordât
accordèrent
accordé
accordée
accordées
accordéon
accordéoniste
accordéonistes
accordéons
accordés
accore
accores




Save the file and then set the value of oFileNew to the filename and path of this file.

In DF19 the sLine reads in correctly (except that it is ANSI and not UTF... but I'm not necessarily worried about that)
but DF20 does not result in the same value in sLine.

starzen
18-Apr-2020, 07:25 AM
DF20 is all UTF

in UTF bytes need to be UTF so a single byte character can only be 0-127.
2 byte codes start with a byte that starts with 110xxxxx
3 byte codes start with a byte that starts with 1110xxxx
4 byte codes start with a byte that starts with 11110xxx

this also means there are specific bytes or byte combinations that are invalid

what are you actually trying to do? You are loading a non unicode file and trying to convert it to unicode?

Mike Cooper
18-Apr-2020, 09:59 AM
Thanks for the explanation Michael

This all started because I was reading in a line delimited text file that contained french words some which had accented characters.

For example, one word was Aloïs the 4th character is ASCII character 139 in the OEM world, but ASCII character 239 in the ANSI world

Prior to DF20, I could just read the string in fine. In DF20 it didn't like any character higher than 7 bit.

So I read the file in to a UChar array. In the Uchar array I could see that the value was represented as 239 which indicated that it was truly ANSI and not OEM
So I converted the UChar array to OEM and this worked perfect. The value of that character in the converted array was now 139 (the OEM value)

At that point I thought that I should just be able to use the character function (ie Character(139) and it should now show in my Dataflex string as "ï", but this no longer works this way in DF20... just on versions leading up to DF20.

I solved the issue by reading the file in using DF19, but my underlying issue is that the Readln command and Character function cannot handle a character that is not 7 bit.... meaning how is one supposed to read in any text file that contains characters with 8 bit characters?

starzen
18-Apr-2020, 10:40 AM
so you start out with an ANSI file and read it into binary data and then convert it to an OEM character. all good

now you are using the character function. I dont really know what it does in DF20 but the fact that you have issues with characters above 127 tells me it expects at least at this time a UTF8 character

here is something i tried



UChar[] test
Move 65 to test[0]
Move 66 to test[1]
Move 130 to test[2]
Move 67 to test[3]

Address aUTF8Buffer
Move (OemToUtf8Buffer((AddressOf(test)),4 )) to aUTF8Buffer
Move (PointerToString(aUTF8Buffer)) to sVal
Move (Free(aUTF8Buffer)) to iVoid


you have to be careful with strings as using strings will do automatic conversions

so here i am using a uchar array and put these characters in
OemToUtf8Bufer will convert OEM to UTF8 and then pointertostring converts it to a DF20 string which is a multibyte string

Mike Cooper
18-Apr-2020, 11:19 AM
Thanks Michael.

I will give that a shot. It is really something that I hadn't considered but will definitely be an issue moving to DF20. Most of my software use the character() function alot because a lot of my customers are french.

Thanks for the input and insight.

M

starzen
18-Apr-2020, 01:18 PM
of course in DF20 that shouldnt be needed any longer as it is in Unicode

so your source code will support the proper unicode characters without any special handling.

SQL database will be fine as well

Text files would need to be UTF-8 encoded files

Mike Cooper
18-Apr-2020, 01:45 PM
So Readln should have read that file correctly in DF20.

That's what I am saying, Readln doesn't read the extended characters correctly in DF20

Samuel Pizarro
18-Apr-2020, 02:04 PM
It does if the file is unicode (UTF8 encoding) mike.

As you believe your file is ANSI , you need to convert it from ANSI to utf8. Thats why I suggested you to use the cseqheler class

Mike Cooper
18-Apr-2020, 03:37 PM
Thanks Samuel.

I haven't got to that yet, but will try it.

starzen
18-Apr-2020, 05:33 PM
there are two issues here

1) reading data
2) using data

reading data should not have any issues of course it should simply read the binary values but when you use the data this is where the issue comes in.

DataFlex 20 internally stores strings in unicode. If you read data into a string this is not simply reading binary data any longer.

I can only guess but i guess DF will expect the input file to be unicode therefor an ASCII 130 is an invalid byte.

in order to read an ASCII/ANSI file you probably need to read it into a byte array. this will not mess with the incoming data. Then use the conversion function to convert it to UTF8.
Now it can be used as strings in DF20

raveens
19-Apr-2020, 04:58 PM
Interesting...

Maybe "Direct_input" should detect the file-format/character-set, for "file: pc-text:", to auto convert text to UTF-8. As currently the default driver/FileMode for direct_input is "file: pc-text:"

Alternatively, maybe there needs to be other file-modes, i.e. ANSI, OEM and not assume the file will always be UTF8.

Secondly, How will resources to be handled ? We use the command "include_resource (https://docs.dataaccess.com/dataflexhelp/mergedProjects/LanguageReference/Include_resource_Command.htm)" a lot in our application and then use "direct_input resource:SQL1" to read the resource into a uchar[] or sometimes to a string, depending if the resource is type binary or line . Will this still work in DF20 ?

starzen
19-Apr-2020, 05:29 PM
we would need to be able to

1) open a file with default encoding
this will open the file, check if it has a BOM and if so read it as the encoding in the BOM. if it does not have a BOM it should read the file using the computers default encoding / codepage

2) open a file with specifying a specific encoding
this will allow a file to be opened with a specified encoding. needed it you have a file without BOM that is a known encoding different from default

3) ability to convert between encodings

we built a dot net library for our sequential IO a while ago and use it for the majority of things but DF20 certainly needs some of these things working

as long as you dont use any special characters you generally will slide through and it will work. once you use characters past 127 thats where things start breaking

Harm Wibier
20-Apr-2020, 04:28 AM
So.. just to make sure:
Direct_Input and ReadLn do not do any conversions. They actually never did and just put the value into a string. So in 19.1 that would all work fine if your file is saved in the OEM format and for anything else you'd need to add conversions.

In DataFlex 20 strings are assumed to be UTF-8 and this means that when reading data into them unconverted it assumes UTF-8. So if your file is stored in UTF-8 encoding all is fine. For anything else you'd need to add conversions.

Luckily UTF-8 is invented to be partially compatible with ANSI & OEM so the newline character is the same and ReadLn command should function properly on all three formats. So if your file is ANSI or OEM you can still use ReadLn and put it directly in a string, but you then have to convert to UTF-8 before doing any further processing. All other string functions will asume the string to be UTF-8 and will break on any extended character (above 128).



Readln channel iChannel sLine
Move (AnsiToUtf8(sLine)) to sLine
Move (Trim(sLine)) to sLine //<---Put a Stop Break here


Between TP2 and Alpha 1 we have actually adjusted the debugger to use more safe string conversions when displaying string values, so it shouldn't break or read past the end of strings any more when a string contains invalid UTF-8. Unfortunately this made the debugger a bit slower, but less weird things will happen when there is ANSI or OEM data in a string.

Note that I do not disagree on the need for file handling API's that help you doing this for you (and support text files with a BOM and such). Not so much because there is more technical reason for it than before, but since we all have to look at our file IO this might be a good time to properly review things.

wila
20-Apr-2020, 05:14 AM
So if your file is ANSI or OEM you can still use ReadLn and put it directly in a string, but you then have to convert to UTF-8 before doing any further processing. All other string functions will asume the string to be UTF-8 and will break on any extended character (above 128).



Readln channel iChannel sLine
Move (AnsiToUtf8(sLine)) to sLine
Move (Trim(sLine)) to sLine //<---Put a Stop Break here



This smells like a hack to me.

As mentioned earlier I would prefer to have an Ansistring variable type added as well.
This makes things much more clear and for these type of things you can change the type of your string variable to ansistring so that it is crystal clear what you are doing.

Might be a "bit" more work for you guys though...
--
Wil

Harm Wibier
20-Apr-2020, 05:34 AM
This smells like a hack to me.
This is what we've been doing in DataFlex since the first windows version came out with OEM & ANSI. At least now we know that it is just temporary, until you all converted your text files to UTF-8.



As mentioned earlier I would prefer to have an Ansistring variable type added as well.
This makes things much more clear and for these type of things you can change the type of your string variable to ansistring so that it is crystal clear what you are doing.

Might be a "bit" more work for you guys though...

We explicitly choose to not go this direction. We don't want to make DataFlex like C where there are hundreds of ways to store & work with strings. Also, we'd be adding a type for a legacy string encoding (likely two, cause I assume you'd want to have OemString as well) that shouldn't be necessary any more in a few years. On top of that, besides the amount of work it would take, there are limitations to how many types DataFlex can have without making major changes to the compiler and the runtime.

wila
20-Apr-2020, 06:04 AM
We explicitly choose to not go this direction. We don't want to make DataFlex like C where there are hundreds of ways to store & work with a string. Also, we'd be adding a type for a legacy string encoding (likely two, cause I assume you'd want to have OemString as well) that shouldn't be necessary any more in a few years. On top of that, besides the amount of work it would take, there are limitations to how many types DataFlex can have without making major changes to the compiler and the runtime.

Nope, not asking for OemString, it's the same bytes, you could call it "obsoleteString" type for all I care.
But OK OK.. I will stop bringing this up, at least for now.

Besides, if I"m the only person who asks for this then clearly it isn't that important.

Thanks for your answer (and patience),

edit: Please make sure that this particular trick is well documented!
-
Wil

Mike Cooper
20-Apr-2020, 08:31 AM
Thanks Harm

DaveR
21-Apr-2020, 04:03 PM
It was mid 1990s when I first asked vendors to send copies of invoices in csv or spreadsheet (Lotus 123 back then!). As of 2020 we have about 5% complying. It didn't stop us using the 'load from csv' paradigm pretty much everywhere though, so we have direct_input stuff all over the system.

It would be nice if the class could be extended to contain functional replacements for read, readln and the rest so that we could search and drop in a similar-sntax replacement without having to worry what format the original file was.

Mike Cooper
21-Apr-2020, 04:25 PM
+1