Page 4 of 5 FirstFirst 12345 LastLast
Results 31 to 40 of 45

Thread: Character and ASCII, how they are supposed to work

  1. #31
    Join Date
    Feb 2009
    Posts
    5,468

    Default Re: Character and ASCII, how they are supposed to work

    I've not read the whole of this thread and this may have been mentioned so far but this page

    https://en.wikipedia.org/wiki/Precomposed_character

    Seems to suggest that é represented as 00E9 as opposed to 0065 0301 is there within unicode for legacy/compatibility reasons

    But of course we and most others are going to have é in the database represented by 00E9

    I guess there are no easy answers and it is almost as bad as codepages !!

    But like others are saying if I do Mid etc then the "character(s)" that come back must represent what any reasonable human looking at the string visually on screen would expect to be returned and not get lost in the weeds of it is doing x because of code points etc etc

    It's not easy and I guess as difficult as knowing looking for the position of ß and expecting it to return the position of "ss" if it finds it

    I guess all the diacritical characters are in the range 0300 to 036F so that is at least something, but coming after the character they modify does not help

    I will stop waffling now
    Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill

  2. #32
    Join Date
    Feb 2009
    Location
    The Netherlands
    Posts
    4,674

    Default Re: Character and ASCII, how they are supposed to work

    We are going to look into adding normalization function(s) to DataFlex to provide developers with a more complete interface for manipulating strings. When adjusting the string functions for Unicode we did research what other environments did and looked at the available API’s. This has led to the current code point based implementations. Changing the implementation of string functions at this stage of the release cycle is not feasible, and may just result in a different set of issues.

    The reason that string comparisons (s1 = s2) consider strings in different normalization forms as equal is because it uses ICU string comparison methods that are influenced by the DF_LOCALE_CODE settings. It does this because you can also do greater than and less than comparisons. If these strings in different forms were not considered equivalent, they would have to be evaluated as greater or less than the other.

    The Pos function and contains operator perform binary searches and do require normalization of input strings to handle composed characters properly.

  3. #33
    Join Date
    Feb 2009
    Location
    Brazil
    Posts
    5,446

    Default Re: Character and ASCII, how they are supposed to work

    Nice article about all of this we have been talking... but from JavaScript perspective.. and what they have done in v6, v7 ECMAScript releases to handle this stuff..

    https://mathiasbynens.be/notes/javascript-unicode
    Samuel Pizarro

  4. #34
    Join Date
    Feb 2009
    Location
    Sweden
    Posts
    1,803

    Default Re: Character and ASCII, how they are supposed to work

    Harm,

    Quote Originally Posted by Harm Wibier View Post
    We are going to look into adding normalization function(s) to DataFlex
    Great, thanks.


    Quote Originally Posted by Harm Wibier View Post
    The Pos function and contains operator perform binary searches and do require normalization of input strings to handle composed characters properly.
    IIUC, what this means for developers in practice is to make sure your database is normalized and that any input from external sources (files, web services, web apps, anything else) is normalized manually when read/received. Assume the users does not copy-paste non-normalized text, that should be unlikely at least for "mostly ascii" languages. If it is a problem, hook into the DD's somewhere and normalize as needed. Is that correct, and did I miss anything?
    // Anders

  5. #35
    Join Date
    Mar 2009
    Posts
    1,292

    Default Re: Character and ASCII, how they are supposed to work

    Did a quick wrap, check out

    http://www.frankcheng.com

    Frank Cheng

  6. #36

    Default Re: Character and ASCII, how they are supposed to work

    Frank,

    Nice!
    Also curious what all the "coming soon" parts on your website are going to be.
    No doubt they will be interesting as well.
    --
    Wil

  7. #37
    Join Date
    Feb 2009
    Location
    Cayman Islands
    Posts
    3,969

    Default Re: Character and ASCII, how they are supposed to work

    Quote Originally Posted by Frank Cheng View Post
    Did a quick wrap, check out

    http://www.frankcheng.com

    Frank Cheng
    That Roman Numerals converter looks like something with which I could really mess people's heads
    I should be on a beach ...

  8. #38
    Join Date
    Feb 2009
    Location
    Sweden
    Posts
    1,803

    Default Re: Character and ASCII, how they are supposed to work

    Frank,

    The Roman Numerals Converter does not compile, row 33 should be removed.
    // Anders

  9. #39
    Join Date
    Mar 2009
    Posts
    1,292

    Default Re: Character and ASCII, how they are supposed to work

    Hi Anders.

    It's fixed now. Thanks.

    Frank Cheng

  10. #40
    Join Date
    Feb 2009
    Location
    Adelaide, South Australia
    Posts
    2,863

    Default Re: Character and ASCII, how they are supposed to work

    Well, I am still confused..

    Lets say I want a é character.
    In this browser I do Alt+130 (numpath).

    Issue 1:
    In Studio, Alt+130 gives a music note.

    Issue 2:
    When I copy a é from the web, into a (utf-8) DataFlex Source file, and then do a (Ascii(sChar)) I get 233.
    On the web I found that this character can be:
    UTF-8: 0xC3 0xA9 or dec bytes 195 169 or combined 50089
    UTF-16BE: 0x00E9 or dec bytes 0 233 or combined 233
    UTF-16LE: 0xE900 or dec bytes 233 0 or combined 59648
    ANSI: 233
    OEM: 130
    With Diacritic: e (U+0065) - ◌́ (U+0301)
    Alt code: Alt 130

    The ASCII function returns 233 so this is either the ANSI code or the UTF-16BE, I was expecting 50089, the Utf-8 decimal value.

    When moving string to UChar I do get 195, 169 (Utf-8)
    When moving wstring to Uchar I get 233, 0 (Utf-16 LE)

    So we have three 'standards'
    Utf-8 = String
    Utf-16BE = Character and Ascii functions
    Utf-16LE - WString

    I think this is a bit much, and agree with the original statement, that the Character and Ascii functions should be based on the Utf-8 decimal value of 50089.

    Copy attached in a source file and compile, with a breakpoint on the inkey, inspecting the locals.

    Code:
    Use Windows.pkg
    
    Procedure CharTest
        String sChar
        String sChar2
        WString sWchar
        UChar[] ucString
        UChar[] ucWString
        Integer iAscii
        // C3 = 195 a9=169
    
    
        // Should have been able to do Alt+130 but that showed a music note, had to copy from https://unicode-table.com/en/00E9/
        Move "é" to sChar
        Move (Ascii(sChar)) to iAscii
        // I am surprised to see that this is value 233 the decimal value of the UTF-18BE 00e9
        // I was expecting 50089, the decimal of UTF-8 c3A9
        
        Move sChar to sWchar
        
        Move (StringToUCharArray(sChar)) to ucString
        // Shows 195, 169 the correct UTF-8 codes
        
        Move (WStringToUCharArray(sWchar)) to ucWString
        // Shows 233, 0 the correct bytes for UTF-16LE, this is unexpected
        // I expected 0, 233 the UTF-16BE as that is how the Ascii function works...
        
        Move (Character(iAscii)) to sChar2
        Inkey windowindex
        
    End_Procedure
    
    
    Send CharTest
    ps this is the released DataFlex 2021, version 20.0

    Kind regards
    Marco
    Marco Kuipers
    DataFlex Consultant
    28 IT Pty Ltd - DataFlex Specialist Consultancy
    DataFlex Channel Partner for Australia and Pacific region
    Adelaide, South Australia
    www.28it.com.au

Page 4 of 5 FirstFirst 12345 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •