Page 1 of 5 1234 ... LastLast
Results 1 to 10 of 45

Thread: Character and ASCII, how they are supposed to work

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Join Date
    Mar 2009
    Posts
    1,291

    Default Character and ASCII, how they are supposed to work

    Hi all,

    Quick recap, I did read this - https://support.dataaccess.com/Forum...acter-function

    I have a string that looks like this



    If you copy and paste that to notepad, and save it as UTF 8 (utf8.txt), you get the following bytes (without BOM)
    41 CC 84

    (a character "A" with a overline combined character |CI$84CC, or 33996). So far so good

    If you copy and paste that to notepad, and save it as UTF 16 LE (utf16.txt), you get the following bytes (without BOM)
    41 00 04 03

    (a character "A" with a overline combined character |CI$0304, or 772). So far so good

    When I read the UTF 8 file to a string and do a showln, it displays the correct string
    Code:
    String s
    Direct_Input "utf8.txt" // Reading 41 CC 84
    Readln s
    Showln (Ascii(Mid(s,1,2))) // I get 772
    Showln s // Shows the string Ā correctly
    Close_Input
    Since DF20+ are all dealing with UTF 8 strings, why are the ASCII / Character functions dealing with UTF 16 code points? Why should I get 772 instead of 33996 when I called the ASCII function?

    Frank Cheng

  2. #2
    Join Date
    Feb 2009
    Posts
    5,467

    Default Re: Character and ASCII, how they are supposed to work

    Just a guess

    Is it perhaps because you have ended up with binary data in a string

    If you had read it into a UChar array and then converted that to a string it would be stored under the hood differently maybe ?
    Success consists of going from failure to failure without loss of enthusiasm - Winston Churchill

  3. #3
    Join Date
    Mar 2009
    Posts
    1,291

    Default Re: Character and ASCII, how they are supposed to work

    If I call StringToUCharArray on the string, I get a UChar[] with 3 elements

    65 204 132

    Convert to Hex, it would be 41 CC 84 (exactly what I expected)

    It seems reasonable to think that ASCII and Character functions should operate on UTF8 code point!

    Code:
     Use UI
        String s
        UCHAR[] aa
        Move 65 to aa[0]  // 41
        Move 204 to aa[1] // CC
        Move 132 to aa[2] // 84
        Move (UCharArrayToString(aa)) to s
        Showln s
        Showln (Ascii(Mid(s,1,2))) // should be 33996 (|CI$84CC), but it's showing 772 (|CI$0304)
        InKey FieldIndex
    Frank Cheng
    Last edited by Frank Cheng; 14-Jan-2021 at 11:00 AM. Reason: Added code sample

  4. #4
    Join Date
    Feb 2009
    Location
    Brazil
    Posts
    5,446

    Default Re: Character and ASCII, how they are supposed to work

    Why do you want the ASC code only for the 2nd byte/position ?

    ASCII is meant to return the entire "character" codepoint.. in this case your "Ã", which is 195. (U00C3)


    Look the following example:

    Code:
                // https://www.unicode.org/charts/case/chart_Latin.html
                // 
                // Ế - 1EBE - 7870 
                // ế - 1EBF - 7871
                // Ề - 1EC0 - 7872
                // ề - 1EC1 - 7873 
                // Ể - 1EC2 - 7874
                // ể - 1EC3 - 7875
                // Ễ - 1EC4 - 7876
                // ễ - 1EC5 - 7877
                // Ệ - 1EC6 - 7878
                // ệ - 1EC7 - 7879 
                
                Move "ẾếỀềỂểỄễỆệ" to sValue 
                
                For i from 1 to (Length(sValue)) 
                    Show ("sValue = " + Mid(sValue, 1, i )) 
                    Showln (" | Ascii(sValue) = " + String(Ascii(Mid(sValue, 1, i ))))
                Loop
    will produce the expected unicode code-point for each "char"
    Code:
    sValue = Ế | Ascii(sValue) = 7870
    sValue = ế | Ascii(sValue) = 7871
    sValue = Ề | Ascii(sValue) = 7872
    sValue = ề | Ascii(sValue) = 7873
    sValue = Ể | Ascii(sValue) = 7874
    sValue = ể | Ascii(sValue) = 7875
    sValue = Ễ | Ascii(sValue) = 7876
    sValue = ễ | Ascii(sValue) = 7877
    sValue = Ệ | Ascii(sValue) = 7878
    sValue = ệ | Ascii(sValue) = 7879


    Now.. What I don't understand from your sample, is WHY your Mid(1,2) returned actually something.. as your string has only 1 single unicode "char". The 2nd position simply "should not exist"

    Mid("Ã", 1, 2) should return nothing.. in my opinion. as Mid should not count on bytes position anymore..
    • If {position} is greater than the length of {string-value}, the function will return an empty string.
    And it does,, if I hardcoded that..

    Code:
        Showln ('Mid("Ã", 1, 2) = ' + Mid("Ã", 1, 2))
        Showln ('Ascii(Mid("Ã", 1, 2)) = ' + String(AscII(Mid("Ã", 1, 2))))
    Results in
    Code:
    Mid("Ã", 1, 2) = 
    Ascii(Mid("Ã", 1, 2)) = 0
    As I expected.. Now why you are getting something, when you read it from the file... !?!? Something seems not right. Maybe because your file has no BOM, and its reading it as if it was an ANSI instead of utf-8 ?
    Last edited by Samuel Pizarro; 14-Jan-2021 at 03:59 PM.
    Samuel Pizarro

  5. #5

    Default Re: Character and ASCII, how they are supposed to work

    Samuel,

    "Ã" <> "Ā"

    I agree with your "why does it return anything for the second position" part though.

    If you get the ascii value for the first character, then it returns 65..

    eg:

    Code:
    Use Windows.pkg
    
     Use UI
     
     Procedure Test
        String s
        UChar[] aa
        Move 65 to aa[0]  // 41
        Move 204 to aa[1] // CC
        Move 132 to aa[2] // 84
        Move (UCharArrayToString(aa)) to s
        Showln s
        Showln (Ascii(Mid(s,1,1))) // should be 33996 (|CI$84CC), but it's showing 772 (|CI$0304)
        InKey FieldIndex
     End_Procedure
    
    Send Test

    65
    --
    Wil

  6. #6
    Join Date
    Feb 2009
    Location
    Brazil
    Posts
    5,446

    Default Re: Character and ASCII, how they are supposed to work

    "Ã" <> "Ā"
    I should have increased my browser font size.. I really saw "A"+"~" there...

    Ok.. hehehe.. I keep learning this every day.. this one I would like to get some comments from DAW. Unicode is a nightmare.

    The same resulting letter , it seems can be encoded in different ways. When I pasted your Ā in studio editor, I realized it is a "composed" type. if you paste it. and use the right arrow key to navigate the cursor, you need 2 strokes to get to the next position.

    so his
    Ā, is a combination of 2 different code-points. (U+0041 / U+0304)
    https://apps.timwhitlock.info/unicode/inspect?s=A%CC%84


    There is another code-point for the same "graphic representation - Ā", which is a single code-point. (U+0100)
    https://apps.timwhitlock.info/unicode/inspect/hex/0100
    This last one has length = 1

    But his one, has length = 2 (2 diff code points)

    Now, I am not an expert, still learning, but I guess Frank is mixing concepts..
    Code-Points (which is what our df Ascii() returns)) is different from UTF-8 / UTF-16 hex encoding representations.

    Showln (Ascii(Mid(s,1,2))) // should be 33996 (|CI$84CC), but it's showing 772 (|CI$0304)


    he was expecting to get 33996 (hex 84CC). But this is not the final code-point. This is just utf-8 encoding that represents the code-point 0304
    The final Code-point (column Code in the link I provided) for the "COMBINING MACRON" is x0304, which in decimal is 772, exactly what ASCII function is giving us back.


    Now.. putting all this in practical terms.. This is going to be fun to handle these kind of things if we need to manipulate strings internally. I was not aware that we could have the same resulting "char" in different ways.

    having fun with unicode ...
    Samuel Pizarro

  7. #7
    Join Date
    Mar 2009
    Posts
    1,291

    Default Re: Character and ASCII, how they are supposed to work

    Hi Samuel,

    You are right. I was confused about code-point and encoding. Your explanation made sense.

    Thanks,

    Frank Cheng

  8. #8

    Default Re: Character and ASCII, how they are supposed to work

    Samuel,

    Good stuff.

    Yes, unicode is both a blessing and a curse, it can be very confusing at times.

    When looking back at one of those favorite examples here at the forum "Hello" in Thai สวัสดี (an example I somehow have some affiliation with )

    At first sight "สวัสดี" looks like it is 4 characters, but in reality there are 6.

    As you will also learn in Thai language lessons, the wo wen (ว) and mai han akaat ( ั) together make วั , same for daw dek (ด) and sara ee ( ี), together they make "ดี". Mai han akaat and sara ee are vowels, but there are also tone markers used as compound characters.

    Code:
        String s t
        Integer i
        
        Move "สวัสดี" to s
        Showln s
        Showln "Length " (Length(s))
        For i from 1 to (Length(s))
            Move (Mid(s,1,i)) to t
            Show t " - "
            Showln (Ascii(t))
        Loop
        InKey FieldIndex
    This will output:
    สวัสดี
    Length 6
    ส - 3626
    ว - 3623
    - 3633
    ส - 3626
    ด - 3604
    - 3637
    Apart from DataFlex not being able to print the code points for mai han akaat en sara ee, it looks fine.
    BTW, it is not unexpected that those 2 vowels don't print as they are only used as combined characters, never standalone.

    To illustrate, if I want to type them here on my Thai keyboard, then I have to type a space character before typing these vowels in order for them to show up.

    --
    Wil

  9. #9
    Join Date
    Jan 2009
    Location
    Richmond, VA
    Posts
    5,854

    Default Re: Character and ASCII, how they are supposed to work

    Yes, unicode is both a blessing and a curse, it can be very confusing at times.
    To paraphrase an old joke...

    "Just add Unicode support", she said
    "It'll be great", she said

    Best regards,

    -SWM-

  10. #10
    Join Date
    Feb 2009
    Location
    Brazil
    Posts
    5,446

    Default Re: Character and ASCII, how they are supposed to work

    Just that... simple isn't it !? hehehe
    Samuel Pizarro

Page 1 of 5 1234 ... LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •