A little on charAt(), charCodeAt(), fromCharCode(), and at()
Spoiler alert: JavaScript has a Unicode problem.
charAt
The charAt() String object takes an optional index parameter and will return the character located at that position.
The index parameter must be an integer between 0 and whatever string.length - 1
is. You can also count backwards and get the penultimate, etc. index, that's ok. You can even pass the string '5'
and charAt() will convert it to the number 5
!
If no index is provided, the default is 0 (so the first character will be returned).
If you try to use a floating point number rather than an integer (7.45
rather than just 7
), then only the integer part will be taken into account.
Alternatively, if your index parameter can't be converted to an integer, charAt() will also apply the default 0.
Lastly, the integer provided is out of the range of the string - that is, if you try to get the character at index 50 but the string is only 20 characters long - then an empty string will be returned.
To make visualization easier:
const string = 'Well, look!'
console.log(string.charAt())
// output: "W"
const index = 10
console.log(string.charAt(index))
// output: "!"
const indexString = '10'
console.log(string.charAt(indexString))
// output: "!"
const floatingIndex = 1.14
console.log(string.charAt(floatingIndex))
// output: "e"
const penultimateIndex = string.length -2
console.log(string.charAt(penultimateIndex))
// output: "k"
const outOfRange = 99
console.log(string.charAt(outOfRange))
// output: ""
const invalidIndex = 'a'
console.log(string.charAt(invalidIndex))
// output: "W"
Ghosts have characters too
When I said charAt()
returned a character, I wasn't lying... but it must be mentioned that, for definition reasons, a character will be ONE UTF-16 code unit long. Some characters can't be represented in a single unit.
Let's take the string G
, (for Ghost
).
let string = 'G'
console.log(string.length)
// output: 1
You see one character, and so does the computer.
Now spicing things up a bit, and let's make our string the ghost emoji 👻
let string = '👻'
console.log(string.length)
// output: 2
Well, would you look at that! A single ghost 👻 is actually two UTF-16 code units long! You see a single entity, but the computer sees two.
Lesson of the day: String length is NOT equal to character count.
That means your charAt()
output might look a little different...
let string = '👻'
console.log(string.charAt(), string.charAt(1), string.charAt(2))
// output: "�" "�" ""
So we can log the first character at index 0, the second character at index 1, and from then on we get an empty string because 2 is out of the range of the string.
What a weird little �! What is it? Before we move on to charCodeAt()
, let me show you something interesting. Look what happens if you remove that console.log and go straight to the source:
'👻'.charAt()
// output: '\uD83D'
'👻'.charAt(1)
// output: '\uDC7B'
or even
let string = '👻'
string.charAt()
//output: '\uD83D'
string.charAt(1)
//output: '\uDC7B'
We've now got the UTF-16, in hex.
Here are some resources
They've definitely helped me understand this Unicode stuff a little better:
- Mathias Bynens has given quite a few talks on the topic. The slides are embedded on the post on his website, or you can check out the slides separately here
- Axel Rauschmayer has hosted workshops that cover strings and grapheme clusters (those sets of code points), and this part of his free online book is exactly about that.
- Last but not least, the ECMAScript language specification.
Well, let's move on to charCodeAt()
to help it make more sense.
charCodeAt
A UTF-16 character will have a code unit ranging from 0 to 65535, and that integer is what charCodeAt()
will return.
It works similarly to charAt()
in that it will take an index parameter, but this time around an index that's out of range will return NaN
.
To use the same examples from above to help see the difference better:
const string = 'Well, look!'
console.log(string.charCodeAt())
// output: "87"
const index = 10
console.log(string.charCodeAt(index))
// output: "33"
const indexString = '10'
console.log(string.charCodeAt(indexString))
// output: "33"
const floatingIndex = 1.14
console.log(string.charCodeAt(floatingIndex))
// output: "101"
const penultimateIndex = string.length -2
console.log(string.charCodeAt(penultimateIndex))
// output: "107"
const outOfRange = 99
console.log(string.charCodeAt(outOfRange))
// output: "NaN"
const invalidIndex = 'a'
console.log(string.charCodeAt(invalidIndex))
// output: "87"
So now those are the UTF-16 values of our characters: W
(uppercase) is 87, e
(lowercase) is 101, and so on.
What happens to the ghost?
We know our friendly ghost is two characters long, so let's try checking the characters at index 0, 1 and 2 (which is out of range). Note that the output will be in decimal, not hex.
const string = '👻'
console.log(string.charCodeAt())
// output: "55357"
const index = 1
console.log(string.charCodeAt(index))
// output: "56443"
const indexString = '1'
console.log(string.charCodeAt(indexString))
// output: "56443"
const outOfRange = 2
console.log(string.charCodeAt(outOfRange))
// output: "NaN"
const invalidIndex = 'a'
console.log(string.charCodeAt(invalidIndex))
// output: "55357"
And then you can go the other way around with our next stop, at fromCharCode()
fromCharCode
fromCharCode()
returns a string based on the UTF-16 sequence you feed it, up to... whatever you want.
Note that it returns a string and not a String object, so you'll always wrap it up like String.fromCharCode()
rather than feed it a string you created.
In practice:
String.fromCharCode(87, 101, 108, 108, 44, 32, 108, 111, 111, 107, 33)
//output: 'Well, look!'
It'll work with the string equivalents, too:
String.fromCharCode('87', '101', '108', '108', '44', '32', '108', '111', '111', '107', '33')
//output: 'Well, look!'
It won't work if the char code is invalid, though. You'll just get back an empty value like so:
String.fromCharCode('a')
//output: '\x00'
And so that our Ghost isn't left behind, you can do that with it too!
Playing around with the ghost
Do note that if you input a decimal UTF-16 like the ones we were getting above, you'll get its hex counterpart! This hex counterpart will be Unicode escaped so it'll have \u
before it.
String.fromCharCode(55357)
//output: '\uD83D'
String.fromCharCode(56443)
//output: '\uDC7B'
String.fromCharCode(55357, 56443)
//output: '👻'
// Slap a 0x in front of the non-escaped hex and see the magic!
String.fromCharCode(0xD83D,0xDC7B)
//output: '👻'
Boo~!
What about at?
at()
is pretty much another way of writing charAt()
, sure, but there's a difference that means the world to us devs that like typing the least possible.
It allows us... THIS:
const string = 'Well, look!'
console.log(string.at(-1))
// output: "!"
Compare it with charAt()
returning us an empty string because -1 is out of the string's length:
const string = 'Well, look!'
console.log(string.charAt(-1))
// output: ""
console.log(string.charAt(string.length -1))
// output: "!"
Since at()
takes negative values out of the box, it does make for a cleaner-looking option.
Addendum
We can access individual characters in a string directly by their index using bracket notation and treating the string as an array:
let string = 'Witness me!'
string[0]
//output: 'W'
string[0].toLowerCase()
//output: 'w'
string.charAt()
//output: 'W'
string.charAt().toLowerCase()
//output: 'w'
Put them into practice
If you do wish to practice them, here's one interesting option:
This 7kyu CodeWars kata goes over the explanation and offers an interesting challenge decoding a secret message!
And that's it for these four string methods! I hope I was of help and that we've learned something together