How to interact with Number Systems and Encoding Schemes in PHP.

In this blog, I’ll walk through how to work with data in various number systems and different encoding schemes. I’ve always loved the process of being creative and building stuff when writing code, but I often find binary and hex notations confusing. I felt like I needed a recap on number systems and encoding schemes to fill in some blanks, which was my main motivation for this article.

My goal with this article is to make it heavy on code examples so that it’s easy and quick to reference when I, or anyone out there, need a quick reminder on how to work with number systems and encoding schemes in PHP. I’ll start by going through number systems, then text encodings, and lastly, I’m going a little bit deeper into UTF-8 and its byte sequence finishing off with some fun emojis. Let’s get right to it!

Number Systems

In this section, I’ll go through the number systems which are often used in computer-related contexts. Base-2 (Binary), Base-8 (Octal) and Base-16 (Hexadecimal or Hex). Not to forget Base-10 (Decimal), which is what we’ve used to count from 1 to 10 on our fingers since the early days of humanity.

A number system has a base or radix that indicates how many digits or symbols are used to represent numbers. In Base-2 we have (0 and 1), Base-8 has (0-7) and Base-16 has (0-9 and A-F in capital or lowercase) making up 16 combinations.

Since a larger base has more combinations per character, it means that a base larger than 10 will be shorter, while a base smaller than 10 will generate an output that’s longer. Take for example the decimal number 127 which is 7f in hex but 1111111 in binary.

Base-2 (Binary)

Let's have a look at how we can work with binary in PHP. PHP doesn't have a native binary type but internally in PHP, all strings are byte arrays. The size of integers is normally 8 bytes on a 64-bit system in PHP, but you can check the value of the constant PHP_INT_SIZE on your system to verify.

// binary notation (0b)
// underscore can be added to integers in PHP to improve readability
$int = 0b11111111;  // int(255)
$int = 0b1111_1111; // int(255)

// bitwise operators
$int = 0b1010 & 0b0011; // int(2)   (AND: 0010)
$int = 0b1010 | 0b0011; // int(11)  (OR:  1011)
$int = 0b1010 ^ 0b0011; // int(9)   (XOR: 1001)
$int = ~0b1111;         // int(-16) (NOT: 1111...0000)

// bit shifting
$int = 8 << 3;     // int(64) (00001000 -> 01000000)
$int = 8 >> 3;     // int(1)  (00001000 -> 00000001)
$int = 8 >> 3 & 1; // int(1)  get the rightmost bit (LSB)

// convert binary formatted string to decimal
$int    = bindec('11111111'); // int(255)
$string = base_convert('11111111', 2, 10); // string(3) "255"

// convert decimal to binary formatted string
// use sprintf() to add padding
$string = decbin(255); // string(8) "11111111"
$string = decbin(127); // string(7) "1111111" (7 bit)
$string = base_convert('127', 10, 2); // string(7) "1111111" (7 bit)
$string = sprintf('%08b', '127'); // string(8) "01111111" (8 bit)

Base-8 (Octal)

Octal numbers are used in file permissions on Unix-like systems. You've probably seen 0755 (or 1755 with a sticky bit) before when working in the terminal browsing and managing files.

// octal notation (0o)
$int = 0o644; // int(420)
$int = 0o777; // int(511)

// octal string notation (\)
// bytes for characters in current charset in octal format
$string = "\101"; // string(1) "A"
$string = "\101\102\103"; // string(3) "ABC"
$string = "\360\237\245\263"; // string(4) "🥳"

// convert octal to decimal
$int    = octdec(777); // int(511)
$string = base_convert('777', 8, 10); // string(3) "511"

// convert decimal to octal
$string = decoct(511); // string(3) "777"
$string = base_convert('511', 10, 8); // string(3) "777"

Base-16 (Hexadecimal, Hex)

Hex is often used to make a shorter representation of a binary number. One hex digit or letter equals 4 bits or a nibble, so you need two of them to represent a byte. They are often used when representing colors in a RGB (Red, Green and Blue) color model, where each color is represented in 1 byte, ranging from 0 to 255. The color white, or max intensity on each color channel (255, 255, 255) is thus represented as #FFFFFF.

// hex notation (0x)
$int = 0xFF; // int(255)
$int = 0x0;  // int(0)

// hex string notation (\x)
// bytes for characters in current charset in hex format
$string = "\x41"; // string(1) "A"
$string = "\x41\x42\x43"; // string(3) "ABC"
$string = "\xf0\x9f\xa5\xb3"; // string(4) "🥳"

// convert string to bytes in hex format for characters in current charset
$string = bin2hex('ABC'); // string(6) "414243"
$string = bin2hex('🥳');  // string(8) "f09fa5b3"

// convert bytes in hex format for characters in current charset to string
$string = hex2bin('414243'); // string(3) "ABC"
$string = hex2bin('f09fa5b3'); // string(4) "🥳"

// convert hex to decimal
$int    = hexdec('FF'); // int(255)
$string = base_convert('FFFFFF', 16, 10); // string(8) "16777215" (3 bytes)

// convert decimal to hex
$string = dechex(127); // string(2) "7f"
$string = base_convert('255', 10, 16); // string(2) "ff"

// convert hex to binary (hex->decimal, decimal->binary)
$string = decbin(hexdec('ff')); // string(8) "11111111"

Encoding Schemes

Now that we’ve covered the most common number systems, let’s take a look at some common text encoding schemes used in web development. After that, we’ll take a closer look at UTF-8 in the last part of the article.

Base-64

Base-64 encoding has the word base in its name but it’s actually not a number system but rather an encoding scheme. In Base-64, 6 bits are used to represent 64 ASCII printable characters. Base-64 is used to represent binary as printable characters and is great for transferring data on text-based protocols or where binary data needs to be stored as text.

Due to its design, when you Base-64 encode binary data you increase the original data by about 30% in size. That’s because the input string uses 8 bits to represent a character while Base-64 encoded strings use 6 bits to represent a character.

In addition, an encoded Base-64 string is always in groups of 4 bytes (or characters), so if the encoded string is not a multiple of 4, padding is added to the end of the string in the form of the character ”=”, making the output even larger.

Let’s go through an example to see how the encoding scheme works. In the example, we’ll Base-64 encode the binary 01000110 which is the letter F in ASCII.

// base64 character set
$charset = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';

// 1. get binary for input
$codepoint = sprintf('%08b', ord('F')); // string(8) "01000110"

// 2. divide 8 bits binary to 6 bit chunks, 6 bits are needed for 64 combinations
$chunks = str_split('01000110', 6); // array(2) ["010001","10"]

// 3. pad last bit if it's not 6 digits
$chunks = array_map(fn ($bin) => str_pad($bin, 6, '0'), $chunks); // array(2) ["010001","100000"]

// 4. get base64 encoded characters from base64 charset
$base64Chars = array_map(fn ($bin) => $charset[bindec($bin)], $chunks); // array(2) ["R","g"]

// 5. base64 encoded strings are always groups of 4 bytes, if the output isn't a multiple of 4, then add padding
$base64Chars = str_pad(implode('', $base64Chars), 4, '='); // string(4) "Rg=="

// check output with PHP native function
$string = base64_encode('F'); // string(4) "Rg=="

URL Encoding (Percent-encoding)

In URL Encoding (also called Percent-encoding), you encode each byte of a character by using a % sign followed by two hex digits.

The urlencode() / urldecode() functions follow the older obsolete RFC 1738 specification and can be used for backward compatibility. While the rawurlencode() / rawurldecode() functions follow the newer RFC 3986 specification which is the more modern and recommended specification to use.

There’s not much difference between the two specifications though, it’s mostly how they treat the tilde (~) character and spaces. The older urlencode() function uses + for spaces and encodes tildes to %7E. The newer rawurlencode() doesn’t encode the tilde character but encodes spaces as %20.

urlencode('~ '); // string(4) "%7E+"
rawurlencode('~ '); // string(4) "~%20"
urldecode('%7E+~%20'); // string(4) "~ ~ "
rawurldecode('%7E+~%20'); // string(4) "~+~ "

// multibyte utf-8 characters
urlencode('あ'); // string(9) "%E3%81%82"
rawurlencode('あ'); // string(9) "%E3%81%82"
urlencode('🥳'); // string(12) "%F0%9F%A5%B3"
rawurlencode('🥳'); // string(12) "%F0%9F%A5%B3"

HTML Entity Encoding

Entity encoding in HTML is encoded in three different ways depending on the data type. HTML special characters are encoded like &code; ( ), while for characters, they can be either &#decimal (A) for a codepoint in decimal format or &#xhex; (🥳) for a codepoint formatted in hexadecimal.

HTML entity encoding is used to prevent issues like XSS (cross-site scripting). But, it can also be used to escape HTML entities to stop the HTML interpreter from parsing HTML special characters like < and > when you want to display them as characters and not entities in your HTML.

htmlspecialchars() / htmlspecialchars_decode() only encodes and decodes HTML special characters (& ” ’ < >) and is what you need if you want to escape user input to prevent cross-site scripting. While htmlentities() / html_entity_decode() converts all applicable characters to HTML entities.

If a character set is not set as an argument in the encoding functions, it defaults to whatever is set in the php.ini default_charset configuration option, in most cases, UTF-8.

// special characters (&;)
htmlspecialchars('<©á&'); // "&lt;©á&amp;"
htmlentities('<©á&');     // "&lt;&copy;&aacute;&amp;"
htmlspecialchars_decode('&lt;&copy;&aacute;&amp;'); // "<&copy;&aacute;&"
html_entity_decode('&lt;&copy;&aacute;&amp;');      // "<©á&"

// decimal (&#;)
htmlspecialchars_decode('&#12354;'); // "&#12354;" (no decoding)
html_entity_decode('&#12354;');      // "あ"

// hex (&#x;)
htmlspecialchars_decode('&#x3042;'); // "&#x3042;" (no decoding)
html_entity_decode('&#x3042;');      // "あ"

Strings and UTF-8

The ASCII charset uses 7 bits and contains 128 codepoints, 95 of them printable. Extended ASCII character sets like the ISO-8859 family use the last bit to add another 128 codepoints with region-specific characters, making up a charset of a full byte. In UTF-8, depending on the codepoint, 1 to 4 bytes are used, which can represent all of the 1,112,064 valid codepoints in the Unicode charset.

UTF-8 is ASCII compatible but not compatible with extended ASCII character sets like the ISO-8859-X family. The first charset in the family, ISO-8859-1 is a single-byte character set that contains Nordic characters. Since it’s a single-byte charset you could output a Nordic character like “Ä” with only 1 byte while it would be 2 bytes in UTF-8.

strlen(mb_convert_encoding('ä', 'ISO-8859-1', 'UTF-8')); // int(1)
strlen('ä'); // int(2) (UTF-8)

Strings in PHP are internally represented as arrays with bytes. This means that you can access the binary of a string by using array brackets like so '🥳'[3]. We can also use the strlen() function to check not how many characters, but how many bytes a string is. Let’s take a look at an example of 1, 2, 3 and 4-byte UTF-8 characters.

// strings are byte arrays
$a = 'A';
$b = 'å';
$c = 'あ';
$d = '🥳';

// check how many bytes the string is
strlen($a) // int(1)
strlen($b) // int(2)
strlen($c) // int(3)
strlen($d) // int(4)

// use ord() to get the byte at given index as a 8 bit unsigned integer
$aBytes = [ord($a[0])]; // [65]
$bBytes = [ord($b[0]), ord($b[1])]; // [195,165]
$cBytes = [ord($c[0]), ord($c[1]), ord($c[2])]; // [227,129,130]
$dBytes = [ord($d[0]), ord($d[1]), ord($d[2]), ord($d[3])]; // [240,159,165,179]

// unpack can also be used to create an array of bytes in desired format
unpack('C*', $d); // array(4) [240,159,165,179] // 8 bit unsigned integers
unpack('H*', $d); // array(1) ["f09fa5b3"] // hex

UTF-8 codepoints are usually prefixed with \u \U or U+. In PHP, we can use the escape sequence "\u{hex}" to get a UTF-8 character from a codepoint given in hexadecimal format. Let’s try it out with the value (f09fa5b3) we got on the last line from the previous example where we unpacked the string to create a hex value from the party emoji character.

$string = "\u{f09fa5b3}";
PHP Parse error: Invalid UTF-8 codepoint escape sequence: Codepoint too large on line 1

Whoops, looks like something went wrong. Simply fetching the bytes from a string doesn’t create a valid codepoint, to understand what’s going on and why this doesn’t work, we have to look into how we can find codepoints in the UTF-8 byte sequence.

UTF-8 Codepoints

ASCII characters are 1 byte, so the codepoint and bits that represent the character are the same. But for characters that are multiple bytes in UTF-8, they are not. In the Unicode character set, codepoints range from U+0000 to U+10FFFF. This equals 1,114,112 codepoints but 2,048 of them are reserved, so in total, there are 1,112,064 codepoints.

As you can see on the hex U+10FFFF, it doesn’t even use 3 bytes. Why are then some UTF-8 characters 4 bytes? That’s because we need to inject the codepoint into the UTF-8 byte sequence to get the final bytes that are used to represent a given character. Let’s take a look at an example of how we can extract the codepoint for the party emoji in UTF-8.

// UTF-8 Byte Sequence (10xxxxxx is a continuation byte)
//
// 1 byte:  0xxxxxxx
// 2 bytes: 110xxxxx 10xxxxxx
// 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
// 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

// let's get the binary representation for the party emoji
implode(' ', array_map('decbin', unpack('C*', '🥳')));

// and check that it's indeed a 4 byte character
strlen('🥳'); // int(4)

// party emoji represented as a character in UTF-8
// 11110000 10011111 10100101 10110011
// 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (UTF-8 4 byte sequence)
//      000   011111   100101   110011 (codepoint for 🥳)

bindec('000011111100101110011'); // int(129395)
mb_ord('🥳'); // int(129395)

In PHP though, we don’t have to make these calculations on our own since there are the convenient functions ord() / chr() and mb_ord() / mb_chr() that do the job for you. ord() can be used to fetch a codepoint from a single byte character, while mb_ord works for both single and multibyte characters. The same goes for chr() / mb_chr() but they’re used to display the character for a given codepoint instead of fetching the actual codepoint.

// 1 byte (capital letter A, codepoint 65, character 65)
$str = 'A';
unpack('C', $str); // array(1) [65] (byte used to represent 'A')
ord($str); // int(65) (codepoint for 'A')
chr(65); // string(1) "A"

// 4 bytes (party emoji, codepoint 129395, character 4036994483)
$str = '🥳';
unpack('C*', $str); // array(4) [240,159,165,179] (f09fa5b3 in hex)
mb_ord($str); // int(129395) (1f973 in hex)
mb_chr(129395); // string(4) "🥳"

Emojis

Emojis in UTF-8 are a lot of fun, so let’s finish off the article on a fun note by taking a closer look at them and how they are constructed. As we saw in the previous examples, the party emoji can be represented by the UTF-8 codepoint 1f973. Some emojis though are capable of being combined to create new emojis which is super interesting.

👨 - Man (Basic) (U+1F468 - 👨)
👨‍🚀 - Man Astronaut (U+1F468 + U+1F680 - 👨 🚀)
👨‍🍳 - Man Cook (U+1F468 + U+1F373 - 👨 🍳)
👨‍🔧 - Man Mechanic (U+1F468 + U+1F527 - 👨 🔧)
👨‍🔬 - Man Scientist (U+1F468 + U+1F52C - 👨 🔬)
👨‍🎨 - Man Artist (U+1F468 + U+1F3A8 - 👨 🎨)
👨‍🌾 - Man Farmer (U+1F468 + U+1F33E - 👨 🌾)
👨‍🚒 - Man Firefighter (U+1F468 + U+1F692 - 👨 🚒)
👨‍🏫 - Man Teacher (U+1F468 + U+1F3EB - 👨 🏫)
👨‍🎤 - Man Singer (U+1F468 + U+1F3A4 - 👨 🎤)
👨‍💼 - Man Office Worker (U+1F468 + U+1F4BC - 👨 💼)
👨‍💻 - Man Programmer (U+1F468 + U+1F4BB - 👨 💻)

👩 - Woman (Basic) (U+1F469 - 👩)
👩‍🚀 - Woman Astronaut (U+1F469 + U+1F680 - 👩 🚀)
👩‍🍳 - Woman Cook (U+1F469 + U+1F373 - 👩 🍳)
👩‍🔧 - Woman Mechanic (U+1F469 + U+1F527 - 👩 🔧)
👩‍🔬 - Woman Scientist (U+1F469 + U+1F52C - 👩 🔬)
👩‍🎨 - Woman Artist (U+1F469 + U+1F3A8 - 👩 🎨)
👩‍🌾 - Woman Farmer (U+1F469 + U+1F33E - 👩 🌾)
👩‍🚒 - Woman Firefighter (U+1F469 + U+1F692 - 👩 🚒)
👩‍🏫 - Woman Teacher (U+1F469 + U+1F3EB - 👩 🏫)
👩‍🎤 - Woman Singer (U+1F469 + U+1F3A4 - 👩 🎤)
👩‍💼 - Woman Office Worker (U+1F469 + U+1F4BC - 👩 💼)
👩‍💻 - Woman Programmer (U+1F469 + U+1F4BB - 👩 💻)

While this is only limited to some emoji combinations, it works by gluing them together with the ZWJ (Zero Width Joiner) codepoint. The general format for combining emojis is constructed with the formula below, some emojis have additional variants as well which can be selected by adding U+FE0E or U+FE0F at the end. emoji + skin tone (omit for default) + ZWJ (U+200D) + emoji

// Composite Emoji Format:
// emoji + skin tone (omit for default) + ZWJ + emoji
//
// Light Skin Tone         U+1F3FB
// Medium-Light Skin Tone  U+1F3FC
// Medium Skin Tone        U+1F3FD
// Medium-Dark Skin Tone   U+1F3FE
// Dark Skin Tone          U+1F3FF

// Zero Width Joiner, ZWJ  U+200D

// Man + Medium Skin Tone + ZWJ + School = Teacher
// U+1F468 + U+1F3FD + U+200D + U+1F3EB
$emoji = "\u{1F468}\u{1F3FD}\u{200D}\u{1F3EB}"; // "👨🏽‍🏫"

// Woman + ZWJ + Woman + ZWJ + Girl = Multi-Parent Family
// U+1F469 + U+200D + U+1F469 + U+200D + U+1F467
$emoji = "\u{1F469}\u{200D}\u{1F469}\u{200D}\u{1F467}"; // "👩‍👩‍👧"

That’s it for this time, I hope that this will be useful for someone out there looking to get more insight, or a refresher, into number systems and encoding schemes and how to interact with them in PHP.

Also, I wish that this article will be used as a reference in case you, like I often do, forget about how some things work, like the encodings featured in this article or how to convert data from one number system to another.

Until next time, have a good one!