In a previous blog, I wrote about IPC in PHP with Sockets [link] and in my last blog [link] I wrote about server-sent events where I briefly touched on WebSockets. So, I thought that would be a nice segue into doing an article about the WebSockets protocol and how to use PHP Sockets to set up a WebSocket server.
After doing a walk-through of the WebSocket protocol, I’ll demonstrate an app and WebSocket server that I built. To make it more interesting, we’ll be unorthodox by reinventing the wheel and building everything from scratch with pure PHP socket_* functions.
Because, I do believe that when learning something new, it often makes sense to roll up your sleeves and get your hands dirty, digging deep and figuring out the details of something in practice, in contrast to only reading and gaining theoretical experience.
In the near future, WebTransport on HTTP/3 is expected to replace WebSockets, but since its API is not finalized and Safari still hasn’t received support for the draft specification, it feels like WebSocket will remain relevant for quite some time, so that’s what we’ll be focusing on in this article.
WebSocket and HTTP
WebSocket is a bidirectional communication protocol, which means that it can send messages back and forth between the client and server, unlike unidirectional server-sent events that I wrote about in my last blog.
Firstly, let’s get an overhead view of how the WebSocket protocol works and examine its many similarities with HTTP/1 and HTTP/2 (hereafter simply HTTP).
Both WebSockets and HTTP are message-based protocols, positioned in the application layer (layer 7 of the OSI model), and using TCP in the transport layer (layer 4 of the OSI model). Since WebSockets are using TCP, we can use socket_create(AF_INET, SOCK_STREAM, SOL_TCP)
in PHP to create a TCP socket that we can use for communication.
When a WebSocket connection is first initiated, there’s a handshake that has to take place. This handshake is a normal HTTP/1.1 request. So, when the client executes this JavaScript <script>const socket = new WebSocket('ws://0.0.0.0')</script>
in their browser to open a WebSocket connection, it will send an HTTP request to the server, asking if it understands and accepts WebSocket communication on the given URL.
GET / HTTP/1.1
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: ...
Sec-WebSocket-Extensions: ...
To not get overwhelmed by too many details at this point, I’ve removed the contents of the Sec-WebSocket headers. If you’d like to check out how the Sec-WebSocket-Accept header is generated, check out this example [here].
Now, if the server understands the request and accepts WebSocket communication, it will respond with the following HTTP response.
HTTP/1.1 101 Switching Protocols
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Version: 13
Sec-WebSocket-Accept: ...
All communication exchanged on this connection from this point will now be conducted in the WebSocket protocol. To see what that data transfer looks like, let’s take a look at the structure of the WebSocket protocol.
Header and Payload
WebSocket data is defined in frames, containing a header and a payload. This is similar to how a HTTP message has a header and a message body. In HTTP, the first line, or start-line, indicates the protocol version together with a path and type of request. This is followed by headers, an empty line and a message body.
POST / HTTP/1.1
Content-Type: application/json; charset=utf-8
Content-Length: 21
{"message": "Hello!"}
With WebSockets, the protocol version is determined during the handshake but similar to the HTTP start-line, the type of request, or type of frame, is defined in the header.
In a WebSocket frame, the header is defined in binary and is between 2 and 10 bytes. The first byte of the header is used to define what type of frame it is and if the payload is a UTF-8 text message or binary data.
The bytes after the first one are used to define the length of the payload, this is similar to the Content-Length
header in a HTTP request. If it’s a short payload, one byte is enough to define the payload length, the WebSocket frame header will therefore be 2 bytes in total. The rest of the frame is the payload, or message body, itself.
10000001 // indicates that frame is a UTF-8 text message
00010101 // content-length, 21 in binary
{"message": "Hello!"}
WebSocket Frames
So we now know that the frame type is defined by the first byte and that the binary 10000001
in WebSockets means a UTF-8 text message. Let’s look into what different frames are available.
The first byte consists of 8 bits. The first bit indicates if the message is fragmented and continues over multiple frames or if it’s final, meaning that it’s a self-contained message that only consists of one frame. Bit 2, 3 and 4 are reserved for future purposes and bit 5, 6, 7 and 8 are opcodes, which is what determines the frame type.
0000 0001 Fragmented UTF-8 Text Frame Start (First message in chain)
0000 0000 Fragmented Frame
1000 0000 Fragmented Frame End (Last message in chain)
1000 0001 UTF-8 Text Frame
1000 0010 Binary Data Frame
1000 1000 (Close Control Frame) Sent when a connection should close.
1000 1001 (Ping Control Frame) Sent to check if connection is alive.
1000 1010 (Pong Control Frame) Sent as a response to an incoming ping frame.
Now, you don’t have to remember all of these, in most scenarios you would probably only use the UTF-8 Text Frame, which is 129 in decimal.
Payload Length
Now that we understand the header of the WebSocket frame, let’s look at the latter part, the payload. There are three types of payload lengths, short, medium and long. They all require different amounts of bytes to define their length in the WebSocket frame header.
If the payload is less or equal to 125, then it can be expressed as one byte, meaning that the total header size will be two bytes. However, if the payload length is larger than 125 but less than 65536, it’s of medium length and the size indicating the payload size of the header will be 2 bytes. If the payload is larger or equal to 65536, then the payload length part of the header will use 8 bytes.
Short Payload (Total Header Size 2 bytes, Payload Length <= 125)
frame type 1 byte + payload length 1 byte.
Medium Payload (Total Header Size 4 bytes, Payload Length > 125 && < 65536)
frame type 1 byte + 1 byte with value 126 to indicate medium size payload + payload length 2 bytes.
Long Payload (Total Header Size 10 bytes, Payload Length >= 65536)
frame type 1 byte + 1 byte with value 127 to indicate long size payload + payload length 8 bytes.
It can be written in PHP like so:
match (true) {
$length <= 125 => pack('CC', $frameType, $length),
$length > 125
&& $length < 65536 => pack('CCn', $frameType, 126, $length),
$length >= 65536 => pack('CCJ', $frameType, 127, $length)
};
Masked bytes
So this is the last part that we’re covering before moving on to demonstrate a demo app I built with WebSockets as an example of how all of this can be put into use building something.
Messages sent from clients to the server need to be masked. The mask consists of 4 random bytes that the client sends with the payload that is used to scramble the message being sent over the wire. The message is decoded by looping over these four bytes, applying them in order to the payload with an XOR bitwise operator to retrieve the original message.
But, how do we know if the message is masked or not? I said that we were done covering the WebSocket frame header, but there’s one last thing. There’s one bit in the WebSocket frame header called the mask bit, the value of the bit is 1 if the message is masked and 0 if it’s not. Normally, messages sent from the client to the server will be masked and messages sent from the server to the client will not, and thus, have this bit set to 0.
Let’s go back to the example I showed at the beginning of the article with the first byte indicating the frame type and the second byte indicating the payload length:
10000001 // indicates that frame is a UTF-8 text message
00010101 // content-length, 21 in binary
{"message": "Hello!"}
Let’s focus on the second byte indicating the payload length. Actually, only the last 7 bits are used to express the length, the first bit is the so-called mask bit and is used to determine if the message is masked or not.
00010101 // content-length, 21 in binary (not masked)
10010101 // content-length, 21 in binary (masked)
If the message is masked, then the 4 mask bytes are always located directly after the header so they can be extracted like this:
$maskBytes = match (true) {
// 4 mask bytes after a 2 byte header
$payloadType === PayloadType::SHORT => substr($webSocketFrame, 2, 4),
// 4 mask bytes after a 4 byte header
$payloadType === PayloadType::MEDIUM => substr($webSocketFrame, 4, 4),
// 4 mask bytes after a 10 byte header
$payloadType === PayloadType::LONG => substr($webSocketFrame, 10, 4),
};
These 4 mask bytes can then be used to unmask the message like this:
function unmaskPayload(string $payload, string $maskBytes): string
{
$unmaskedText = '';
for ($i = 0; $i < strlen($payload); $i++) {
$unmaskedText .= $payload[$i] ^ $maskBytes[$i % 4];
}
return $unmaskedText;
}
Payload
I hope you’re following along so far, we’re now done and all that is left is the message itself. Depending on whether the WebSocket frame has a short, medium or long payload and if the frame contains mask bytes or not, the starting index of the payload will differ. The payload can be extracted like so:
$payload = match (true) {
// payload start index: 2 or 6
$type === PayloadType::SHORT => substr($frame, $isMasked ? 6 : 2),
// payload start index: 4 or 8
$type === PayloadType::MEDIUM => substr($frame, $isMasked ? 8 : 4),
// payload start index: 10 or 14
$type === PayloadType::LONG => substr($frame, $isMasked ? 14 : 10),
};
Chat App Demo
I think that just like a to-do list example app for a JavaScript framework, the most typical demo for WebSockets is a chat app. Sadly, I’m no different, I built a chat app to demonstrate and utilize all this newly gained knowledge about WebSockets.
But, to make things a bit more interesting, I set up a Docker environment with a load balancer that handles a web app cluster and a WebSocket server cluster. They both boot 3 instances each and this setup makes it possible to do horizontal scaling, meaning that you can keep on adding instances if you need more power supplied to your app.
Setting up multiple WebSocket servers that keep each other in sync while communicating back and forth between the clients was an interesting challenge. Also, choosing the path of using pure PHP socket_*() functions instead of simply installing a library like SwoolePHP was also challenging but super interesting.
Closing notes
Learning a new protocol can be a bit tiresome but it’s also very rewarding, especially when you put your newly gained knowledge into building an app or anything really that can be used for something. I’ve been wading through WebSockets documentation all over the web this past month, some more easily understandable than others. With this blog, my goal is to try to explain the WebSocket protocol in as easy terms as possible, by also providing multiple code examples.
I had the intention to write more about how I created the WebSocket server but realized that it would never fit into one blog article, so I’m listing some links to the example chat app and WebSocket server that I built so you can check it out for yourself if you’re interested - after all, real code examples are worth a thousand words, right?
• Server Logic
• Client Logic
• Load Balancer Config
That’s it for this time, I hope that this will be useful for someone out there looking to get more insight into the WebSocket protocol and how pure PHP socket_* functions can be used to build a WebSocket server.
Until next time, have a good one!