How the Cloudflare Calls orange Demo App Works

Situation: You're building a Slack killer and need to implement a version of the "Huddles" feature. You find out Cloudflare has a service for WebRTC powered calls. You read the docs and think hmmm... what am I supposed to do with this?

The answer is to read and copy code from the demo application called "orange" on GitHub. It turns out that there’s a LOT of work to do. There's a bunch of features and UI to implement after you pull out the base pieces that you intend to use from the demo app.

In this post, we'll look at the architecture and code of the orange demo application. Once we have at least a surface level understanding of all the moving parts, we can make progress on building a custom app with new features based on the demo.

9/14/24 Update: The implementation has changed significantly since this was originally written. The basic structure and concepts are still mostly relevant though.

Architecture breakdown

A lot of the basic info is available in the README file. orange is a Remix app deployed to Cloudflare Workers.

There's a lot of things that make developing this type of application challenging. There is complex state management, managing user connections to the WebSocket server, handling Video/Audio with WebRTC, and more.

Let's look at the Architecture Diagram:

As we see in the diagram the application includes:

A client/JavaScript application
A server application
Cloudflare Calls API

The application code lives in the app directory of the repository. Here's a high level look at some of the important files/directories that represent different parts of the app.

- app
	- api
		- roomsApi.server.ts > This file creates a durable object for a meeting room
	- durableObjects
		- ChatRoom.server.ts > This is the durable object code that manages room state and messaging with connected clients
	- routes > All the routes / pages for the site
		- api.calls.$.tsx > This api route file proxies requests to the Cloudflare Calls API
		- api.room.$roomName.$.tsx > This route proxies requests to the roomsApi.server.ts file to interact with the durable object
	- hooks > A pretty robust React hooks API for the client application
	- utils > There is some really important lower level client code in here like Signal.ts and Peer.ts for signaling and WebRTC peer connections

Next, we'll start by going through the flow of the application and see how all the pieces come together.

The Flow

I have a room with one user publishing audio and video. We can learn a lot about how the application works by connecting another user to this room and inspecting dev tools to see what is happening.

Connecting to a room

GET http://localhost:8787/cea93e5e/room - A request for the room document

GET ws://localhost:8787/api/room/cea93e5e/websocket - The local user opens a WebSocket connection to the ChatRoom.server.ts Durable Object

Messages are exchanged from the durable object:

(The state managed in the durable object powers the room/meeting application. You can think of the durable object state as the brains and source of truth for a call)

%% Establish Identity %%
{"from":"server","timestamp":0,"message":{"type":"identity","id":"a9437aa2-1056-45f9-a8f1-b2b0ad16a81d"}}

%% Send Room State to Client %%
{"from":"server","timestamp":0,"message":{"type":"roomState","state":{"users":[{"id":"e2924ab3-c6be-4591-a41d-8697966bf605","name":"other","joined":true,"raisedHand":false,"speaking":false,"transceiverSessionId":"85c7967b9313662ef565446ea35049c1","tracks":{"audioEnabled":false,"videoEnabled":true,"screenShareEnabled":false,"audio":"85c7967b9313662ef565446ea35049c1/8100727b-48f9-4a5a-a82a-07cd364a355c"}},{"id":"65b27b8b-aa39-4700-bc55-ca2fc4caf780","name":"Dane","joined":false,"raisedHand":false,"speaking":false,"tracks":{"audioEnabled":false,"videoEnabled":false,"screenShareEnabled":false}},{"id":"a9437aa2-1056-45f9-a8f1-b2b0ad16a81d","name":"Dane","joined":false,"raisedHand":false,"speaking":false,"tracks":{"audioEnabled":false,"videoEnabled":false,"screenShareEnabled":false}}]}}}

%% Send Local Participant State to Server/Durable Object %%
{"type":"userUpdate","user":{"id":"a9437aa2-1056-45f9-a8f1-b2b0ad16a81d","name":"Dane","joined":true,"raisedHand":false,"speaking":false,"tracks":{"audioEnabled":false,"videoEnabled":true,"screenShareEnabled":false}}}

%% Send Updated Room State To Client %%
{"from":"server","timestamp":0,"message":{"type":"roomState","state":{"users":[{"id":"e2924ab3-c6be-4591-a41d-8697966bf605","name":"other","joined":true,"raisedHand":false,"speaking":false,"transceiverSessionId":"85c7967b9313662ef565446ea35049c1","tracks":{"audioEnabled":false,"videoEnabled":true,"screenShareEnabled":false,"audio":"85c7967b9313662ef565446ea35049c1/8100727b-48f9-4a5a-a82a-07cd364a355c"}},{"id":"65b27b8b-aa39-4700-bc55-ca2fc4caf780","name":"Dane","joined":false,"raisedHand":false,"speaking":false,"tracks":{"audioEnabled":false,"videoEnabled":false,"screenShareEnabled":false}},{"id":"a9437aa2-1056-45f9-a8f1-b2b0ad16a81d","name":"Dane","joined":true,"raisedHand":false,"speaking":false,"tracks":{"audioEnabled":false,"videoEnabled":true,"screenShareEnabled":false}}]}}}

The first message is sending back a session identifier for the local user. After that the durable object passes the state of all the participants in the room and the local user sends their details to be merged into the room state. There is one remote peer active in this room so it will send along their details including any tracks that they have published.

POST http://localhost:8787/api/calls/sessions/new - Proxy request to Calls API to create a new session

POST http://localhost:8787/api/calls/sessions/7a4268ed8018ee3e886b1abf51796072/tracks/new - Proxy request to Calls API to add tracks to the local participants session (Our tracks as well as ones from remote peers that were provided by the Durable Object state)

This is a high-level look at what happens when a user connects to a room. There may be some additional requests made for renegotiation or other tracks as they are added and removed but this is good for now for a basic mental model. I recommend running the app and inspecting whats happening in dev tools as you use it.

Now that we have at least a surface level understanding of how a call is setup, we can dig in a little bit further to the core components of the application.

App Client

The client app is a pretty large and complex React app. We'll go more in-depth with certain parts of it in future posts but for now we can just identify some of the important parts and how they are working. Lets get started by tracing our way through some of the code.

The Room

The naming convention for the routes is a little confusing if you're not very experienced with Remix but there are two different routes and a layout for a room.

app/routes/_room.tsx This is the layout for the room views/routes. This component is actually very important as it sets up most of the state and puts it into a React Context for the room routes and child components.

const userMedia = useUserMedia(mode)
const room = useRoom({ roomName, userMedia })
const { peer, debugInfo, iceConnectionState } = usePeerConnection(apiExtraParams)

const pushedVideoTrack = usePushedTrack(peer, userMedia.videoStreamTrack)
const pushedAudioTrack = usePushedTrack(peer, userMedia.audioStreamTrack)
const pushedScreenSharingTrack = usePushedTrack(peer, userMedia.screenShareVideoTrack)

These are all custom hooks defined in the orange application and together they handle most of the work involved with setting up the room and call for the local user.

useUserMedia - Gets all the local user tracks (audio/video/screen) and related metadata
useRoom - Sets up the WebSocket connection to the durable object, listens for events, and manages the room state
usePeerConnection - Establishes the WebRTC Peer Connection with the Calls API, handles communication between the client and Calls API through the server proxy
usePushedTrack - Manages the local peer media tracks with the WebRTC connection

Next we have the context that gets created and passed to the actual route component (lobby or room):

const context: RoomContextType = {
  joined,
  setJoined,
  traceLink,
  userMedia,
  userDirectoryUrl,
  peer,
  peerDebugInfo: debugInfo,
  iceConnectionState,
  room,
  pushedTracks: {
    video: pushedVideoTrack,
    audio: pushedAudioTrack,
    screenshare: pushedScreenSharingTrack,
  },
}

return <Outlet context={context} />

As you can see this RoomContext holds on to pretty much all the state needed to power the call. We've got the room state, peer WebRTC connection, userMedia, pushedTracks local user media tracks.

app/routes/_room.$roomName._index.tsx This route is the lobby / waiting room view for the room. We're not going to look at this page but the functionality is similar to the waiting room / lobby of a Google Meets or Zoom call where the user can set up their media and manage settings before joining the room.

app/routes/_room.$roomName.room.tsx This route is the root component for a "joined" room. The JoinedRoom component will give us a good idea of what it looks like when all the pieces of the client API are put together.

Some of the data from the RoomContext is required

const {
  userMedia,
  peer,
  pushedTracks,
  room: { otherUsers, signal, identity },
} = useRoomContext()

Layouting for the video grid is pretty involved and important for the meeting UI:

const { GridDebugControls, fakeUsers } = useGridDebugControls({
  defaultEnabled: false,
  initialCount: 0,
})

const [containerRef, { width: containerWidth, height: containerHeight }] =
  useMeasure<HTMLDivElement>()
const [firstFlexChildRef, { width: firstFlexChildWidth }] = useMeasure<HTMLDivElement>()

const { width } = useWindowSize()

const stageLimit = width < 600 ? 2 : 8
const flexContainerWidth = useMemo(
  () =>
    100 /
      calculateLayout({
        count: totalUsers,
        height: containerHeight,
        width: containerWidth,
      }).cols +
    '%',
  [totalUsers, containerHeight, containerWidth]
)

useBroadcastStatus sends messages to the Durable Object to keep the local user state on the server in sync for the room

useBroadcastStatus({
  userMedia,
  peer,
  signal,
  identity,
  pushedTracks,
  raisedHand,
  speaking,
})

useStageManager manages a list of participants to track who is visible on screen and active in the call.

const { recordActivity, actorsOnStage } = useStageManager(otherUsers, stageLimit)

actorsOnStage is used to render the video grid

{
  actorsOnStage.map((user) => (
    <Fragment key={user.id}>
      <PullVideoTrack video={user.tracks.video} audio={user.tracks.audio}>
        {({ videoTrack, audioTrack }) => (
          <Participant
            user={user}
            flipId={user.id}
            videoTrack={videoTrack}
            audioTrack={audioTrack}
            pinnedId={pinnedId}
            setPinnedId={setPinnedId}
          ></Participant>
        )}
      </PullVideoTrack>
      {user.tracks.screenshare && user.tracks.screenShareEnabled && (
        <PullVideoTrack video={user.tracks.screenshare}>
          {({ videoTrack }) => (
            <Participant
              user={user}
              videoTrack={videoTrack}
              flipId={user.id + 'screenshare'}
              isScreenShare
              pinnedId={pinnedId}
              setPinnedId={setPinnedId}
            />
          )}
        </PullVideoTrack>
      )}
    </Fragment>
  ))
}

There are a bunch of components used to help render the call and user controls and you can see how it's all put together in the render function. Aside from the video grid, there is a participant list, microphone/camera toggle, and other controls for the local user.

WebRTC and Signaling

This part of the client code could and should be it’s own post. The WebRTC and Signaling are the bits that really power this real time media application. For now we'll just get a surface level idea of what these pieces are doing.

/app/utils/Signal.ts

If you're unfamiliar with WebRTC, signaling is an important concept. Typically it's used to allow peers to discover each other and exchange the required information to establish a connection to one another (peer-to-peer). It works a little bit differently in a Cloudflare Calls app since it's a SFU (selective forwarding unit) and doesn't do peer-to-peer connections.

The Signal class is actually pretty straightforward though. It has a connect method for establishing the WebSocket connection to the Durable Object and it allows sending and receiving messages.

/app/utils/Peer.client.ts

As we mentioned in Signal, the local user doesn't need to establish connections to all the peers in the call, only the Cloudflare Calls server. So the user establishes a connection the Calls server, and sends their media to it. Through this connection they will receive media from the tracks that they request from other peers.

We can break down the code in another post but it is pretty similar to the code you would write to establish a connection to a peer. In our case the peer in just the Calls API/server. Click here to learn more about RTCPeerConnection which is the WebRTC API that gets set up and is used to send and receive media tracks.

Client Summary

The client code is a good window into all the bits in pieces that make up a Cloudflare Calls application.

Signal - Manages a WebSocket connection that provides a shared room state for all connected participants
Peer - Establishes a connection to send and receive media from Cloudflare Calls
RoomContext - Holds all the state that powers the client application

App Server

The server application consists of some endpoints and a Durable Object. If you're not yet familiar with Durable Object's I suggest taking a look at the documentation. Durable Object is a really amazing Cloudflare service that is commonly used for building real-time applications.

Endpoints

We can split up the API endpoints into two categories.

Cloudflare Calls API proxy
WebSocket / Durable Object endpoint

Cloudflare Calls is not safe to use from a client so you need to have your own server that makes requests to the API. In orange this is just done via a simple proxy at /app/routes/api.calls.$.tsx.

The WebSocket / Durable Object endpoint is /api.room.$roomName.$.tsx and it forwards to /app/api/roomsApi.server.ts. All this endpoint does is get a handle to the Durable Object for the room and makes a fetch request to it. This is the URL that gets passed to new WebSocket to establish the WebSocket connection to the room's Durable Object.

Durable Object

This is where all the magic happens on the server. The implementation is located at /app/durableObjects/ChatRoom.server.ts. The Durable Object is a class implementation defined as ChatRoom.

The ChatRoom DO does a few things:

manages room state
receives messages
broadcasts messages (to one or all user sessions)

The roomsApi.server.ts handler calls the fetch method on the ChatRoom DO. The fetch method does the following:

Setup the WebSocket connection
Establish an identity for the user session
Create the initial state for the user session
Update the state of the room with the new user session
Broadcast the state changes to all other sessions

The messages that are sent an received have a defined contract. We'll skip reviewing the code for the Durable Object in depth because I think we can get a good view of what it does by looking at the types.

The RoomState is just an array of the established user sessions. The server and client both have a set of defined messages that they can send. You'll find code in ChatRoom that knows how to handle each defined ClientMessage and same on the client as well.

export type RoomState = {
  users: User[]
}

export type ServerMessage =
  | {
      type: 'roomState'
      state: RoomState
    }
  | {
      type: 'error'
      error?: string
    }
  | {
      type: 'identity'
      id: string
    }
  | {
      type: 'directMessage'
      from: string
      message: string
    }
  | {
      type: 'muteMic'
    }

export type ClientMessage =
  | {
      type: 'userUpdate'
      user: User
    }
  | {
      type: 'directMessage'
      to: string
      message: string
    }
  | {
      type: 'muteUser'
      id: string
    }
  | {
      type: 'userLeft'
    }
  | {
      type: 'heartBeat'
    }

Summary

As we have seen building this type of application is pretty involved. Thankfully the Cloudflare team has already written a lot of great code that we can use and learn from. That being said, orange is mostly setup as a reference application. To build your own Calls based application will still require quite a bit of work.