What does it mean to “render” a webpage? This may sound like a simple question, but when you dive into the technical details you begin to realize how much work a browser does in an incredibly short amount of time. Knowing more about this process will allow you to make better decisions when it comes to optimizing the performance of your site.
While external factors often come into play, much of the responsibility for laying the foundation of a smooth experience rests on the shoulders of the developer. Here at Something Digital, we’ve found that many people don’t have a full understanding of the process a webpage takes to go from files on a server to a complete page in your browser – so it may be beneficial to gain a better understanding of what’s going on behind the scenes.
From a technical standpoint, the process for loading a webpage can be broken into four stages: navigation, parsing, rendering, and interaction. Let’s break each one of those steps down in detail.
Navigation
First, the browser needs to retrieve all the necessary files from a remote server to create the initial page. This is limited by factors such as the end user’s internet speed and network latency. It is for this reason that data centers are often spread all over the world, and CDNs are often used to deliver content from a number of potential locations, depending on whichever is geographically closest.
Files are requested over a protocol called HTTP, or HyperText Transfer Protocol. Although created in the late 80s, this standard continues to evolve and the newest version called HTTP/2 improves upon HTTP/1.1 in a number of ways. One such advantage is that it now allows the client’s browser to make requests for multiple assets (images, stylesheets, JavaScript, etc.) concurrently over a single TCP connection, whereas with HTTP/1.1 this required opening separate connections if you wanted to transfer data in parallel.
“Time to first byte” (or TTFB) is a common measurement of the time it takes the browser to receive the first byte of a response, and should ideally occur in a half second or less.
Parsing
Once the browser has all the necessary files, they can begin to be interpreted. This is when the files are read and the structure of the website begins to take shape through what’s known as lexical parsing and syntax analysis.
The Lexer breaks up code into “tokens” that can be easily processed, stripping out white space and other unnecessary characters. Each token is then passed to the syntax analyzer to apply language-specific syntax rules and added to the parse tree. In the event that syntax errors are found, this is where a runtime exception will be thrown.
Once lexical parsing and syntax analysis are complete, HTML and XML elements are used to create the Document Object Model (or DOM) – a series of element and text nodes organized in a tree-like structure. JavaScript can then use these DOM nodes in order to manipulate the document’s contents.
Similar to the DOM, the CSS Object Model (or CSSOM) is also constructed at this stage, allowing JavaScript to read and modify CSS rules dynamically. The tree-like structure of the CSSOM is what gives CSS its “cascade”, as stylesheets are interpreted from top to bottom with increasingly specific rules.
JavaScript and CSS are “render blocking” resources, meaning they can negatively affect load time by preventing the rest of the page from being parsed until they’re finished being executed. If there’s any inline JavaScript or CSS embedded in the HTML document, it will be parsed synchronously. Since this can have a huge impact on overall load time, loading non-critical scripts asynchronously or using the defer / async attributes is generally a good practice.
Rendering
Rendering is the multi-step process in which the content of the page begins to become visible to the user. This can be a relatively expensive task for the browser to perform, depending on the complexity of the styles and animations being rendered.
First the DOM and CSSOM are combined to create the “Render Tree” by traversing the DOM nodes and finding the appropriate CSSOM rules that apply to them. This only includes nodes that will occupy space in the layout, so if an element has display: none, it will be omitted from this tree.
Next the layout stage computes the exact size and positions of each node within the layout by creating a box model, and reserves that space on the page. This is also commonly referred to as “reflow”.
Content is then displayed to the screen through a process called “painting”. More complicated styles such as drop shadows are more memory intensive to compute and render, and may take longer to be painted than a solid color. The first meaningful paint represents the moment when the user is able to see meaningful content on your webpage for the first time.
Since parts of the page may have been drawn to different layers, compositing is the process in which the GPU is used to “flatten” these layers, ensuring the order of elements on the page remains correct.
Actions such as adding / removing elements from the DOM, changing inline styles, or resizing the window will cause additional reflow, since the browser needs to perform the above process again to calculate the new positions of elements. This is a user-blocking operation and should be avoided as much as possible as it can lead to an unpleasant user experience.
As images begin to be displayed, the smaller more compressed images will typically appear first. Newer image formats (such as WebP) also promise to reduce load times by more efficiently encoding the images, although not all formats are supported by modern browsers.
Interaction
Finally, the interaction step is when the user can begin browsing and using the page. A page is considered “fully interactive” when all previous steps have completed and users can begin to scroll, type, and interact with elements on the page.
First CPU Idle represents the point at which the page is minimally interactive – meaning it has loaded enough information for it to be able to handle a user’s input. Most, but not all of the UI is interactive and the page responds to user input in a reasonable amount of time.
As animations begin to run, 60 frames per second is usually the target for a smooth frame rate – and anything less than that starts to become noticeable as “jank”. With lower frame rates, scrolling can appear choppy or become unresponsive altogether, whereas higher frame rates are usually indicative of a site’s overall responsiveness.
The process a browser takes to construct a webpage is far more complex than one might think. With each step comes more opportunities for developers to make decisions that can provide a smoother user experience and have fundamental impacts on increasing conversion rates.