How do recommender systems work on digital platforms?

Few things are as vital to democracy as the free flow of information. If an enlightened citizenry is essential for democracy, as Thomas Jefferson suggested, then citizens need to a way to be kept informed. For most of the modern era, that role has been played by the press—and especially the editors and producers who exercise control over what news to publish and air.

Yet as the flow of information has changed, the distribution and consumption of news has increasingly shifted away from traditional media and toward social media and digital platforms, with over a quarter of Americans now getting news from YouTube alone and more than half from social media. Whereas editors once decided which stories should receive the broadest reach, today recommender systems determine what content users encounter on online platforms—and what information enjoys mass distribution. As a result, the recommender systems underlying these platforms—and the recommendation algorithms and trained models they encompass—have acquired newfound importance. If accurate and reliable information is the lifeblood of democracy, recommender systems increasingly serve as its heart.

As recommender systems have grown to occupy a central role in society, a growing body of scholarship has documented potential links between these systems and a range of harms—from the spread of hate speech, to foreign propaganda, to political extremism. Nonetheless, the models themselves remain poorly understood, among both the public and the policy communities tasked with regulating and overseeing them. Given both their outsized importance and the need for informed oversight, this article aims to demystify recommender systems by walking through how they have evolved and how modern recommendation algorithms and models work. The goal is to offer researchers and policymakers a baseline from which they can ultimately make informed decisions about how to oversee and govern them.

Suppose you run a social media or digital platform. Each time your users open your app, you want to show them compelling content within a second. How would you go about surfacing that content?

The quickest and most efficient approach is just to sort content by time. Since most social networks and digital platforms have a large back catalogue of content, the most recent or "freshest" content is more likely to be compelling than content drawn at random. Simply displaying the most recent items in reverse-chronological order is thus a good place to start. As a bonus, this approach is both easy to implement and straightforward to understand—your users will always have a clear sense of why they are seeing a given piece of content and an accurate mental model of how the app behaves. While the industry has moved beyond them, reverse-chronological recommendation algorithms powered the first generation of social media feeds and are why most feeds are still known today as "timelines."

While appealing in their simplicity, purely reverse-chronological feeds have a massive downside: They don't scale well. As platforms expand, the amount of content they host grows exponentially, but a user's free time does not. The most recently added content will therefore serve as a less and less effective proxy for the most compelling content. Worse, users who want to build a wide audience will flood the platform with new content in a bid to stay at the top of other users’ feeds. As a result, your app will quickly become biased to the most active users rather than the most interesting ones. Less engaging content—or even outright spam—will start to inundate user timelines.

To address that problem, you could craft hard-coded rules to prioritize among the most recent content. For instance, you could write a rule that says: If Nicole has liked posts from Dia more than any other user, then show Nicole Dia's latest post from today before anything else. Or you could write a rule that says: If Nicole liked video more than any other form of content, then the most recently added video from her friends should be shown to Nicol first, before any other content. By mixing and matching these manual rules, attribute- and category-based recommendation algorithms can more reliably surface compelling content than a purely reverse-chronological feed.

However, relying on hand-coded rules also has its drawbacks. It forces developers to bake in a lot of assumptions about what users will be most interested in, many of which may not actually be true. Do users always like video more than text? And when a user likes a given post, do they always want to see more from its author? So long as a recommendation algorithm is purely hand-coded, the algorithms will be biased toward developers’ assumptions about what users are most interested in viewing.This approach also doesn't scale well: The more rules are manually added, each incremental new rule will be less effective and make the codebase more difficult to maintain.

At a certain size, the best approach for efficiently surfacing compelling content is to rely on machine learning. By drawing on past user data, deep learning recommendation algorithms—and the deep learning recommendation models trained on them—have proven particularly effective at "learning" what content users will find compelling and to surface it for them. Every major platform now relies on some version of deep learning to choose what content to display, but these approaches come at a cost: Whereas reverse-chronological algorithms are easy to implement and understand, large-scale deep learning algorithms are complex to implement and effectively impossible to comprehend and interpret.

Which recommendation algorithm works best for your platform will depend on tradeoffs between performance, cost, and interpretability, or how easy it is to identify why the algorithm is behaving in a certain way. For large social networks and digital platforms, the performance gains of deep learning recommendation algorithms far outweigh both the cost of developing them and the corresponding decline in interpretability.

While that tradeoff may make users more likely to continue engaging with content on the platform, it has important externalities for democratic societies. In the United States alone, researchers have documented how recommender systems clearly exposed users to far-right extremist movements, as well as conspiracy theories regarding COVID-19 and the outcome of the 2020 election. Despite the role recommender systems played in spreading content related to those movements and narratives—which have been instrumental in fomenting recent political violence—they nonetheless remain poorly understood by both policymakers and the public. Understanding how the technology works is thus a vital first step toward an "enlightened citizenry" capable of governing it.

Although the details vary slightly by platform, large-scale recommender systems generally follow the same basic steps. As Figure 1 shows, recommender systems typically first produce an inventory of available content and then filter it in line with their content moderation policies, after which they pare the inventory down to only the items users are most likely to be interested in.

In recent years, many of the policy conversations around mitigating the harms linked to digital platforms have focused on the integrity step—especially the content moderation policies that determine whether a piece of content can be published or shared—but far greater attention needs to be paid to the ranking step. If in fact recommender systems are having a significant impact on everything from electoral integrity to public health, then the process by which recommender systems sort and rank content matter a great deal as well. By better understanding the complex system behind content ranking, policymakers will be in a better position to oversee their use.

Although social media platforms architect their ranking algorithms slightly differently than other digital platforms, in general nearly all large platforms now use a variant of what is known as a "two towers" architecture to rank items.

To see what that means in practice, imagine you have two different spreadsheets. The first is a spreadsheet where every row is a user, and every column is a user attribute (e.g., age, location, search history). In the second spreadsheet, every row is a piece of content, and every column is a content attribute (e.g., content type, title, number of likes). By modeling the information in each spreadsheet in separate parts of a deep neural network—an algorithm whose structure is (very) loosely analogous to the way neurons connect in the brain—a "two-towers" approach learns over time the likelihood of whether a given user will engage with a particular piece of content.

Although this approach has proven remarkably successful, platforms with a large user base and a deep catalogue of content end up needing to train models that are exceedingly large. A platform with a billion users and a trillion pieces of content, for instance, would need to learn a model capable of efficiently generalizing to 10^21 potential user-item pairs, a challenge made all the more daunting by the fact that most users never engage with the vast majority of content. As a result, they need to include an extraordinarily large number of model parameters, or "neurons" in a neural network, to perform well across so many different user-item pairs. Recommendation algorithms are much larger than other forms of deep learning for this reason. Whereas GPT-3, a powerful large language model released in 2020 by OpenAI, had 175 billion parameters, or "neurons" in its deep neural network, the recommendation model powering Facebook's newsfeed has 12 trillion parameters. With so many parameters, it is effectively impossible to understand and reason about how the model behaves merely by examining the trained model itself.

The architecture of modern recommender systems has important implications for policymakers and the public at large, yet they may not be obvious to non-technical audiences. The following implications are especially important:

Since the architecture of large recommender systems makes it difficult to understand how they behave, finding better ways to evaluate their behavior is vital. Regulators, researchers, and the technology industry can all take steps to better evaluate models. From platform-researcher collaborations to simulated environments and other privacy-preserving techniques, it is possible to gain greater clarity on the behavior and impact of recommender systems than we currently enjoy.

Seizing those opportunities will be ever more vital as recommender systems continue to grow in importance. TikTok, a viral video app, recently eclipsed Google in internet traffic largely by virtue of its improved recommender system, which surfaces content from across the entire app's userbase rather than just a user's connections. In response, social media platforms like Facebook and Twitter have started to similarly expand the "inventory" initially surfaced by their recommender systems to include more content from across the entire platform. Mark Zuckerberg, for example, recently said that he expects that by 2023 more than 30% of the items in a user's feed on Instagram and Facebook will come from accounts a user has not friended or followed. As other platforms rush to keep pace, they too will all but certainly increase their reliance on purely recommended content.

In turn, the potential impact of recommender systems on democratic societies will only grow—as will the importance of understanding how they work.

Chris Meserole is a fellow in Foreign Policy at the Brookings Institution and director of research for the Brookings Artificial Intelligence and Emerging Technology Initiative.

Facebook and Google provide financial support to the Brookings Institution, a nonprofit organization devoted to rigorous, independent, in-depth public policy research.

Inventory Integrity processes Candidate generation. Ranking Re-ranking The outcome metric matters. A lot They are too large to explain and interpret Frequent retraining and model updates make evaluation a challenge Algorithmic impacts cannot be assessed via auditing the underlying code and trained model alone Chris Meserole