# The Math Behind Comment Quality Metrics

By Francis Tseng

First published in November 2015

At The Coral Project, we’re trying to come up with some interesting and useful metrics about community members and discussion on news sites for our first product.

It’s an interesting exercise to develop metrics which embody an organization’s principles. For instance – perhaps we see our content as the catalyst for conversations, so we’d measure an article’s success by how much discussion it generates.

Generally, there are two groups of metrics that I have been focusing on:

• Asset-level metrics, computed for individual articles or whatever else may be commented on
• User-level metrics, computed for individual users

For the past couple of weeks I’ve been sketching out a few ideas for these metrics:

• For assets, the principles that these metrics aspire to capture are around quantity and diversity of discussion.
• For users, I look at organizational approval, community approval, how much discussion this user tends to generate, and how likely they are to be moderated.

Here I’ll walk through my thought process for these initial ideas.

## Asset-level metrics

For assets, I wanted to value not only the amount of discussion generated but also the diversity discussions. A good discussion is one in which there’s a lot of high-quality exchange (something else to be measured, but not captured in this first iteration) from many different people.

There are two scores to start:

• a discussion score, which quantifies how much discussion an asset generated. This looks at how much people are talking to each other as opposed to just counting up the number of comments. For instance, a comments section in which all comments are top-level should not have a high discussion score. A comments section in which there are some really deep back-and-forths should have a higher discussion score.
• a diversity score, which quantifies how many different people are involved in the discussions. Again, we don’t want to look at diversity in the comments section as a whole because we are looking for diversity within discussions, i.e. within threads.

The current sketch for computing the discussion score is via two values:

• maximum thread width: what is the highest number of replies for a comment?

These are pretty rough approximations of “how much discussion” there is. The idea is that for sites which only allow one level of replies, a lot of replies to a comment can signal a discussion, and that a very deep thread signals the same for sites which allow more nesting.

The discussion score of a top-level thread is the product of these two intermediary metrics:

$$\text{discussion score}_{\text{thread}} = \max(\text{thread}_{\text{depth}}) \max(\text{thread}_{\text{width}})$$

The discussion score for the entire asset is the value that answers this question: if a new thread were to start in this asset, what discussion score would it have?

The idea is that if a section is generating a lot of discussion, a new thread would likely also involve a lot of discussion.

The nice thing about this approach (which is similar to the one used throughout all these sketches) is that we can capture uncertainty. When a new article is posted, we have no idea how good of a discussion a new thread might be. When we have one or two threads – maybe one is long and one is short – we’re still not too sure, so we still have a fairly conservative score. But as more and more people comment, we begin to narrow down on the “true” score for the article.

More concretely (skip ahead to be spared of the gory details), we assume that this discussion score is drawn from a Poisson distribution. This makes things a bit easier to model because we can use the gamma distribution as a conjugate prior.

By default, the gamma prior is parameterized with $k=1, \theta=2$ since it is a fairly conservative estimate to start. That is, we begin with the assumption that any new thread is unlikely to generate a lot of discussion, so it will take a lot of discussion to really convince us otherwise.

Since this gamma-Poisson model will be reused elsewhere, it is defined as its own function:

def gamma_poission_model(X, n, k, theta, quantile):
k = np.sum(X) + k
t = theta/(theta*n + 1)
return stats.gamma.ppf(quantile, k, scale=t)

Since the gamma is a conjugate prior here, the posterior is also a gamma distribution with easily-computed parameters based on the observed data (i.e. the “actual” top-level threads in the discussion).

We need an actual value to work with, however, so we need some point estimate of the expected discussion score. However, we don’t want to go with the mean since that may be too optimistic a value, especially if we only have a couple top-level threads to look at. So instead, we look at the lower-bound of the 90% credible interval (the 0.05 quantile) to make a more conservative estimate.

So the final function for computing an asset’s discussion score is:

def asset_discussion_score(threads, k=1, theta=2):
n = len(X)

k = np.sum(X) + k
t = theta/(theta*n + 1)

return {'discussion_score': gamma_poission_model(X, n, k, theta, 0.05)}

A similar approach is used for an asset’s diversity score. Here we ask the question: if a new comment is posted, how likely is it to be a posted by someone new to the discussion?

We can model this with a beta-binomial model; again, the beta distribution is a conjugate prior for the binomial distribution, so we can compute the posterior’s parameters very easily:

def beta_binomial_model(y, n, alpha, beta, quantile):
alpha_ = y + alpha
beta_ = n - y + beta
return stats.beta.ppf(quantile, alpha_, beta_)

Again we start with conservative parameters for the prior, $\alpha=2, \beta=2$, and then compute using threads as evidence:

def asset_diversity_score(threads, alpha=2, beta=2):
X = set()
n = 0
X = X | users
y = len(X)

return {'diversity_score': beta_binomial_model(y, n, alpha, beta, 0.05)}

Then averages for these scores are computed across the entire sample of assets in order to give some context as to what good and bad scores are.

## User-level metrics

User-level metrics are computed in a similar fashion. For each user, four metrics are computed:

• a community score, which quantifies how much the community approves of them. This is computed by trying to predict the number of likes a new post by this user will get.
• an organization score, which quantifies how much the organization approves of them. This is the probability that a post by this user will get “editor’s pick” or some equivalent (in the case of Reddit, “gilded”, which isn’t “organizational” but holds a similar revered status).
• a discussion score, which quantifies how much discussion this user tends to generate. This answers the question: if this user starts a new thread, how many replies do we expect it to have?
• a moderation probability, which is the probability that a post by this user will be moderated.

The community score and discussion score are both modeled as gamma-Poission models using the same function as above. The organization score and moderation probability are both modeled as beta-binomial models using the same function as above.

## Time for more refinement

These metrics are just a few starting points to shape into more sophisticated and nuanced scoring systems. There are some desirable properties missing, and of course, every organization has different principles and values, and so the ideas presented here are not one-size-fits-all, by any means. The challenge is to create some more general framework that allows people to easily define these metrics according to what they value.

Image by DarwinPeacock, Maklaan, CC-BY