(2017-04-03) Data + Intuition: A Hybrid Approach to Developing Product North Star Metrics
Data + Intuition: A Hybrid Approach to Developing Product North Star Metrics (at LinkedIn). We present a five-step framework for developing quality metrics using a combination of machine learning and product intuition. Machine learning ensures that the metric accurately captures user experience. Product intuition makes the metric interpretable and actionable.
Through a case study of the Endorsements product at LinkedIn
wrong metrics have the potential to mislead the business. For example, at Bing, the Microsoft-owned search engine, two key metrics, queries per user and revenue per user, actually increased when a bug degraded search relevance
In addition to the usual emphasis on prediction accuracy, our approach puts a heavy emphasis on making the metric actionable and intuitive, an important concern for a metric to be adopted in product development. We show how product intuition can be used jointly with machine learning for this purpose
We demonstrate that an in-product survey can be a scalable way to collect high-quality labeled data.
2. BACKGROUND AND RELATED WORK
2.1 LinkedIn Endorsements
The vision of Endorsements is to build the largest and most trusted professional peer validation system. Endorsements allow users to vouch for the expertise of other users
The Endorsements product wasfirst introduced to LinkedIn in 2012, and has been heavily used by LinkedIn users since inception. Today there are more than 10 billion total endorsements
This work was motivated by the realization that not all endorsements are equally valuable. For example, an endorsement from an adviser or colleague is likely more significant than an endorsement from a social acquaintance. We wanted to identify the ones which best serve the product vision.
2.2 Literature on what constitutes a good endorsement
there is literature around related products such as reviews and (up/down) votes
Yet endorsements are not entirely like reviews; they are much more lightweight and other users cannot comment on whether or not they are helpful. In that sense, endorsements are more similar to the upvotes on question and answer sites like Quora and StackOverflow
The common theme across reviews and upvotes is that the credibility of the reviewer or voter determines the value of their opinion
2.3 Literature on using metrics to guide product decision
2.3.1 Measuring volume count metrics
Engagement metrics are easy to track and give a good baseline of overall performance, but a common problem is that raw activity doesn’t necessarily translate to the quality of the user experience or the business objectives
2.3.2 Measuring experience quality metrics
simple aggregation (e.g. count, ratio) of user actions may be insufficient. For example, the Web Search community often wants to measure search success or searcher satisfaction. Neither of these can be reliably defined by counting intuitive actions like search queries or result clicks; for example, in the case of“good abandonment,”the searcher finds the information on the search result page and has no need to reformulate the query or click a result
An important difference between our work and most of the above literature is that the ultimate goal of our work is to create a metric that can be used to drive product development, while the focus of most of the cited work here (e.g. ) is on prediction accuracy.
Our work expands upon the modeling-based work reviewed in this section by presenting a concrete framework for developing accurate, intuitive, and actionable metrics that capture user experience. The framework shows how data insights and product intuition play complementary roles in this process
3. METRIC DEFINITION FRAMEWORK
There are five steps
3.1 Collect labeled data
The first step is to collect labeled data that measures the True North success of the product (e.g. quality of user experience).
In many cases, clickstream data are unable to measure quality, so a survey may be appropriate
In our case, we want to know which endorsements serve the purpose of validating a LinkedIn user’s skills. We believe it is difficult to infer quality from clickstream logs. We could try to measure it based on whether an endorsement leads to more interest from recruiters. However, from discussions with our recruiter colleagues at LinkedIn, the presence of an endorsement is a small part of the decision, whereas past experience bears more weight. That seems like a major piece of info. Do Endorsements have any functional value?
We therefore took the survey approach, asking endorsement recipients one of two questions when they are notified about receiving the endorsement.
The first question distinguishes valid endorsements from social ones. The endorser should be able to assess the recipient’s skill level well enough to give a lightweight recommendation. The second question asks which endorsements satisfy the recipient’s goal to improve their reputation. We took care to be specific in our wording so that the questions are clear, explicitly stating both the endorser’s name and the skill name. We used the familiar five-point rating scale with standard Likert scale choices (visible upon tapping on stars).
For the purposes of metric development, we collected survey data over 18 days, for a total of 30,563 responses from 25,422 users.
3.2 Identify broad set of relevant signals
The goal is to identify all the quality endorsements without surveying all recipients. To do this, we need signals indicative of quality that are known at the time an endorsement is given. In this step, we identify as many relevant signals as possible. Not all will be used in the final metric definition.
By discussing with the product experts, we identified a total of 84 signals
3.3 Apply machine learning to identify top signals
To prepare the survey responses for modeling, 1-3 star endorsements were considered ‘not quality’ and 5-star were considered ‘quality.’ We discarded 4-star endorsements as a neutral buffer
The top features of each model were selected, looking for a natural cutoff in feature importance. As a result, this narrowed our original set of 84 features down to 12
3.4 Propose candidate definitions using top signals
The goal is to find 2-3 intuitive metrics
For Endorsements, product intuition was guided by user research. From interviewing users, we learned that the endorser’s reputation and relationship to the recipient affect how the endorsement is perceived.
The result was a set of three candidate definitions of Quality Endorsements
3.5 Pick winning definition using human judgment
By letting the team choose the final metric, we secured buy-in from the stakeholders who will be using and relying on the metric
4. RESULTS AND DISCUSSION
4.1 Model performance
4.2 Comparison of definitions
Based on the model output, we constructed three candidate definitions of Quality Endorsements, each of which included the top signals identified from the models. We compared them with two baseline definitions. The first baseline definition is effectively what the team used before this work took place: treating every endorsement as quality (B1). The second baseline definition is based solely on product intuition (B2): we discussed with the product experts before surveying users and hypothesized that a quality endorsement is one given by a coworker, classmate, or senior-level user, who is a top expert in the skill area. M1, M2, and M3 are the final three candidates crafted through the metric development framework. They each include the components of knowing the person and knowing the skill. But they vary in strictness of what it means to satisfy each condition
After seeing these results, the product team chose M1 as the final definition over M2 and M3 because it achieves the highest recall while maintaining high precision
By combining machine learning with product intuition, the result is an accurate and sensitive metric that is easy to communicate and use.
PRODUCT IMPACT In a data-driven organization, it is important to have the right metrics to create the right product. In this section, we show how optimizing for total endorsements influenced the existing Endorsements product, and how the new Quality Endorsement metric is driving changes in the right direction.
5.1 Past Metrics Drove the Wrong Goal
When Endorsements was introduced in 2012, the North Star metric was total endorsements given
e the total endorsements metric certainly increased as a result of these promos, the value of the Endorsements product did not necessarily improve.
the recipient is concerned about building a strong reputation, and the profile viewer (e.g. hiring manager) is interested in making an evaluation.
Although total endorsements was the right metric at product launch, it became a misleading metric over time. Focus on a misleading metric blinded us to a user experience that drifted away from the main purpose to validate user’s skills and provide the viewer a way to assess expertise. Users expressed skepticism to trust endorsements as a measure of expertise, because it was hard to tell the signal from the noise.
5.2 New Product Direction
we changed how we present the suggestions to users. We explain the reason for the suggestion in the context of the endorser’s skill and relationship with the recipient (Figure 4). Our A/B tests indicate that these changes increased Quality Endorsements given by over 50%.
survey responses from members have improved noticeably over the last eight months (Figure 6). In absolute terms, the percentage of 5-star responses have increased by nearly 5 percentage points, while 1-star responses decreased by around 1pp.
Edited: | Tweet this!