
How do you know early on whether a new game concept is worth pursuing? This guide explains how to evaluate early player data, choose the right North Star metric, use benchmarks and experiments, and make better product decisions during prototype and soft launch.
When a new game concept starts showing early data, the hardest part is rarely finding numbers. It is knowing which numbers matter, which ones can be trusted, and when mixed signals mean “keep going” versus “rethink the concept.”
That is the core challenge explored in this conversation with Clara Livingston of New York Times Games: how to tell, as early as possible, whether a game idea has real potential. Watch the full episode here.
The answer is not to watch every metric equally. It is to decide what success means first, then use data to test whether the game is moving toward that goal.
As Clara puts it, “You have to have that path. You have to have a North Star and a benchmark.”
Start with a North Star, not a dashboard
In early-stage game analysis, one of the biggest mistakes is trying to optimize everything at once. New concepts often produce mixed signals by definition. Installs may look promising while retention looks weak. Early engagement may be solid, but progression falls apart. If the team reacts equally to every number, it becomes easy to justify almost any conclusion.
That is why Clara emphasizes the value of choosing a North Star metric early. Not because it is the only thing worth watching, but because it gives the team a consistent way to interpret mixed data. For example, if the main goal of a soft launch is to understand whether players will stick with the game, then retention should dominate the analysis. Other metrics still matter, but they should not distract from the central question.
Her warning is practical: “It is hard to try to move all of them at the same time.”
Why benchmarks matter early
A North Star alone is not enough. Teams also need a reference point. Benchmarks help answer a simple but critical question: is this number good for this kind of game, at this stage? For some studios, internal comparisons are available. If the team has tested or launched similar titles before, those products become useful context. For earlier-stage studios without that history, external comparisons matter more.
That does not mean copying top performers blindly. It means using category-level data to understand what realistic targets look like. Clara’s point here is important: benchmarks are not there to flatten ambition. They are there to prevent teams from drifting into fantasy. If a team is expecting a retention curve far above what similar games achieve, it needs to understand how high that bar really is.
Big changes are more useful than tiny ones
Once a team has a North Star and sees that performance is off target, the next question is how to move it. Clara’s answer is very clear: if the metric that matters is far from where it needs to be, small changes are not enough.
She argues that when teams are worried about something as important as retention, they should test bold interventions first. Her logic is that a major change gives the clearest signal. If it works, the team learns where value may be unlocked. If it does not, that is also information.
"You’re gonna move the needle the furthest with your biggest bet.”
That makes early-stage experimentation less about polishing and more about discovering whether the game has enough elasticity to improve.
If the big change does nothing, that is the real warning sign
One of the strongest ideas in the discussion is that teams should be most concerned not when a bold test goes the wrong way, but when it does not move anything at all. If a major change produces no meaningful movement in the target metric, that can signal a deeper issue with the concept, core loop, or audience fit.
Clara says it plainly: “You can’t move an immovable object.” That is a useful mindset shift. A failed big change is still informative. It tells the team something about where the real constraints may be. But if everything becomes a fight over tiny percentage differences, the team may already be too far into refinement mode for a product that has not yet proven its base potential.
Mixed metrics do not mean you should change your goal
Another common problem in early analysis is that one metric improves while another worsens. For example, a major onboarding or progression change may boost short-term retention but reduce a later-stage metric. Or the primary target gets worse while some secondary metrics improve. Clara’s advice is not to panic and choose whichever metric looks best after the fact. If the team already agreed on a North Star, that should remain the anchor.
Her phrasing is helpful here: “Don’t change your North Star just because your numbers flipped around.”
Instead, teams should ask where in the funnel the change was supposed to have impact, and whether the movement in other metrics reveals something about the player experience. If onboarding changes D1, that makes sense. If a later progression adjustment unexpectedly hits D1, that deserves investigation. The point is not to ignore side effects. It is to interpret them in relation to the original product question.
Small audiences create noisy data, but not useless data
A major challenge in early-stage testing is limited audience size. When a game only has a few thousand players, day-to-day variability can look dramatic. Metrics like retention or DAU may swing in ways that feel alarming, even when the underlying average is relatively stable. Clara explains this through distributions: smaller audiences naturally create wider variance. Larger audiences compress that variance and make smaller changes easier to interpret.
That means teams in soft launch need to get comfortable with the shape of their data before overreacting to short-term movement. The question is not whether there is fluctuation. The question is whether the fluctuation is normal for the audience size and acquisition setup. This is why time matters. If a team chooses to stay in a narrow test environment, it gains control, but it sacrifices speed and certainty.
When to stop making giant changes
Eventually, every game reaches a point where the question shifts from “does this concept have potential?” to “how do we improve what is already working?” That is the point where giant changes become less useful and finer iteration becomes more appropriate.
Clara suggests that this usually happens when:
- the game has found a more stable player base
- the core metrics have become more consistent
- bigger tests have already moved the game closer to its target
- the audience is large enough for smaller effects to be measurable
At that stage, the team can begin refining specific parts of the funnel more precisely. But until then, trying to optimize micro-details too early can become expensive and misleading.
As she notes, “It’s really expensive to get stuck in that loop.”
Early concepts still need player context
Another important layer in the discussion is that numbers are not enough on their own. Clara highlights the value of combining product data with player understanding. That means not only watching the metrics, but also understanding what players are experiencing and why.
In practice, that often means:
- looking at where players drop in specific funnels
- comparing player behavior across cohorts
- pairing analytics with research
- identifying whether the game is failing because of concept, flow, or expectation mismatch
This matters even more when the first audience is unusually loyal or brand-aware. Early players are not always representative of the wider market. They can help validate some parts of the experience, but not necessarily predict broader adoption.
Final takeaway
The clearest message from this discussion is that early concept evaluation is less about collecting more data and more about using the right frame. Teams need to decide what success means, choose a North Star metric, set a realistic benchmark, and then be willing to test big enough changes to learn whether the concept can actually move. If those bigger changes create real movement, the team can refine. If they do not, that is a sign to step back and rethink what the game needs.
Clara’s framing is the most useful summary: “If your big thing or your big change isn’t causing any difference, you need to consider maybe some bigger changes overall or core gameplay changes.”
That is what early analysis should do. Not just measure a prototype, but help a team understand whether it has something worth pushing further.
FAQ
How do you know early if a new game concept has potential?
Start by choosing a North Star metric and a benchmark. Then test whether the game can move that metric in a meaningful way through bold enough product changes.
Should early-stage teams focus on one metric or many?
Teams will look at multiple metrics, but they should prioritize one clear North Star to guide decisions. Trying to optimize everything at once usually creates confusion.
Why are benchmarks important in soft launch?
Benchmarks help teams understand whether their current numbers are realistic, strong, or weak compared to similar games or past internal tests.
Are big product changes better than small iterations early on?
Usually yes. If the game is far from its target, larger changes are more likely to reveal whether the concept can improve in a meaningful way.
What does it mean if a big experiment changes nothing?
That is often a stronger warning sign than a failed result. It may suggest that the core issue is deeper than the specific thing being tested.
How should teams think about noisy early data?
Small player bases naturally create more variance. Teams should look at trends, distributions, and normal variability for their audience size rather than overreacting to single-day changes.

