KPI Design

The Highest-Leverage Evaluation Decision

A federally qualified health center in eastern Oregon receives a three-year HRSA Rural Communities Opioid Response grant. The program manager inherits a set of 12 KPIs from the application narrative, written six months earlier by a grant writer who is no longer with the organization. Three of the KPIs are solid. The other nine will produce either data nobody can act on, numbers that arrive too late to matter, or incentives that actively degrade the program they were meant to measure.

This is not a hypothetical edge case. It is the default condition. KPI design is the single highest-leverage decision in program evaluation because KPIs are not passive observation instruments. They are feedback loops. A well-designed KPI tells the program manager whether the intervention is working in time to adjust course. A poorly designed KPI tells the program manager nothing useful until the reporting deadline, at which point the data confirms a failure that can no longer be corrected. And the worst-designed KPIs — the ones that are simultaneously measurable, visible, and tied to funding consequences — do not merely fail to inform. They actively redirect program effort away from outcomes and toward metric performance, through the Goodhart’s Law dynamics described in Human Factors Module 8.

The logic model from the preceding page (PF M5, Logic Models) establishes the causal chain: inputs produce activities, activities produce outputs, outputs produce outcomes. KPIs are the measurement instruments placed along that chain. Where you place them, how you define them, and how frequently you measure them determines whether the evaluation system detects reality or constructs a parallel accounting of activity that may have no relationship to actual program impact.

KPIs as Feedback Loops

The purpose of a KPI is not documentation. It is course correction.

This distinction is the source of most KPI design failures in grant-funded programs. When KPIs are designed for reporting — to satisfy a funder’s data requirements at the end of a performance period — they optimize for completeness and defensibility. The question becomes: can we produce a number for every required field? When KPIs are designed for learning — to tell the program team whether the intervention is working — they optimize for timeliness and actionability. The question becomes: does this number tell us something we can act on before the next measurement?

Donabedian’s structure-process-outcome framework (1966) provides the foundational taxonomy. Structure KPIs measure what is in place (staffing levels, equipment, protocols). Process KPIs measure what is being done (screening rates, referral completion, training hours). Outcome KPIs measure what changed (symptom reduction, wait time decrease, readmission reduction). Each level has a causal relationship to the next: adequate structure enables effective process, effective process produces desired outcomes. A KPI set that measures only structure tells you what the program has. A KPI set that measures only process tells you what the program does. Only outcome KPIs tell you what the program achieved. And only the combination — with structure and process KPIs serving as leading indicators of outcome KPIs — creates a feedback loop that supports intervention while there is still time to intervene.

The IHI (Institute for Healthcare Improvement) measurement framework distinguishes three purposes for measurement: improvement, accountability, and research. Improvement measurement requires frequent data, small sample sizes, and rapid feedback — the goal is to detect change in time to act. Accountability measurement requires standardized definitions, risk adjustment, and periodic reporting — the goal is fair comparison across programs. Research measurement requires statistical rigor, controlled conditions, and sufficient power — the goal is causal inference. Most grant KPIs attempt to serve all three purposes simultaneously and succeed at none. A quarterly-reported, self-collected, unstandardized metric is too infrequent for improvement, too unstandardized for accountability, and too uncontrolled for research. KPI design must choose a primary purpose and optimize for it.

For grant-funded healthcare transformation programs, the primary purpose should be improvement. The program team needs to know, month by month, whether the intervention is producing the expected changes along the logic model’s causal chain. Accountability data for the funder can be derived from improvement data by aggregating and standardizing at reporting intervals. But the reverse does not work: accountability metrics collected quarterly cannot be disaggregated into monthly improvement signals after the fact. Design for improvement first, then aggregate for accountability.

The KPI Quality Test

Not all measurable things are worth measuring, and not all important things are measurable with available resources. The quality test for a KPI has five dimensions, each of which must be satisfied independently. A KPI that fails any single dimension should be redesigned or replaced.

Valid. The KPI measures what it claims to measure. A “behavioral health access” KPI that counts the number of behavioral health appointments scheduled measures scheduling activity, not access. Access requires that the appointment is available within a clinically appropriate timeframe, that the patient can reach the service (transportation, telehealth availability), and that the patient actually receives care. The appointment count is a proxy for access, and the gap between the proxy and the construct is where gaming enters (HF M8). Validity requires explicit articulation of what the KPI is intended to represent and critical examination of whether the operational definition actually captures it. The CDC evaluation framework (1999) identifies accuracy as one of four evaluation standards; validity is the measurement-level expression of accuracy.

Reliable. Repeated measurement under the same conditions produces the same result. If two different staff members extract the same KPI from the same data source and get different numbers, the KPI is unreliable. Reliability fails most often when KPIs depend on clinical judgment (was this a “positive” screen?), manual chart review (does this encounter qualify?), or inconsistent data entry practices (was the PHQ-9 entered as a structured field or a free-text note?). EHR-extracted KPIs are more reliable than manually collected KPIs, but only if the EHR data entry is standardized. A PHQ-9 screening rate extracted from structured flowsheets is reliable. A PHQ-9 screening rate reconstructed from progress notes is not.

Feasible. The data required to compute the KPI actually exists and can be collected within available resources. This is the constraint that kills the most KPIs in practice. A wait time KPI requires that the scheduling system records both the date of referral and the date of the first available appointment — two data points that many scheduling systems do not capture in queryable fields. A care coordination KPI that requires matching records across two organizations’ EHR systems requires either a health information exchange or a manual matching process, both of which have costs and error rates. The CDC evaluation framework elevates feasibility to one of its four standards precisely because evaluations routinely specify measures that cannot be collected with available infrastructure. If the data does not exist in a queryable form, the KPI is aspirational, not operational. Either build the data infrastructure first (and make that a milestone, per PF M4) or choose a different KPI.

Actionable. Someone can do something with the result. A KPI that reports “65% of patients completed the PHQ-9” is actionable only if the program team can identify which patients were not screened, why they were not screened, and what workflow change would increase the rate. If the KPI is a black-box number with no diagnostic pathway, it is a score, not a signal. Actionability requires that the KPI be decomposable — that a program manager who sees the number move in the wrong direction can trace the cause and identify an intervention. Aggregate outcome KPIs (overall readmission rate) are less actionable than disaggregated ones (readmission rate by diagnosis, by discharge provider, by day of week) because the aggregate obscures the mechanism.

Timely. The KPI is available in time to matter. A quarterly KPI in a 12-month performance period provides exactly three data points before the annual report is due — and the first data point arrives at month 3, when the intervention may have been running for only 6-8 weeks. If the KPI shows the program is off track at month 3, there are only two quarters to correct course. If the KPI shows it at month 6, there is one quarter. At month 9, the annual report is already being drafted. Monthly KPIs provide 11 data points before the annual report and detect deviation early enough for meaningful intervention. Weekly KPIs — feasible for process measures like screening rate and appointment utilization — provide near-real-time feedback. The IHI improvement measurement framework recommends the highest frequency that is feasible, precisely because more frequent data enables faster learning cycles.

A KPI that passes all five dimensions is rare. Most grant programs operate with KPIs that pass two or three. The quality test is not an all-or-nothing gate; it is a diagnostic tool that identifies which dimension is weakest and what redesign is needed.

Process vs. Outcome KPIs

Process KPIs measure what the program does. Outcome KPIs measure what changed because of what the program did. The distinction maps directly to Donabedian’s framework and to the logic model: process KPIs correspond to the activities and outputs levels, outcome KPIs correspond to the short-term and long-term outcomes levels.

Most grant programs over-index on process KPIs for three reasons. First, process KPIs are easier to measure — they count activities under the program’s control (number of screenings, training sessions, referrals made). Second, process KPIs are easier to achieve — the program can always screen more patients or conduct more trainings, regardless of whether those activities produce results. Third, process KPIs are safer to report — a process KPI that shows “we screened 800 patients” is always a positive number, while an outcome KPI that shows “PHQ-9 scores did not improve” is an uncomfortable finding that may threaten continuation funding.

Funders increasingly recognize this pattern. HRSA’s Performance Improvement and Measurement System (PIMS) requires both process and outcome measures. SAMHSA’s GPRA/NOMs (Government Performance and Results Act / National Outcome Measures) framework mandates outcome-level reporting: functional status, employment, housing stability, substance use frequency — measures that cannot be satisfied by counting activities. The shift from process to outcome measurement is not a trend; it is a federal reporting requirement that grantees must design for from the start.

The operational solution is not to replace process KPIs with outcome KPIs. It is to link them through the logic model. A process KPI (PHQ-9 screening rate) should connect to an outcome KPI (mean PHQ-9 score change at 6-month follow-up) through an explicit causal hypothesis: screening identifies patients who need treatment, treatment produces symptom improvement, symptom improvement is captured by the PHQ-9 follow-up score. If the process KPI improves (screening rate rises) but the outcome KPI does not (PHQ-9 scores are unchanged), the causal chain has a broken link. Either screening is not leading to treatment, or treatment is not producing improvement. The process-outcome linkage converts the KPI set from a collection of independent numbers into a diagnostic system that identifies where the causal chain is failing.

Leading vs. Lagging Indicators

Leading indicators predict future performance. Lagging indicators confirm past performance. Both are necessary; leading indicators are more actionable because they create time to intervene.

Leading indicators are upstream in the causal chain. In a behavioral health integration program: referral volume to behavioral health services predicts future caseload. Screening rate predicts future identification of patients needing services. Provider appointment availability predicts future access. Staff satisfaction and burnout scores predict future retention. Each of these measures tells the program manager what is likely to happen in the next quarter, not what happened in the last one. A declining referral volume at month 6 predicts an access problem at month 9 — and month 6 is early enough to investigate and intervene (increase primary care provider training on referral criteria, remove workflow barriers, address stigma concerns).

Lagging indicators are downstream in the causal chain. Readmission rate change is a lagging indicator — it reflects the cumulative effect of months of care delivery, and by the time it moves, the conditions that produced the change are already in the past. PHQ-9 improvement at 6-month follow-up is lagging by definition — it requires six months of elapsed time after intake. ED utilization for behavioral health crises is lagging — it reflects whether community-based services prevented crises over the preceding period.

The failure pattern is a KPI set composed entirely of lagging indicators measured quarterly. The program team receives its first outcome data at month 6 (covering months 1-3, with a 3-month data lag). The data shows no improvement. But the program only achieved operational capacity at month 4 — the first quarter of data reflects the pre-capacity period. By month 9, the second quarter’s data arrives (covering months 4-6), and it shows marginal improvement. The program team cannot determine whether the trajectory is sufficient to meet Year 1 targets until month 12, when the third quarter’s data arrives — which is also the annual reporting deadline. There has been zero time for course correction. The program has been flying blind.

The solution is to pair every lagging indicator with at least one leading indicator that predicts it. Readmission rate (lagging) pairs with post-discharge follow-up completion rate (leading). PHQ-9 improvement (lagging) pairs with treatment engagement rate (leading) — the percentage of patients with a positive screen who attend at least three therapy sessions. ED utilization for behavioral health crises (lagging) pairs with crisis plan completion rate (leading) — the percentage of high-risk patients with a documented safety plan. The leading indicator does not guarantee the lagging outcome, but it provides early evidence that the preconditions for the outcome are being established.

The Goodhart’s Law Problem

Any KPI that is tied to incentives or reporting will eventually be gamed. This is not a possibility to be guarded against. It is a certainty to be designed for. The mechanism is described in detail in Human Factors Module 8 (Incentive Gaming); here the focus is on how it applies specifically to grant program KPIs.

Screening rate gaming. A program reports PHQ-9 screening rate as the number of patients screened divided by the number of eligible primary care encounters. The screening rate improves from 30% to 75% over 12 months — an apparent success. But examination reveals that the threshold for “eligible encounter” was narrowed: annual wellness visits only, excluding acute visits, follow-up visits, and telehealth encounters. The denominator shrank. The number of patients actually screened increased modestly. The rate improved dramatically because the definition of the population at risk was manipulated. The KPI reported success. The program served roughly the same number of patients.

Wait time gaming. A program reports average wait time for behavioral health intake as the interval between referral and first appointment. Wait times improve from 42 days to 14 days. But the definition of “referral” was changed: only formal electronic referrals count, not phone referrals, fax referrals, or patient self-referrals. Patients who call directly are scheduled as “new patient” appointments rather than “referred” appointments, and their wait times are not captured in the KPI. The measured population is a subset of the actual population seeking access. The KPI shows improved access. The patient experience of access may be unchanged.

Training completion gaming. A program reports that 95% of clinical staff completed behavioral health integration training. The training was a 30-minute online module completed during a staff meeting, with a post-test that could be retaken unlimited times. The milestone is met. Whether any clinician’s practice changed is unmeasured and, based on the training design, unlikely. The KPI captured seat-time, not competency.

These are not exotic failure modes. They are the predictable consequences of attaching reporting requirements to metrics that have a gap between the proxy (what is measured) and the construct (what matters). KPI design must anticipate gaming by applying the adversarial design principle from HF M8: before deploying a KPI, ask how a competent, motivated program manager could maximize the metric without improving the underlying outcome. If the answer is easy and obvious, the KPI needs a tighter definition, a paired metric that would diverge under gaming, or replacement with a less gameable measure.

Healthcare Example: Redesigning a HRSA Rural Transformation Grant’s KPI Set

Consider a three-year, $1.8M HRSA-funded rural health transformation grant at a 30-bed critical access hospital with two primary care clinics in central Montana, population 18,000. The original application specified 12 KPIs. Below is the analysis of each, categorized by failure mode, followed by the redesign.

The Original 12 KPIs

Process-only KPIs (counting activities without connecting to outcomes):

Number of community health worker (CHW) home visits conducted per quarter. Counts activity. A CHW could conduct 200 visits that produce no change in patient self-management. No connection to whether visits improved outcomes.
Number of staff trained in trauma-informed care. Counts seat-time. Identical to the training gaming pattern above. No assessment of whether training changed clinical behavior.
Number of telehealth encounters delivered per quarter. Counts utilization of a modality. A telehealth encounter is not inherently better or worse than an in-person encounter; counting encounters reveals nothing about quality, appropriateness, or outcomes.
Number of referral agreements signed with community partners. Counts documents. A signed agreement that produces zero referrals is a meaningless artifact. The agreement is a structural input; the KPI should measure the process or outcome it enables.

Lagging-only KPIs (quarterly measurement too slow for course correction):

30-day readmission rate for behavioral health patients, reported quarterly. Lagging and low-frequency. The first data point arrives at month 6 (quarter 1 data with processing lag). With small rural volumes (perhaps 15-25 BH admissions per quarter), random variation dominates. No leading indicator paired with it.
ED utilization rate for behavioral health crises, reported quarterly. Same problem. Small denominators, long lag, no leading indicator. By the time a quarterly trend is visible, nine months have passed.
Patient satisfaction score from annual survey. Annual measurement provides exactly one data point per grant year. Useless for course correction. By the time dissatisfaction is detected, a full year of patient experience has occurred without intervention.

Gameable KPIs (definitions allow manipulation):

PHQ-9 screening rate in primary care. The denominator — “eligible primary care encounters” — is undefined. As described above, narrowing the denominator inflates the rate. Without a locked denominator definition, this KPI is a Goodhart’s Law case study.
Average wait time for behavioral health intake. The definition of “wait” and the definition of “referral” are both ambiguous. Wait time measured from electronic referral only, excluding self-referrals and external referrals, captures a subset of the access problem.

Well-designed KPIs (valid, timely, connected to logic model):

Percentage of patients with positive PHQ-9 screen (score >= 10) who receive a warm handoff to behavioral health within the same visit, measured monthly from EHR workflow data. Valid: measures the care integration the program is designed to produce. Reliable: EHR-extractable with a clear operational definition. Feasible: requires a structured warm-handoff field in the EHR (buildable). Actionable: if the rate drops, the program manager can identify which clinic, which provider, which workflow step is failing. Timely: monthly.
Mean PHQ-9 score change from intake to 6-month follow-up for patients enrolled in behavioral health treatment, reported monthly on a rolling basis. Valid: directly measures symptom improvement. Reliable: PHQ-9 is a validated instrument. The rolling measurement means new data arrives every month as patients reach their 6-month mark, rather than waiting for a quarterly aggregate. Connects directly to the logic model’s outcome level.
Behavioral health provider panel utilization rate (scheduled appointments as a percentage of available appointment slots), measured weekly. Valid: measures whether the capacity the program built is being used. Reliable: scheduling system data. Feasible: standard scheduling report. Actionable: low utilization triggers investigation (referral volume? no-show rate? insufficient provider hours?). Timely: weekly.

The Redesign

The nine weak KPIs are redesigned using the quality test framework:

Original KPI	Failure Mode	Redesigned KPI
1. CHW home visits (count)	Process-only	CHW patients with documented self-management goal progress at 90-day review (% of active caseload), measured monthly
2. Staff trained (count)	Process-only	Percentage of primary care encounters where behavioral health screening protocol is followed correctly (chart audit of 20 random encounters per month)
3. Telehealth encounters (count)	Process-only	Telehealth appointment completion rate (attended vs. scheduled) and patient-reported telehealth satisfaction (post-visit text survey), measured monthly
4. Referral agreements signed	Process-only	Bidirectional referral completion rate per partner (referrals sent that result in completed intake at receiving organization / total referrals sent), measured monthly
5. BH readmission rate (quarterly)	Lagging-only	Retain as quarterly outcome KPI; add leading indicator: post-discharge follow-up contact within 72 hours (%), measured weekly
6. ED BH crisis utilization (quarterly)	Lagging-only	Retain as quarterly outcome KPI; add leading indicator: crisis safety plan completion rate for high-risk patients (%), measured monthly
7. Patient satisfaction (annual)	Lagging-only	Replace annual survey with monthly 3-question post-visit text survey (access, communication, overall); retain annual comprehensive survey as supplemental
8. PHQ-9 screening rate (gameable)	Gameable denominator	Lock denominator: all primary care encounters with established patients aged 12+, excluding procedures-only visits; denominator definition reviewed and approved by evaluator before first measurement
9. Wait time (gameable)	Gameable definition	Redefine: calendar days from any referral source (electronic, phone, fax, self-referral) to first completed behavioral health encounter; measure from scheduling system plus manual log for non-electronic referrals; measured monthly

The redesigned set has 15 KPIs (12 redesigned plus 3 original well-designed KPIs, with leading indicators added to lagging KPIs). This is more KPIs, but each passes the quality test and connects to a specific node in the logic model. The process KPIs (screening rate, referral completion, follow-up contact) are linked to outcome KPIs (PHQ-9 improvement, readmission rate, ED utilization) through explicit causal hypotheses. The leading indicators (follow-up contact rate, safety plan completion, referral completion) pair with lagging indicators (readmission rate, ED utilization) to provide early signal.

The Denominator Problem

Healthcare KPIs have a structural vulnerability that most other domains do not: denominator instability. The population being measured changes as the program operates, and that change can produce apparent improvement or deterioration that has nothing to do with program effectiveness.

A PHQ-9 screening rate improves from 30% to 60%. Possible explanation one: the program doubled its screening activity — twice as many patients were screened out of the same population. This is genuine improvement. Possible explanation two: the denominator was redefined. Initially, all adult primary care encounters were included. After six months, the program narrowed eligibility to annual wellness visits only. The denominator dropped by 60%. The numerator (patients screened) increased by 10%. The rate doubled. The program screened marginally more patients while reporting dramatically better performance.

The denominator problem is not always intentional. Enrollment changes, eligibility redefinition, seasonal variation in patient volume, and provider panel changes all shift the denominator organically. A substance use treatment program that reports “percentage of enrolled patients completing 90-day treatment” will see the denominator shift as enrollment patterns change. Early enrollees are more motivated (selection bias); later enrollees may include court-ordered patients with different completion profiles. The completion rate at month 6 and the completion rate at month 18 may not be measuring the same population even if the KPI definition has not changed.

SAMHSA’s GPRA/NOMs framework addresses this partially by requiring standardized intake and follow-up instruments administered at fixed intervals, creating a denominator defined by enrollment rather than by encounter. HRSA’s PIMS measures similarly define populations by enrollment or attribution rather than by encounter volume. These approaches stabilize the denominator but do not eliminate the problem — enrollment itself is subject to the same selection and definition pressures.

The operational defense is denominator documentation. Every KPI that involves a rate or percentage must have a denominator definition that is: specified in writing before first measurement, approved by the program evaluator, and unchanged throughout the measurement period unless a formal revision is documented with justification. Denominator changes should be reported alongside KPI changes so that the funder and the program team can distinguish genuine improvement from definitional artifact. A KPI dashboard that shows the numerator, denominator, and rate as three separate trend lines — rather than the rate alone — makes denominator instability visible rather than hidden.

The Product Owner Lens

What is the funding/compliance/execution problem? Grant programs select KPIs during the application process, often under time pressure and without measurement expertise, then discover during execution that the KPIs are unmeasurable, ungameable-proof, or uninformative. By the time KPI problems become visible (the first reporting period), the KPI definitions are locked into the award terms and cannot be easily changed.

What mechanism explains the operational bottleneck? KPI design requires simultaneous competency in measurement science (validity, reliability), data infrastructure (feasibility), program theory (logic model alignment), and behavioral anticipation (gaming resistance). These competencies rarely coexist in the person writing the grant application. The result is KPIs that satisfy the funder’s template without satisfying the quality test, producing a measurement system that generates numbers without generating learning.

What controls or workflows improve it? KPI quality review at application time using the five-dimension test. Denominator locking before first measurement. Leading-lagging pairing for every outcome KPI. Monthly measurement frequency as default, with quarterly aggregation for reporting. Adversarial review of KPI definitions (how would you game this?) before finalizing the evaluation plan.

What should software surface? KPI trend dashboard showing numerator, denominator, and rate as separate time series — making denominator instability visible. Process-outcome correlation tracking: when a process KPI and its linked outcome KPI diverge, flag the divergence as a potential gaming signal or broken causal link. Leading indicator trajectory projection: given the current leading indicator trend, is the lagging outcome KPI on track to meet its target? Evidence readiness status for each KPI: is the data collection method active, is the data source producing queryable results, has the baseline been captured? KPI quality scorecard at application time: does each proposed KPI pass all five dimensions of the quality test?

What metric reveals risk earliest? The process-outcome correlation coefficient, measured monthly on a rolling basis. When screening rate (process) and PHQ-9 improvement (outcome) are both trending upward, the program is working. When screening rate trends upward but PHQ-9 improvement is flat, one of three things is happening: the causal link is broken (screening is not leading to treatment), the outcome measure is lagging (improvement will appear later), or the process KPI is being gamed (the screening rate increase is definitional, not real). The divergence is the earliest signal that the KPI system needs investigation — earlier than the quarterly report, earlier than the annual evaluation, and early enough to intervene.

Warning Signs

All KPIs are process measures. If the KPI set counts activities (screenings conducted, trainings completed, meetings held) but does not measure whether those activities produced outcomes (symptom improvement, behavior change, access improvement), the evaluation will demonstrate effort without demonstrating impact. Funders will notice.

No KPI has a defined denominator. If rate-based KPIs do not specify exactly which patients, encounters, or events constitute the denominator — and if that definition is not documented before the first measurement — the denominator will shift in whatever direction produces a better-looking number.

The KPI set has no leading indicators. If every KPI measures outcomes that lag by 3-6 months, the program team will not detect problems until it is too late to fix them. The absence of leading indicators is a structural blind spot.

KPIs were designed by the grant writer, not the program team. If the people responsible for achieving the KPIs did not participate in defining them, the KPIs may be unmeasurable with existing infrastructure, disconnected from the actual intervention, or set at thresholds that are either trivially achievable or impossibly ambitious.

The same KPI set has been used for three consecutive grant cycles. If KPIs have not evolved, they have been optimized. The program has learned how to produce good numbers on these specific metrics, and the metrics have stopped generating useful information about actual performance. This is Goodhart’s Law in its mature form.

Improving KPIs do not match staff experience. If the dashboard shows improving performance but frontline staff report that things are not getting better — or are getting worse — the KPIs are measuring something other than what staff experience. This divergence is the most reliable qualitative signal that the measurement system has decoupled from reality.

Integration Hooks

Human Factors Module 8 (Incentive Gaming and Goodhart’s Law). HF M8 describes the mechanism: when a measure becomes a target, it ceases to be a good measure. This page applies that mechanism to the specific context of grant program KPIs. The four gaming types from HF M8 — cherry-picking, teaching to the test, threshold manipulation, and definitional gaming — each have direct KPI design analogs. Cherry-picking: narrowing the patient population included in the KPI. Teaching to the test: concentrating resources on measured KPIs while unmeasured program dimensions degrade. Threshold manipulation: redefining what counts as a “positive” screen or a “completed” referral. Definitional gaming: changing what constitutes an “eligible encounter” in the denominator. The adversarial design principle from HF M8 — red-teaming KPIs before deployment — is the most underused tool in grant evaluation design. A program that asks “how would we game this KPI?” before locking the evaluation plan will design KPIs that are structurally harder to game.

Public Finance Module 4 (Milestone Design). Milestones require evidence, and evidence requires KPI data. If the KPIs are wrong — measuring the wrong thing, arriving too late, or susceptible to gaming — the milestone evidence is cosmetic. A milestone of “achieve 70% PHQ-9 screening rate by month 18” is only as good as the screening rate KPI that underlies it. If the screening rate KPI has an undefined denominator, the milestone can be “achieved” by manipulating the denominator rather than by improving screening. If the screening rate KPI is measured quarterly, the program will not know whether it is on trajectory until it is too late to adjust. The milestone-KPI relationship is bidirectional: milestone design (PF M4) determines what must be measured; KPI design (this page) determines whether the measurement is trustworthy. A program with strong milestones and weak KPIs is a program that sets the right targets and then cannot tell whether it hit them.

Key Frameworks and References

Donabedian, A. (1966) — structure-process-outcome framework for quality assessment; the foundational taxonomy for classifying healthcare quality measures and the basis for distinguishing KPI types
IHI Measurement Framework — distinguishes improvement, accountability, and research as three purposes of measurement, each with different data requirements; critical for designing KPIs that serve the right purpose
CDC Framework for Program Evaluation in Public Health (1999) — establishes utility, feasibility, propriety, and accuracy as evaluation standards; feasibility is the binding constraint for most grant program KPIs
HRSA Performance Improvement and Measurement System (PIMS) — HRSA’s framework requiring grantees to report on program-specific performance measures; increasingly demands outcome-level data
SAMHSA GPRA/NOMs (Government Performance and Results Act / National Outcome Measures) — federal outcome measurement requirements for SAMHSA grantees; mandates standardized intake and follow-up instruments with fixed-interval measurement
Goodhart, C.A.E. (1975) — “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes”; the mechanism by which KPIs degrade when tied to incentives
Campbell, D.T. (1979) — “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures”; extends Goodhart by identifying active behavioral distortion
Strathern, M. (1997) — “When a measure becomes a target, it ceases to be a good measure”; the concise restatement of Goodhart’s Law widely adopted in quality measurement literature
Muller, J.Z. (2018), The Tyranny of Metrics — comprehensive analysis of metric fixation and its effects across public and private sectors; directly applicable to grant program KPI design
W.K. Kellogg Foundation Logic Model Development Guide (2004) — the standard reference for connecting KPIs to the causal chain from inputs through outcomes