Discrete-Event Simulation

Module 6: Simulation and Scenario Analysis Depth: Foundation | Target: ~2,500 words

Thesis: Discrete-event simulation models healthcare operations as a sequence of events (arrivals, service starts, departures) and reveals emergent system behavior that intuition misses.


The Operational Problem

Queueing theory (Module 2) gives you closed-form answers about how waiting systems behave — but only when the assumptions hold. The moment you face time-varying arrival rates, multi-stage patient routing, heterogeneous providers, shared resources that create dependencies between service lines, or priority rules that shift with system state, the elegant formulas stop working. Most real healthcare operations combine all of these features simultaneously.

Discrete-event simulation (DES) is what you reach for when the system is too complex for closed-form models but the stakes are too high for guessing. It is the computational generalization of queueing theory: rather than solving equations that describe a stylized system, you build a model of the actual system and run it forward in time, watching what happens. The result is not a single number — it is a distribution of outcomes that reveals emergent behavior no amount of intuition would predict.

DES has been the dominant simulation method in healthcare operations research since the 1980s. Sally Brailsford’s extensive surveys of healthcare simulation confirm that DES accounts for the majority of published healthcare simulation studies, with applications spanning emergency departments, surgical suites, outpatient clinics, bed management, and entire hospital systems. The method is mature, the tools are commercially available, and the barrier to entry is not mathematical sophistication — it is disciplined model-building.


How DES Works: The Mechanics

Understanding DES requires understanding five components and the algorithm that connects them. This is not a black box. Every DES model, from a textbook exercise to a multimillion-dollar hospital planning tool, runs on the same engine.

Entities

Entities are the things that flow through the system. In healthcare DES, entities are almost always patients — but they can also be lab samples, referral requests, prior authorization forms, or ambulances. Each entity carries attributes: arrival time, acuity level, insurance type, required services, appointment status. These attributes determine how the entity is routed and processed.

Resources

Resources are the things entities compete for. Providers, exam rooms, lab instruments, beds, registration clerks, imaging equipment. A resource has a capacity (how many entities it can serve simultaneously) and a state (idle, busy, or down for maintenance). The fundamental tension in every DES model is that entities need resources, resources are finite, and when demand exceeds supply, queues form.

Queues

When an entity needs a resource that is currently busy, it enters a queue. Queues have disciplines — FIFO, priority-based (triage acuity), shortest-expected-processing-time, or custom rules. Queue behavior is not programmed directly; it emerges from the interaction of arrival patterns, service times, resource availability, and routing logic. This is the core insight of DES: you specify the rules, and the queues reveal themselves.

Events

An event is an instantaneous occurrence that changes the state of the system. The canonical events are:

  • Arrival: A new entity enters the system. If a required resource is available, service begins immediately (generating a “service start” event). If not, the entity joins a queue.
  • Service start: An entity seizes a resource and begins processing. A service completion event is scheduled at the current time plus the sampled service duration.
  • Service completion (departure): An entity releases a resource. The entity either moves to the next stage of its route (generating a new resource request) or exits the system. The released resource checks its queue — if anyone is waiting, the next entity begins service immediately.

Other events may include scheduled shift changes, equipment failures, appointment arrivals at fixed times, or rule-triggered events (e.g., activating an overflow protocol when the ED queue exceeds a threshold).

The Simulation Clock and Event List

This is the mechanism that makes DES computationally efficient. The simulation does not step forward in fixed time increments (that would be time-step simulation, a different and less efficient approach). Instead, DES maintains an event list — a time-ordered schedule of future events — and advances the clock directly from one event to the next.

The algorithm, formalized by Banks, Carson, Nelson, and Nicol in their canonical textbook Discrete-Event System Simulation:

  1. Initialize. Set the simulation clock to zero. Schedule initial events (typically, the first patient arrival). Initialize all resources to idle, all queues to empty, all statistical counters to zero.
  2. Advance the clock. Remove the earliest event from the event list. Set the simulation clock to that event’s time. No computation occurs between events — if the next event is 7.3 minutes away, the clock jumps directly to that point.
  3. Process the event. Execute the logic associated with the event type. This may change the state of entities, resources, and queues, and it may schedule new future events. An arrival event, for instance, schedules the next arrival (by sampling from the inter-arrival time distribution) and either starts service or enqueues the entity.
  4. Update statistics. Record whatever performance measures matter: time-in-system, queue length, resource utilization, wait times.
  5. Repeat. Return to step 2. Continue until a termination condition is met (end of simulated day, target number of patients processed, etc.).

This next-event time advance is what makes DES “discrete-event” — the system state changes only at event times, and the simulation skips over the intervals between events. A clinic that processes 60 patients in an 8-hour day might involve 300-400 events (arrivals, service starts, completions, room assignments, departures). The simulation executes those 300-400 state changes and ignores the idle time between them.


The Core Insight: Emergence from Rules

The power of DES is not that it runs an algorithm. Spreadsheets run algorithms. The power is that DES generates system-level behavior from component-level rules. You specify:

  • How patients arrive (distributions fit to real scheduling and walk-in data)
  • How long each service takes (distributions fit to real process times)
  • How patients are routed (decision rules: if lab needed, go to lab; if acuity >= 3, see physician first)
  • What resources are available and when (staffing schedules, room counts, equipment uptime)

You do not specify the queue lengths, the wait times, the bottleneck locations, or the throughput rate. Those emerge. And what emerges is frequently surprising, because human intuition is poor at predicting the behavior of systems with stochastic interactions and shared resources. Michael Pidd, in his work on simulation modeling, emphasizes that the value of simulation is precisely in revealing system behaviors that cannot be deduced from understanding the components in isolation.

The implications for decision-making are direct: DES lets you test interventions before you implement them. What happens if we add a fourth exam room? What if we shift the MA start time 30 minutes earlier? What if walk-in volume increases 20%? Each question becomes a simulation experiment — change the parameter, run the model, observe the outcome distribution.


Input Modeling: Fitting Distributions to Reality

A DES model is only as good as its input distributions. This is where most naive simulation efforts fail — not in the algorithm, but in the assumptions fed to it.

Why Exponential Is Often Wrong

The exponential distribution has a single parameter (the rate) and the memoryless property: the probability of completing service in the next instant is the same regardless of how long service has already lasted. This is mathematically convenient — it underpins the Markovian (M) assumption in queueing theory — but it is a poor fit for most healthcare service times. Exponential distributions have a coefficient of variation (CV) of exactly 1.0, meaning the standard deviation equals the mean. They put substantial probability mass near zero (very short services) and have a long right tail.

Healthcare service times rarely look like this. A primary care visit has a practical minimum duration — no matter how straightforward, the provider needs time to review the chart, greet the patient, perform an assessment, and document. This creates a left truncation that the exponential cannot represent. At the same time, complex cases extend the right tail beyond what the exponential predicts.

Better Alternatives

Lognormal distribution. Service times in healthcare frequently follow a lognormal — the logarithm of the service time is normally distributed. This produces a right-skewed distribution with a minimum near zero, a mode at a typical duration, and a tail that captures occasional complex cases. Law’s Simulation Modeling and Analysis, the standard reference for input modeling, identifies the lognormal as a strong default for service processes where times are positive-valued and right-skewed.

Gamma distribution. The gamma offers flexible shape control via two parameters and can represent distributions ranging from exponential-like (shape = 1) to nearly symmetric (large shape parameter). It is commonly used for length-of-stay modeling in hospital units.

Empirical distributions. When sufficient data exists (hundreds of observations), you can use the empirical distribution directly — drawing service times from the actual observed dataset. This avoids distributional assumptions entirely but requires that the data represent the operating conditions you want to simulate.

Fitting process. The standard approach (detailed in Law & Kelton) is: (1) collect data, (2) hypothesize candidate distributions, (3) estimate parameters via maximum likelihood, (4) test goodness-of-fit via chi-square or Kolmogorov-Smirnov tests, (5) select the best-fitting distribution. In practice, many healthcare DES projects skip this rigor and default to exponential or triangular distributions, producing models that underestimate variability in some processes and overestimate it in others.


Output Analysis: Why One Run Is Not an Answer

A single DES run is one sample path through a stochastic system. Run the same model with a different random number seed and you get different arrival times, different service durations, different queue realizations — and therefore different performance metrics. Treating a single run as “the answer” is the simulation equivalent of drawing one patient from a population and declaring their blood pressure to be the population mean.

Replications

The standard approach: run the model N times with different random seeds and treat the N results as a sample from the distribution of possible outcomes. Compute confidence intervals around performance metrics. Banks et al. recommend a minimum of 20-30 replications for stable estimates of means; estimating tail quantiles (e.g., 95th percentile wait time) requires more.

Warm-Up Periods

If a DES model starts with an empty system (no patients, all resources idle), the initial period is unrepresentative of steady-state operation. A clinic model that starts empty at 8:00 AM will show low utilization and short waits for the first hour simply because the system has not loaded up yet. If you are estimating steady-state performance, you must identify and discard this warm-up period — the transient phase before the system reaches statistical equilibrium.

Welch’s method (described in Law & Kelton) provides a systematic approach: run multiple replications, compute a moving average of the performance metric across replications, and identify the point where the moving average stabilizes. Everything before that point is warm-up data and should be excluded from analysis.

For terminating systems — a clinic that opens at 8 AM and closes at 5 PM — the warm-up question is different. The transient behavior is the behavior of interest. The model should start in the same state the real system starts in each day.

Common Mistake: Pseudo-Replication

Running one long simulation and dividing it into time windows is not the same as running independent replications. The windows are autocorrelated — a queue that built up in window 3 has not disappeared by window 4. This inflates apparent sample size and produces artificially narrow confidence intervals. Independent replications with different random seeds are the correct approach for most healthcare DES analyses.


Healthcare Example: FQHC Clinic Flow

Consider a Federally Qualified Health Center with the following configuration:

  • 3 providers (two physicians, one NP), each with a daily panel of 18-20 scheduled patients plus walk-in capacity
  • 2 exam rooms per provider (6 total), shared across the clinic
  • 1 shared lab for point-of-care testing (HbA1c, rapid strep, urinalysis)
  • 2 medical assistants handling vitals, room prep, and patient rooming
  • Patient mix: 70% scheduled (arriving within a 15-minute window of appointment time), 30% walk-in (arrivals following a time-varying Poisson process peaking at 10:00-11:30 AM)

An operations manager looks at this system and sees three providers with two rooms each. Intuition says the binding constraint is provider time — add a fourth provider, get more throughput. The clinic’s CFO models the problem as a spreadsheet: 3 providers x 8 hours x 2.5 patients/hour = 60 patients/day capacity. Actual throughput is 48-52 patients/day. The gap looks like a provider efficiency problem.

A DES model of this clinic, with service times fit from EHR timestamps (lognormal for provider encounters, gamma for lab processing, empirical for room turnover), reveals something different:

The binding constraint during morning hours (8:00-11:30 AM) is exam room turnover, not provider availability. Here is the mechanism: When a provider finishes with a patient, the room must be cleaned, restocked, and the next patient roomed — a process taking 8-12 minutes. During morning peak, both rooms assigned to a provider are frequently occupied: one with the current patient, one being turned over. The provider is ready for the next patient but has no room to see them in. They idle for 4-6 minutes per cycle — invisible in the schedule but devastating to throughput when compounded across a morning of 10-12 patients per provider.

The DES reveals that this room turnover bottleneck costs the clinic approximately 6-8 patient slots per day across the three providers. It also reveals that the bottleneck is time-dependent: by afternoon, the walk-in surge has subsided, scheduled patients are more spread out, and room turnover is no longer binding. The system has two different bottlenecks at two different times of day — something no static analysis would capture.

The intervention test: The simulation tests adding a dedicated medical assistant whose sole morning responsibility is room turnover — stripping, cleaning, restocking, and pre-staging the next patient’s chart. With this role added, room turnover drops from 8-12 minutes to 4-6 minutes. Simulated throughput increases by 7-8 patients per day (approximately 15%). By contrast, a simulation adding a fourth provider with existing room infrastructure shows a throughput increase of only 4-5 patients per day — the new provider hits the same room bottleneck and spends even more aggregate time waiting for rooms.

The DES made the non-obvious case: a $35,000/year MA role produces more throughput than a $180,000/year additional provider. No amount of spreadsheet modeling or provider utilization reporting would surface this finding, because it requires modeling the temporal interaction between provider pace, room turnover, patient flow sequencing, and time-varying demand.


Build vs. Buy: Choosing a Simulation Approach

Commercial DES Tools

Arena (Rockwell Automation), Simul8, AnyLogic, and FlexSim are the major commercial platforms. They provide graphical model-building environments, built-in statistical analysis, animation, and output reporting. Their strength is rapid model construction for standard process flows; their weakness is cost ($5,000-$50,000+ annually) and the learning curve for complex logic.

For a health system that needs to model a specific operational question — “Should we reconfigure our ED fast-track?” or “What happens to surgical throughput if we add a second PACU bay?” — commercial tools are usually the right choice. The model can be built in days to weeks by someone with simulation training, and the animated output helps communicate results to non-technical stakeholders.

Custom Code

Python (with SimPy), R (with simmer), or Julia-based DES models offer full flexibility and zero licensing cost. The tradeoff is development time: a custom model requires programming every component that a commercial tool provides out of the box. Custom builds are justified when the model will be embedded in a production system (e.g., a real-time scheduling optimizer), when the logic is too complex for drag-and-drop tools, or when the organization has software engineering capacity but not simulation tool licenses.

Simplified Spreadsheet Models

For many operational questions, a full DES is overkill. If the system approximates a standard queueing configuration (M/M/c or similar), a spreadsheet implementing Erlang formulas or the Kingman approximation (see Module 2) gives directional answers in hours rather than weeks. Robinson’s work on conceptual modeling emphasizes that the appropriate level of model complexity is the simplest level that captures the dynamics relevant to the decision. A spreadsheet model that correctly identifies the binding constraint is more valuable than a DES model that takes three months to build and validate.

Decision heuristic: Use a spreadsheet model if the system has stable arrival rates, a single service stage, and homogeneous servers. Use DES when you face multi-stage routing, time-varying demand, shared resources across service lines, or you need to test time-dependent interventions (like shifting staff schedules). Use commercial DES for one-off studies; use custom code for models that will be maintained and rerun.


Verification and Validation

A simulation model that has not been verified and validated is a random number generator with a narrative attached. Stewart Robinson’s framework for simulation model credibility distinguishes two essential activities:

Verification answers: “Does the model do what we told it to do?” This is debugging. Does the arrival generator produce the specified distribution? Does the routing logic correctly assign patients to the right service? Does the resource release mechanism work? Verification is a software engineering problem — code review, unit tests, trace analysis, and structured walkthroughs. The most effective verification technique is trace-driven simulation: feed the model a known sequence of inputs (deterministic arrival times and service durations) and confirm the model produces the expected sequence of events.

Validation answers: “Does the model match reality?” This is a statistical and operational question. Run the model with current parameters and compare outputs to actual performance data — average wait times, throughput, utilization by resource, queue lengths at specific times of day. Brailsford emphasizes that validation in healthcare DES is complicated by the difficulty of collecting clean operational data and by the fact that the “real system” itself may not be in steady state.

Validation is not binary. A model is validated with respect to a specific purpose. A model validated for estimating average daily throughput may not be valid for estimating 95th percentile wait times. The validation scope must match the decision scope.

Warning signs of invalid models:

  • The model matches average performance but misses variance (it produces the right mean wait time but too-narrow confidence intervals)
  • The model cannot reproduce known system responses to past changes (if you added a provider last year and throughput increased 10%, the model should predict something similar)
  • Stakeholders with operational knowledge reject the model’s behavior as unrealistic — “patients don’t actually flow that way” — and these objections have not been investigated

Integration Points

Module 2 (Queueing Theory and Wait-Time Dynamics). DES is the computational extension of queueing theory. The M/M/c model gives you the utilization-delay curve for a stylized system with Poisson arrivals, exponential service, and identical servers. DES lets you model the actual system — with lognormal service times, priority-based triage, time-varying arrivals, heterogeneous providers, and multi-stage routing — and observe the delay dynamics that emerge. The queueing formulas tell you which parameters to measure and approximately what to expect. The DES model tells you what will actually happen with this specific configuration. The workflow is sequential: queueing theory for insight, DES for precision.

Human Factors Module 6 (Human Factors in Product Design). If a DES model is to be used by operators — not just built by analysts and presented in reports — its interface must be designed for human usability. Simulation tools that require users to specify 40 input distributions before seeing any output will not be used. Effective simulation products apply progressive disclosure (show the simple scenario first, let users add complexity), provide animated visualizations that build intuition for system dynamics, and present output as decision-relevant summaries rather than raw statistical tables. The gap between “we built a simulation” and “operators use the simulation to make better decisions” is a human factors problem, not a simulation problem.


Product Owner Lens

What is the operational problem? Healthcare systems make capacity and process design decisions based on averages, intuition, and precedent — then are surprised by emergent bottlenecks, cascading delays, and interventions that fail to produce expected improvements.

What mechanism explains the system behavior? Stochastic interactions between time-varying demand, multiple shared resources, and multi-stage patient routing produce emergent system behavior that cannot be predicted from component-level analysis. The nonlinearities identified in queueing theory (Module 2) are amplified by the complexity of real systems.

What intervention levers exist? DES does not provide levers directly — it tests them. The levers are the same as in any operations problem: staffing levels and schedules, resource counts, routing rules, scheduling policies, and process redesign. DES reveals which levers matter most and quantifies the expected impact before implementation.

What should software surface? A simulation capability embedded in an operational product should let users define scenarios (add a provider, change hours, shift demand), run simulations, and see projected impacts on wait times, throughput, utilization, and patient volumes — with confidence intervals, not point estimates. The interface should flag when a scenario pushes any resource into the steep region of the utilization-delay curve (above 85% utilization).

What metric reveals degradation earliest? The divergence between simulated performance under current parameters and actual observed performance. When the model (validated against recent data) starts predicting better performance than reality delivers, something has changed that the model does not capture — a process has degraded, a resource has been quietly lost, or demand has shifted. The model-vs-actual gap is the canary.


Summary

Discrete-event simulation is the workhorse method for understanding healthcare operations that are too complex for closed-form analysis. It works by modeling individual events — arrivals, service starts, departures — and letting system behavior emerge from the interaction of stochastic processes, resource constraints, and routing logic. The method is mechanistically transparent: you specify what you know (arrival patterns, service time distributions, resource counts, routing rules), and the simulation reveals what you cannot deduce (where bottlenecks form, which interventions work, how sensitive the system is to parameter changes).

The discipline of DES is not primarily computational. It is modeling discipline: fitting input distributions honestly, running enough replications to produce credible estimates, validating the model against reality, and communicating results as distributions rather than point predictions. The most common failure mode is not algorithmic — it is the analyst who fits exponential distributions to everything, runs one replication, and presents the output as fact.

For healthcare operators, DES offers something that no other analytical method provides: the ability to test operational changes in silico before implementing them in a system where the cost of getting it wrong is measured in patient access, staff burden, and financial sustainability.