Traffic Simulation Runs: How Many Needed?
FHWA conducted a CORSIM case study to examine the validity of computer models for generating results that are reliable enough to make transportation investment decisions.
Engineers use microsimulation models to replicate individual vehicle movements on a second-by-second or even a subsecond basis to assess the traffic performance of highway and street systems. Microsimulation software uses factors such as vehicle type, road geometry, and driver aggressiveness to best replicate the day-to-day variability of drivers and make decisions in the model. To use these models to make significant decisions about infrastructure improvements and investments, however, State departments of transportation (DOTs) and others need mean values or ranges of values that account for the hour-to-hour and day-to-day variability of traffic. To determine these mean values so they can be used in analyzing transportation design alternatives, engineers run a microscopic simulation model multiple times for the roadway segment and traffic period being analyzed.
"All too often we see only one run used for a given alternative," says Grant Zammit, operations team manager of the Federal Highway Administration (FHWA) Resource Center. "This approach does not recognize the stochastic nature of simulation, may erode the confidence and credibility of the recommendation or decision, and may unintentionally mislead the decisionmaker and the public." With tight budgets and limited human resources, DOTs often elect to reduce the number of runs and save money in the microsimulation analysis portion of project development.
So the question is: "How does a State DOT attain a balance between available resources versus acceptable results?"
To address this question, FHWA carried out an in-house study in 2009 and 2010 investigating the relationship between the number of traffic simulation runs and the aggregate results under various levels of error. The goal was to determine the level of simulation required for a calibrated network -- a model that reproduces field-measured traffic conditions -- to be considered statistically significant at a predefined confidence level for various measures of effectiveness (MOEs).
MOEs are system performance statistics that show the degree to which a model meets performance objectives. Common systemwide MOEs, measuring traffic operations across the entire model network, include vehicle-miles traveled (total miles traveled by all vehicles on the network), vehicle-hours traveled (total hours of travel by all vehicles on the network), and mean system speed (mean speed of all vehicles on the network). Common link-level MOEs, which measure only the traffic conditions on a specific link (freeway segment), include mean speed, mean link discharge (number of vehicles exiting a link per time period), and link density (number of vehicles on a link in a specific time period).
Typically, a State DOT or other sponsoring agency devises predetermined calibration levels of acceptance for specific MOEs. However, funding and time available for a project often limit the extent of traffic analysis and thus the number of simulations run by the performing entity.
According to FHWA's Traffic Analysis Toolbox Volume III: Guidelines for Applying Traffic Microsimulation Modeling Software (FHWA-HRT-04-040), the results from individual runs can vary by 25 percent or higher, and a single run cannot be expected to reflect any specific field condition. But without analyzing the traffic variability within the network and the run's results, how can a transportation professional conclude that an analysis is complete at 5 or 10 runs? Because of traffic variability, or randomness of driver characteristics, a statistically calculated number of runs is necessary to achieve a predetermined level of confidence in the results as prescribed by the sponsoring agency. A model may require 25, 30, 50, or more runs to minimize variability and stabilize results, providing a mean value or range of values that decisionmakers can use and be confident that they are designing for real-life conditions. What would be the effects, or the impacts on results, of concluding that the mean values from a model at 5 runs are acceptable, when in reality 30 runs are actually needed to provide results that best reflect typical traffic conditions at a predetermined level of confidence?
The FHWA researchers recently set out to answer these questions using FHWA's Corridor Simulation (CORSIM) software, a traffic microsimulation program, in a case study of computer model runs.
Design of the Study
The FHWA research utilized a calibrated model of six freeway links on I-694 and I-35W in the northern Minneapolis/St. Paul, MN, metropolitan area. This model was calibrated in accordance with Traffic Analysis Toolbox Volume IV: Guidelines for Applying CORSIM Microsimulation Modeling Software (FHWA-HOP-07-079). Each freeway facility had either two or three through lanes in each direction, with peak hour volumes of nearly 2,000 vehicles per lane. The researchers analyzed the six links for the afternoon peak period of 3 to 6 p.m. They calculated three link-level MOEs -- lane density, link discharge, and link speed -- for each link by analyzing the measured data collected during the CORSIM simulation runs. They also calculated the sampling error for the mean value at 95 percent confidence for each MOE. For this study, the researchers followed the rule of thumb that a sampling error under 10 percent represents stability in the mean value because the number of runs minimizes the variability of results.
For each of the six freeway links, the researchers carried out six independent run sets (5, 10, 15, 20, 25, and 30). To ensure that there was no correlation between individual runs and each run set, each run was different and independent of all other runs and run sets. This independence ensures that the full variability between runs is captured and provides examples of the progressions toward stabilized runs.
Level of Effort
For each model simulation, the researchers documented the level of effort required to run the sets for the six scenarios, noting the time required for additional runs, the total time to complete all runs and analysis, the average time per run, and the range of times per run for each set.
The following were the total times to complete all runs within a given set:
- 5 runs: 45 minutes
- 10 runs: 1 hour 17 minutes
- 15 runs: 1 hour 53 minutes
- 20 runs: 3 hours 19 minutes
- 25 runs: 3 hours 18 minutes (a likely reason for the 1-minute reduction in run time was a greater availability of computer and processing resources because fewer other computer programs were open at the same time)
- 30 runs: 4 hours 29 minutes
For all runs, the time-per-run ranged from 6.8 to 11.1 minutes.
The researchers considered these scenarios to be adequate examples of real-world urban freeway models. The level of effort will vary, depending on model complexity, size, data collection, and traffic volumes. Additional variation of the simulation duration may occur because of factors such as number of programs running on the computer being used and the number of simulations already completed on a given day.
Based on this case study, the researchers concluded that the level of effort for additional runs was not significant. Thirty simulation runs can be completed in a few hours (approximately 4.5 hours for this case study), and the runs can be carried out in the background while working on other tasks on the computer.
Overall Results
Many of the numerical and graphical representations of the run sets revealed recognizable improvements to the results between 5 and 10 runs. But overall evidence from this particular case study indicated that at least 10 to 15 simulation runs were needed to stabilize the results. It is important to note that the number of simulation runs necessary to attain stabilized results varies from one project to the next. Further, stabilized results do not necessarily mean that they meet the predetermined levels of confidence in reflecting real-world traffic conditions. A statistical analysis is necessary to validate the model results for project approval.
After the researchers achieved a stabilized run set, minimal benefit was obtained from additional runs. In addition, the researchers found that the most significant variability between runs occurred at the onset of congested conditions (that is, peak hours), because the beginning of queue development varied in intensity and initiation time.
Of the six scenarios, the two that contained readily apparent nuances for comparison were the northbound segments 147-149 on I-35W and westbound segments 718-720 on I-694.
Scenario A: Link 147-149
Link 147-149 on I-35W was entered in the model as a three-lane freeway segment approximately 1,300 feet (396 meters) in length. As determined by collecting field data, the traffic on this link exhibited free-flow conditions through the first hour of the afternoon peak period analyzed, until approximately 4:15 p.m. From there, speed began to steadily decrease from 60 to 30 miles per hour, mi/h (97 to 48 kilometers per hour, km/h) at 5 p.m., resulting in increases in density and a gradual decrease in discharge volumes. Congestion was present at this location until approximately 5:45 p.m., at which time traffic movement was back to free-flow conditions.
The mean link discharge model values did not vary significantly between run sets and generally followed the shape of the field data within 5 to 10 percent. The link discharge sampling error was less than 10 percent for all run sets, showing minimal variability across all runs, and tended to stabilize around 2 percent after more runs. Overall, minimal improvement in the results for mean link discharge was gained from additional runs.
Although the model's mean speed values generally followed a similar trend across all run sets, the 5-run set had a mean speed approximately 5 to 10 mi/h (8 to 16 km/h) lower than the others during congestion. The 15-run set was similar to the results from the 30-run set, indicating a minimal gain on investment after 15 simulation runs.
Further investigation of the model's speed values showed the variability of the results from individual runs. The mean speed sampling error was high for the 5-run set, exceeding 50 percent error of the mean. The error was reduced by nearly half to 25 percent for run sets 10 through 20. However, the incremental error reductions between the 20-, 25-, and 30-run sets were minimal, only reducing error to just under 20 percent and leveling out. Significantly improved stabilization occurred with each additional run up to 10 runs. However, running the model in excess of 10 runs provided minimal additional benefit in reducing error.
Similar to the trend shown in the mean speed graph, the 5-run set stood out from the other run sets for the mean lane densities. The 5-run set varied from the other sets by as much as 30 vehicles per mile per lane between the onset and closeout of congestion. The other sets varied among each other by a range of 5 to 10 vehicles per mile per lane.
The sampling error of the mean exceeded 60 percent for the 5-run set between 5:20 p.m. and 6 p.m., indicating significant variability in the mean density values between runs. For this time period, as the number of runs increased, the sampling error decreased dramatically. The 15-run set had a peak sampling error of nearly 30 percent, and the 30-run set had a peak sampling error of 20 percent, both significant improvements from the 60 percent for the 5-run set. With this trend, more than 30 runs would be necessary to reduce the sampling error to less than 10 percent.
In conclusion, for Link 147-149 the return on effort was typically minimized after about 20 runs. The scenario provided several examples of variability for queue initiation time and intensity, as well as the effect on the sampling error for lane density and link speed. Although the model results for the mean link discharge were quite similar across all run sets, the mean link speed and lane density MOE deviated much more significantly. This result was most evident with the differences between mean values of the 5-run set and the other run sets.
Scenario B: Link 718-720
Link 718-720 on I-694 was coded in the model as a two-lane freeway segment approximately 1,400 feet (427 meters) in length. According to the field data, traffic exhibited free-flow conditions until 4 p.m., the onset of the link's congested conditions. Operations broke down reaching a point of minimal throughput between 5 p.m. to 6 p.m. This scenario highlighted the effects of simulation variability when a link transitioned from uncongested to congested conditions, plus the importance of performing multiple runs.
Comparable to Link 147-149, the simulation results for the mean link discharge were similar across the six run sets. The sampling error exceeded 10 percent for only one 15-minute time interval for the 5-run set. The error continued to be reduced with additional runs and stabilized between 2 and 4 percent at 15 runs. Thereafter, the researchers found minimal improvement with additional runs.
The mean speed plot highlighted the transition from uncongested to congested conditions beginning at 4 p.m. The vehicle speeds steadily declined until 5 p.m., where they stabilized between 10 and 15 mi/h (16 and 24 km/h) and remained there through the end of the analysis period. Although each run set followed a general trend, noticeable differences between them appeared. The onset and intensity of the congested conditions varied between run sets, thus showing the inherent variability of individual runs. This was highlighted in two time periods:
- At 4:10 p.m., the mean speed for the 5-run set was 40 mi/h (64 km/h), while the mean speed for the 10-run set was nearly 55 mi/h (89 km/h) and about 49 mi/h (79 km/h) with 30 runs.
- Similarly, at 4:45 p.m., the mean speed of the 5-run set was about 11 mi/h (18 km/h), while the mean speed with 25 runs was over 25 mi/h (40 km/h).
The variability of individual runs was further evident in the sampling error for mean speed. The sampling error across all run sets was in excess of 25 percent between 4 p.m. and 5 p.m., with the 5- and 10-run sets in excess of 50 percent on two occurrences. Significant error was still present at the 25-run set, with values that were nearly three times higher than those for 10-, 15-, and 20-run sets from 5 p.m. to 6 p.m. Additional runs would be necessary to reduce the error below 10 percent for the mean speed MOE.
Although all run sets followed a similar trend of increasing mean lane densities, there were notable extremes for the 5- and 10-run sets. The 15- through 30-run sets tended to show similar mean values, indicating a gravitation toward stabilization. The sampling error peaked at over 70 percent for the 5-run set during the queue formation but was reduced to 30 percent for the 10-run set, which was a significant improvement. The error was further reduced to 20 percent for the 30-run set.
Results from Link 718-720 showed the importance of establishing predetermined levels of acceptance prior to calibration in order to identify the goals for level of effort and a confidence level in the results. By looking only at the link discharge MOE, a transportation professional might conclude that only 5 or 10 simulation runs were necessary to reduce the sampling error below 10 percent. However, other MOEs showed that more than 30 runs were needed to provide a confidence level with errors under 10 percent. The onset of congested conditions, and continued deterioration, was difficult to model because of varying queue characteristics, such as start time, intensity, and traffic conditions upstream and down.
Field Data and Model Results
Traffic data collected in the field is the basis for modeling real-life traffic conditions on a selected network. The selected network is calibrated toward this field data, which represents a typical, or average, traffic period. As shown in the examples in this research, a certain number of runs are needed to provide model results that mirror the field results, thus minimizing the individual variability of the model runs.
The examples in this research look only at link-level MOEs, not network level MOEs. The specific links and calibration MOEs used by the model developers during calibration are unknown and thus might differ from the links and MOEs analyzed in this case study. Although the model results were similar to the field data for mean link discharge, the variation from field-measured data and model results for mean speed often were significant.
Concluding Remarks
Results generated from Scenarios A and B on calibrated models of I-35W and I-694 in the northern Minneapolis/ St. Paul metropolitan area provide graphical examples of the differences in the number of microsimulation runs, which engineers ultimately use to help make transportation improvement and investment decisions. As shown, the improvements to the mean values vary between run sets, some with significant differences and others with irrelevant ones. According to the case study, the most notable differences were between the 5- and 10-run sets, and results generally became stabilized at some point after 10 to 15 simulation runs.
Sharp peaks and valleys were noticeable in those sets with a smaller number of simulation runs. With fewer runs, the presence of outliers and the influence of the variability of the queue initiation were more apparent in the mean and sampling error plots. Sets with a larger number of runs tended to generate more rounded and stable line plots. Additional runs might reduce the sampling error because, as the number of runs increased, the mean of the runs and sampling errors per set stabilized. In this case study, the stabilization occurred within the 20-, 25-, and 30-run sets for the three MOEs.
The examples provided in this case study show the effects of carrying out too few simulation runs on reducing variability in the results. Similarly, when applying future increases in annual traffic demand, the impacts of a poorly calibrated model might skew the results exponentially based upon the level of increased demand. State DOT and local agencies usually use calibration acceptance targets such as error tolerance and confidence intervals to ensure that an acceptable level of calibration is achieved.
Jonathan D. Wiegand is a transportation engineer with FHWA's Nebraska Division Office. Wiegand joined FHWA in 2007. He received a B.S. in civil and environmental engineering from South Dakota State University and an M.S. in civil engineering (transportation) from Iowa State University.
C. Y. David Yang, Ph.D., is a research engineer with FHWA's Office of Operations Research and Development in McLean, VA. Yang joined FHWA in 2008 and is responsible for traffic modeling and simulation research at the Turner-Fairbank Highway Research Center. He is the chair of the Transportation Research Board's Committee on User Information Systems and serves on the editorial board of the Journal of Intelligent Transportation Systems. He attended Purdue University and received B.S., M.S., and Ph.D. degrees in civil engineering.
The authors would like to thank James McCarthy of FHWA's Minnesota Division Office forproviding the data used in the case study and Randall VanGorder of FHWA's Office of Operations Research and Development for providing two of the CORSIM photos, and acknowledge John Halkias from the Office of Operations for his advice and comments at the beginning stage of this work.
For more information, contact Jonathan Wiegand at jonathan.wiegand@dot.gov or 402-742-8475 or David Yang at david.yang@dot.govor 202-493-3284.