Simulating Survey Results
This simulation is designed to demonstrate the difficulty of accurately estimating a death toll on the basis of a random sample of survivors. Estimates will generally be low, and the margin of error increases as clustering of mortality increases. (Please note that this is a beta version.)
An overview of the problems demonstrated by this simulation is outlined below, in the section entitled Understanding the Simulation.
To run the simulation, you'll need to set five parameters: a population size, an average family size, a death rate, the number of survivors who will be polled, and the extent of clustering of mortality. For details on each of these parameters, and the other field of the table, click the headings below:
Starting Population
Average Family Size
Death Rate
Number of Families to Poll
Extent of Clustering
Number of Families Successfully Polled
Total Deaths in Polled Families
Average Number of Deaths per Polled Family
Total Estimated Deaths
Total Actual Deaths
Discrepancy
Correction Factor
A technical description describing the algorithms used in the simulation follows the descriptions of the various fields.
Understanding the Simulation
The Purpose Of This Program
History is untidy: It is not always easy to reconstruct what really happened. This is particularly true in the case of catastrophic events. War, genocide, and natural disasters leave an indelible mark on victims and survivors, but these events often disrupt or destroy the institutions responsible for recordkeeping. The effects of these disasters, however memorable, become difficult to quantify.
One method of estimating the toll of these events is by polling survivors to determine the effects of these crises on their own families. Death rates are then calculated on the basis of these responses, and applied to the affected population to estimate an overall death toll. As an example, let's suppose that a particular village had, before the crisis, a population of 3,000. If, for example, a poll of survivors revealed that, on the average, one out of every six family members died during the crisis, we could apply the this death rate (15%) to the entire population. This would suggest a death toll of around 450.
At a glance, this seems like a reasonable method for estimating mortality. In reality, however, there is a serious problem with this method: it will typically result in an underestimate. This program is designed to simulate a crisis, and the probable results of estimations derived from polling the survivors.
First, let's examine why this method is flawed; then, we'll look at how the simulation works.
Why An Underestimate?
The first shortcoming of this method is fairly straightforward: if everyone in a family dies, that family cannot be represented in a poll of survivors. This means that the the families with the highest rate of fatalities cannot be included in the overall average. One might argue that it is fairly rare for every member of a family to die, and that this would have a negligible effect on the overall estimate. There is, however, a related phenomenon which does significantly affect the calculated death rate. The problem is that not all families have an equal chance of being represented in the survey. Random surveys will tend to overrepresent families with more survivors; after all, there are more of them. In fact, those with the highest likelihood of inclusion in the survey are those from the families with the fewest deaths.
To illustrate this, consider a simple example. Imagine that we return to the village we described earlier, with a starting population of 3,000. What happens if the fatalities in this village are not distributed evenly? Suppose that we have a total of 300 families, each with ten members, and that 150 of those families suffered three fatalities per family. The other 150 families, meanwhile, suffered one fatality per family. The total death toll, then, is 600. Initially, it might seem that our method of sampling survivors will work just fine: half of our families had three fatalities, and half had one; so on the average, there were two deaths per family, which matches the actual death toll precisely.
So why is there a problem? Remember, we will not be polling all of the survivors. We're going to poll only a fraction of them, on the assumption that a random sample will accurately reflect the entire group. Is that true of our hypothetical village? Of the 2,400 survivors, there are 1,350 from families where only one member died... but only 1,050 from families where three members died. Imagine a raffle where one person buys 1,350 tickets, and another buys 1,050. Who has a better chance of winning? The sample principle applies to the random sample: because survivors from families with a lower death rate outnumber the survivors from families with a higher death rate, the former are more likely to be selected for inclusion in the survey.
How severe is this problem? It depends. Two critical factors are the methods used to select the respondents, and the extent to which mortality is clustered among certain families.
Clustering of Mortality
Is it likely that deaths will be distributed fairly evenly among all families? Or is it more likely that some families will suffer far more than others? The answer depends primarily on the causes of the excess mortality. Consider the case of an epidemic that affects mainly the elderly. If we assume that most households have a similar demographic composition (for example, two grandparents, two parents, two children), we would probably see a fairly even distribution of mortality. Contrast this with the case of a village bombarded by artillery: a shell hitting a single house might kill everyone inside, while the family next door is unharmed. As deaths become increasingly clustered within particular families, it becomes less and less likely that a random sample of survivors will yield an accurate estimate of deaths.
While this problem is most extreme when deaths are deliberately concentrated within particular families (in the case of ethnic cleansing, for example), it's important to remember that some clustering is likely to occur even when the probability of dying is purely random. Why? This is because reality does not always match probability, particulary within small samples. Imagine tossing a coin: we know that there is a 50/50 probability of heads or tails. Out of every ten tosses, however, we will not always have five heads and five tails; sometimes we might wind up with six of one and four of the other, or seven and three, or even ten and zero. Results will typically conform to a bell curve; and since the nature of the survey means that results from one tail of the curve are overrepresented, while results from the other tail are underrepresented, the calculations derived from the survey are likely to be skewed.
How Does the Simulation Work?
The form at the top of this page contains twelve fields. The top five fields can be set by the user; the bottom seven fields display the probable results, based on the values entered by the user.
The Input Fields
Starting Population
This number sets the size of the population for our simulation. This should be set to a number between 1000 and 100000. Simulations from a toosmall population will not always yield consistent results, and exceedingly large simulations will be very timeconsuming, and may cause your browser to appear to stop responding. Recommended setting: 50000.Average Family Size
Because we will estimate the number of deaths by determining the average number of deaths in each family, we need to define exactly how big each family will be. We can think of this number as the size of the immediate family, or the size of the extended family. This number must be at least 4, and at most 30. In general, larger family sizes will yield more consistent results. Excessively large family sizes, however, will force the use of smaller sample rates, which will again lead to less reliable results. Recommeded setting: 10.Death Rate
The death rate represents the probability of dying during the crisis. If set to 25%, for example, the simulation will apply a oneinfour chance of dying to every member of the starting population. Think of a 25% death rate as rolling a foursided die for every individual: if the result is 1, 2, or 3, that individual lives; if the result is four, they die. (The highest death rate allowed in this simulation is 50%.) Recommended setting: 20.Number of Families to Poll
This field determines how many people we will survey after the crisis. Setting this to 1000 would mean that we will try to find a survivor from each of 1,000 different families, and we will ask that survivor how many members of their family died. Just as in an actual survey, a sample that is too small may give us poor results. There is also a maximum sample rate that can't be exceeded; after all, we can't ask more families than we have in our starting population. Recommended setting: 1000.Extent of Clustering
This simulation allows us to model three different levels of clustering of mortality: None, Moderate, or High. When the extent of clustering is set to "None," all families in the simulation are subject to the same probability of deaths. When set to "Moderate" clustering, some families have a slightly elevated risk of death, and other families have a slightly decreased risk of death; and when set to "High," some families have a significantly increased risk of death, while others have a significantly decreased risk of death. (A more detailed explanation, including a table that illustrates probable death rates, is outlined in the description of the clustering algorithms in the technical description of this program.)The Results Fields
Number of Families Successfully Polled
The dead tell no tales: we can't ask them how many other members of their family died. At very high sample rates, we may not be able to find as many people as we want to poll. This field will show how many survivors we successfully interviewed. A more detailed explanation of how this number is calculated is outlined in the technical description of this simulation.Total Deaths in Polled Families
This field displays the exact total of deaths among the families we interviewed.Average Number of Deaths per Polled Family
This field shows the average number of deaths in each family we polled, determined by dividing the number of total deaths by the number of families successfully polled.Total Estimated Deaths
The Total Estimated Deaths field displays the estimate we would arrive at if we assume that our survey has given us an accurate representation of the death rate. The estimate is calculated by dividing the starting population by the average family size; this gives us the total number of families. Next, we take the average number of deaths per family (calculated from our poll responses) and multiply it by the total number of families. This gives us our estimate of the total number of deaths for the entire community.Total Actual Deaths
This field shows how many individuals died during our simulated crisis. It is calculated by applying the probability of dying (that is, the death rate entered by the user) to every member of the starting population.Discrepancy
The Discrepancy field displays how much the estimate deviates from the actual total; it is shown as a percentage of the true toll.Correction Factor
The correction factor is the number that we would need to multiply by the estimate in order to arrive at the correct toll.
Technical Details
This simulation is written in Javascript. The actual code for the script can be seen by displaying the HTML source of this page. An overview of the methods used in the simulation is provided below.
Modeling a Crisis
How can we simulate a mortality crisis? Two elements are necessary: a starting population, and a probability of dying. As in reality, there is an element of chance here. We don't know exactly how many people will die, nor do we know who will survive, and who will not. We model this by setting a death rate, and applying this probability to each and every person in the population. In effect, if we begin with a population of 10,000, we are going to roll the dice 10,000 times.
Polling the Survivors
Once we have simulated a crisis, how do we determine who will be included in our poll? This is also left to chance. Given a family size of n, we will try to ask the nth person in each family. (That is, if family size specified in the simulation is 5 members, we will ask the 5th person in the family; if the family size is 10, we will ask the 10th member, and so on.) An important point here is that we will try to ask. Since every member of the population has a chance of dying during the crisis, there will be times when the family member we want to query has died.
But why do we need to ask one particular member of the family? If the tenth member of the family has died, why couldn't we just as easily ask the eighth member, or the ninth member? We use this method because the availability of one specific member of the family is a reflection of the mortality within that particular family. If, for example, there has been only one death a family of ten, there is only a oneinten chance that it was the tenth member who died; but if there have been nine deaths in the family, there is a nineinten chance that Member 10 is among the dead. By consistently questioning the same member within each group, we mimic the realworld difficulty in finding survivors from families with higher mortality. The probability that a family with eight survivors would be represented in a poll of survivors is four times as great as the probability that a family with only two survivors would be represented. The same ratio holds true for polling a particular member of each family: in a family with four times as many survivors, the likelihood that the specific member we want to poll has survived is four times as great. Querying a specific member of the family is also consistent with certain realworld constraints, in that we might not be able to get accurate answers from every survivor; if, for example, the sole survivor is a twoyearold, we can't expect to get an accurate accounting of the rest of the family.)
An alternative algorithm for modeling this difficulty would be as follows: we simulate the mortality crisis in the same manner, but record the identity of each survivor, and how many of that survivor's family members died. All the survivors are then put into an array, and respondents are randomly selected from that array, taking measures to ensure that we do not ask more than one survivor from each family. One could argue that this is in fact a more accurate representation of realworld conditions. Mathematically, either method models the effects of higher mortality fairly accurately, but the second method is far less efficient programatically.
Modeling Clustering of Mortality
As noted previously, clustering of mortality refers to a situation where fatalities are disproportionally concentrated within certain groups or areas. The program allows for the simulation of increased mortality within particular families: in other words, a situation where some families suffer significantly higher risk of death than other families.
What constitutes "Moderate" clustering? What constitutes "High" clustering? These terms are, of course, subjective, and there is no universal formula to express the "real" extent of clustering; after all, the amount of clustering in any crisis can be determined by an almost infinite number of variables.
So how can we simulate clustering? Essentially, we break our overall population into smaller groups, and apply higher and lower mortality to some subsets. For the sake of example, let's call our base death rate (x). Breaking our starting population into ten subsets, the death rate under our various clustering scenarios would be as follows:




No Clustering  Moderate Clustering  High Clustering  
Subset 1  x  .5x  .1x 
Subset 2  x  .75x  .2x 
Subset 3  x  x  .4x 
Subset 4  x  x  .8x 
Subset 5  x  x  x 
Subset 6  x  x  x 
Subset 7  x  x  1.2x 
Subset 8  x  x  1.6x 
Subset 9  x  1.25x  1.8x 
Subset 10  x  1.5x  1.9x 
Strictly speaking, the table represents the probable breakdown of the death rates in the Moderate and High clustering scenarios. Here again, we again rely on chance: for every family, we generate a random number between 1 and 10. That random number is then used to determine which subset the family belongs to: if it is 1, they are assigned the death rate for Subset 1; if it is 2, they are assigned the death rate from Subset 2, and so on.
Limitations
The methods used in this simulation impose certain restrictions. Some input parameters are limited, merely because they would be statistically unreliable. Even beyond these hardcoded restrictions, however, there are obvious limitations to the use of the program. Survey size is a perfect example; just as in reality, polling too few survivors would result in unreliable results. (In the real world, for example, one would not attempt to predict the outcome of an election on the basis of a survey of ten people.)
The methods used in the simulation also require an upper limit on the number of survivors we can poll. As noted previously, since we are going to calculate the death toll by tallying deaths within individual families, there is no point in asking more than one person in each family. Additionally, there will be times when we can't find as many survivors as we want to poll. Because the simulation relies on probable mortality, there is no way to know precisely how many survivors we will have in any given simulation. Typically, however, we will start to have difficulty finding respondents when the number of families we want to poll reaches (a  ab), where a equals the number of families at the start of the simulation, and b equals the death rate. As an example, let's say we have a starting population of 1000 people, with 10 people in each family. The death rate is 20%, and we want to poll 85 people. Since we will be asking the tenth member of each family how their family fared, we have 100 potential respondents. Those 100 respondents, however, will be subject to the same death rate as the rest of the population. With a 20% chance of dying, we will probably have only about 80 survivors among our respondents; so it is unlikely that we will be able to successfully poll 85. The number people we were able to ask is displayed in the "Number of Families Successfully Polled" field once the simulation has run.
Because this program is written in Javascript, it suffers from some performance limitations. Javascript is not particularly efficient, and large simulations can be very slow. The notes above indicate that the starting population should not be set to more than 100,000. Strictly speaking, the program can model larger numbers, but your processor will be heavily taxed. On a Celeron 650, a simulation using a starting population of 200,000 took about 18 seconds; midway through the simulation, Internet Explorer popped up an alert saying that the script was causing the browser to run slowly, and offering an option to cancel its execution. A simulation with a starting population of 8,000,000, meanwhile, took nearly five minutes on a Duron 1.6 GHz with 256MB of RAM.
Known Bugs
Currently, the simulation discards "excess" family members if the total population is not evenly divisible by family size.
There is at this point virtually no input validation. Numbers must be entered as numbers only, with no commas or other characters.
Unquestionably there are other bugs as well; this is, again, a beta version.
Credits
This program was written by John May and Bruce Sharp, February 2005.
Related Page
This simulation was originally created to accompany an evaluation of various estimates of the death toll in Cambodia. That article, entitled Counting Hell, is online at www.mekong.net/cambodia/deaths.htm.