Skip to main content

CALEA Update Magazine | Issue 84


Printer-friendly version

Measuring the Performance of Law Enforcement Agencies - Part 2 of a 2-part articles appearing in the CALEA Update  (Published February 2004 - Part 1 -Published September 2003)


This is the second segment of a two-part article on measuring the performance of law enforcement agencies.  It is written for a policing audience, and draws in part on my discussions with members of CALEA’s Performance Measurement Subcommittee and those who have attended my training workshops at the last three CALEA Conferences. In the first segment, I introduced the general concepts, terminology, and history of comparative performance measurement in policing. In this second segment, I show you how to develop, pilot-test, and implement comparative performance measurement in your agency. This article is one small part of a larger effort by CALEA to explore the feasibility and utility of agency-level performance measurement in policing. That journey is just beginning, and will proceed slowly, but it will be a worthwhile one.


Once the goals of policing have been identified and a salient list of dimensions (and perhaps sub-dimensions) has been prepared, it is time to begin formulating specific performance measurements. Many times, people start off in the middle of the process by generating a list of performance measures without having first done the necessary and far more difficult work of thinking about the broad dimensions of police performance. The process I am recommending is a rough analogue to the deductive model of science in which we start by identifying theories and concepts and then collect data on specific measures that reflect those broader theories and concepts. The search for specific performance measures should be a liberating, unconstrained process in which participants are encouraged to think well outside of the traditional boundaries. In this section, I provide a number of suggestions about potential data sources and research methods for generating performance measures.

Traditional performance measures in policing are often derived from administrative data maintained by the police department. While these data can often be very useful and should be included, official police data should not be the only source used in a comprehensive performance measurement system. Below I present some other options.

1. General Community Surveys

Nationwide, nearly 1/3 of local police departments conduct community surveys each year. The Bureau of Justice Statistics now makes available to police departments a free software package and guide for conducting community surveys.[i]  These kinds of surveys are useful for several purposes: learning about crime, fear of crime, victimization experiences, and overall impressions about the police.[ii]  They are sometimes used as a crutch, however. Research on customer and client satisfaction across industry types has shown that satisfaction levels reported in response to very general survey questions are routinely high and do not tend to differ greatly across organizations.[iii]  Other research, however, shows that the specificity and wording of the survey question can have a profound impact on satisfaction levels.[iv]  Therefore, police organizations can get out of a community survey what they put into it. If they want a public relations gimmick, they can ask one or two very general questions about citizen satisfaction with police. If they are interested in using the survey as a platform for organizational learning, they can ask a number of more specific questions about the quality of policing in the community. Another problem with general community surveys is that many of the respondents have not had any contact with the police. Therefore, it is difficult to know if their impressions of the police were formed through the media, through vicarious contact with friends or relatives, or through previous experiences with other police organizations.

2. Contact Surveys

Contact surveys are administered to those who have had recent contact with the police. These kinds of surveys can be very revealing, particularly when they are focused on different kinds of contacts. Surveys of victims, for instance, can be very useful for learning whether the department is responding appropriately to their needs. When police in Toronto surveyed rape victims, they received numerous complaints about the uniformed officers who responded to the initial call, but nearly universal praise for the department’s sex crimes unit.[v]  Surveys of drivers stopped and/or searched by the police can be used to learn about citizen perceptions of police practices. Even arrestee surveys can be very useful. Although a common perception is that such surveys would not be useful because all arrestees will be dissatisfied with the outcome, research has shown that citizens are willing to accept negative outcomes if they view the process that led to the outcome as fair.[vi]  Arrestee surveys administered in multiple cities could be very useful for learning whether a department is perceived as more or less fair than others.[vii]  Contact surveys could also be administered over time within a single department to learn whether certain training programs or supervisory approaches are improving citizen perceptions of police.

3. Employee Surveys

Employee surveys are valuable for many reasons. They can be used to gauge the perceptions of employees about certain administrative initiatives. They can be used to assess morale issues. Employee surveys have also been used in some very unique and helpful ways in recent years. For instance, researchers from the Federal Bureau of Prisons aggregated (combined) data from individual employee surveys to form composite measures of the organizational social climate in the Bureau’s various prison facilities.[viii]  A similar approach was recently applied to the measurement of police integrity. Researchers aggregated the responses of more than 3,000 thousand individual police officers to form an aggregate measure of the “environment of integrity” in 30 police agencies.[ix]  The results showed that police agencies vary widely with regard to their overall environments of integrity. This information was presumably very useful to police executives in those agencies, particularly those who ranked at the bottom of the list.

4. Direct Observation

Direct observation by trained raters or coders can also be a very useful method for collecting valuable performance information. For example, the Early Childhood Environment Rating Scale uses trained observers to rate the quality of child care facilities based on direct observation of the space and furnishings, the interaction between children and teachers, and several other dimensions of performance.[x]  Observers in New York City use vehicles equipped with specially designed measuring instruments to rate the “smoothness” of 670 miles of streets in 59 districts.[xi]  In policing and criminology, there is some precedent for using direct observation to generate data useful for performance measurement. For instance, coders can use “systematic social observation” techniques to record the volume of physical and social disorder in neighborhoods.[xii]  This is a useful technique for generating data, independent of police, on quality of life issues in the community. Or, using techniques developed by Mastrofski and his colleagues, trained coders can conduct standardized observations of police-citizen encounters.[xiii]   While direct observation can be a very useful technique for gathering data on performance, it is personnel intensive, and therefore very expensive.

5. Independent Testing or Simulation Studies

A fifth alternative source for collecting data on police performance is independent testing or simulation studies. Rather than observing performance in completely “natural” settings, independent tests create artificial opportunities to measure performance. For instance, the Insurance Institute for Highway Safety uses crash tests to rank vehicle safety. The Institute’s primary mission is research, and insurance companies rather than auto manufacturers fund it. Many firms hire people to pose as customers (known as “secret shoppers” or “mystery shoppers”) who visit their facilities to perform checks on quality of service, cashier accuracy, ethical standards, and many other issues. Internal affairs units in large police agencies have conducted various kinds of “integrity tests” for many years. ABC News conducted independent integrity tests of police in New York and Los Angeles by turning over 40 wallets or purses to police officers chosen at random. In every case, the officers turned in the wallets and purses with contents intact.[xiv]  The Police Complaint Center (PCC) is a Florida-based firm that conducts proactive investigations of police misconduct. The PCC videotapes its investigators in a variety of settings: being stopped by officers, trying to secure complaint forms from police agencies, and other situations. PCC investigators have videotaped numerous instances of police misconduct. While certainly controversial, testing and simulation offer some interesting promise for collecting performance data that are truly independent of the police.

Computing Aggregate Measures

Once the performance measures have been selected, and the data have been collected, the next question is what kind of analysis to perform. The first step, depending on the data being collected, is to aggregate the data to compute an overall organizational score for each individual performance measure. If the measure is a count variable (such as the number of arrests), it can be summed, or an average or ratio can be computed. If the measure is categorical (such as a survey question with five response categories ranging from strongly disagree to strongly agree), the proportion of people choosing each response can be computed. Since a comparative performance measure is intended to measure some aspect of the organization, each measure needs to be aggregated so it represents an organization-wide score.

Standardized Composites or Individual Measures?

When a student takes the SAT, the GRE, or other similar standardized tests, the overall scores represent “composites” of the individual test questions. These composite scores are standardized so that they fall within a certain well-known range, such as 200-800 for the SAT. Nobody is very interested in his or her performance on individual test questions, only the overall score within each dimension (such as math and verbal). Anytime we create new performance measures, we have a series of analytical choices about how we want to use the data. For instance, suppose we generate a list of 7 general dimensions of police performance. Within each one, we collect data on 7 specific performance measures. We will end up with 49 (7 x 7) separate pieces of information from each organization. One possibility is to treat each of the 49 specific items as a performance measure. In some ways, this is analogous to inspecting a student’s performance on each individual SAT question. There are commonly used statistical methods, however, that can be used to reduce these 49 separate items into 7 composite scores representing each overall dimension. Furthermore, these composites can also be standardized (so that, for instance, an agency would receive a score falling between 0 and 100 on each dimension). This approach is common in psychology when making standardized instruments to measure a variety of individual traits. Our efforts to create performance measures nationally will focus on creating composite scores. Local police departments implementing performance measures on their own may not have access to sufficient statistical expertise to form composite measures.


One of the complaints about some performance measurement systems is that they treat each measure equally. This is acceptable as long as the different domains of performance are equivalent, but if some are much more important than others, it is misleading. Sometimes it is useful to assign greater weight to certain measures when computing composite performance scores. There are a variety of methods for doing this; they require technical expertise, but they can easily be done. The more pressing question is how to assign the weights in a manner that is not totally arbitrary. For instance, former NYPD Commissioner Bratton explained that his strategies for reducing crime in New York came with some consequences:

“We defined brutality as un-necessary behavior that caused broken bones, stitches, and internal injuries. But those were not the figures that had gone up significantly. What had risen were reports of police inappropriately pushing, shoving, sometimes only touching citizens. We were taking back the streets . . . we were being more proactive, we were engaging more people, and often they didn’t like it.”[xv]

Implicit in this explanation is the argument that crime control is a more important function of policing than citizen satisfaction or appropriate use of minor levels of force. Policing is certainly not the only industry in which these kinds of questions arise. As Gormley and Weimer note:

“A physician with a good bedside manner is not enough when a patient’s life is at stake. A teacher with a winning smile is not enough if challenging subjects are being taught.”[xvi]

Some goals may be more important than others. An important decision for those designing comparative performance measures is how to quantify differences in importance between multiple goals. If the differences are minor, they may be worth overlooking. If there are major differences in importance, such as the friendliness of the hospital staff versus its mortality rate, then it will be useful to either consider each performance measure individually, or to use a weighting procedure before forming composite scores.

How can the weights be formed? One method might be to use an expert group and ask them to compile a ranking system. Mark Moore and his colleagues at Harvard’s Kennedy School have already used a similar approach for ranking the most important innovations in policing.[xvii]  Focus groups or surveys of citizens could also be used to determine which goals are the most important to them. Once again, a national system of performance measurement should take pains to compute weights for each dimension of performance. Local law enforcement agencies might not have access to the statistical expertise necessary to form an elaborate weighting system, but they should still go through the process of thinking about which dimensions of performance are most important.   

After the performance measures have been specified, the data have been collected, and the analysis has proceeded through the possible stages of aggregation, formation of composites, and weighting, it is time to use the data to make comparisons. In the next section, I examine two methods for ensuring that comparisons are as fair as possible.


In 1923, Clarence Smith raised a number of concerns about using statistics to compare police departments. His argument, quite simply, was that police in different communities face very different circumstances that need to be taken into account when comparing agencies. These differences range from demographic and economic features to topography and culture, including: race, population density, the nature of industrial development, the condition and distribution of the streets and highways, the volume of tourist traffic, and “the habits, traditions, and natural law-abiding inclination and disposition of the people of the city”.[xviii]   Smith’s concern with comparative statistics is apropos today as we consider how to develop the systematic capacity for comparative performance measurement in policing.

This concern with making fair comparisons is not unique to the police. It affects all kinds of organizations. For instance, students graduating from Harvard University are likely to have GRE scores that are higher than those of students attending state universities. Does that mean Harvard performs at a higher level than state universities?  Not necessarily. The typical student admitted to Harvard presumably entered with much greater aptitude and higher SAT scores than the typical state university student. The important question is not whether one organization has better inputs than the other, but which one adds more value. The key point here is that organizations often have very different inputs, and this variation in inputs should be reflected in performance measures. This notion of “value-added” applies to schools, hospitals, police departments, and many other kinds of organizations.

Like other kinds of organizations, police departments face drastically different workloads, challenges, and environments. One department might work in a poor, ethnically heterogeneous community with high rates of crime and disorder. Another might work in a wealthy, ethnically homogenous, sleepy suburb in which a patrol officer’s greatest challenge is to write traffic citations and make an occasional arrest. The key to comparing these two organizations, despite their differences, is known as risk adjustment. Hospitals admitting the most at-risk patients might be expected to have the highest death rates. Prisons admitting the worst offenders might be expected to have the highest recidivism rates. However, we can control for the inputs of an organization when measuring performance. There are two primary methods: stratification (forming peer groups), and calculating “risk-adjusted” performance measures.

Stratification, or forming peer groups containing similar agencies, is one useful way to account for differing inputs. Groups of agencies that are approximately similar in size, type, jurisdiction, and workload will become peers. Each agency within the peer group can compare their performance measurements with the other peer agencies. Forming peer groups is much easier than doing risk adjustment, but it too will be tricky. Some cities are simply unique. Others may belong in certain classes of cities that are difficult to identify in advance. For instance, some “edge cities” have a small population but due to their proximity to large urban areas, they may face issues that make them unique compared with other similarly sized communities.[xix]  Nonetheless, the difficulties inherent in peer groups are much less formidable than the difficulties with risk adjustment.

Criminologist Lawrence Sherman acknowledges that cities vary widely with regard to the social and economic correlates of crime. He proposes using statistical methods to purge homicide rate measures of the influence of these other factors. The resulting measure will be a “risk adjusted homicide rate” that is very similar to the risk adjusted mortality rates used by hospitals. For instance, one could use relatively straightforward statistical techniques (such as regression analysis) to purge homicide rates of the influence of poverty, unemployment, race, divorce, and population density. Once such factors are controlled statistically, the resulting measure can more easily be compared across cities, even those that are very different from one another. One research team has already created a prototype ranking system based on risk-adjusted homicide rates for 21 cities.[xx]  Sherman suggests that such a measure can be used to rank the performance of police agencies at dealing with crime.[xxi]  This process will require technical expertise and a substantial investment in testing and calibration to ensure that the risk-adjustment procedures are scientifically defensible. Furthermore, since risk adjusted crime rates are based on an implicit assumption that demographic and structural characteristics (such as poverty, race, and region) influence crime, the risk-adjustment procedures might inspire controversy. Although the research on risk adjustment in policing extends back to at least 1971,[xxii] much more work remains to be done before police agencies can rely on its scientific foundation. For now, comparative performance measurement initiatives will need to rely on stratification or the formation of peer groups.


Much of this background information might leave you wondering how to establish comparative performance measures in your agency. This section walks you through the steps, providing brief pointers to help keep you on track.

1.      Make a commitment to COMPARATIVE performance measures.

This involves: (a) comparing your agency’s performance over multiple time periods, or (b) comparing your agency to other agencies. Other options include comparing district or beat-level performance within a large agency.

Conducting one data collection exercise (such as a citizen survey) in one jurisdiction does not provide you with a “comparative” performance measure because it does not offer the opportunity for comparison.

2.      Select the units that you intend to compare.

·        Will you compare time periods (months, years), beats, districts, or different agencies?

·        Use caution in selecting “peer” agencies. Make sure they are comparable.

3.      Select the dimensions of performance that are valuable for your agency.

·        This will feel like a philosophical or theoretical exercise.

·        The search for specific performance measures should be a liberating, unconstrained process in which participants are encouraged to think well outside of the traditional boundaries.

·        Do not focus yet on whether you can measure these items. That comes later.

·        What does your community want from its police? 

·        Determine the relative importance of your dimensions: Are some more important than others?

4.      Figure out how to measure those dimensions of performance.

·        Think broadly about potential data sources. Some will be contained in agency data and some will need to be collected using surveys or other methods.

·        Some alternative methods include: general community surveys, contact surveys, employee surveys, direct observation, and independent testing or simulation studies.

·        You may not be able to measure all of the important dimensions you’ve identified.

·        Do not reverse steps 3 and 4. Step 3 comes before step 4 for an important reason.

5.      Use the measures to improve your organization.

·        All organizations are capable of self-learning, adaptation, adjustment, experimentation, and innovation. To do so, organizations need information and feedback.

·        Comparative performance measures will provide police organizations with crucial information: how they are doing relative to other police agencies on a variety of performance dimensions, and how they are improving relative to their own previous levels of performance. 

·        Use them and act on them. Don’t just use them as a public relations gimmick for a news article or an annual report.

·        Treat the process as an integral step in organizational learning. Take your organization’s temperature.  Take its blood pressure. Then use those measurements to form a diagnosis and implement organizational change.

6.      Repeat the process routinely.


There are many ways to change organizations, from improved recruitment, hiring and training, to the selection of a new leader.  This article presents just one potential method for improving police organizations: comparative performance measurement. All organizations are capable of self-learning, adaptation, adjustment, experimentation, and innovation. To undergo these processes, however, organizations need information and feedback. This article presents a systematic framework for improving policing by creating comparative performance measures. Such measures will provide police organizations with crucial information: how they are doing relative to other police agencies on a variety of performance dimensions, or how they are improving relative to their own previous levels of performance. Performance measures are an essential component of an ongoing “organizational learning” strategy.

I applaud CALEA for beginning to think about how to measure the performance of law enforcement agencies. I also applaud those of you who attempt to implement such measures in your own agency. I look forward to learning about your experiences.


ABC News (2001). Testing police honesty: A PrimeTime investigation with lost wallets. PrimeTime Thursday, May 17th [television show].

Bratton, W., with Knobler, P. (1998). Turnaround: How America’s top cop reversed the crime epidemic. New York: Random House.

Camp, S. D., Saylor, W. G., & Harer, M. D. (1997). Aggregating individual-level evaluations of the organizational social climate: A multilevel investigation of the work environment at the Federal Bureau of Prisons. Justice Quarterly, 14(4), 739-761.

Fund for the City of New York (2001). How Smooth are New York City’s Streets?  Center on Municipal Government Performance.

Gallagher, C., Maguire, E.R., Mastrofski, S.D., & Reisig, M.D. (2001). The public image of the police: Final report to the International Association of Chiefs of Police. Manassas, VA: George Mason University, Administration of Justice Program.

Garreau, J. (1991). Edge City: Life on the New Frontier. New York: Doubleday.

Gormley, W. T., Jr., and D. L. Weimer (1999). Organizational report cards. Cambridge, MA: Harvard University Press.

Harms, T..,  R.M. Clifford, and D. Cryer (1998). Early  childhood environment rating scale. New York:  Teacher College Press.

Hickman, Matthew J. and Brian A. Reaves. (2001). Community Policing in Local Police Departments, 1997 and 1999. Washington, DC: Bureau of Justice Statistics.

Hoffman, R. B. (1971). Performance measurements in crime control. Journal of Research in Crime & Delinquency, 8, 2, 165-174.

Klockars, C. B., S. K. Ivkovich, W. E. Harver, and M. R. Haberfeld (2000). The measurement of police integrity. National Institute of Justice: Research in Brief. NCJ-181465. Washington, D.C.: U.S. Department of Justice, Office of Justice Programs, National Institute of Justice.

Mastrofski, S.D., Parks, R.B., Reiss, A.J. Jr., Worden, R.E., DeJong, C., Snipes, J.B., & Terrill, W. (1998). Systematic Observation of Public Police: Applying Field Research Methods to Policy Issues. Washington, DC: National Institute of Justice (December).

Moore, M.H., Spelman, W., & Young, R. (1992). Innovations in Policing: A Test of Three Different Methodologies for Identifying Important Innovations in a Substantive Field. Unpublished Paper. Cambridge, MA: Harvard University, Kennedy School of Government.

Rape victims rate police performance (1998). Toronto Star, July 23: C3.

Sampson, R.J., and Raudenbush, S. (1999). Systematic social observation of public spaces: A new look at disorder in urban neighborhoods. American Journal of Sociology, 105, 603-651.

Sherman, L.W. (1998). Evidence-based policing. Ideas in American Policing, Washington, D.C.: U.S., Department of Justice, Police Foundation (July).

Smith, C.B., Jr. (1923).  The Adequacy of police forces. Journal of the American Institute for Criminal Law and Criminology. 13, 266-271.

Tyler, T.R. (1990). Why People Obey the Law. New Haven, CT: Yale University Press.

Weisel, Deborah L. (1999). Conducting Community Surveys: A Practical Guide for Law Enforcement Agencies. Washington, DC: Office of Community Oriented Policing Services and Bureau of Justice Statistics.


[i].              Weisel, 1999.

[ii].             Hickman and Reaves, 2001.

[iii].            Gormley and Weimer, 1999.

[iv].            Gallagher, Maguire, Mastrofski, and Reisig, 2001.

[v].             Rape victims rate police performance (1998). Toronto Star, July 23: C3.

[vi].            Tyler, 1990.

[vii].         Arrestee surveys are common in the study of drug use, but rare in the study of police performance. In the United States, the Arrestee Drug Abuse Monitoring (ADAM) project collects urine samples from arrestees in multiple cities. Local ADAM sites sometimes conduct arrestee surveys that focus on issues other than drug use, but these are isolated efforts. The ADAM program represents an ideal framework on which to build a national data collection effort on police performance as viewed through the eyes of the arrestee population. Similar programs exist in Australia, South Africa, and other nations, suggesting the possibility of collecting such measures internationally.

[viii].       Camp, Saylor, and Harer, 1997.

[ix].          Klockars, Ivkovich, Harver, and Haberfeld, 2000.

[x].           Harms, Clifford, and Cryer, 1998.

[xi].          Fund for the City of New York, 2001.

[xii].         Sampson and Raudenbush, 1999.

[xiii].        Mastrofski, Parks, Reiss, Worden, DeJong, Snipes, & Terrill (1998).

[xiv].        ABC News, 2001.

[xv].         Bratton with Knobler, 1998, p. 291.

[xvi].        Gormley and Weimer, 1999, p. 205.

[xvii].       Moore, Spelman, & Young, 1992

[xviii].      Smith, 1923,  p. 267.

[xix].         Garreau, 1991.


[xxi].           Sherman, 1998.

[xxii].      Hoffman, 1971.

Send mail to with questions or comments about this web site
or phone us at: (703) 352-4225.
Copyright Commission on Accreditation for Law Enforcement Agencies, Inc. 2010-All Rights Reserved.

Edward R. Maguire, Ph.D.,
Associate Professor, Administration of Justice Program
George Mason University
Fairfax, Virginia
Share this