for Evaluating Public R&D Investment
CHAPTER 1: Evaluation Fundamentals
Objectives of Evaluation
Evaluation is the systematic investigation of the value or merit of a thing or an activity. Evaluation has a long history, reportedly dating back 4,000 years to China, where it was used to assess public programs. While evaluation often is viewed as an adversarial process, it can also be viewed as a tool that not only measures, but also contributes to success.
Program evaluation looks at the impacts of a collection of projects. Project evaluation focuses on individual projects. Project evaluation can start sooner than program evaluation, which requires time for multiple projects to make progress toward program goals.
Evaluation can tell an organization what is working and what is not. It can reveal what outputs are being produced, and with what efficiency. It can indicate when performance is improving and when performance is declining. It can answer a host of questions needed to determine if an activity or suite of activities is on track to produce desired outcomes. Evaluation can also address “why” questions. In sum, evaluation can be as much a method of learning as it is a method of documentation, and certainly far more than a mandated obligation.
Table 1–1 summarizes major reasons for evaluating programs and activities and gives examples of the kinds of information that may be developed in pursuit of each purpose.
*GPRA is the abbreviation for the Government Peformance and Results Act of 1993, which is further discussed below.
For Internal Program Management
Managers who are not continually assessing their operations lack the information to manage strategically. Systematic monitoring, analysis, and feedback of information enable managers to identify and implement needed changes.
To Answer Stakeholder Questions
All organizations are accountable to someone, and the persons or organizations to which they are accountable are called stakeholders. Stakeholders generally offer their support to a program on the condition that their targeted goals and objectives are met, and stakeholders usually require evidence that they are getting the desired return on their investment. Stakeholders in private organizations are stockholders and other owners and investors, with employees, managers, and boards of directors also in stakeholder roles. While stakeholders in public agencies are ultimately the citizens, the legislative and executive branches are the de facto stakeholders. Other individuals and groups, including agency employees, program administrators, oversight committees, program participants, and the communities from which these participants are drawn, also have stakeholder roles. Evaluation can be used as a tool for stakeholders to learn if their objectives are being met. 17
To Meet Official Requirements
Steps are sometimes taken on stakeholders’ behalf to institute specific evaluation and reporting requirements. In 1994, for example, the U.S. Congress passed the Government Performance and Results Act of 1993 (GPRA), requiring, among other things, all federal agencies to develop strategic plans that identified agency objectives, to relate budgetary requests to specific outcome goals, to measure performance, and to report on the degree to which goals are met. The requirements require administrators to move beyond a focus on activities to a focus on results. The GPRA is aimed at improving efficiency, effectiveness, and accountability in federal management and budgeting. 18
In a study of the feasibility of implementing the GPRA for basic and applied research, the Committee on Science, Engineering, and Public Policy of the National Academy of Sciences, took the following position with respect to applied research: 19
GPRA now drives the evaluation efforts of many federal agencies. Some state government legislations have also put into place requirements for evaluation of state programs. 20 As a result, more public programs are subjected to evaluation, and more data on their operations and results are becoming available. In 2002, the Bush Administration added additional requirements related to program evaluation aimed at improving performance management practices of federal agencies. 21 The President’s Management Agenda includes new investment criteria and a program assessment rating tool, known as PART, for federal R&D programs.
To Understand Specific Phenomena
A program frequently encompasses a complex array of theories, issues, and perspectives. Evaluation of the program thus may present an opportunity to investigate a wide range of phenomena of interest. For example, ATP’s legislative charge to foster collaborative activities provides a singular opportunity to explore various aspects of collaboration. Evaluation thus helps ATP benefit from more knowledge about the strengths and weaknesses of collaborations as well as others who use collaboration to accomplish goals.
To Promote Interest in and Support of a Program or Activity
Evaluations inform and educate a large audience. Most audiences wish to hear not just about goals and objectives but also about accomplishments. Administrators of both public and private programs who must make and defend budget requests and solicit support are increasingly expected to present well-documented evidence that past budgets are producing desired results and that there are reasons to expect that future budgets will also pay off. Public policy makers look for evidence that programs are, or are not, working. 22 Over the long run, public opinion and underlying support can be influenced by evaluation-based evidence that specific programs work or do not work. In contested political environments, pro and con evaluations may compete for public attention. The quality and credibility of the evaluations thus can become part of the larger competition for public acceptance and support. Program administrators need to be able to explain their core competencies and their unique contributions and to document their claims of impact with the best possible evidence.
Mapping Evaluation to Mission and Stages of Implementation: A Generic Evaluation Logic Model
A recommended starting point in planning evaluation is to develop a logic model. 23 A logic model is intended to provide a clear diagram of the basic elements of a program, subprogram, or project, revealing what it is to do, how it is to do it, and with what intended consequences. It shows the logical linkages among mission, activities, resources (inputs), outputs, outcomes, and impacts. It is a first step in identifying critical measures of performance. The logic-model tool has been used in program evaluation more than 20 years ago and has been adapted to program planning where it helps ensure a correspondence among all the elements of a program. 24
Evaluation works best when it is closely mapped to a program’s mission and to each stage of program implementation through a logic model. Figure 1–1 shows a generic evaluation model that reveals the integral relationships of a hypothetical public program and its evaluation program. This evaluation logic model is tailored to assess effects at each stage and to provide feedback from evaluation to program administrators and policymakers.
Starting at the top left of Figure 1–1 and working down and across, a societal goal, such as economic prosperity or improved health, provides the impetus for establishing a public program. The program is then presumably designed to carry out the intended mission, and it must obtain resources to carry out its mission. These resources, or “inputs,” expressed in monetary terms, convey the program’s costs to the public. A program’s structure and operational mechanisms for carrying out its mission determine how the inputs are used and what they produce in terms of program “outputs.” For example, a program driven by the societal goal of improved health might fund research in infectious diseases. Program appropriations allow it to purchase labor and materials needed for operations. Short-run program outputs might include publications, presentations, workshops, and test results. Next-stage outputs might include prototype therapies, and prototype vaccines, followed by clinical trials. Longer-term program outcomes might include treatments and vaccines applied by medical establishments. Long-run impacts might include reduced rates of disease spread, higher survival rates of those infected, and reduced mortality rates for the nation. All of the impacts of a program should be assessed against the program’s mission, intended results, and costs. A final, important step is to feed the evaluative findings back to inform program administrators and policymakers and to improve the program’s structure and operations.
Note: The dynamics by which the transformation among these various stages occur are often complex and are themselves the subject of evaluation.Evaluation also looks at the process dynamics whereby inputs are converted to outputs, which in turn may lead to outcomes, which in turn may translate into impacts. Evaluations made at various stages can determine how a program is progressing toward its mission. To continue the above health example, evaluative techniques may be used to assess the transfer of the program’s research results to pharmaceutical companies. Evaluators might explore how the rate of transfer can be accelerated. They might ask what percentage of medical establishments is using program-derived improved treatments and vaccines at a given time. They might investigate how the program affects production costs, what the likely outcome would have been without the program, and what return society is likely to receive from investing in the research program. From these investigations, insights may be gained for modifying the program in ways to bring its dynamics in closer accord with its mission.
Evaluation can provide answers to these and other questions that arise during a program’s implementation and operations. Figure 1–1 suggests the complexity and scope of a full-fledged, fully integrated evaluation program.
Regardless of their degree of comprehensiveness, most evaluation studies can follow a systematic procedure. Table 1–2 summarizes the steps that are normally involved in organizing and carrying out an evaluation study. In practice, a study advances in an iterative manner, as previous steps tend to be revisited as a study proceeds.
Study Scope and Rigor
Evaluation appropriately comes in many forms. It can be comprehensive, encompassing the design, conceptualization, implementation, and impacts of a program. It can be tailored to selected features of a program. It can be focused on measurement of specific outputs or outcomes, or constructed broadly to permit tests of competing causal linkages. It can be “rigorous” in the sense of searching for the most comprehensive and systematic set of causal linkages between and among variables, employing carefully constructed and sifted data. Or it can be “good enough,” that is, offering a defensible answer sufficient to the question at hand, given often severe constraints on time, budget, and access to data. 25
Designing Appropriate Tests of Program Success
It is important to define what constitutes program success before designing evaluation metrics, success being the degree to which a program produces its desired outcomes. That a program’s founding legislation often contains multiple desired outcomes, some of which may be contradictory or entail sizeable tradeoffs, is a common conundrum. At the least, multiple tests of success may be needed, as is the case with ATP.
How to measure success is itself one of the decisions surrounding the design of an evaluation program. Rossi and Freeman, for example, in their treatment of measuring efficiency, list 15 key concepts. 26 Included in the list are standard measures such as benefits, cost effectiveness, cost-benefit ratio, distributional effects, externalities, opportunity costs, and shadow prices. Specification of the measure(s) of success can influence the selection of evaluation design, identification of data to be collected, type of statistical or other tests of causation or significance to be employed, and, ultimately, conclusions about program impacts. 27 (See page 67 for discussion of specific tests used to assess ATP’s success.)
Use of Control Groups and Counterfactuals
Evaluation seeks not only to measure change, but also to determine if the cause of change is attributable to program intervention. Evaluation thus is directed at ruling out alternative, competing explanations for the change.
Indeed, alternative explanations for observed changes frequently abound. To cite a simple but important example, a program begun at a cyclical trough may be followed by economic improvement on the part of a firm or region that participates in the program, but this improvement could plausibly have been associated with the economy’s general recovery. A well-designed and implemented evaluation, however, may show that the program caused the positive impacts. This ability to “isolate” or “demonstrate” cause is one of the most important tasks evaluation performs for an agency.
To rule out alternative explanations, it is often necessary to contrast the changes that occurred in the group participating in the program with a comparable or “like” group. Comparison and control groups are generic techniques used to gauge whether the observed changes would have occurred even without the program. In each case, evaluators seek to find a population held to be like the participating population in all relevant respects other than participation. 28
Random assignment of participants to either experimental or control conditions is an approach that allows evaluators to generate two groups that have the same general characteristics in all salient variables other than the one to be tested. This is the approach generally used, for example, in testing the effectiveness of medical treatments.
Whether projects in a public program are selected randomly, by merit, or by other criteria, evaluation requires the construction of a comparable group. The critical decision is the selection of what Mohr has termed the “criterion” population. 29 To whom should program participants be compared? Non-funded applicants? All firms in the same technology/industry sector? All firms? As Judd and Kenny note, in referring to different ways in which the population may be tiered, “Deciding on which level is the most appropriate in any piece of applied research is a fundamental and frequently difficult problem.” 30
After an evaluator chooses the criterion population, the next decision becomes how members of this population are chosen for the comparison group. Random assignment can work if the criterion population is large enough, but other considerations may lead to a more purposeful selection of a “matched” set of actors. For instance, say an agency can fund only one half of its eligible applicants, and awardees and non-awardees are alike in program-relevant criteria. In this case, the impacts of the program will be better measured by comparing the technical and economic performance of these two groups rather than comparing the program’s participants to a random selection of all firms within the relevant technology/ industrial sectors. 31
To take things further, an evaluator might assess the program’s impacts by comparing the awardee/non-awardee performance as described above, and also comparing each of those groups to a randomly selected set of firms from the criterion population. Comprehensiveness of design, however, comes with a price. The larger and more differentiated the comparison and/or control groups, the more expensive the project and the more complex the conduct of the evaluation.
An alternative approach, termed a “counterfactual,” solicits expert judgment about what would likely have happened in absence of the program in question. For example, had a program not funded the development of a new technology, would someone else have developed it? In the same timeframe? With the same features? By the same parties? If it had not been developed, what would have been used in lieu of the new technology? Or if it had been developed differently without the program, what would the differences likely have been? Expert judgment about the most likely alternative scenario is used to establish a base line against which the program result can be compared. Generally the use of a counterfactual to isolate the effect attributable to a program is less rigorous than use of a control or comparison group. But, it has the advantages of often being feasible when a control or comparison group is not, and it is a respected approach in social sciences research, where controlled experimentation is usually difficult, if not impossible.
Deciding Who Should Perform Evaluation
The skills to perform evaluation are diffused throughout several sectors of the economy. As attested to by the large and diverse membership of the American Evaluation Association and the existence of several evaluation journals, evaluators come from the fields of economics, sociology, history, statistics, public administration, and other fields. The nature of the study and the methods to be used influence the selection.
The use of internal or external evaluators is often an important decision; each choice has advantages and disadvantages. Evaluation by in-house staff offers the following several advantages: staff evaluators can make important contributions to mission-focused studies; provide continuous feedback between evaluation staff, operations staff, and agency decision makers; and translate findings into agency relevant publications and briefings. Using staff evaluators facilitates access to and use of confidential data related to private sector activities that are controlled by federal agencies. Perhaps more important than any of the above, command of the analytical and empirical skills to conduct evaluation is essential to an agency’s ability to design, manage, monitor, interpret, assess, and disseminate studies and findings performed by outside contractors.
Sole reliance on in-house staff, though, runs the risk of neglecting newer conceptual, methodological, or empirical advances occurring in program evaluation in other organizations. It also runs the risk of inconsistent, incomplete evaluations when agency personnel are detailed for other pressing agency needs. Legislatively imposed constraints on personnel classifications, staffing limits, and salaries may simply make it impossible for an agency to assemble the staff required to conduct the range and depth of an evaluation program deemed necessary. But, most importantly, relying totally on an in-house staff for evaluation runs the risk of reducing credibility for the studies. In short, especially in politically contested domains, recourse to external evaluators is as much a matter of political credibility as of comparative methodological expertise.
Use of outside contractors can be an indispensable component of an agency’s evaluation efforts, Reliance, however, on outside evaluators presents its own set of problems. External evaluators may not know the ins and outs of a program and may not want to invest the time to learn. External evaluators may have their own agendas, which may place more importance on pursuing a particular line of research than on addressing questions of interest to program administrators. External evaluators may see their audience in very different terms than do program administrators, and may resist translating their studies to language that program stakeholders can understand. Program participants who are reluctant to share data with “outsiders” may meet external evaluators with distrust. Finally, external evaluators may be lacking in objectivity and credibility if, for example, they have established their reputations based on particular findings, such as that all federal programs are either wasteful or efficient, or that a particular technology or industry is of prime importance or of little importance.
Transparency and ReplicabilityGood evaluations follow the dictates of good scientific research; that is, good evaluations set forth explicit hypotheses and research protocols, make public the documentation used, and delineate the tests of impact or causation, statistical or otherwise, used to formulate conclusions. Holding aside possible differences in interpretation that may follow from two or more researchers examining the same data and tests, a key to a sound evaluation is that its inner workings are transparent, thus permitting others to replicate it if they choose, or to adjust the evaluation’s mechanics to determine how sensitive the findings and recommendations are to specific assumptions.
Codes and criteria for good evaluation practice abound, albeit flexibility rather than orthodoxy characterizes the best and most recent outlooks of nationally prominent evaluators. 32 At a minimum, best practice means: (1) addressing significant programmatic questions, (2) linking evaluation questions and design to program mission, (3) focusing on outputs and outcomes rather than only inputs, (4) carefully identifying and collecting relevant data, (5) considering alternative explanations for observed changes, (6) ensuring transparency in treatment of assumptions and presentation of data and other evidence, (7) using a degree of methodological rigor and care that can withstand critical scrutiny, and (8) communicating the findings effectively.
Evaluations are often research-oriented undertakings, designed to test hypotheses and generate primary data. Such efforts need to be conducted with the same level of precision that would follow in seeking publication in high-quality, peerreviewed journals. Program administrators anxious for bottom-line results may meet the research nature of some evaluation studies with impatience. Evaluation staff will need to recognize the possible tension between pursuing researchoriented, exploratory studies, which advance the tools of their trade, and producing and communicating studies of immediate applicability. Finding a workable balance is important to achieving both short- and long-run success for an evaluation program.
18 An overview of the GPRA is provided in Appendix 1 of the General Accounting Office Executive Guide, “Effectively Implementing the Government Performance and Results Act,” GAO Report GGD–96–118, Washington, DC, 1996.
19 National Academy of Sciences, Committee on Science, Engineering, and Public Policy, Evaluating Federal Research Programs; Research and the Government Performance and Results Act (Washington, DC: National Academy Press, 1999).
20 See, for example, Susan Cozzens and Julia Melkers, “Use and Usefulness of Performance Measurement in State
Science and Technology Programs,” Policy Studies Journal, 25: 425–435, 1975.
21 Office of Management and Budget, The President’s Management Agenda, available online at http://www.whitehouse.gov/omb.
22 Politics and ideology also play major roles in determining support for a particular effort. Here the focus is on the objective basis of support.
23 A number of guides to logic modeling are available. For a listing of guides that focus on the forms and structures, strengths and weaknesses, and uses of logic modeling, see Molly den Heyer, ed., A Bibliography for Program Logic Models/Logframe Analysis (Ottawa, Canada: Evaluation Unit, International Development Research Centre, 2001). Included in the list is John A. McLaughlin and Gretchen B. Jordan, “Logic Models: A Tool for Telling your Program Performance Story,” Evaluation and Program Planning, 22:65–72, 1999.
24 Paul F. McCawley, The Logic Model for Program Planning and Evaluation, CIS 1097 (Moscow, Idaho: University of Idaho Extension Program), p. 1.
25 P. Rossi and H. Freeman, Evaluation: A Systematic Approach, 4th ed. (Beverly Hills, CA: Sage, 1989); Shadish et al., Foundations of Program Evaluation , 1991.
26 P. Rossi and H. Freeman, Evaluation: A Systematic Approach, p. 377.
27 Though it may seem self evident, stakeholders sometimes need to be reminded that a program, to be fairly tested, must be measured against its mission. This is sometimes overlooked, inadvertently or deliberately, either to the detriment or benefit of the program in question. For example, ATP was once criticized that it had “merely accelerated technology development,” when, in fact, accelerating technology development is an accomplishment that is core to ATP’s legislated mission. This criticism of the program’s accomplishments is therefore invalid. Being able to link the measured effects directly back to mission is critical for program administrators who present and defend a program. Similarly, when presenting unintended results, it is incumbent on program administrators to acknowledge that the results presented, though they may be desirable, are not within the mission scope.
28 D. Cook and J. Stanley, Experimental and Quasi-Experimental Design for Research (Chicago: Rand McNally, 1966).
29 L. Mohr, Impact Analysis for Program Evaluation, 2nd ed. (Thousand Oaks, CA: Sage, 1995).
30 Charles M. Judd and David A. Kenny, Estimating the Effects of Social Interventions (New York: Cambridge University Press, 1981), p. 55.
31 Randomization is less useful—indeed it is likely to be politically unacceptable if used for project selection—when an agency seeks the “best” of a set of proposals. Phrased differently, it would not be expected that an agency announce a program competition, set criteria, and then randomly select grantees from among a set of applicants. That agency would instead measure the applicants against the specified selection criteria. An “intermediate” process does exist: An agency could announce a program competition, convene selection panels to sort proposals into acceptable/non-acceptable categories, and randomly select awardees from the acceptable category. No federal agency has yet been willing to experiment with the technique to the authors’ knowledge.
32 Mark et al. Evaluation: An Integrated Framework for Understanding, Guiding, and Improving Policies and Programs, 2000.
Date created: July 13,
NIST is an agency of the U.S. Commerce Department