Identifying and Implementing Educational Practices Supported By Rigorous

Yüklə 113,13 Kb.

səhifə	2/3
tarix	23.11.2017
ölçüsü	113,13 Kb.
	#12174

1 2 3

3. The study should provide data showing that there were no systematic differences between the intervention and control groups before the intervention.
As discussed above, the random assignment process ensures, to a high degree of confidence, that there are no systematic differences between the characteristics of the intervention and control groups prior to the intervention. However, in rare cases - particularly in smaller trials - random assignment might by chance produce intervention and control groups that differ systematically in various characteristics (e.g., academic achievement levels, socioeconomic status, ethnic mix). Such differences could lead to inaccurate results. Thus, the study should provide data showing that, before the intervention, the intervention and control groups did not differ systematically in the vast majority of measured characteristics (allowing that, by chance, there might have been some minor differences).
Key items to look for in the study's collection of outcome data
4. The study should use outcome measures that are "valid" - - i.e., that accurately measure the true outcomes that the intervention is designed to affect. Specifically:
a.. To test academic achievement outcomes (e.g., reading/math skills), a study should use tests whose ability to accurately measure true skill levels is well-established (for example, the Woodcock-Johnson Psychoeducational Battery, the Stanford Achievement Test, etc.).

b.. Wherever possible, a study should use objective, "real-world" measures of the outcomes that the intervention is designed to affect (e.g., for a delinquency prevention program, the students' official suspensions from school).

c.. If outcomes are measured through interviews or observation, the interviewers/observers preferably should be kept unaware of who is in the intervention and control groups.

Such "blinding" of the interviewers/observers, where possible, helps protect against the possibility that any bias they may have (e.g., as proponents of the intervention) could influence their outcome measurements. Blinding would be appropriate, for example, in a study of a violence prevention program for elementary school students, where an outcome measure is the incidence of hitting on the playground as detected by an adult observer.

d.. When study participants are asked to "self-report" outcomes, their reports should, if possible, be corroborated by independent and/or objective measures.

For instance, when participants in a substance-abuse or violence prevention program are asked to self-report their drug or tobacco use or criminal behavior, they tend to under-report such undesirable behaviors. In some cases, this may lead to inaccurate study results, depending on whether the intervention and control groups under-report by different amounts.

Thus, studies that use such self-reported outcomes should, if possible, corroborate them with other measures (e.g., saliva thiocyanate tests for smoking, official arrest data, third-party observations).
5. The percent of study participants that the study has lost track of when collecting outcome data should be small, and should not differ between the intervention and control groups.
A general guideline is that the study should lose track of fewer than 25 percent of the individuals originally randomized - the fewer lost, the better. This is sometimes referred to as the requirement for "low attrition." (Studies that choose to follow only a representative subsample of the randomized individuals should lose track of less than 25 percent of the subsample.)
Furthermore, the percentage of subjects lost track of should be approximately the same for the intervention and the control groups. This is because differential losses between the two groups can create systematic differences between the two groups, and thereby lead to inaccurate estimates of the intervention's effect. This is sometimes referred to as the requirement for "no differential attrition."
6. The study should collect and report outcome data even for those members of the intervention group who don't participate in or complete the intervention.
This is sometimes referred to as the study's use of an "intention-to-treat" approach, the importance of which is best illustrated with an example.
Example. Consider a randomized controlled trial of a school voucher program, in which students from disadvantaged backgrounds are randomly assigned to an intervention group - whose members are offered vouchers to attend private school - or to a control group that does not receive voucher offers. It's likely that some of the students in the intervention group will not accept their voucher offers and will choose instead to remain in their existing schools. Suppose that, as may well be the case, these students as a group are less motivated to succeed than their counterparts who accept the offer. If the trial then drops the students not accepting the offer from the intervention group, leaving the more motivated students, it would be create a systematic difference between the intervention and control groups - namely, motivation level. Thus the study may well over-estimate the voucher program's effect on educational success, erroneously attributing a superior outcome for the intervention group to the vouchers when in fact it was due to the difference in motivation.
Therefore, the study should collect outcome data for all of the individuals randomly assigned to the intervention group, whether they participated in the intervention or not, and should use all such data in estimating the intervention's effect. The study should also report on how many of the individuals assigned to the intervention group actually participated in the intervention.
7. The study should preferably obtain data on long-term outcomes of the intervention, so that you can judge whether the intervention's effects were sustained over time.
This is important because the effect of many interventions diminishes substantially within 2-3 years after the intervention ends. This has been demonstrated in randomized controlled trials in diverse areas such as early reading, school-based substance-abuse prevention, prevention of childhood depression, and welfare-to-work and employment. In most cases, it is the longer-term effect, rather than the immediate effect, that is of greatest practical and policy significance.
Key items to look for in the study's reporting of results

8. If the study claims that the intervention improves one or more outcomes, it should report (i) the size of the effect, and (ii) statistical tests showing the effect is unlikely to be due to chance.

Specifically, the study should report the size of the difference in outcomes between the intervention and control groups. It should also report the results of tests showing the difference is "statistically significant" at conventional levels -- generally the .05 level. Such a finding means that there is only a 1 in 20 probability that the difference could have occurred by chance if the intervention's true effect is zero.
a. In order to obtain such a finding of statistically significant effects, a study usually needs to have a relatively large sample size.
A rough rule of thumb is that a sample size of at least 300 students (150 in the intervention group and 150 in the control group) is needed to obtain a finding of statistical significance for an intervention that is modestly effective. If schools or classrooms, rather than individual students, are randomized, a minimum sample size of 50 to 60 schools or classrooms (25-30 in the intervention group and 25-30 in the control group) is needed to obtain such a finding. (This rule of thumb assumes that the researchers choose a sample of individuals or schools/classrooms that do not differ widely in initial achievement levels.).15 If an intervention is highly effective, smaller sample sizes than this may be able to generate a finding of statistical significance.
If the study seeks to examine the intervention's effect on particular subgroups within the overall sample (e.g., Hispanic students), larger sample sizes than those above may be needed to generate a finding of statistical significance for the subgroups.
In general, larger sample sizes are better than smaller sample sizes, because they provide greater confidence that any difference in outcomes between the intervention and control groups is due to the intervention rather than chance.
b. If the study randomizes groups (e.g., schools) rather than individuals, the sample size that the study uses in tests for statistical significance should be the number of groups rather than the number of individuals in those groups.
Occasionally, a study will erroneously use the number of individuals as its sample size, and thus generate false findings of statistical significance.
Example. If a study randomly assigns two schools to an intervention group and two schools to a control group, the sample size that the study should use in tests for statistical significance is just four, regardless of how many hundreds of students are in the schools. (And it is very unlikely that such a small study could obtain a finding of statistical significance.)
c. The study should preferably report the size of the intervention's effects in easily understandable, real-world terms (e.g., an improvement in reading skill by two grade levels, a 20 percent reduction in weekly use of illicit drugs, a 20 percent increase in high school graduation rates).
It is important for a study to report the size of the intervention's effects in this way, in addition to whether the effects are statistically significant, so that you (the reader) can judge their educational importance. For example, it is possible that a study with a large sample size could show effects that are statistically significant but so small that they have little practical or policy significance (e.g., a 2 point increase in SAT scores). Unfortunately, some studies report only whether the intervention's effects are statistically significant, and not their magnitude.
Some studies describe the size of the intervention's effects in "standardized effect sizes."16 A full discussion of this concept is beyond the scope of this Guide. We merely comment that standardized effect sizes may not accurately convey the educational importance of an intervention, and, when used, should preferably be translated into understandable, real-world terms like those above.
9. A study's claim that the intervention's effect on a subgroup (e.g., Hispanic students) is different than its effect on the overall population in the study should be treated with caution.
Specifically, we recommend that you look for corroborating evidence of such subgroup effects in other studies before accepting them as valid.
This is because a study will sometimes show different effects for different subgroups just by chance, particularly when the researchers examine a large number of subgroups and/or the subgroups contain a small number of individuals. For example, even if an intervention's true effect is the same on all subgroups, we would expect a study's analysis of 20 subgroups to "demonstrate" a different effect on one of those subgroups just by chance (at conventional levels of statistical significance). Thus, studies that engage in a post-hoc search for different subgroup effects (as some do) will sometimes turn up spurious effects rather than legitimate ones.
Example. In a large randomized controlled trial of aspirin for the emergency treatment of heart attacks, aspirin was found to be highly effective, resulting in a 23 percent reduction in vascular deaths at the one-month follow-up. To illustrate the unreliability of subgroup analyses, these overall results were subdivided by the patients' astrological birth signs into 12 subgroups. Aspirin's effects were similar in most subgroups to those for the whole population. However, for two of the subgroups, Libra and Gemini, aspirin appeared to have no effect in reducing mortality. Clearly it would be wrong to conclude from this analysis that heart attack patients born under the astrological signs of Libra and Gemini do not benefit from aspirin. .17
10. The study should report the intervention's effects on all the outcomes that the study measured, not just those for which there is a positive effect.
This is because if a study measures a large number of outcomes, it may, by chance alone, find positive (and statistically-significant) effects on one or a few of those outcomes. Thus, the study should report the intervention's effects on all measured outcomes so that you can judge whether the positive effects are the exception or the pattern.

B. Quantity of evidence needed to establish "strong" evidence of effectiveness.

1. For reasons set out below, we believe "strong" evidence of effectiveness requires:
(i) that the intervention be demonstrated effective, through well-designed randomized controlled trials, in more than one site of implementation, and
(ii) that these sites be typical school or community settings, such as public school classrooms taught by regular teachers. Typical settings would not include, for example, specialized classrooms set up and taught by researchers for purposes of the study.
Such a demonstration of effectiveness may require more than one randomized controlled trial of the intervention, or one large trial with more than one implementation site.
2. In addition, the trials should demonstrate the intervention's effectiveness in school settings similar to yours, before you can be confident it will work in your schools and classrooms.
For example, if you are considering implementing an intervention in a large inner-city public school serving primarily minority students, you should look for randomized controlled trials demonstrating the intervention's effectiveness in similar settings. Randomized controlled trials demonstrating its effectiveness in a white, suburban population do not constitute strong evidence that it will work in your school.
3. Main reasons why a demonstration of effectiveness in more than one site is needed:
a.. A single finding of effectiveness can sometimes occur by chance alone. For example, even if all educational interventions tested in randomized controlled trials were ineffective, we would expect 1 in 20 of those trials to "demonstrate" effectiveness by chance alone at conventional levels of statistical significance. Requiring that an intervention be shown effective in two trials (or in two sites of one large trial) reduces the likelihood of such a false-positive result to 1 in 400.
b.. The results of a trial in any one site may be dependent on site-specific factors and thus may not be generalizable to other sites. It is possible, for instance, that an intervention may be highly effective in a school with an unusually talented individual managing the details of implementation, but would not be effective in another school with other

individuals managing the detailed implementation.

Example. Two multi-site randomized controlled trials of the Quantum

Opportunity Program - a community-based program for disadvantaged high

school students providing academic assistance, college and career planning,

community service and work experiences, and other services - have found that

the program's effects vary greatly among the various program sites. A few

sites - including the original program site (Philadelphia) - produced

sizeable effects on participants' academic and/or career outcomes, whereas

many sites had little or no effect on the same outcomes.18 Thus, the

program's effects appear to be highly dependent on site-specific factors,

and it is not clear that its success can be widely replicated.

4. Pharmaceutical medicine provides an important precedent for the

concept that "strong" evidence requires a showing of effectiveness in more

than one instance.
Specifically, the Food and Drug Administration (FDA) usually requires

that a new pharmaceutical drug or medical device be shown effective in more

than one randomized controlled trial before the FDA will grant it a license

to be marketed. The FDA's reasons for this policy are similar to those

discussed above.19

III. HOW TO EVALUATE WHETHER AN INTERVENTION IS BACKED BY "POSSIBLE"

EVIDENCE OF EFFECTIVENESS.
Because well-designed and implemented randomized controlled trials are not

very common in education, the evidence supporting an intervention frequently

falls short of the above criteria for "strong" evidence of effectiveness in

one or more respects. For example, the supporting evidence may consist of:

a.. Only nonrandomized studies;

b.. Only one well-designed randomized controlled trial showing the

intervention's effectiveness at a single site;

c.. Randomized controlled trials whose design and implementation contain

one or more flaws noted above (e.g., high attrition);

d.. Randomized controlled trials showing the intervention's effectiveness

as implemented by researchers in a laboratory-like setting, rather than in a

typical school or community setting; or

e.. Randomized controlled trials showing the intervention's effectiveness

for students with different academic skills and socioeconomic backgrounds

than the students in your schools or classrooms.

Whether an intervention not supported by "strong" evidence is nevertheless

supported by "possible" evidence of effectiveness (as opposed to no

meaningful evidence of effectiveness) is a judgment call that depends, for

example, on the extent of the flaws in the randomized controlled trials of

the intervention and the quality of any nonrandomized studies that have been

done. While this Guide cannot foresee and provide advice on all possible

scenarios of evidence, it offers in this section a few factors to consider

in evaluating whether an intervention not supported by "strong" evidence is

nevertheless supported by "possible" evidence.

A. Circumstances in which a comparison-group study can constitute "possible"

evidence of effectiveness:

1. The study's intervention and comparison groups should be very closely

matched in academic achievement levels, demographics, and other

characteristics prior to the intervention.
The investigations, discussed in section I, that compare comparison-group

designs with randomized controlled trials generally support the value of

comparison-group designs in which the comparison group is very closely

matched with the intervention group. In the context of education studies,

the two groups should be matched closely in characteristics including:

a.. Prior test scores and other measures of academic achievement

(preferably, the same measures that the study will use to evaluate outcomes

for the two groups);

b.. Demographic characteristics, such as age, sex, ethnicity, poverty

level, parents' educational attainment, and single or two-parent family

background;
c.. Time period in which the two groups are studied (e.g., the two

groups are children entering kindergarten in the same year as opposed to

sequential years); and
d.. Methods used to collect outcome data (e.g., the same test of reading

skills administered in the same way to both groups).

These investigations have also found that when the intervention and

comparison groups differ in such characteristics, the study is unlikely to

generate accurate results even when statistical techniques are then used to

adjust for these differences in estimating the intervention's effects.

2. The comparison group should not be comprised of individuals who had the

option to participate in the intervention but declined.

This is because individuals choosing not to participate in an intervention

may differ systematically in their level of motivation and other important

characteristics from the individuals who do choose to participate. The

difference in motivation (or other characteristics) may itself lead to

different outcomes for the two groups, and thus contaminate the study's

estimates of the intervention's effects.

Therefore, the comparison group should be comprised of individuals who did

not have the option to participate in the intervention, rather than

individuals who had the option but declined.
3. The study should preferably choose the intervention/comparison groups

and outcome measures "prospectively" - that is, before the intervention is

administered.
This is because if the groups and outcomes measures are chosen by the

researchers after the intervention is administered ("retrospectively"), the

researchers may consciously or unconsciously select groups and outcome

measures so as to generate their desired results. Furthermore, it is often

difficult or impossible for the reader of the study to determine whether the

researchers did so.

Prospective comparison-group studies are, like randomized controlled

trials, much less susceptible to this problem. In the words of the director

of drug evaluation for the Food and Drug Administration, "The great thing

about a [randomized controlled trial or prospective comparison-group study]

is that, within limits, you don't have to believe anybody or trust anybody.

The planning for [the study] is prospective; they've written the protocol

before they've done the study, and any deviation that you introduce later is

completely visible." By contrast, in a retrospective study, "you always

wonder how many ways they cut the data. It's very hard to be reassured,

because there are no rules for doing it."20

4. The study should meet the guidelines set out in section II for a

well-designed randomized controlled trial (other than guideline 2 concerning

the random-assignment process).
That is, the study should use valid outcome measures, have low attrition,

report tests for statistical significance, and so on.

B. Studies that do not meet the threshold for "possible" evidence of

effectiveness:

1. Pre-post studies, which often produce erroneous results, as discussed

in section I.

2. Comparison-group studies in which the intervention and comparison

groups are not well-matched.

As discussed in section I, such studies also produce erroneous results in

many cases, even when statistical techniques are used to adjust for

differences between the two groups.
Example. As reported in Education Week, several comparison-group studies

have been carried out to evaluate the effects of "high-stakes testing" -

i.e., state-level policies in which student test scores are used to

determine various consequences, such as whether the students graduate or are

promoted to the next grade, whether their teachers are awarded bonuses, or

whether their school is taken over by the state. These studies compare

changes in test scores and dropout rates for students in states with

high-stakes testing (the intervention group) to those for students in other

states (the comparison groups). Because students in different states differ

in many characteristics, such as demographics and initial levels of academic

achievement, it is unlikely that these studies provide accurate measures of

the effects of high-stakes testing. It is not surprising that these studies

reach differing conclusions about the effects of such testing.21
3. "Meta-analyses" that combine the results of individual studies that do

Yüklə 113,13 Kb.

Dostları ilə paylaş:

1 2 3