Designing assessment tasks: interactive speaking
The final two categories of oral production assessment (interactive and extensive speaking) include tasks that involve relatively long stretches of interactive discourse (interview, role plays, discussions, games) and tasks of equally long duration but that involve less interaction (speeches, telling longer stories, and extended explanations and translations). The obvious difference between the two sets of tasks is the degree of interaction with an interlocutor. Also, interactive tasks are what some would describe as interpersonal, while the final category includes more transactional speech event.
When oral production assessment is mentioned, the thing that comes to mind is an oral interview: a test administrator and a test-taker sit down in a direct face to face exchange and proceed through a protocol of questions and directives. The interview, which may be tape recorder for re listening, is then scored on one or more parameters such as accuracy in pronunciation and/or grammar, vocabulary usage, fluency, sociolinguistics/pragmatic appropriateness, task accomplishment, and even comprehension.
Interview can vary in length from perhaps five to forty-five minutes, depending on their purpose and context. Placement interview, designed to get a quick spoken sample from a student in order to verify placement into a course, may need only five minutes if the interviewer is trained to evaluate the output accurately. Longer comprehensive interviews such as the OPI are designed to cover predetermined oral production contexts and require the better part of an hour.
Every effective interview contains a number of mandatory stages. Two decades ago, michealcanale (1984) proposed a framework for oral proficiency testing that has withstood the test of time. He suggested that test-takers will perform at their best if they are led through four stages:
In a minute or so of preliminary small talk, the interviewer directs mutual introductions, apprises the test takers of the format, and allays anxieties. No scoring of this phase takes place.
- Level check
Through a series of preplanned questions, the interviewer stimulates the test taker to respond using expected or predicted forms and function. If, for example, from previous test information, grades, or other data, the test taker has been judged to be a level 2 (see below) speaker, the interviewer’s prompts will attempt to confirm this assumption. The responses may take very simple or very complex form, depending on the entry level of the learner. Questions are usually designed to elicit grammatical categories (such as past tense or subject-verb agreement), discourse structure (a sequence of events), vocabulary usage, and/or sociolinguistics factors (politeness conventions, formal/informal language). This stage could also give the interviewer a picture of the test taker’s extroversion, readiness to speak, and confidences, all of which may be of significant consequence in the interviewer’s results. Linguistic target criteria are scored in this phase. If this stage is lengthy, a tape recording of the interviewer is important.
Probe questions and prompts challenge test taker to go to the height of their ability, to extend beyond the limits of the interviewer’s expectation through increasingly difficult questions. Probe questions may be complex in their framing and/or complex in their cognitive and linguistics demand. Through probe items, the interviewer discovers the ceiling or limitation of the test taker’s proficiency. This need not be a separate stage entirely, but might be a set of questions that are interspersed into the previous stage. At the lower levels of proficiency, probe items may simply demand a higher range of vocabulary or grammar from the test taker than predicted. At the higher levels, probe items will typically ask the test taker to give an opinion or a value judgment, to discuss his or her field of specialization, to recount a narrative, or to respond to questions that are worded in complex form. Responses to questions may be scored, or they may be ignored if the test taker displays an inability to handle such complexity.
- Wind down
This final phase of the interviewer is simply a short period of time during which the interviewer encourages the test taker’s mind at ease, and provides information about when and where to obtain the results of the interviewer. This part is not scored.
The suggested set of content specifications for an oral interviewer (below) may serve as sample questions that can be adapted to individual situations.
The success of an oral interview will depend on
- Clearly specifying administrative procedures of the assessment (practicality),
- Focusing the questions and probes on the purpose of the assessment (validity),
- Appropriately eliciting an optimal amount and quality of oral production from the test taker (biased for best performance), and
- Creating a consistent, workable scoring system (reliability).
This last issue is the thorniest. In oral production tasks that are open ended and that involve a significant level of interaction, the interviewer is forced to make judgments that are susceptible to some unreliability. Through experience, training, and careful attention to the linguistic criteria being assessed, the ability to make such judgments accurately will be acquired. In table 7.2, a set of descriptions is given for scoring open ended oral interviews. These descriptions come from an earlier version of the oral proficiency interview and are useful for classroom purposes.
The test administrator’s challenge is to assign a score, ranging from 1 to 5, for each of the six categories indicated above. It may look easy to do, but in reality the lines of distinction between levels is quite difficult to pinpoint. Some training or at least a good deal of interviewing experience is required to make accurate assessments of oral production in the six categories. Usually the six scores are then amalgamated into one holistic score, a process that might not be relegated to a simple mathematical average if you wish to put more weight on some categories than you do on others.
This five point scale, once known as FSI levels (because they were first advocated by the foreign service institute in Washington, D.C), is still in popular use among U.S. government foreign service staff for designating proficiency in a foreign language. To complicate the scoring somewhat, the five point holistic scoring categories have historically been subdivided into “pluses” and “minuses” as indicated in table 7.3. to this day, even though he official nomenclature has now changed (see OPI description below), in group conversations refer to colleagues and co-workers by their FSI level: “oh, bob, yeah, he’s good 3+ in Turkish-he can easily handle that assignment .”
A variation on the usual one-on-one format with one interviewer and one test taker is to place two test takers at a time with the interviewer. An advantage of a two-on-one interview is the practicality of scheduling twice as many candidates in the same time frame, but more significant is the opportunity for student-student interaction. By deftly posing questions, problems, and role plays, the interviewer can maximize the output of the test takers while lessening the need for his or her own output. A further benefit is the probable increase in authenticity when two test-takers can actually converse with each other. Disadvantages are equalizing the output between the two test-takers, discerning the interaction effect of unequal comprehension and production abilities, and scoring two people simultaneously.
Role playing is a popular pedagogical activity in communicative language-teaching classes. Within constrain set forth by the guidelines, it frees students to be somewhat creative in their linguistic output. In some versions, role play allows some rehearsal time so that students can map out what they are going to say. And it has the effect of the lowering anxieties as students can, even for a few moments, take on the persona of someone other than themselves.
As an assessment device, role play opens some windows of opportunity for the test-takers to use discourse that might otherwise be difficult to elicit. With prompts such as “pretend that you’re a tourist asking me for directions” or “you’re buying a necklace from me in a flea market, and you want to get a lower price,” certain personal, strategic, and linguistic factors come into the foreground of the test-taker’s oral abilities. While role play can be controlled or “guided” by the interviewer, this technique takes test-takers beyond simple intensive and responsive levels to a level of creativity and complexity that approaches real-world pragmatics. Scoring presents the usual issue in any task that elicits somewhat unpredictable responses from test-takers. The test administrator must determine the assessment objectives of the role play, then devise a scoring technique that appropriately pinpoints those objectives.
Discussion and conversations
As formal assessment devices, discussions and conversations with and among students are difficult to specify and even more difficult to score. But as informal techniques to assess learners, they offer a level of authenticity and spontaneity that other assessment techniques may not provide. Discussion may be especially appropriate tasks through which to elicit and observe such abilities as
- Topic nomination, maintenance, and termination;
- Attention getting, interrupting, floor holding, control;
- Clarifying, questioning, paraphrasing;
- Comprehension signals (nodding, “uh-uh,” “hmm,”etc.);
- Negotiating meaning;
- Intonation patterns for pragmatics effect;
- Kinesics, eye contact, proxemics, body language; and
- Politeness, formality, and other sociolinguistic factors.
Assessing the performance of participants through scores or checklists (in which appropriate or inappropriate manifestations of any category are noted) should be carefully designed to suit the objectives the observed discussion. Of course, discussion is an integrative task, and so it also advisable to give some cognizance to comprehension performance in evaluating learners.
Among informal assessment devices are a variety of games that directly involve language production. Consider the following types:
- “Tinkertoy” game: A Tinkertoy (or Lego block) structure is built behind a screen. One or two learners are allowed to view the structure. In successive stages of construction, the learners tell”runners” (who can’t observe the structure) how to re-create the structure. The runners then tell “builders” behind another screen how to build the structure. The builders may question or confirm as they proceed, but only through the two degrees of separation. Object: re-create the structures as accurately as possible.
- Crossword puzzles are created in which the names of all members of a class are clued by obscure information about them. Each class member must ask questions of other to determine who matches the clues in the puzzle.
- Information gap grids are created such that class members must conduct mini-interviews of other classmates to fill in boxes, e.g., “born in july,” “plays the violin,” “has a two year old child,” etc.
- City maps are distributed to class members. Predetermined map directions are given to one student who, with a city map in front of him or her, describes the route to a partner, who must then trace the route and get to correct final destination.
Clearly, such tasks have wandered away from the traditional notion of an oral production test and may even be well beyond assessments, but if you remember the discussion of these terms in Chapter 1 of this book, you can put the tasks into perspective. As assessment, the key is to specify a set of criteria and a reasonably practical and reliable scoring method. The benefit of such an informal assessment may not be as much in a summative evaluation as in its formative nature, with washback for the students.
Oral proficiency interview (OPI)
The best known oral interview format is one that has gone through a considerable metamorphosis over the last half-century, the oral proficiency interview (OPI). Originally known as the Foreign Service Institute (FSI) test, the OPI is the result of a historical progression of revisions under the auspices of several agencies, including the Educational Testing Service and the American Council on Teaching foreign Language (ACTFL). The latter, a professional society for research on foreign language instruction and assessment, has now become the principal body for promoting the use of the OPL; certification workshop are available, at costs of around $700 for ACTFL members, through ACTFL at selected sites and conferences throughout the year.
Specifications for the OPI approximate those delineated above under the discussion of oral interviews in general. In a series of structured tasks, the OPI is carefully designed to elicit pronunciation, fluency and integrative ability, sociolinguistic and cultural knowledge, grammar, and vocabulary. Performance is judge by the examiner to be at one of ten possible levels on the ACTFL-designated proficiency guidelines for speaking: superior; Advanced-high, mid, low; intermediate-high, mid, low; novice-high, mid, low. A summary of those levels is provided in table 7.4.
The ACTFL proficiency guidelines may appear to be just another form of the “FSI levels” described earlier. Holistic evaluation is still implied, and in this case four levels are described. On closer scrutiny, however, they offer a markedly different set of descriptors. First, they are more reflective of a unitary definition of ability, as discussed earlier in this book (page71). Instead of focusing on separate abilities in grammar, vocabulary, comprehension, fluency, and pronunciation, they focus more strongly on the overall task and on the discourse ability needed to accomplish the goals of the tasks.
Second, for classroom assessment purposes, the six FSI categories more appropriately describe the components of oral ability than do the ACTFL holistic scores, and therefore offer better washback potential.
Third, the ACTFL requirement for specialized training renders the OPI less useful for classroom adaptation. Which form of evaluation is best is an issue that is still hotly debated.
It was noted above that for official purposes, the OPI relies on an administrative network that mandates certified examiners, who pay a significant fee to achieve examiner status. This systematic control of the OPI adds test reliability to the procedure and assures test-takers that examiners are specialists who have gone through a rigorous training course. All these safeguards discourage the appearance of “outlaw” examiners who might render unreliable scores.
On the other hand, the whole idea of an oral interview under the control of an interviewer has come under harsh criticism from a number of language-testing specialists. Valdman (1988, p 125) summed up the complaint:
From a Vygotskyan perspective, the OPI forces test-takers into a closed system where, because the interviewer is endowed with full social control, they are unable to negotiate a social world. For example, they cannot nominate topics for discussion, they cannot switch formality levels, they cannot display a full range of stylistic maneuver. The total control the OPI interviewers possess is reflected by the parlance of the test methodology…. In short, the OPI can only inform us of how learners can deal with an artificial social imposition rather than enabling us to predict how they would be likely to manage authentic linguistic interactions with target-language native speakers.
Bachman (1988,p.149) also pointed out that the validity of the OPI simply cannot be demonstrated “because it confounds abilities with elicitation procedures in its design, and it provides only a single rating , which has no basis in either theory or research.”
Meanwhile, a great deal experimentation continues to be conducted to design better oral proficiency testing methods (Bailey, 1998;Young& He, 1998). With ongoing critical attention to issues of language assessment in the years to come, we may be able to solve some of the thorny problems of how best to elicit oral production in authentic contexts and to create valid and reliable scoring methods.