Writing Assessment for Second Language Learners
Writing assessment for second language (L2) learners involves evaluating their writing abilities while considering various factors that can affect the assessment's fairness and accuracy. Unlike native speakers, L2 learners face additional complexities, as these assessments often measure both writing skill and overall language proficiency. Different assessment purposes, such as diagnostic, developmental, and promotional evaluations, influence the choice of scoring methods, which can be holistic—providing a single score based on overall impression—or analytic, focusing on specific writing traits.
Key elements that impact assessment outcomes include the writer’s background, the writing task, the written product, and the raters involved. Each of these components can introduce variability, as writers bring unique cultural contexts and experiences that shape their writing. Furthermore, raters’ judgments can differ based on their backgrounds and perspectives. Assessment validity is crucial, as it ensures that tests measure what they intend to measure, and concerns about high-stakes testing highlight the potential for misuse, particularly regarding cultural biases.
Alternative assessment methods, such as writing portfolios, are gaining recognition for promoting self-reflection and a more comprehensive evaluation of a writer's abilities. Overall, effective writing assessment for L2 learners requires careful consideration of these factors to provide meaningful feedback and support learners' growth in a culturally sensitive manner.
On this Page
Subject Terms
Writing Assessment for Second Language Learners
In the evaluation of second language (L2) learners' writing abilities, many factors can influence the fairness and accuracy of the assessment. As in evaluating native speakers, the assessment results may vary depending on what kind of test is given, who scores the test and what criteria are applied to the final product. Because L2 writing assessments are frequently used to measure not only writing ability, but also language proficiency, an additional complicating factor is present.
Keywords Analytical Scoring; Assessment; Concurrent Validity; Construct Validity; Contrastive Rhetoric; Critical Language Testing; High Stakes tests; Holistic Scoring; Interrater Reliability; Second Language (L2) Learners; Writing Assessment; Writing Portfolio
English as a Second Language > Writing Assessment for Second Language Learners
Overview
In the evaluation of second language (L2) learners' writing abilities, many factors can influence the fairness and accuracy of the assessment. As in evaluating native speakers, the assessment results may vary depending on what kind of test is given, who scores the test and what criteria are applied to the final product. Because L2 writing assessments are frequently used not just to measure writing ability, but also language proficiency, an additional complicating factor is present. This article examines the most important factors that must be considered when appraising L2 student writing.
Before designing or administering a writing assessment, the purpose of the assessment must be clearly identified. Three common purposes for evaluating L2 writers are:
• Diagnostic,
• Developmental and
• Promotional.
Diagnostic or entrance exams are given in order to determine placement within language programs. Developmental or progress assessments give an indication of a writer's growth over time. Promotional or exit tests assess whether the writer is ready to advance to the next level or graduate from a program.
Holistic Scoring
The assessment's purpose influences the choice of rating scale used to evaluate the test. Typically, assessments are evaluated using either holistic or analytical scales. Holistic scoring involves assigning a single score based on a rater's overall impression of a particular piece of work. The rater is usually given a rubric of criteria outlining expected norms for each level of writing. However, scoring relies heavily on a rater's training and expertise to intuitively weight the criteria before producing a final mark. Holistic scoring is frequently used for diagnostic testing. It is also popular for large-scale assessments such as the Test of English as a Foreign Language (TOEFL) because it saves time and money.
Analytic Scoring
Analytic scoring, on the other hand, is often used when the assessment is meant to provide feedback regarding a student's progress or readiness for promotion. Analytic scales identify one or more traits of writing to be assessed. Primary trait scales evaluate one main aspect of a text. Multiple trait scales focus on more than one component of a work (Bacha, 2001).
An issue in both analytic and holistic scoring is determining what aspects of the writing should be assessed. In other words, what are the components of good writing? Investigations in both the field of composition and second language learning have been conducted to answer this question (Casanave, 2004; Cumming, Kantor & Powers, 2002). However, because writing is a complex mental activity that involves creativity, because there are many purposes for which one writes and because formats affect style, no one set of criteria can be said to definitely define the best qualities of writing. Moreover, no one has yet identified a definite developmental sequence in writing. Rather, it appears that students' writing performance is variable. They may perform better in one area, such as being able to write a complex sentence, while doing poorly in another such as organizing a longer text (Bacha, 2001).
What should be Assessed?
Despite the difficulties in defining the best qualities of writing, in order to make judgments, criteria must be established. Studies have been conducted to determine what native speakers and experienced raters consider to be good writing. Cumming et. al. (2002) compared the evaluation processes of experienced English mother tongue composition raters and experienced ESL/EFL raters and found that both groups of raters tended to report the following qualities as being particularly effective in writing for a composition exam:
• Rhetorical organization: including introductory statements, development, cohesion, fulfillment of the writing task;
• Expression of ideas: including logic, argumentation, clarity, uniqueness, and supporting points;
• Accuracy and fluency of English grammar and vocabulary; and
• The amount of written text produced (p. 72).
Other researchers have developed frameworks for evaluation that include specific grammatical and discourse features. Chiang (1999), in a study that looked at the relative importance of these features to raters, evaluated 35 textual features under the categories of morphology, syntax, cohesion and coherence. Haan & Van Esch's (2004) framework considered overall quality, linguistic accuracy, syntactic complexity, lexical features, content, mechanics, coherence & discourse, fluency and revision. Connor and Mbaye (2002), in an attempt to apply a model of communicative competence to writing assessment, suggest the following breakdown of linguistic skills:
• Grammatical competence - spelling, punctuation, knowledge of words and structures;
• Discourse competence - knowledge of discourse, organization of genre, cohesion and coherence;
• Sociolinguistic competence - written genre appropriateness, audience awareness and appeals to audience, pertinence of claim and tone.
• Strategic competence - use of transitions and metatextual markers
Finally, in what is only a brief list of ways to categorize writing components, the ESL Composition Profile, which has been frequently adopted or modified for research and assessment, consists of five categories:
• Content,
• Organization,
• Vocabulary,
• Language and
• Mechanics (Jacobs, Zinkgraf, Wormuth, Hartfiel, & Hughey, 1981).
Other Variables
While having a clearly defined purpose and assessment scale are important, these two factors alone cannot ensure that tests are scored fairly and accurately. Into the testing mix are thrown several other variables that influence the outcome of any assessment. Kroll (1998) summarizes these critical variables as follows:
• The writer whose work is to be assessed;
• The writing task(s) that writers have been asked to complete;
• The written product(s) subject to assessment;
• The reader(s) who score or rank the product;
• The scoring procedures, which can be subdivided into the scoring guidelines and the actual reading procedures (p. 223).
The Writer
The writer, naturally, has a central role in the outcome of the writing assessment. The writer brings to the task a unique background that is comprised of several variables that impact writing performance. Level of language proficiency, cultural background, familiarity with testing situations, motivation to complete the task and prior educational experiences can all influence the extent to which the writer performs in a novel testing situation (Kroll, 1998).
The Task
Creating a writing task that is valuable and reliable is important in any kind of assessment. In writing, tests are said to have content validity when they ask testers to perform the same kind of tasks that they would in the classroom (Bacha, 2001). Yet ESL writers in a language program typically represent multiple disciplines. Creating prompts that are general enough to be fair to all but specific enough to allow individuals to draw upon prior knowledge can be difficult. The wording of the prompt, its mode of discourse, rhetorical specifications and subject matter can affect results (Tedick, 1990). For instance, Tedick (1990) compared the impact of writing performance of a field-specific topic vs. a general writing topic. Drawing on research in cognitive psychology and reading research that shows comprehension and understanding of a text is partially dependent on a reader's prior knowledge and experience, she hypothesized that if students were familiar with the subject matter of the prompt, they would write more effectively. In a study of 105 graduate students who were asked to write on both general and discipline-specific topics, she found that not only was student writing performance on a field-specific topic superior, but such topics were better at identifying writer's varying levels of writing proficiency.
Additionally, in order for tasks to give a fair assessment, prompts must not be laden with cultural knowledge that is unfamiliar to someone outside the target language culture (e.g. questions about pop culture or historical figures) and enough time must be given for students to complete the task. Because writing is a process, sufficient time frequently involves not just time to write a response to a prompt, but also time to plan and revise the response (Casanave, 2004; Kroll, 1998).
The Product
Although the product that is produced during an assessment does not have an active role in the testing situation, perceptions of the product are in part due to the expectations that readers have for particular texts. Texts generally fall into a variety of genres that include particular rhetorical strategies. Some research indicates that genres, discourse structures and rhetorical forms may vary across cultures. In the field of contrastive rhetoric, researchers seek to understand these culturally-bound differences. While the research is somewhat controversial in that some have argued that early research in this area overemphasized culturally-related differences in discourse structure, it remains a point of investigation in the field of L2 writing (Casanave, 2004). One of the main differences that have been noted between English academic text and other forms of writing is that English academic text tends to be linear in its development. For instance, in a study of English and French paragraph writing, it was noted that English paragraphs began with a topic sentence and developed the main idea of the sentence. French paragraphs, on the other hand, tended not to be organized around a topic sentence but instead seemed to be more of a loose collection of data (Régent, 1985 as cited in Casanave, 2004).
The Rater
Perhaps as significant as the writer is the reader, or rater, of the test. In writing assessment, a consensus in scoring by two or more raters is generally considered an accurate assessment of a writer's work. Interrater reliability is achieved when raters apply the same criteria to one writing and assign the same or very close score. Intrarater reliability is achieved when the same rater applies the same criteria in the same way over time (an important consideration since it would not be fair for a rater to assign a different score when tired than when wide awake). Ensuring that raters apply the same criteria in the same way is not simple. Like the writer, raters have extensive background knowledge and experience that affects their perceptions of the written product. Raters with backgrounds in the fields of ESL, English composition and other disciplines have been found to differ on what features they deem to be important in writing and on how much weight they give to various writing components (Cumming, et. al., 2001; Weigle, Boldt, & Valsecchi, 2003). Similarly, raters from different cultural backgrounds sometimes appreciate different features of writing. In a study of American and Chinese raters who read English essays, Americans expressed appreciation for clear logic and a clear opening while the Chinese valued essays that expressed sentiment, natural scenes and a moral message (Casanave, 2004). Finally, raters may become less reliable when the move away from judging accuracy of form to judging more subjective aspects of writing such as appropriateness of meaning (Chiang, 1999).
In order to compensate for rater's individual differences and achieve the greatest degree of interrater reliability, Bacha (2001) suggests the following guidelines for raters: raters should focus on the task objectives and use the same criteria with common understanding. Novices should approximate expert raters, and all raters should be sensitive to writing strategies of other cultures. At the same time, others suggest training for students who take the test. By making test-takers aware of audience expectations, writers can formulate their ideas in ways that are most acceptable to raters of a particular test (Casanave, 2004).
An interesting finding related to raters who are also instructors relates to how instructors conceptualize their teaching purpose and how this conceptualization then influences the use of assessments. Cumming (2001) found that instructors who viewed themselves as teaching English for specific purposes (such as to get into a particular university program) viewed and used assessment differently than instructors who perceived their purpose as teaching English for more general purposes. Instructors teaching for specific purposes offered clear rationales for choosing assessments and gave specific standards for achievement. As a result of these very specific standards, they used limited forms of assessment. On the other hand, instructors who viewed their purpose as preparing students to use English in more general settings used a wide range of assessments and offered a broad range of criteria for judging achievement. These findings highlight the fact that the subjectivity of the tester/rater has a great impact on both the testing situation and its outcomes.
Further Insights
Validity
Test-makers often go out of their way to ensure that tests have construct and concurrent validity. In other words, they seek to ensure that the tests measure what they say they measure and results on the tests are similar to results achieved on a comparable test with a similar population. However, some people question the validity of timed testing in general. They say that writing under timed conditions is unnatural. Because writing is a process that involves revision, they say timed tests do not give sufficient opportunity for students to fully-utilize their writing skills. Moreover, they say that the quality of any one individual's writing is variable depending on the type of writing being required and the writer's familiarity with the format or topic. Thus, some advocate for the use of alternative assessments such as the writing portfolio (Casanave, 2004; Yang, 2003).
Portfolios
A writing portfolio is a collection of student writing on a variety of topics in a variety of different formats. Typically, students are involved in the evaluation process, selecting, with the help of teachers, pieces they feel represent their best work. Proponents of portfolios say that the self-reflection and self-assessment that students engage in when preparing a portfolio is one of this assessment's greatest benefits. By developing metacognitive awareness of their writing processes and development, students, they say, not only gain a sense of achievement, but also take control of their own learning in a way that impacts future learning situations (Yang, 2003).
Another benefit of portfolios is that students report feeling more positive about their learning after participating in the portfolio assessment process (Barootchi & Keshavarz, 2002). In particular, students who typically like using learning strategies that involve writing what they hear, read or speak or in working in pairs and small groups, enjoy using portfolios and believe the process improves their writing abilities (Yang, 2003).
On the other hand, portfolios can be time-consuming and difficult to manage. In a compilation of student comments on portfolios, Yang (2003) reported that some students complained of the work involved in portfolio preparation. Student comments indicated the obstacles they faced when organizing their portfolios included: time management, overcoming poor learning attitudes such as laziness or procrastination, maintaining records, deciding how to select and organize work and dealing with computer-related and/or other access problems. Other concerns about portfolios relate to the ability of raters to reliably score them since each student's portfolio consists of a different set of documents. In general, this issue is what makes portfolio assessment impractical for most large-scale assessments (Casanave, 2004).
Yet despite these difficulties, many feel that portfolios are worth the effort. Along with the above mentioned benefits, portfolios have been used to improve home-school connections of students who are learning a second language (Paratore, Homza, & Krol-Sinclair, 1995). Studies that have attempted to compare student achievement as measured in portfolios with achievement measured through other types of assessment have found that portfolios are sometimes more successful in identifying students who will succeed in future writing courses than other forms of assessment. (Song & August, 2002).
Viewpoints
One of the controversies in L2 writing assessment, as in all areas of assessment, surrounds the use of high stakes tests. High stakes tests are single tests which have far-reaching consequences for the test-taker. Examples of high stakes tests are tests that determine whether a student graduates from high school, is admitted to a particular university program or is placed on an individualized learning plan in elementary school. One of the reasons that high stakes tests are controversial is that there is evidence to support the idea that such tests are frequently used by those in power to manipulate educational systems and to impose particular ideological agendas (Shohamy, 2001).
The reason that tests can have this effect is because those whose lives are most impacted by the results of the test, whether it be an individual or a collective system, typically change their behavior in order to maximize their test scores. Thus, schools will "teach to the test" emphasizing information that the test-makers feel is important while downplaying other information, and individuals will adopt ideological positions that they think the testmakers believe to be "correct." In second language learning situations, tests may be used to by one dominant cultural group to maintain power over another. For example, tests that require the tester to have particular cultural knowledge or to adopt certain ideologies could function to control the minority cultural group either by requiring acceptance of dominant cultural norms or by limiting access to those who do not adopt the norms (e.g. citizenship tests).
In response to concerns about the power of tests, Shohamy (2001) writes that all testers should apply the standards of critical language testing to monitor the use of tests, to challenge test assumptions and to examine the consequences of their use. Shomany advocates for democratic alternatives to traditional assessments, arguing that individuals and citizens need to participate collectively in systems that require testing. Such participation could include parents, teachers and students working collaboratively to develop tests or other assessments in a school. With a critical language testing approach, Shohamy believes assessments would be more likely to consider the knowledge base of different groups as well as provide multiple procedures for interpreting knowledge of individuals or groups. Thus, the ability of those in power to use tests to control others would be lessened.
In conclusion, evaluating second language learners' writing is a complex process that requires accounting for multiple factors to ensure a fair and accurate judgment of the writer's abilities. Testers must be sure to set purposes for assessment, choose appropriate scoring criteria, and be aware of how multiple variables in the testing situation can influence outcomes. Testers should also be aware of how assessments can be used to control or manipulate educational systems and/or individuals. By being aware of these issues, students, teachers and others involved in the testing process have a greater chance of providing meaningful assessments that can provide useful feedback to learners.
Terms & Concepts
Analytic Scoring: Analytic Scoring involves identifying one or more traits of a written work to be assessed. A score is assigned for each component of the writing.
Concurrent Validity: A test is said to have concurrent validity when scores reported on the test correlate significantly and positively with scores on another test that attempts to measure the same or similar skills.
Construct Validity: Construct Validity describes the ability of an assessment to measure what it is supposed to measure.
Contrastive Rhetoric: In the field of contrastive rhetoric, researchers compare texts from different cultures in order to identify differences in structure and rhetoric.
Critical Language Testing: Critical Language Testing refers to a perspective on testing that seeks to question and understand the underlying power issues that relate to testing.
Holistic Scoring: Holistic scoring involves assigning one score to a written text based on the rater's overall impression of the text. These tests are most appropriate for placement purposes.
Interrater Reliability: Interrater reliability refers to the consistency with which individual raters will give the same score to the same text when they use the same scoring criteria.
Writing Portfolio: A writing portfolio is a collection of student writing samples on a variety of topics and in a variety of formats.
Bibliography
Bacha, N. (2001). Writing Evaluation: What can analytic versus holistic essay scoring tell us? System: An International Journal of Educational Technology and Applied Linguistics, 29 , 371-383.
Barootchi, N. & Keshavarz, M. H. (2002). Assessment of achievement through portfolios and teacher-made tests. Educational Research, 44 , 279-288.
Campbell, H., Espin, C., & McMaster, K. (2013). The technical adequacy of curriculum-based writing measures with English learners. Reading & Writing, 26, 431-452. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=85399321&site=ehost-live
Casanave, C. (2004). Controversies in second language writing. Ann Arbor: The University of Michigan Press.
Chiang, S. (1999). Assessing grammatical and textual features in L2 writing samples: The case of French as a foreign language. The Modern Language Journal, 83 , 219-232.
Connor, U., & Mbaye, A. (2002). Discourse approaches to writing assessment. Annual Review of Applied Linguistics, 22, 263-278.
Cumming, A. (2001). ESL/EFL instructors' practices for writing assessment: Specific purposes or general purposes? Language Testing, 18 , 207-224.
Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86 , 67-96.
Fahim, M., & Jalili, S. (2013). The impact of writing portfolio assessment on developing editing ability of Iranian EFL learners. Journal of Language Teaching & Research, 4, 496-503. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=90356426&site=ehost-live
Haan, P., & Van Esch, K. (2004). Towards an instrument for the assessment of the development of writing skills. In U. Connor & T. Upton (Eds.), Applied Corpus Linguistics: A Multidimensional Perspective (pp. 267-279). Amsterdam: Netherlands.
Jacobs, H.J., Zinkgraf, S.A., Wormuth, D.R., Hartfiel, V.F., & Hughey, J.B. (1981). Testing ESL composition: A practical approach. Rowley, MA: Newbury House.
Kachchaf, R., & Solano-Flores, G. (2012). Rater language background as a source of measurement error in the testing of English Language Learners. Applied Measurement in Education, 25, 162-177. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=74073516&site=ehost-live
Kroll, B. (1998). Assessing writing abilities. Annual Review of Applied Linguistics, 18, 219-240.
Paratore, J., Homza, A., & Krol-Sinclair, B. (1995). Shifting boundaries in home and school responsibilities: The construction of home-based literacy portfolios by immigrant parents and their children. Research in the Teaching of English, 29, 367-389.
Shohamy, E. (2001). Democratic assessment as an alternative. Language Testing, 18 , 373- 391.
Song, B., & August, B. (2002). Using portfolios to assess the writing of ESL students: A powerful alternative? Journal of Second Language Writing, 11 , 49-72.
Tedick, D. J. (1990). ESL writing assessment: Subject-matter knowledge and its impact on performance. English for Specific Purposes, 9 , 123-143.
Weigle, S.C., Boldt, H., & Valsecchi, M.I. (2003). Effects on task and rater background on the evaluation of ESL student writing: a pilot study. TESOL Quarterly, 37, 345-354.
Yang, N. (2003). Integrating portfolios into learning strategy-based instruction for EFL college students. IRAL, 41, 293-317.
Suggested Reading
Casanave, C. (2004). Controversies in second language writing. Ann Arbor: The University of Michigan Press.
Hamp-Lyons, L. (Ed.). (1991). Assessing second language writing in academic contexts. Norwood, NJ: Ablex.
Hamp-Lyons, L., & Kroll, B. (1996). Issues in ESL writing assessment: An overview. College ESL, 1996, 6, 1, June, 6 , 52-72.