Web Content Display (Global)

The Institute of Educational Assessors (IEA) has been rebranded to Prescient. Head to our new website at www.prescient.edu.au.

Web Content Display

Importance of moderation processes in ensuring the reliability of teacher judgments

Author: Chris Payne

Chris looks at the importance of moderation processes in ensuring the reliability of teacher judgments which, in turn, affects the Catholic Education South Australia funding allocations for English as an Additional Language.

1 — Context and focus


In Catholic Education South Australia, schools that wish to be considered for biennial funding allocations from the English as an Additional Language (EAL) program are invited to submit data about the English language proficiency of their EAL cohort. This data comprises assessments using the Language and literacy levels across the Australian Curriculum: EALD Students (DECD 2014).

The language and literacy levels are used predominantly as an assessment for learning tool; however, the data collection for funding allocations necessarily shifts the emphasis to one of reporting and system-level accountability (Maxwell 2002). As Maxwell (2002, p 21) states, ‘official reporting demands some form of moderation if the assessments are to have credibility’; hence, sound assessment moderation practices are crucial to ensure stakeholder confidence, quality assurance, and the resultant equitable allocation of funds.

Current assessment and moderation practices

In order to submit EAL data, it is a stated expectation that assessors have participated in training to use the language and literacy levels, which includes learning about the model of language underpinning them. It is also expected that assessors participate in a central assessment moderation process.

Apart from the initial training workshops, the official moderation days tied to the data collection are the only formal opportunities for ensuring consistency and comparability of professional assessment judgments (Maxwell 2002).

Traditionally, the main focus of the so-called moderation days was to give teachers a quiet space to do their assessments and occasionally clarify their judgments with a colleague. In more recent years, the EAL team has provided a social moderation process (Linn 1993, in Klenowski & Wyatt-Smith 2013) during which teachers have collaborated to reach a consensus about interpretations. This helped to support consistency of judgment within and between assessors (Klenowski & Wyatt-Smith 2013). However, there was still some acceptance of a degree of disagreement, which contradicts Maxwell’s (2002) assertion about the intention of moderation — ‘[it] is to resolve any differences of opinion rather than to calculate and accept the degree of disagreement’ (Klenowski & Wyatt-Smith 2013, p 17).

Also, until now there has been no evaluation of the reliability of assessment judgments.

The focus group for this study is the cohort of teachers who are responsible for the assessments. The assessors who attend the moderation days are mainly specialist EAL teachers who are experienced in using the language and literacy levels, but some are non-specialists with limited experience.

As a result of my analysis of the current situation, the main questions to be answered are: How reliable are the assessment data? How can reliability be improved? In what ways does the moderation process enhance assessment literacy?

Back to top

2 — Action plan

The plan for this case study was to refine and enhance the moderation processes in order to ensure the reliability of the assessment judgments, as well as gathering evidence of this reliability for quality assurance. In doing so, it was intended that the assessment literacy of the assessors would also be enhanced. This reflects the thinking of Klenowski and Wyatt-Smith (2010) and Wyatt-Smith (2016), who assert that positive participation in moderation, even as a system initiative, ‘build[s] teacher assessment capacity, as well as teacher confidence in the judgments they make’ (Klenowski & Wyatt-Smith 2010, p 115).

In designing the professional learning, I based my decisions on Maxwell’s (2002) idea that a moderation process seeks to approve assessor judgments, bearing in mind that there may need to be some adjustment of those judgments to conform to the common standard (or level). It is not a passive process that simply checks how much agreement there is. It is an active process in which assessment judgments are aligned with each other to ensure consistency of interpretation and implementation of standards across the whole system.

Moderation for accountability — based on Maxwell (2002)

Quality assurance (prior)

  • Use of language and literacy levels as a consistent assessment instrument.
  • Requirement that all those who conduct assessments using the language and literacy levels have completed relevant professional development, and understand and know how to use the levels as assessment criteria.
  • Recommendations for task types and the number of assessment tasks to be assessed.
  • A ‘guild of professionals’ (Meiers, Ozolins & McKenzie 2007) work together to establish benchmark levels against which two exemplar writing tasks are assessed. They determine the language and literacy level allocations based on discussion and close examination of the writing sample as per the quality control process outlined in parts 1, 2, and 3 below.

Quality control during moderation sessions

LEARNING: enhance the assessment literacy of assessors.

  1. Assessors collaboratively and then individually assess provided written language samples.
  2. For each assessment, facilitators lead a clarifying discussion to compare and calibrate judgments: What level do you think it is? What evidence indicated or led to your decision about specific language features, e.g. cohesion, text structure, appraisal, word groups, sentence structure?
  3. Assessors annotate their sample and may need to recalibrate their interpretation based on an exemplar which makes explicit the match between the evidence in the work and the standards (levels). Assessors should also provide a commentary about how the standards have been applied (Klenowski & Wyatt-Smith 2013).
    • Sadler (1987) argues that exemplars of student work provide concrete referents to illustrate standards and can show the different manifestations (or variety of evidence) of a particular language and literacy level.
    • Consistency of judgment is possible through making direct links between the identified evidence in the student writing sample and the standards described in the language and literacy levels; in this way, judgments are made defensible (Klenowski & Wyatt-Smith 2013).
  4. Teachers then assess the remainder of their writing samples, applying their ‘calibrated’ interpretation of the levels.

ACCOUNTABILITY: ensuring inter-rater reliability.

  1. During the morning, each assessor submits a sample they have already assessed. The assigned level is recorded on a master sheet but not on the sample.
  2. This sample is given ‘blind’ to another assessor, who assesses and records their judgment of the level.
  3. The levels assigned are reviewed for consistency in interpretation and application of the standard criteria. If there is a discrepancy of two or more levels, a member of the ‘guild of professionals’ conducts a third assessment for comparability.
  4. The assessment judgment that conforms to the standard is verified with the assessor. The judgment that requires adjustment is immediately shared with the relevant assessor so that they can understand why the judgment was altered.

This reliability check also enabled collection of data to evaluate the level of reliability.

Back to top

3 — Findings and recommendations

How reliable are the assessment data?

The data from the reliability check indicated that 85% of the assessors were accurate with their judgments within the allowable error margin of one level. About 15% of the assessors had discrepancies that required a third assessment review and this showed that they were usually between two to three levels ‘out’. Analysis of the data revealed that the assessors who demonstrated the least reliability (i.e. the greatest margin of error in their assessment judgments) were those who had the least experience in using the language and literacy levels or had not even undergone the requisite training! That is, they were ‘novices’ as opposed to ‘expert’ assessors (Grainger & Adie 2014).

Another option, during the non-data collection years, would be to build moderation practices into the language and literacy levels training workshops so that they are experienced as a ‘natural’ part of building assessment literacy. This would also support the development of an assessment community and the de-privatisation of assessment practice (Wyatt-Smith 2016).

There is also scope for developing additional exemplars that represent typical achievement of a level, as well as showing what the top and lower end of a level actually looks like. This knowledge would also be useful for students (Wyatt-Smith 2016).

Back to top

4 — Evaluation

I have learned that the moderation process can be a positive professional learning experience and still meet quality assurance standards. Initially I was nervous about introducing the ‘extra’ element of the reliability check to the process; I feared that experienced assessors might resent even more of their ‘levelling’ time being taken away. However, overwhelmingly, teachers valued the affirmation they received from this process. For those whose judgments needed to be adjusted, they valued the ‘just-in-time’ professional learning that it offered. Although there could have been a risk of novice assessors feeling inadequate, the feedback indicates that they valued this as a professional learning opportunity.

Furthermore, as an organisation, we are able to feel confident about the reliability of the data and this is supported by evidence.

Overall I feel that the moderation processes were conducted in a spirit of professional learning and quality improvement and they led to improved learning and teacher practice (Certified Educational Assessor [CEA], Module 5). The next step is to make this a more regular and embedded practice, beyond the requirements of the data collection process.

Back to top


Custance, B 2014, Language and literacy levels across the Australian curriculum: EALD students, Department of Education and Child Development, https://www.decd.sa.gov.au/teaching/curriculum-and-teaching/numeracy-and-literacy/english-additional-language-or-dialect

Grainger, P & Adie, L 2014, ‘How do preservice teacher education students move from novice to expert assessors?’, Australian Journal of Teacher Education, vol 39, issue 7, viewed 21 March 2018, http://ro.ecu.edu.au/ajte/vol39/iss7/6/

Klenowski, V. Wyatt-Smith, C 2010, Standards, teacher judgement and moderation in contexts of national curriculum and assessment reform. Assessment Matters, 2010, 107-131

Klenowski, V & Wyatt-Smith, C 2013, Assessment for education: standards, judgement and moderation, SAGE Publications

Maxwell, G 2002, Moderation of teacher judgements in student assessment, Queensland School Curriculum Council, Brisbane

Meiers, M, Ozolins, C & McKenzie, P 2007, Improving consistency in teacher judgements, Australian Council for Educational Research, viewed 21 March 2018, http://research.acer.edu.au/cgi/viewcontent.cgi?article=1020&context=tll_misc

Sadler, R 1987, ‘Specifying and promulgating achievement standards’, Oxford Review of Education, vol 13, no 2, Taylor & Francis Ltd, Oxford

Sadler, R 2012, ‘Assuring academic achievement standards: from moderation to calibration’, Assessment in Education: Principles, Policy and Practice, vol 20, issue 1, University of Queensland, viewed 21 March 2018, https://www.tandfonline.com/doi/full/10.1080/0969594X.2012.714742

Wyatt-Smith, C 2016, ‘Why teacher practice, classroom assessment, standards and moderation matter more than ever’, The power of assessment: what every teacher must understand conference, Institute of Educational Assessors, Adelaide, 27 June 2016

Back to top