Adapted from M.E. Piontek (2008)
Center for Research on Learning and Teaching
The most obvious function of assessment methods (such as exams, quizzes, papers, and presentations) is to enable instructors to make judgments about the quality of student learning (i.e., assign grades). However, the method of assessment also can have a direct impact on the quality of student learning. Students assume that the focus of exams and assignments reflects the educational goals most valued by an instructor, and they direct their learning and studying accordingly (McKeachie & Svinicki, 2006). General grading systems can have an impact as well. For example, a strict bell curve (i.e., norm-reference grading) has the potential to dampen motivation and cooperation in a classroom, while a system that strictly rewards proficiency (i.e., criterion-referenced grading) could be perceived as contributing to grade inflation. Given the importance of assessment for both faculty and student interactions about learning, how can instructors develop exams that provide useful and relevant data about their students' learning and also direct students to spend their time on the important aspects of a course or course unit? How do grading practices further influence this process?
Guidelines for Designing Valid and Reliable Exams
Ideally, effective exams have four characteristics. They are:
- Valid, (providing useful information about the concepts they were designed to test),
- Reliable (allowing consistent measurement and discriminating between different levels of performance),
- Recognizable (instruction has prepared students for the assessment), and
- Realistic (concerning time and effort required to complete the assignment) (Svinicki, 1999).
Most importantly, exams and assignments should focus on the most important content and behaviors emphasized during the course (or particular section of the course). What are the primary ideas, issues, and skills you hope students learn during a particular course/unit/module? These are the learning outcomes you wish to measure. For example, if your learning outcome involves memorization, then you should assess for memorization or classification; if you hope students will develop problem-solving capacities, your exams should focus on assessing students’ application and analysis skills. As a general rule, assessments that focus too heavily on details (e.g., isolated facts, figures, etc.) “will probably lead to better student retention of the footnotes at the cost of the main points" (Halpern & Hakel, 2003, p. 40). As noted in Table 1, each type of exam item may be better suited to measuring some learning outcomes than others, and each has its advantages and disadvantages in terms of ease of design, implementation, and scoring.
Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement Test Items
Type of Item
Many items can be administered in a relatively short time. Moderately easy to write; easily scored.
Limited primarily to testing knowledge of information. Easy to guess correctly on many items, even if material has not been mastered.
Can be used to assess broad range of content in a brief period. Skillfully written items can measure higher order cognitive skills. Can be scored quickly.
Difficult and time consuming to write good items. Possible to assess higher order cognitive skills, but most items assess only knowledge. Some correct answers can be guesses.
Items can be written quickly. A broad range of content can be assessed. Scoring can be done efficiently.
Higher order cognitive skills are difficult to assess.
Short Answer or Completion
Many can be administered in a brief amount of time. Relatively efficient to score. Moderately easy to write.
Difficult to identify defensible criteria for correct answers. Limited to questions that can be answered or completed in very few words.
Can be used to measure higher order cognitive skills. Relatively easy to write questions. Difficult for respondent to get correct answer by guessing.
Time consuming to administer and score. Difficult to identify reliable criteria for scoring. Only a limited range of content can be sampled during any one testing period.
Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.
General Guidelines for Developing Multiple-Choice and Essay Questions
The following sections highlight general guidelines for developing multiple-choice and essay questions, which are often used in college-level assessment because they readily lend themselves to measuring higher order thinking skills (e.g., application, justification, inference, analysis and evaluation). Yet instructors often struggle to create, implement, and score these types of questions (McMillan, 2001; Worthen, et al., 1993).
Multiple-choice questions have a number of advantages. First, they can measure various kinds of knowledge, including students' understanding of terminology, facts, principles, methods, and procedures, as well as their ability to apply, interpret, and justify. When carefully designed, multiple-choice items also can assess higher-order thinking skills.
Multiple-choice questions are less ambiguous than short-answer items, thereby providing a more focused assessment of student knowledge. Multiple-choice items are superior to true-false items in several ways: on true-false items, students can receive credit for knowing that a statement is incorrect, without knowing what is correct. Multiple-choice items offer greater reliability than true-false items as the opportunity for guessing is reduced with the larger number of options. Finally, an instructor can diagnose misunderstanding by analyzing the incorrect options chosen by students.
A disadvantage of multiple-choice items is that they require developing incorrect, yet plausible, options that can be difficult to create. In addition, multiple- choice questions do not allow instructors to measure students’ ability to organize and present ideas. Finally, because it is much easier to create multiple-choice items that test recall and recognition rather than higher order thinking, multiple-choice exams run the risk of not assessing the deep learning that many instructors consider important (Greenland & Linn, 1990; McMillan, 2001).
Guidelines for writing multiple-choice items include advice about stems, correct answers, and distractors (McMillan, 2001, p. 150; Piontek, 2008):
- Stems pose the problem or question.
- Is the stem stated as clearly, directly, and simply as possible?
- Is the problem described fully in the stem?
- Is the stem stated positively, to avoid the possibility that students will overlook terms like “no,” “not,” or “least”?
- Does the stem provide only information relevant to the problem?
Possible responses include the correct answer and distractors, or the incorrect choices. Multiple-choice questions usually have at least three distractors.
- Are the distractors plausible to students who do not know the correct answer?
- Is there only one correct answer?
- Are all the possible answers parallel with respect to grammatical structure, length, and complexity?
- Are the options short?
- Are complex options avoided? Are options placed in logical order?
- Are correct answers spread equally among all the choices? (For example, is answer “A” correct about the same number of times as options “B” or “C” or “D”)?
An example of good multiple-choice questions that assess higher-order thinking skills is the following test question from pharmacy (Park, 2008):
Patient WC was admitted for third-degree burns over 75% of his body. The attending physician asks you to start this patient on antibiotic therapy. Which one of the following is the best reason why WC would need antibiotic prophylaxis?
a. His burn injuries have broken down the innate immunity that prevents microbial invasion. b. His injuries have inhibited his cellular immunity. c. His injuries have impaired antibody production. d. His injuries have induced the bone marrow, thus activated immune system
A second question builds on the first by describing the patient’s labs two days later, asking the students to develop an explanation for the subsequent lab results. (See Piontek, 2008 for the full question.)
Essay questions can tap complex thinking by requiring students to organize and integrate information, interpret information, construct arguments, give explanations, evaluate the merit of ideas, and carry out other types of reasoning (Cashin, 1987; Gronlund & Linn, 1990; McMillan, 2001; Thorndike, 1997; Worthen, et al., 1993). Restricted response essay questions are good for assessing basic knowledge and understanding and generally require a brief written response (e.g., “State two hypotheses about why birds migrate. Summarize the evidence supporting each hypothesis” [Worthen, et al., 1993, p. 277].) Extended response essay items allow students to construct a variety of strategies, processes, interpretations and explanations for a question, such as the following:
The framers of the Constitution strove to create an effective national government that balanced the tension between majority rule and the rights of minorities. What aspects of American politics favor majority rule? What aspects protect the rights of those not in the majority? Drawing upon material from your readings and the lectures, did the framers successfully balance this tension? Why or why not? (Shipan, 2008).
In addition to measuring complex thinking and reasoning, advantages of essays include the potential for motivating better study habits and providing the students flexibility in their responses. Instructors can evaluate how well students are able to communicate their reasoning with essay items, and they are usually less time consuming to construct than multiple-choice items that measure reasoning.
The major disadvantages of essays include the amount of time instructors must devote to reading and scoring student responses, and the importance of developing and using carefully constructed criteria/rubrics to insure reliability of scoring. Essays can assess only a limited amount of content in one testing period/exam due to the length of time required for students to respond to each essay item. As a result, essays do not provide a good sampling of content knowledge across a curriculum (Gronlund & Linn, 1990; McMillan, 2001).
Guidelines for writing essay questions include the following (Gronlund & Linn, 1990; McMillan, 2001; Worthen, et al., 1993):
- Restrict the use of essay questions to educational outcomes that are difficult to measure using other formats. For example, to test recall knowledge, true-false, fill-in-the-blank, or multiple-choice questions are better measures.
- Identify the specific skills and knowledge that will be assessed. Piontek (2008, available online: http://www.crlt.umich.edu/sites/default/files/resource_files/CRLT_no24.pdf) gives several examples of question stems that assess different types of reasoning skills, such as:
- Generalizations: State a set of principles that can explain the following events.
- Synthesis: Write a well-organized report that shows…
- Evaluation: Describe the strengths and weaknesses of…
- Write the question clearly so that students do not feel that they are guessing at “what the instructor wants me to do.”
- Indicate the amount of time and effort students should spend on each essay item.
- Avoid giving students options for which essay questions they should answer. This choice decreases the validity and reliability of the test because each student is essentially taking a different exam.
- Consider using several narrowly focused questions (rather than one broad question) that elicit different aspects of students’ skills and knowledge.
- Make sure there is enough time to answer the questions.
Guidelines for scoring essay questions include the following (Gronlund & Linn, 1990; McMillan, 2001; Wiggins, 1998; Worthen, et al., 1993; Writing and grading essay questions, 1990):
- Outline what constitutes an expected answer.
- Select an appropriate scoring method based on the criteria. A rubric is a scoring key that indicates the criteria for scoring and the amount of points to be assigned for each criterion. A sample rubric for a take-home history exam question might look like the following:
Level of performance
Number of references to class reading sources
Lots of inaccuracies
No apparent inaccuracies
No argument made; little evidence for argument
Argument is vague and unevenly supported by evidence
Argument is clear and well-supported by evidence
Many grammar and spelling errors
Few (1-2) grammar or spelling errors
No grammar or spelling errors
Total Points (out of 8 possible):
For other examples of rubrics, see CRLT Occasional Paper #24 (Piontek, 2008).
- Clarify the role of writing mechanics and other factors independent of the educational outcomes being measured. For example, how does grammar or use of scientific notation figure into your scoring criteria?
- Create anonymity for students’ responses while scoring and create a random order in which tests are graded (e.g., shuffle the pile) to increase accuracy of the scoring.
- Use a systematic process for scoring each essay item. Assessment guidelines suggest scoring all answers for an individual essay question in one continuous process, rather than scoring all answers to all questions for an individual student. This system makes it easier to remember the criteria for scoring each answer.
You can also use these guidelines for scoring essay items to create grading processes and rubrics for students’ papers, oral presentations, course projects, and websites. For other grading strategies, see Responding to Student Writing – Principles & Practices and Commenting Effectively on Student Writing.
Cashin, W. E. (1987). Improving essay tests. Idea Paper, No. 17. Manhattan, KS: Center for Faculty Evaluation and Development, Kansas State University.
Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmillan Publishing Company.
Halpern, D. H., & Hakel, M. D. (2003). Applying the science of learning to the university and beyond. Change, 35(4), 37-41.
McKeachie, W. J., & Svinicki, M. D. (2006). Assessing, testing, and evaluating: Grading is not the most important function. In McKeachie's Teaching tips: Strategies, research, and theory for college and university teachers (12th ed., pp. 74-86). Boston: Houghton Mifflin Company.
McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction. Boston: Allyn and Bacon.
Park, J. (2008, February 4). Personal communication. University of Michigan College of Pharmacy.
Piontek, M. (2008). Best practices for designing and grading exams. CRLT Occasional Paper No. 24. Ann Arbor, MI. Center for Research on Learning and Teaching.>
Shipan, C. (2008, February 4). Personal communication. University of Michigan Department of Political Science.
Svinicki, M. D. (1999a). Evaluating and grading students. In Teachers and students: A sourcebook for UT- Austin faculty (pp. 1-14). Austin, TX: Center for Teaching Effectiveness, University of Texas at Austin.
Thorndike, R. M. (1997). Measurement and evaluation in psychology and education. Upper Saddle River, NJ: Prentice-Hall, Inc.
Wiggins, G. P. (1998). Educative assessment: Designing assessments to inform and improve student performance. San Francisco: Jossey-Bass Publishers.
Worthen, B. R., Borg, W. R., & White, K. R. (1993). Measurement and evaluation in the schools. New York: Longman.
Writing and grading essay questions. (1990, September). For Your Consideration, No. 7. Chapel Hill, NC: Center for Teaching and Learning, University of North Carolina at Chapel Hill.
back to top
Candace Caraco, TA, Department of English
Grading student papers for a course in any discipline presents a series of challenges different from grading other kinds of assignments. Typically, a wide range of responses will be acceptable, and every paper (unless it is plagiarized) will have some merit. Consequently, grading essays demands a teacher’s close attention to insure that each paper is judged by the same standards. A method for evaluating essays that breaks the grading process into parts can help an instructor work more consistently and efficiently. By assessing papers based upon the three general categories of ideas, argument, and mechanics and style, categories easily adapted for each discipline and assignment, an instructor can more easily recognize and comment on an essay’s strengths and weaknesses and so face that daunting pile of twenty, forty, or even one hundred essays with less trepidation. Furthermore, if teachers make clear to students how this method works, fewer students will be confused about their grades or apt to charge that papers are graded in an arbitrary or purely subjective way.
Before applying the three categories for evaluation, think through what it is you want an assignment to accomplish. Grades should reflect the most significant strengths and weaknesses of an essay, so a teacher should carefully consider ahead of time what expectations he or she has for a paper and especially what he or she most wants students to do for a particular assignment. For example,
- Do the instructions to students require specific tasks, such as agreeing or disagreeing with an author, outlining a book’s argument for review, or analyzing a particular section of a work?
- What is it that students should show they understood?
More generally, a teacher may also consider the following:
- Has the student presented ideas in a logical order?
- Is the essay written in clear, grammatically correct prose?
- Has the student offered explanation or examples to support generalizations?
For any given assignment, your criteria for success may vary in the details; whatever they are, make a list of them. Ideally, students would receive a copy of this list before they begin writing their essays.
The problem with such a list of criteria, however, is that it can quickly grow unwieldy. While we need some specific questions as a checklist for student writing success, we can benefit from a streamlined evaluation system. The ideas/argument/mechanics and style format is a simple way to group criteria, both for yourself and your students. Once you have a set of criteria for an essay to succeed, you can decide how these questions fit under the three headings. A general breakdown of these questions might look like this:
- Does the student understand the accompanying reading or the principles behind the experiment, etc.?
- Does the student offer original interpretations?
- Do the student’s explanations of terms, ideas, and examples demonstrate an ability to grasp the main points, paraphrase them, and apply them?
- Does the student answer the question(s) assigned?
- Does the essay demonstrate an understanding of a subject, or does it wander from one subject to the next without offering more than superficial remarks?
- Can we easily determine what the author’s main point is?
- Does the essay provide a series of points that add up to an argument supporting the main point (thesis)?
- Does the essay proceed logically from point to point?
- Does the student provide examples and explanations to support his or her generalizations?
- Does the essay contain contradictions? Is the paragraph structure logical?
MECHANICS AND STYLE
- Is it clear what the student’s point of view is?
- Does the student control tone? Is the essay free of grammatical errors?
- Is the essay punctuated appropriately?
- Do citations and bibliography follow the correct format?
- Is the prose clear or do you puzzle over individual sentences?
- Are words spelled correctly?
What I am suggesting is essentially adapted from the methods of two English professors, Charlene Sedgwick and Steve Cushman. Sedgwick’s “ENWR Handbook” offers guidelines for evaluating freshman composition papers by assessing focus, organization, style, and mechanics; Cushman has in the past recommended that graders for his upper-level literature courses weigh mechanics and style (together) as one-third of a grade, and ideas and argument as the other two-thirds. Though instructors for non-English courses may want less emphasis on writing skills per se in an essay grade, I would argue that papers for all courses should be evaluated at least in part for their grammar, punctuation, and prose style because these fundamentals of writing are everywhere necessary for readers to understand writers. And a teacher in any discipline can easily tailor the three categories of ideas/argument/mechanics and style to the conventions of the course and its academic discipline.
Simplified (and Platonized) then, these three categories translate into the following grade scale: essays with good ideas that are logically organized into an argument and written in clear and mechanically clean prose receive an A; essays lacking in one category (e.g., have poor organization) receive a B; essays weak in two categories receive a C; and essays that manage none the three general criteria garner a D or fail. What constitutes an “A” within any given category will also depend upon the course level and the assignment, but in a very general way, if a student’s essay can answer “yes” to all of your questions for a category, then the student should have an “A” for that portion of the grade. (More explicit criteria appear in “Responding to Student Writing” by Stella Deen in the November, 1995, Teaching Concerns.)
Particularly for new teachers, it is sometimes helpful to read through several essays to see what an average paper for a class looks like. Checking to see if several papers have similar difficulties can also help you detect unclear instructions in the assignment or a content issue that may require further class discussion: if we have been unclear in some way, then we should be prepared to cut our students some slack when evaluating that part of the assignment.
However much we simplify the process, grading essays will never be as simple as marking multiple choice exams. Most student essays are some combination of good ideas and slight misunderstanding, clear argument and less clear argument: they don’t neatly divide into three parts. Typically the problems in an essay are closely related: for example, a misunderstanding of content can lead to a logical flaw in the argument and to prose that is full of short sentences because the author is not certain which ideas should be subordinated to others. Because of this system of logical relations, it is all the more important to include a final comment with a grade.
Writing final comments may indeed slow grading, but the pedagogical benefits of comments far outweigh the few minutes per paper needed to write them. Students continue to learn from an assignment if they understand what their work accomplished and what it didn’t. More importantly, final comments can help students write more fully conceived and better executed papers on the next assignment. (For a time-saving method of offering detailed comments about common problems in a set of essays, see Nancy Childress’s essay “Using General Comment Sheets,” published in the October, 1995 issue of Teaching Concerns; she recommends preparing a handout for the entire class in addition to [shorter] written comments on individual essays.)
One way to organize an end comment is to write at least one sentence pertaining to each of the three categories of ideas, argument, and style and mechanics. Breaking an essay into these three components can help us comment on an essay’s strengths and weaknesses more quickly than if we had no set criteria or if we had too many. A particularly successful comment will explain to a student how ideas, argument, style, and even grammar work together. Final comments also serve as a check on ourselves, especially if we tie our general end comments to specific examples within the paper. For example, when I finish reading Student A’s essay, I may sense that he didn’t offer proof in support of assertions. But when I look for an example of an unsupported assertion, I find there are passages that might serve as supporting evidence; however, he has not explained very carefully how the examples work, so my impression has been that his essay lacks proof. Even when we are sure that we have avoided bias and inconsistency, comments pointing to examples will better illustrate to students what they can improve. Above all, comments should not be mere justifications for grades, though they may coincidentally deter students from seeking explanations as to why the received a “B” instead of an “A.”
TRC NOTE: For help in implementing these suggestions, request a Writing Workshop.