New Moves: Discretized Grades

Two of the courses I teach, AP Calculus AB and IB Mathematics SL year two, have clear curricula to follow, which is both a blessing an a curse. While I primarily report standards based grades in these courses, I have also included a unit exam component that measures comprehensive performance as well. These are old fashioned summative assessments that I haven't felt comfortable expelling from these particular courses. Both courses end with a comprehensive exam in May. The scores on these exams will be scaled either to a 1 - 5 (AP) or a 1 - 7 (IB). The longer I have taught, the more I have grown to like the idea of reporting grades as one of a limited set of discrete scores.

Over my entire teaching career I have worked within systems that report grades as a percentage, usually to two digit precision. Sometimes these grades are mapped to an A-F scale, but students and parents tend not to pay attention to those. One downside to the percentage reporting system is that it implies that we have measured learning to within a single percentage point. Let's leave out the idea that we should be measuring learning numerically at all for the moment, and talk about why the idea of discrete grades is a better choice.

As a teacher, I need to make sure that I grade assignments consistently across a course, or across a section at a minimum. I'm not sure I can be consistent within a percentage point when you consider the number of my students multiplied by the number of assessment items I give them. I'm likely consistent within five percent, and very likely consistent within ten. I am also confident in my ability to have a conversation with any student about what he or she can do to improve because of the standards based component of my grading system.

One big problem I see with grading scales that map to letter grades is the arbitrary mapping between multiples of ten and the letter grades themselves. As I mentioned before, many don't pay attention to the letter at all when the number is next to it. Students that see a score of 79 wonder what one thing they should have done on the assessment to be bumped up by a percentage point to get an 80, resulting in a letter grade of a B. That one point also becomes that much more consequential than a single point raising a 75 to a 76.

Another issue comes from the imprecise definition of the points for each question. Is that single point increase a result of a sign error or a conceptual issue that is more significant? The single digit precision suggests that we can talk about things this accurately, but it is not common to plan assessments in such a way that these differences are clearly identified. I know I don't have psychometricians on staff.

For all of these reasons and more, I've been experimenting with grading exams in a way that acknowledges this imprecision and attempts to deal with it appropriately.

The simplest way I did this was with final exams for my Precalculus course last year. In this case, all scores were reported after being rounded to the nearest three percentage points. This meant that student scores were rounded roughly to the divisions of the letter grades for plus, regular, or minus (e.g. B-/B/B+).

In the AP and IB courses, this process was more involved. I decided that exam scores would be 97, 93, 85, 75, and 65 which would map to 5-4-3-2-1 for AP and 7-6-5-4-3 for IB. I entered student performance on each question into a spreadsheet. Sometimes before, and sometimes after, I would also go through each question and decide what sort of representative mistakes I would expect a 5 student to make, a 4 student, and so on. I would also do a couple different scenarios of scoring at each level to find how much variation in points might result in a given score. That led me to decide on which cut scores should apply, or at least would suggest what they might be for this particular exam. Here is an example of what this looks like:

At this point I would also look at individual papers again, identify holistically which score I thought the student should earn, and then compared their raw scores to the scores of the representative papers. If there was any clear discrepancy, this would lead to a change in the cut scores. Once I thought most students were graded appropriately, I added the scores into a Google script to scale all of the scores to the discrete scores.

This process of norming the papers took time, but it always felt worth it in the end. I felt comfortable talking to students about their scores and the work that qualified them for that score. The independence of these totals from the standard 90/80/70/60 mapping between percentages and letter grades meant that the scores were appropriate indicators of how they did, regardless of the percentages of points. Students weren't excited to know that they couldn't figure out their total point percentage and know their score, but this was not a major issue for them. Going through this process felt much more appropriate than applying a 10*sqrt(score) type of mapping to the raw scores.

In my end of semester feedback, some students reported their frustration that they would receive the same score as other students that earned fewer points. I understand this frustration in principle, but not in practice. The scores 92.44% and 91.56% also receive the same score under the standard system by rounding to the nearest percentage. I think in the big picture, the grades students received were fair, and students have also reported a feeling of fairness with respect to the grades I give them.

I'm in favor of eliminating the plus and minus designations from letter grades. They are communication marks and nothing more, and I would rather communicate those distinctions through written comments or in person rather than by a symbol. These marks are more numerical consequences of the percentage grade scale than they are intentional comments on student learning, and they do more harm than good.

New Moves: Reassessment

I’ve been a bit swamped over the course of the semester and unfortunately haven’t made the time to write regularly. There were lots of factors converging, and nothing negative, so I accepted that it might be one of the things to slip. This is something I will adjust for semester two.

I’ve written in the past about my reassessment systems and use of WeinbergCloud to manage them. I knew something had to change and thought a lot about what I was going to do to make my system more reasonable, something the old system was not.

At the beginning of the year, I sat down and started to reprogram the site...and then stopped. As much as I enjoyed the process of tweaking its features and solving problems that arose with its use, it was not where I wanted to spend my time. I also knew that I was going to teach a course with a colleague who also was planning to do reassessment, but I was not ready to build my system to manage multiple teachers.

I made an executive decision and stepped away from the WeinbergCloud project. It served me well, but it was time to come up with a different solution. We use Google for Education at my school, and the students are well versed in the use of calendars for school events. I decided to make this the main platform for all sorts of reasons. By putting my full class and meeting schedule into Google calendar, it meant that I could schedule student reassessments by actually seeing what my schedule looked like on a given week. Students last year would sign up to reassess at times when I had lunch duty or an after school meeting because my site didn’t have any way to block out times. This was a major improvement.

I also limited students to one reassessment per week. They needed to email me before the beginning of any given week and tell me what standard they wanted to reassess over. I would then send them an invite to a time they would show up to do their reassessment. This improved both student preparation and my ability to plan ahead for reassessments knowing what my schedule looked like for the day. Students liked it up until the final week of the semester, when they really wanted to reassess multiple times. I think this is a feature, not a bug, and will incentivize planning ahead.

I recorded student reassessments in PowerSchool in the comment tab. Grades with comments appear with a small flag next to them. This meant I could scan across horizontally to see what an individual student had reassessed on. I could also look vertically to see which standards were being assessed most frequently. The visual record was much more effective for qualitative views of the system than what I had previously with WeinbergCloud.

The system above was for my IB and AP classes. For Algebra 2 (for which I teach two sections and share with the other teacher) we had a simpler system. Students would be quizzed on standards, usually two at a time. Exams would be reassessments on all of the standards. Students would then have a third opportunity to be quizzed on up to three of the standards of each unit later in the semester. Students that had less than an 8 were required to reassess. This system worked well for the most part. Some students thought that the type of questions between the quiz and exam were different enough that they were not equivalent assessments of the standards. My colleague and I spent a lot of time talking through the questions, identifying the types of mistakes on individual questions that were indicators of 6 versus 8 versus 10, and also unifying the feedback we gave students after assessments. The system isn’t perfect, but students also were all given up to three opportunities to be assessed on every standard. This equity is not something that I’ve had happen before in my previous manifestations of SBG.

On the whole, both flavors of reassessment systems were much more reasonable and manageable, and I think they are here to stay. I’ll spend some time during the winter break thinking about what tweaks might be needed, if any, for the second half of the year.