Hacking the 100-point Scale - Part 4: Playing with Neural Networks

First, a review of where we've been in the series:

  • The 100 point scale suffers from issues related to its historical use and difficulties of communicating what it means.
  • It might be beneficial to have a solid link between the 100 point scale (since it likely isn't going anywhere) and the idea of achievement levels. This does not need to be rigidly defined as 90 - 100 = A, 80-89 = B, and so on.
  • I asked for you to help me collect some data. I gave you a made up rubric with three categories, and three descriptors for each, and asked you to categorize these as achievement levels 1 - 4. Thank you to everyone who participated!

This brings us to today's post, where I try to bring these ideas together.

In case you only have time for a quick overview, here's the tl;dr:

I fed the rubric scores you all sent me after the previous post to train a neural network. I then used that neural network to grade all possible rubric scores and generate achievement levels of 1, 2, 3, or 4.

Scroll down to the image to see the results.

Now to the meat of the matter.

Rubric design is not easy. It takes quite a bit of careful thought to decide on descriptors, point values and much of the time we don't have a team of experts on the payroll to do this for us.

On the other hand, we're asked to make judgements on students all the time. These judgements are difficult and subjective at times. Mathematical tools like averages help reduce the workload, but they do this at the expense of reducing the information available.

The data you all gave me was the result of educational judgment, and that judgement comes from what you prioritize. In the final step of my Desmos activity, I asked what you typically use to relate a rubric score to a numerical grade. Here are some of the responses.

From @aknauft:

I need to see a consistent pattern of top rubric scores before I assign the top numerical grade. Similarly, if the student does *not* have a consistent set of low rubric scores, I will *not* give them the low numerical grade.
Here specifically, I was looking for:
3 scores of 1 --> skill level 1
2 scores of 2 or 1 score of 3 --> skill level 2 or more
2 scores of 3 --> skill level 3 or more
3 scores of 3 --> skill level 4

From Will:

Sum 'points'
3 or 4 points= 1
5 or 6 points = 2
7 points= 3
8 or 9 points = 4

From Clara:

1 is 60-70
2 is 70-80
3 is 80-90
4 is 90-100
However, 4 is not achievable based on your image.
Also to finely split each point into 10 gradients feels too subjective.
Equivalency to 100 (proportion) would leave everyone except those scoring 3 on the 4 or scale, failing.

Participant Paul also shared some helpful percentages that directly relate the 1 - 4 scale to percentages, perhaps off of his school's grading policy. I'd love to know more. Dennis (on the previous post) commented that multi-component analysis should be done to set the relative weights of the different categories. I agree with his point that this is important and that it can easily be done in a spreadsheet. The difficulty is setting the weights.

The experience of assigning grades using percentages is a time saver, and is easy because of its historical use. Generating the scales as the contributors above did is helpful for relating how a student did on a task to their level. My suggestion is that the percentages we use for achievement levels should be an output of the rubric design process, not an input. In other words, we've got it all backwards.

I used the data you all gave me and fed it into a neural network. This is a way of teaching a computer to make decisions based on a set of example data. I wanted the network to understand how you all thought a particular set of rubric scores would relate to achievement level, and then see how the network would then score a different set of rubric scores.

Based solely on the six example grades I asked you to give, here are the achievement levels the neural network spit out:

ml-rubric-output

I was impressed with how the network scored with the twenty one (out of 27 possible permutations) that you didn't score. It might not be perfect, and you might not agree with every one. The amazing part of this process, however, is that any results you disagree with could be tagged with the score you prefer, and then the network could retrain on that additional training data. You (or a department of teachers) could go through this process and train your own rubric fairly quickly.

I was also curious about the sums of the scores that led to a given achievement level. This is after all what we usually do with these rubrics and record in the grade book. I graphed the rounded results in Desmos. Achievement level is on the vertical axis, and sum is on the horizontal.

One thing that struck me is the fuzziness around certain sum values. A score of 6, for example, leads to a 1, 2, or a 3. I thought there might be some clear sum values that might serve as good thresholds for the different levels, but this isn't the case. This means that simply taking the percentage of points earned and scaling into the ten point ranges for A, B, C, and D removes some important information about what a student actually did on the rubric.

A better way to translate these rubric scores might be to simply give numerical grades that indicate the levels, and communicate the levels that way as part of the score in the grade book. "A score of 75 indicates the student was a level 2."

Where do we go from here? I'm not sure. I'm not advocating that a computer do our grading for us. Along the lines of many of my posts here, I think the computer can help alleviate some of the busy work and increase our efficiency. We're the ones saying what's important. I did another data set where I went through the same process, but acted like the third category was less important than the other two. Here's the result of using that modified training data:

ml-rubric-output-modified

It's interesting how this changed the results, but I haven't dug into them very deeply.

I just know that something needs to change. I had students come to me after final exam grades were put in last week (which, by the way, were raw percentage grades) and being confused by what their grades meant. The floor for failing grades is a 50, and some students interpreted this to mean that they started with a 50, and then any additional points they earned were added on to that grade. I use the 50 as a floor, meaning that a 30% raw score is listed as a 50% in the final exam grade. We need to improve our communication, and there's a lot of work to do if the scale isn't going away.

I'm interested in the idea of a page that would let you train any rubric of any size through a series of clicks. What thoughts do you have at the end of this exploration?


Technical Details:

I used the Javascript implementation of a neural network here to do the training. The visualizations were all made using the Raphael JS library.