Hacking the 100-point Scale – Part 4: Playing with Neural Networks
First, a review of where we’ve been in the series:
- The 100 point scale suffers from issues related to its historical use and difficulties of communicating what it means.
- It might be beneficial to have a solid link between the 100 point scale (since it likely isn’t going anywhere) and the idea of achievement levels. This does not need to be rigidly defined as 90 – 100 = A, 80-89 = B, and so on.
- I asked for you to help me collect some data. I gave you a made up rubric with three categories, and three descriptors for each, and asked you to categorize these as achievement levels 1 – 4. Thank you to everyone who participated!
This brings us to today’s post, where I try to bring these ideas together.
In case you only have time for a quick overview, here’s the tl;dr:
I fed the rubric scores you all sent me after the previous post to train a neural network. I then used that neural network to grade all possible rubric scores and generate achievement levels of 1, 2, 3, or 4.
Scroll down to the image to see the results.
Now to the meat of the matter.
Rubric design is not easy. It takes quite a bit of careful thought to decide on descriptors, point values and much of the time we don’t have a team of experts on the payroll to do this for us.
On the other hand, we’re asked to make judgements on students all the time. These judgements are difficult and subjective at times. Mathematical tools like averages help reduce the workload, but they do this at the expense of reducing the information available.
The data you all gave me was the result of educational judgment, and that judgement comes from what you prioritize. In the final step of my Desmos activity, I asked what you typically use to relate a rubric score to a numerical grade. Here are some of the responses.
From @aknauft:
I need to see a consistent pattern of top rubric scores before I assign the top numerical grade. Similarly, if the student does *not* have a consistent set of low rubric scores, I will *not* give them the low numerical grade.
Here specifically, I was looking for:
3 scores of 1 –> skill level 1
2 scores of 2 or 1 score of 3 –> skill level 2 or more
2 scores of 3 –> skill level 3 or more
3 scores of 3 –> skill level 4
From Will:
Sum ‘points’
3 or 4 points= 1
5 or 6 points = 2
7 points= 3
8 or 9 points = 4
From Clara:
1 is 60-70
2 is 70-80
3 is 80-90
4 is 90-100
However, 4 is not achievable based on your image.
Also to finely split each point into 10 gradients feels too subjective.
Equivalency to 100 (proportion) would leave everyone except those scoring 3 on the 4 or scale, failing.
Participant Paul also shared some helpful percentages that directly relate the 1 – 4 scale to percentages, perhaps off of his school’s grading policy. I’d love to know more. Dennis (on the previous post) commented that multi-component analysis should be done to set the relative weights of the different categories. I agree with his point that this is important and that it can easily be done in a spreadsheet. The difficulty is setting the weights.
The experience of assigning grades using percentages is a time saver, and is easy because of its historical use. Generating the scales as the contributors above did is helpful for relating how a student did on a task to their level. My suggestion is that the percentages we use for achievement levels should be an output of the rubric design process, not an input. In other words, we’ve got it all backwards.
I used the data you all gave me and fed it into a neural network. This is a way of teaching a computer to make decisions based on a set of example data. I wanted the network to understand how you all thought a particular set of rubric scores would relate to achievement level, and then see how the network would then score a different set of rubric scores.
Based solely on the six example grades I asked you to give, here are the achievement levels the neural network spit out:
I was impressed with how the network scored with the twenty one (out of 27 possible permutations) that you didn’t score. It might not be perfect, and you might not agree with every one. The amazing part of this process, however, is that any results you disagree with could be tagged with the score you prefer, and then the network could retrain on that additional training data. You (or a department of teachers) could go through this process and train your own rubric fairly quickly.
I was also curious about the sums of the scores that led to a given achievement level. This is after all what we usually do with these rubrics and record in the grade book. I graphed the rounded results in Desmos. Achievement level is on the vertical axis, and sum is on the horizontal.
One thing that struck me is the fuzziness around certain sum values. A score of 6, for example, leads to a 1, 2, or a 3. I thought there might be some clear sum values that might serve as good thresholds for the different levels, but this isn’t the case. This means that simply taking the percentage of points earned and scaling into the ten point ranges for A, B, C, and D removes some important information about what a student actually did on the rubric.
A better way to translate these rubric scores might be to simply give numerical grades that indicate the levels, and communicate the levels that way as part of the score in the grade book. “A score of 75 indicates the student was a level 2.”
Where do we go from here? I’m not sure. I’m not advocating that a computer do our grading for us. Along the lines of many of my posts here, I think the computer can help alleviate some of the busy work and increase our efficiency. We’re the ones saying what’s important. I did another data set where I went through the same process, but acted like the third category was less important than the other two. Here’s the result of using that modified training data:
It’s interesting how this changed the results, but I haven’t dug into them very deeply.
I just know that something needs to change. I had students come to me after final exam grades were put in last week (which, by the way, were raw percentage grades) and being confused by what their grades meant. The floor for failing grades is a 50, and some students interpreted this to mean that they started with a 50, and then any additional points they earned were added on to that grade. I use the 50 as a floor, meaning that a 30% raw score is listed as a 50% in the final exam grade. We need to improve our communication, and there’s a lot of work to do if the scale isn’t going away.
I’m interested in the idea of a page that would let you train any rubric of any size through a series of clicks. What thoughts do you have at the end of this exploration?
Technical Details:
I used the Javascript implementation of a neural network here to do the training. The visualizations were all made using the Raphael JS library.
Evan, thanks for the deep thoughts in writing this up and trying to find a consistent way to have our students know where they stand. I was the “Paul” who submitted a grading scale. I attempted standards based grading and had to find a way to translate it to a 100 point scale. I only gave a student a Zero if they did not attempt the quiz/test in a meaningful way.
I originally went with a 1,2,3,4 whole number scale, trying to keep it as simple as possible, but I would find when I was grading that a student was almost there, but didn’t quite get it. So I ended up putting in half steps.
1 = 50% Made an attempt, but showing no understanding of the standard.
1.5 = 60% Little understanding, but on the way.
2 = 70% Some understanding, but missing some concepts
2.5 = 78% Almost there, but not quite.
3 = 85% Understand the concept, making little mistakes, or not giving full explanations of how you understand the concept.
3.5= 94% Full understanding, good explanations, maybe one or two small mistakes.
4 = 100% Full understanding, Full explanations, no mistakes.
Students (7th grade math) seemed to think it was clear. If I was still in a classroom I would definitely still be using this scale.
Paul
Hi Paul,
Thanks for identifying yourself – the extra detail here is useful background information. I’m curious about the grade percentages you picked for these standard levels. What did you use to generate these?
Thanks, Evan, for all these posts. These days I ponder about the assessment. My conclusion is that each activity could be assessed for different criteria/skills. For example: a word problem could have these skills: 1) understand the problem 2) plan an strategy to solve the problem 3) apply this strategy coherently 4) give correct answer and 5) interpret the results and give predictions. Each student when does an activity she gets some number: 6 for example. But what does it mean 6 in this? Which skills are missing? Which are the best abilities of her?
In my next course I will put 5 numbers in each activity and I sum the number obtained to the skills. So I will assess by skills not by “exams” or “additions of skills”. And, at the end of the course, each student will have an evolution of each skill and I will decide if she has it or not. If she would have the majority or the main skills, then she will pass. Otherwise not.
What do you think? I have thought a lot. But just because you have published all of these posts.
Thanks a lot,
That’s the same thing I’m looking to do as I develop this idea. The only thing is that I need to make a tool that simplifies the process of differently sized rubrics and skill levels, as you described. You are describing the essence of standards based grading there, something I’ve used for the past three years and have grown to love.
The only thing I’d adjust is the sum. I’m not convinced yet of the benefit of combining all of those scores into one through addition – it’s a loss of information as far as I can tell. I might not be understanding what you mean though.
I like the idea that the grade process should be driven by the rubric. And then for into the 100 pt scale b/c of the grading systems used by our schools and put out to our parents.
If I look at your images generating the 1, 2, 3, 4: all 3s generates a 4, which appears to be equivalent to 100.
The variation in 3s would perhaps be able to be attached to points within the “A” range of most school systems (some range from 90-100; mine uses 93-100). The computation could place various combinations within the 93-99 values.
Same thing for 2, for 1.
I have kids that don’t do anything- except put their name on the paper, or attempt a few scribbles that are essentially zero (yes- I target them for intervention), so what grade do they get until I can get with them to determine level of mastery?
Sorry- going beyond the scope here. I like the process you are trying to build.