Scaling up SBG for the New Year

In my new school, the mean size of my classes has doubled. The maximum size is now 22 students, a fact about which I am not complaining. I’ve missed the ease of getting students to interact with simple proximity as the major factor.

I have also been given the freedom to continue with the standards based grading system that I’ve used over the past four years. The reality of needing to adapt my systems of assessment to these larger sizes has required me to reflect upon which aspects of my system need to be scaled, and what (if anything) needs to change.

The end result of that reflection has identified these three elements that need to remain in my system:

  • Students need to be assessed frequently through quizzes relating to one to two standards maximum.
  • These quizzes need to be graded and returned within the class period to ensure a short feedback cycle.
  • There must still be a tie between work done preparing for a reassessment and signing up for one.

Including the first element requires planning ahead. If quizzes are going to take up fifteen to twenty minutes of a class block, the rest of the block needs to be appropriately planned to ensure a balance between activities that respond to student learning needs, encourage reinforcement of old concepts, and allow interaction with new material. The second element dictates that those activities need to provide me time to grade the quizzes and enter them as standards grades before returning them to students. The third happens a bit later in the cycle as students act on their individualized needs to reassess on individual standards.

The major realization this year has been a refined need for standards that can be assessed within a twenty minute block. In the past, I’ve believed that a quiz that hits one or two aspects of the topic is good enough, and that an end of unit assessment will allow complete assessment on the whole topic. Now I see that a standard that has needs to have one component assessed on a quiz, and another component assessed on a test, really should be broken up into multiple standards. This has also meant that single standard quizzes are the way to go. I gave one quiz this week that tested a previously assessed standard, and then also assessed two new ones. Given how frantic I was in assessing mastery levels on three standards, I won’t be doing that again.

The other part of this first element is the importance of writing efficiently targeted assessment questions. I need students to arrive at a right answer by applying their knowledge, not by accident or application of an algorithm. I need mistakes to be evidence of misunderstanding, not management of computational complexity. In short, I need assessment questions that assess what they are designed to assess. That takes time, but with my simplified schedule this year, I’m finding the time to do this important work.

My last post was about my excitement over using the Numbas web site to create and generate the quizzes. A major bottleneck in grading these quizzes quickly in the past has been not necessarily having answers to the questions I give. Numbas allows me to program and display calculated answers based on the randomized values used to generate the questions.

Numbas has a feature that allows students to take the exam entirely online and enter their answers to be graded automatically. In this situation, I have students pass in their work as well. While I like the speed this offers, that advantage primarily exists in cases where students answer questions correctly. If they make mistakes, I look at the written work and figure out what went wrong, and individual values require that I recalculate along the way. This isn’t a huge problem, but it brings into question the need for individualized values which are (as far as I know right now) the only option for the fully online assessment. The option I like more is the printed worksheet theme that allows generation of printable quizzes. I make four versions and pass these out, and then there are only four sets of answers to have to compare student work against.

With the answers, I can grade the quizzes and give feedback where needed on wrong answers in no more than ten or fifteen minutes total. This time is divided into short intervals throughout the class block while students are working individually. The lesson and class activities need to be designed to provide this time so I can focus on grading.

The third element is still under development, but my credit system from previous years is going to make an appearance. Construction is still underway on that one. Please pardon the dust.


P.S:

If you’re an ed-tech company that wants to impress me, make it easy for me to (a) generate different versions of good assessment questions with answers, (b) distribute those questions to students, (c) capture the student thinking and writing that goes with that question so that I can adjust my instruction accordingly, and (d) make it super easy to share that thinking in different ways.

That step of capturing student work is the roughest element of the UX experience of the four. At this time, nothing beats looking at a student’s paper for evidence of their thinking, and then deciding what comes next based on experience. Snapping a picture with a phone is the best I’ve got right now. Please don’t bring up using tablets and a stylus. We aren’t there yet.

Right now there are solutions that hit two or three, but I’m greedy. Let me know if you know about a tool that might be what I’m looking for.

Context and Learning Names

I wrote yesterday about my decision to try learning names of my students on the first day.

As of the middle of week two, I’ve learned the names of every student within each class with few exceptions. In some of the bigger groups, I mix one or two names that start with the same first letter, but I correct myself pretty quickly. I’ve come to recognize some individual traits that make each student unique within the group, and am feeling comfortable building on my knowledge of their names to find out more about who they are.

In the hallways, in line for lunch, and walking around campus, I struggle. Outside of the classroom, I lack the context of those names that I can usually lean back upon to remember them. With the students all mixed up together, including with students that I don’t have in my classes, it takes longer to put a name with the face. As I develop an understanding of the students beyond names, this struggle will go away.

The analogy to learning in any classroom context stands on its own, so I won’t ruin it with more commentary.

IMG_3451

Generality vs. Specificity

We want our students to have problem solving methods that are general enough to work in any situation. If we assign a series of exercises that are too similar to each other, it becomes easy for students to lock onto the wrong pattern, or to use a ‘trick’ that works just frequently enough to seem worth the effort to learn it.

One thing I tried this year was to prompt students to make themselves aware of the spectrum from generality to specificity. What works for solving specifically this question? What general ideas apply to answering all of the problems on the page?

I used my randomized question generator to help create problems that worked this way. Here’s an example:

Screen Shot 2016-07-01 at 11.05.02 PM

I only started a deliberate effort to prompt these conversations at the middle of the second semester. I wish I was doing it all year.

Hacking the 100-point Scale – Part 4: Playing with Neural Networks

First, a review of where we’ve been in the series:

  • The 100 point scale suffers from issues related to its historical use and difficulties of communicating what it means.
  • It might be beneficial to have a solid link between the 100 point scale (since it likely isn’t going anywhere) and the idea of achievement levels. This does not need to be rigidly defined as 90 – 100 = A, 80-89 = B, and so on.
  • I asked for you to help me collect some data. I gave you a made up rubric with three categories, and three descriptors for each, and asked you to categorize these as achievement levels 1 – 4. Thank you to everyone who participated!

This brings us to today’s post, where I try to bring these ideas together.

In case you only have time for a quick overview, here’s the tl;dr:

I fed the rubric scores you all sent me after the previous post to train a neural network. I then used that neural network to grade all possible rubric scores and generate achievement levels of 1, 2, 3, or 4.

Scroll down to the image to see the results.

Now to the meat of the matter.

Rubric design is not easy. It takes quite a bit of careful thought to decide on descriptors, point values and much of the time we don’t have a team of experts on the payroll to do this for us.

On the other hand, we’re asked to make judgements on students all the time. These judgements are difficult and subjective at times. Mathematical tools like averages help reduce the workload, but they do this at the expense of reducing the information available.

The data you all gave me was the result of educational judgment, and that judgement comes from what you prioritize. In the final step of my Desmos activity, I asked what you typically use to relate a rubric score to a numerical grade. Here are some of the responses.

From @aknauft:

I need to see a consistent pattern of top rubric scores before I assign the top numerical grade. Similarly, if the student does *not* have a consistent set of low rubric scores, I will *not* give them the low numerical grade.
Here specifically, I was looking for:
3 scores of 1 –> skill level 1
2 scores of 2 or 1 score of 3 –> skill level 2 or more
2 scores of 3 –> skill level 3 or more
3 scores of 3 –> skill level 4

From Will:

Sum ‘points’
3 or 4 points= 1
5 or 6 points = 2
7 points= 3
8 or 9 points = 4

From Clara:

1 is 60-70
2 is 70-80
3 is 80-90
4 is 90-100
However, 4 is not achievable based on your image.
Also to finely split each point into 10 gradients feels too subjective.
Equivalency to 100 (proportion) would leave everyone except those scoring 3 on the 4 or scale, failing.

Participant Paul also shared some helpful percentages that directly relate the 1 – 4 scale to percentages, perhaps off of his school’s grading policy. I’d love to know more. Dennis (on the previous post) commented that multi-component analysis should be done to set the relative weights of the different categories. I agree with his point that this is important and that it can easily be done in a spreadsheet. The difficulty is setting the weights.

The experience of assigning grades using percentages is a time saver, and is easy because of its historical use. Generating the scales as the contributors above did is helpful for relating how a student did on a task to their level. My suggestion is that the percentages we use for achievement levels should be an output of the rubric design process, not an input. In other words, we’ve got it all backwards.

I used the data you all gave me and fed it into a neural network. This is a way of teaching a computer to make decisions based on a set of example data. I wanted the network to understand how you all thought a particular set of rubric scores would relate to achievement level, and then see how the network would then score a different set of rubric scores.

Based solely on the six example grades I asked you to give, here are the achievement levels the neural network spit out:

ml-rubric-output

I was impressed with how the network scored with the twenty one (out of 27 possible permutations) that you didn’t score. It might not be perfect, and you might not agree with every one. The amazing part of this process, however, is that any results you disagree with could be tagged with the score you prefer, and then the network could retrain on that additional training data. You (or a department of teachers) could go through this process and train your own rubric fairly quickly.

I was also curious about the sums of the scores that led to a given achievement level. This is after all what we usually do with these rubrics and record in the grade book. I graphed the rounded results in Desmos. Achievement level is on the vertical axis, and sum is on the horizontal.

One thing that struck me is the fuzziness around certain sum values. A score of 6, for example, leads to a 1, 2, or a 3. I thought there might be some clear sum values that might serve as good thresholds for the different levels, but this isn’t the case. This means that simply taking the percentage of points earned and scaling into the ten point ranges for A, B, C, and D removes some important information about what a student actually did on the rubric.

A better way to translate these rubric scores might be to simply give numerical grades that indicate the levels, and communicate the levels that way as part of the score in the grade book. “A score of 75 indicates the student was a level 2.”

Where do we go from here? I’m not sure. I’m not advocating that a computer do our grading for us. Along the lines of many of my posts here, I think the computer can help alleviate some of the busy work and increase our efficiency. We’re the ones saying what’s important. I did another data set where I went through the same process, but acted like the third category was less important than the other two. Here’s the result of using that modified training data:

ml-rubric-output-modified

It’s interesting how this changed the results, but I haven’t dug into them very deeply.

I just know that something needs to change. I had students come to me after final exam grades were put in last week (which, by the way, were raw percentage grades) and being confused by what their grades meant. The floor for failing grades is a 50, and some students interpreted this to mean that they started with a 50, and then any additional points they earned were added on to that grade. I use the 50 as a floor, meaning that a 30% raw score is listed as a 50% in the final exam grade. We need to improve our communication, and there’s a lot of work to do if the scale isn’t going away.

I’m interested in the idea of a page that would let you train any rubric of any size through a series of clicks. What thoughts do you have at the end of this exploration?


Technical Details:

I used the Javascript implementation of a neural network here to do the training. The visualizations were all made using the Raphael JS library.

Rubrics and Numerical Grades – Hacking the 100-Point Scale, Part 3

As part of thinking through my 100-point scale redesign, I’d like you to share some of your thoughts on a rubric scenario.

Rubrics are great for how they clearly classify different components of assessment for a given task. They also use language that, ideally, gives students the feedback to know what they did well, and where they fell short on that assessment. Here’s an example rubric with three performance levels and three categories for a generic assignment:

Screen Shot 2016-06-13 at 5.23.27 PM

I realize some of you might be craving some details of the task and associated descriptors for each level. I’m looking for something here that I think might be independent of the task details.

The student shown above has scores of 1, 2, and 3 respectively for the three categories on this assignment, and all three categories are equally important. Suppose also that in my assessment system, I need to identify a student as being a 1, 2, 3, or 4 in the associated skills based on this assessment.

More generally, I want to be able to take a set of three scores on the rubric and generate a performance level of the student that earned them. I’d like to get your sense of classifying students into the four levels this way.

Here are the rubrics I’d like your help with:
rubrics1

I’ve created a Desmos Activity using Activity Builder to collect your thoughts. I chose Activity Builder because (a) Desmos is awesome, and (b) the internet is keeping me from Google Docs.

You can access that activity here.

I’ll be using the results as an input for a prototype idea I have to make this process a bit easier for all involved. Thanks in advance!

Hacking the 100-Point Scale – Part 2

My previous post focused on the main weakness of the 100-point scale which is the imprecision with which it is defined. Is it percentage of material mastered? Homework percentage completion? Total points earned? It might be all of these things, or none of them, depending on the details of one person’s grade book.

Individual departments or schools might try to define uniformity in grading policies, give common final assessments, or spread grading of final exams amongst all teachers to ensure fairness. This might make it easier to compare two students across a course, but still does not clearly define what the grade means. What, however, does it signify that a student in an AP course has an 80 while a student in a regular section of the same course has a 90?

Part of the answer here is based in curriculum. Understanding what students are learning and in what order defines what is being learned, and would add some needed information to compare the AP and regular students just mentioned. The other part is assessment: a well crafted assessment policy based in learning objectives and communicated to a student helps with understanding his or her progress during the school year. I hope it goes without saying that these two components must be present for a teacher to be able to craft and communicate a measure of the student’s learning that students, teachers, parents, and administrators can understand.

At this point, I think the elementary teachers have the right idea. I’ve been in two different school systems now that use a 1 – 4 scale for different skills, with clear descriptors that signify the meaning of each level. Together with detailed written comments, these can paint a picture of what knowledge, skills, and understanding a student has developed during a block of the school year. These levels might describe the understanding of grade level benchmarks using labels such as limited, basic, good, and thorough understanding. These might classify a student using the state of their progress with terms like novice/beginner/intermediate/advanced. The point is that these descriptors are attached to a student and ideally are assigned after reviewing the learning that the student has done over a period of time. I grant that the language can be vague, but this also demands that a teacher must put time into understanding the criteria at his or her school in order to assign grades to a particular student.

When it comes to the 100 point scale, it’s all too easy to avoid this deliberate process. I can report assignments as a series of total point values, and then report a student’s grade as a percentage of the total using grade book software. Why is a student failing? He didn’t earn enough points. How can he do better? Earn more points. How can he do that? Bonus assignments, improving test scores, or by developing better work habits. The ease of generating grades cheapens the deliberate process that may (or may not) have been involved in generating them. Some of the imprecision of the meaning of this grade comes, ironically, from an assumption that the precision of a numerical grade makes it a better indicator. It actually requires more on the part of the teacher to define components of the grade clearly using numerical indicators, and defining these in a way that avoids unintended consequences requires a lot of work to get right.

Numerical grades inform a student’s progress, but don’t tell the whole story. The A-B-C-D-F grading system hasn’t been in use in any of the schools where I’ve taught, but it escapes some of the baggage of the numerical grade in that it requires that the school report somehow what each letter grade represents. An A might be mapped from a 90-100% average in the class, or 85-100 depending on the school. As with a verbal description, there needs to be some deliberate conversation and communication about the meaning of those grades, and this process opens the door for descriptors for what grades might represent. Numerical grades on the 100 point scale lack this specificity because grades on this scale can be generated with nothing more than a calculation. That isn’t to say that a teacher can’t put in the time to make that calculation meaningful, but it does mean it’s easy to give the impression of precision that isn’t there.

Compounding the challenge of its imprecision is the reality that we use this scale for many purposes. Honor roll or merit roll are often based in having a minimum average over courses taken in a given semester. Students on probation, often measured by having a grade below a cut-off score, might not be able to participate in sports or activities. Students with a given GPA have automatic admission to some universities.

I’m not proposing breaking away from grading, and I don’t think the 100 point scale is going away. I want to hack the 100 point scale to do a better job of what it is supposed to do. While technology makes it easier to generate a grade than it used to be, I believe it also provides opportunity to do some things that weren’t feasible for a teacher to do in the past. We can improve the process of generating the grade to be a measure of learning, and in communicating that measure to all stakeholders.

Some ideas on this have been brewing as I’ve started grading finals and packing for the end of the year. Summer is a great time to reflect on what we do, isn’t it?

Hacking The 100-Point Scale – Part 1

One highlight of teaching at an international school is the intersection of many different philosophies in one place. As you might expect, the most striking of these is that of students comparing their experiences. It’s impressive how the experienced students that have moved around quickly learn the system of the school they are currently attending and adjust accordingly. What unites these particularly successful students is their awareness that they must understand the system they are in if they are to thrive there. 

This is the case with teachers, as we share with each other just as much. We discuss different school systems and school structures, traditions, and assessment methods. Identifying the similarities and differences in general is an engaging exercise. In general, these conversations lead to a better understanding of why we do what we do in the classroom. Also, in general, these conversations end with specific ideas for what we might do differently on the next meeting with students.

There is one important exception. No single conversation topic has caused more argument, debate, and unresolved conflict at the end of a staff meeting than the use of the 100-point scale.

The reason it’s so prevalent is  that it’s easy to use. Multiply the total points earned by 100, and then divide by the total possible points. What could go wrong with this system that has been used for so long by so many?

There a number of conversation threads that have been particularly troublesome in our international context, and I’d like to share one here.

“A 75 isn’t a bad score.”

For a course that is difficult, this might be true. Depending on the Advanced Placement course, you can earn the top score of 5 on the exam by earning anywhere between around 65% and 100% of the possible points. The International Baccalaureate exams work the same way. I took a modern physics exam during university on which I earned a 75 right on the nose. The professor said that considering the content, that was excellent, and that I would probably end up with an A in the course. 

The difference between these courses and typical school report cards is that the International Baccalaureate Organization (IBO), College Board, and college professor all did some sort of scaling to map their raw percentages to what shows up on the report card. They have specific criteria for setting up the scaling that goes from a raw score to the 1 – 5 or 1 – 7 scores for AP or IB grades respectively.

What are these criteria? The IBO, to its credit, has a document that describes what each score indicates about a student with remarkable specificity. Here is their description of a student that receives score of 3 in mathematics:

Demonstrates some knowledge and understanding of the subject; a basic sense of structure that is not sustained throughout the answers; a basic use of terminology appropriate to the subject; some ability to establish links between facts or ideas; some ability to comprehend data or to solve problems.

Compare this to their description of a score of 7:

Demonstrates conceptual awareness, insight, and knowledge and understanding which are evident in the skills of critical thinking; a high level of ability to provide answers which are fully developed, structured in a logical and coherent manner and illustrated with appropriate examples; a precise use of terminology which is specific to the subject; familiarity with the literature of the subject; the ability to analyse and evaluate evidence and to synthesize knowledge and concepts; awareness of alternative points of view and subjective and ideological biases, and the ability to come to reasonable, albeit tentative, conclusions; consistent evidence of critical reflective thinking; a high level of proficiency in analysing and evaluating data or problem solving.

I believe the IBO uses statistical and norm referenced methods to determine the cut scores between certain score bands. I’m also reasonably sure the College Board has a similar process. The point, however, is that these bands are determined so that a given score matches

The college professor used his professional judgement (or a bell curve, I don’t actually know) to make his scaling. This connects the raw score to the ‘A’ on my report card that indicated I knew what I was doing in physics.

The reason this causes trouble in discussions of grades in our school, and I imagine in other schools as well, is the much more ill-defined definition of what percentage grades mean on the report card. Put quite simply, does a 90% on the report card mean the student has mastered 90% of the material? Completed 90% of the assignments? Behaved appropriately 90% of the time? If there are different weights assigned to categories of assignments in the grade book, what does an average of 90% mean?

This is obviously an important discussion for a school to have. Understanding the meaning of the individual percentage grades and what they indicate about student learning should be clear to administrators, teachers, parents, and most importantly, the students themselves. These is a tough conversation.

Who decided that 60% is the percentage of the knowledge I need to get credit? On a quiz on tool safety in the maker space, is 60% an appropriate cut score for someone to know enough? I say no. On the report card, I’d indicate that a student has a 50 as their grade until they demonstrate he or she can get 100% of the safety questions correct. Here, I’ve redefined the grade in the grade book as being different from the percentage of points earned, however. In other words, I’ve done the work of relating a performance measure to a grade indicator. These should not be assumed to be the same thing, but being explicit about this requires a conversation defining this to be the case, and communication of this definition to students and teachers sharing sections of the same course.

Most of this time, I don’t think there is time for this conversation to happen, which is the first reason I believe this issue exists. The second is the fact that a percentage calculation is mathematically simple and understood as a concept by students, teachers, and parents alike. Grades have been done this way for so long that a grade on the 100-point scale is generally assumed to be this percentage mastered or completed concept.

This is too important to be left to assumption. I’ll share more about the dangers of this assumption in a future post.

Building Functions – Thinking Ahead to Calculus

My ninth graders are working on building functions and modeling in the final unit of the year. There is plenty of good material out there for doing these tasks as a way to master the Common Core standards that describe these skills.

I had a sudden realization that a great source for these types of tasks might be my Calculus materials. Related rates, optimization, and applications of integrals in a Calculus course generally require students to write models of functions and then apply their differentiation or integration knowledge to arrive at a result. The first step in these questions usually involves writing a function, with subsequent question parts requiring Calculus methods to be applied to that function.

I dug into my resources for these topics and found that these questions might be excellent modeling tasks for the ninth grade students if I simply pull out the steps that require Calculus. Today’s lesson using these adapted questions was really smooth, and felt good from a vertical planning standpoint.

I could be late to this party. My apologies if you realized this well before I did.

Problems vs. Exercises

My high school mathematics teacher, Mr. Davis, classified all learning tasks in our classroom into two categories: problems and exercises. The distinction between the two is pretty simple. Problems set up a non-routine mathematical conflict. Once that conflict is resolved once, problems cease to be problems – they become exercises. Exercises tend to develop content skills or application of knowledge – problems serve to develop one’s habits of mathematical practice and understanding.

I tend to give a mixture of the two types to my students. The immediate question in an assessment context is whether my students have a particular skill or can apply concepts. Sometimes this can be established by doing several problems of the same or similar type. This is usually the situation when students sign up for a reassessment on a learning standard. In cases where I believe my students have overfit their understanding to a particular question type, I might throw them a problem – a new task that requires higher levels of understanding. I might also give them a task that I know is similar to a question they had wrong last time, with a twist. What I have found over time is that there needs to be a difference between what I give them on a subsequent assessment, or I won’t get a good reading on their mastery level.

The difficulty I’ve established over the past few years learning to use SBG has been curating my own set of problems and exercises for assessment. I have textbooks, both electronic and hard copy, and I’ve noted down the locations of good problems in analog and digital forms. I’ve always felt the need to guard these and not share them with students so that they don’t become exercises. My sense is that good problems are hard to find. Good exercises, on the other hand, are all over the place. This also means that if I’ve given Student A a particular problem, that I have to find an entirely different one for Student B in case the two pool their resources. In other words, Student A’s problem then becomes Student B’s exercise. I haven’t found that students end up thinking that way, but I still feel weird about using the same problem multiple times.

What I’ve always wanted was a source of problems that somehow straddled the two categories. I want to be able to give Student A a specific problem that I carefully designed for assessing a particular standard, and student B a different manifestation of that same problem. This might mean different numbers, or a slight variation that still assesses the same thing. I don’t want to have to reinvent the problem every single time – there must be a way to avoid repeating that effort. By carefully designing a problem once, and letting, say, a computer make randomized changes to different instances of that problem, I’ve created a task I can use with different students. Even if I’m in the market for exercises, it would be nice to be able to create those quickly and efficiently too. Being able to share that initial effort with other teachers who also share a need would be a bonus.

I think I’ve made an initial stab at creating something to fit that need.

Taking Time Learning Math: A Student’s Perspective

Yesterday was our school’s student led conference day. I’ve written previously on how proud these days make me as an educator. Whens students do genuine reflection on their learning and share the ups and downs of their school days, it’s hard not to see the value of this as an exercise.

During one conference, a student shared a fascinating perspective on her learning in math. This is not the usual level of specificity that we get from our students, so I am eager to share her thinking. Here’s the student’s comment during the conference:

“It isn’t that I don’t like math. Learning takes time in math, and I don’t always get the time it takes to really understand it.”

I asked her for further clarification, and this was her response:

…Math is such an interesting subject that can be “explored” in so many different ways, however, in school here I don’t really get to learn it to a point where I say yeah this is what I know, I fully understand it. We move on from topic to topic so quickly that the process of me creating links is interrupted and I practice only for the test in order to get high grades.

It’s certainly striking to get this sort of feedback from a student who is doing all the things we ask her to do. The activities this student is doing in class are not day-after-day repetitions of “I do, we do, you do” – we do a range of class activities that involve exploring, questioning, and interacting with other students.

This student’s comment is about limitations of time. She isn’t saying that we aren’t doing enough of X, Y, or Z – quite the contrary, she just is asking for time to let it sink in. She doesn’t answer the question of what that time looks like, but that’s not her job, it’s ours.

I know I always feel compelled to nudge a class forward in some way. This doesn’t mean I moving through material more quickly, but I do push for increased depth, intuition, or quality conversation about the content in every class period. Her comment makes me realize that something still stands to be improved. Great food for thought for the weekend.

1 2 3 4 11