The AI algorithms grading student essays are a black box.
Algorithms are grading student essays across the country. So can artificial intelligence really teach us to write better?
Todd Feathers, who wrote about AI essay grading for Motherboard, called up every state in the country and found that at least 21 states use some form of automated scoring.
“The algorithms are prone to a couple of flaws. One is that they can be fooled by any kind of nonsense gibberish sophisticated words. It looks good from afar but it doesn’t actually mean anything. And the other problem is that some of the algorithms have been proven by the testing vendors themselves to be biased against people from certain language backgrounds.”
Feathers wasn’t able to pin down exactly how many students are affected by this. But here’s what we do know: These programs are being used to grade students of all ages and levels, from high school students to students applying to grad school, from middle school students even down to those in elementary school.
The reason it’s so hard to figure out who’s affected by AI grading is because there’s no one program that’s being used. There are a bunch of different algorithms, made by a bunch of different companies. But they’re all made in basically the same way: First, an automated scoring company looks at how human graders behave. Then, the company trains an algorithm to make predictions as to how a human grader might score an essay based on that data. Depending on the program, those predictions can be consistently wrong in the same way. In other words, they can be biased. And once those algorithms are built, explains Reset host Arielle Duhaime-Ross, they can reproduce those biases at a huge scale.
And the worst part? You can’t cross-examine an algorithm and get to the bottom of why it made a specific decision. It’s a black box.
Listen to the entire discussion on this episode of Reset. Below, we’ve also shared a lightly edited transcript of the episode. In addition to Feathers, you’ll hear from Utah parent David Hart; Aoife Cahill, a managing senior research scientist at Educational Testing Service; and Vox reporter Sigal Samuel.
Subscribe to Reset on Apple Podcasts, Stitcher, Spotify, or wherever you listen to podcasts.
Arielle Duhaime-Ross spoke with Aoife Cahill, a managing senior research scientist at Educational Testing Service. Their algorithm grades the GRE and other standardized tests.
It’s very possible that programs can be biased if you don’t train them correctly. So, you want to make sure that the data that you use to feed the system to train the system is as unbiased as possible. But it is very possible that you can introduce it because of course the systems are learning from humans. So, [if the] dataset you happen to choose is biased, the machine is going to learn that bias.
When you’re picking a dataset, how do you even know if that dataset might be biased and then how do you know if that’s actually affecting the machine?
It’s a very challenging topic, actually. We have a number of checks in place. We first of all try and make sure that the humans that are scoring the essays in the first place are well-trained. They get monitored to make sure that they’re sticking to the rubrics. We make sure that responses would be scored by multiple humans to make sure that they’re all roughly in agreement. But it’s not perfect; it’s not a perfect system. It can happen potentially that you might end up with a biased dataset.
We spoke to a parent who is frustrated that one of these language systems wasn’t really teaching his child how to write. He thought the program was teaching his kid how to write big words rather than how to write well. How would you respond to that?
He’s probably not wrong. At least when we develop tools that try and support learners of writing, we try and collaborate with the writing community to try and find out what are the things that people who are researching writing, what are the things that they teach? What are the things that they find important? Having a system teach big words is, you know, it’s a particular skill but it’s maybe not core to being able to write well. The ability to write well has a whole range of skills; maybe vocabulary is one piece of it, but it’s not the whole thing.
You read the Motherboard article. What was your reaction to it?
What I felt was that people don’t always get the full picture of how these systems are used. These systems can be used inappropriately and if they’re allowed then of course there’s going to be problems with them. But I think these systems actually can provide a lot of benefit and support to teachers and students if they’re used appropriately. And I think there was some … My biggest disappointment with the article was that it didn’t give that side of the thing.
Duhaime-Ross also spoke with Vox reporter Sigal Samuel, who’s written extensively about artificial intelligence. She’s also a novelist. And, recently, she’s been applying AI to her writing.
I had a bizarre thought enter my head when I first heard about these language models which was, “I wonder if, at some point, these AIs are going to be able to write my novel ideas better than I could.”
I decided to sort of like test this by actually taking the novel that I published in 2015, which is called The Mystics of Mile End, and plunk paragraphs from that novel into GPT-2. It’s at https://talktotransformer.com.
So you can actually just go on this website and put in like a couple sentences and see what happens?
Exactly. It’s super fun. I put in three, four sentences from my novel, and then it generates a bunch of text, a continuation. The algorithm is sort of analyzing your words, your syntax, and then it’ll spit out how it thinks your text should be continued.
Here, I’ll give you an example. There’s one scene where one of my characters, a young woman, is actually kind of losing her sanity. Her father has died — uh, spoiler. And she’s actually in a moment of great distress eating this manuscript that he had been writing. So I’ll read you a little bit of what I wrote and then what the AI wrote.
“Letters stumbled into my mouth and I swallowed them; ink poured down my throat and I drank it.” And then the AI says, “Words I didn’t know flowed through my skin and I drank them and drank them and drank them all over again. I ate, sated, until I vomited.”
The AI came up with this great idea, which is that my character, after gobbling up her father’s words in a sort of strange attempt to reconnect with him, her body has this violent physical reaction to this attempt and she vomits, and I love that idea. And I didn’t think of it. And in retrospect it would’ve been perfect.
How does that make you feel as an artist, as a writer? I feel like all I can think is that was kind of hurtful.
I mean, part of me is like, “Well, damn.” I spent years honing my craft and getting a degree in creative writing. But honestly the bigger part of me is just pretty delighted because A) this kind of new AI is just super cool and it’s a fun toy to play with, but B) I really sincerely do think that it’s going to make my future writing stronger. And I’m excited for how I’m gonna get to use GPT-2 to write my next novel.
You’re actually going to use this to write your novel. How are you gonna use it?
One of the next projects I’m working on is a children’s book. It’s about two little girls who discover a hotel with infinite rooms and there’s a black hole in the middle of it. And so they jump into the black hole, and obviously there’s a ton of wormholes in the black hole. So they have to figure out how to navigate them.
As a writer, you don’t always have the luxury of being in the middle of an MFA workshop or just friends who you can bat around these ideas with. So it’s kind of nice to have this machine sounding board slash collaborator.
You sound really positive about this but I can only assume that there are limitations. So what is it bad at?
It can be really useful on the localized level, helping you think of specific questions or writing a few terrific sentences, but it’s really bad at larger story structure. It can only generate something based on what it’s already … what you’ve already put down. It can’t generate like a whole narrative arc, a larger plot structure that you need for a novel and that makes a novel satisfying.
Do you think it could get there at some point?
It’s conceivable. We’re not anywhere close to that. But you know, it has been said that in all of literature there are only six main story arcs. There’s like this Cinderella arc there. You know, there’s rags to riches, there are specific arcs that are common to a lot of our literature. It’s conceivable to me that an AI could be taught to mimic those basic templates and then kind of like slot in the specifics of characters and words and scenes. I am skeptical, though, that an AI by itself without any human involvement is ever going to write a Pulitzer Prize-winning novel.
Listen to the full conversation and subscribe to Reset on Apple Podcasts, Stitcher, Spotify, or wherever you listen to podcasts.