January 16, 1998
New Research Casts Doubt on Value of Student Evaluations of ProfessorsStudies find that faculty members dumb down material and inflate grades to get good reviews
By ROBIN WILSON
As an assistant professor of marketing, Robert S. Owen knows the importance of keeping the customer satisfied. His job depends on it.
That's why, in his courses at the State University of New York College at Oswego, he gives multiple-choice rather than essay exams and asks students to evaluate research papers rather than write their own. A student who questions the fairness of a question on a test might receive extra credit simply for expressing interest.
"If students come to my office," he says, "I have to make sure they walk out happy."
Dr. Owen learned that lesson the hard way. Three years ago, he lost his job at Bloomsburg University of Pennsylvania when students gave his teaching mixed reviews. He describes himself as a casualty of an era in which administrators increasingly rely on student evaluations of teaching to decide who gets tenure and who doesn't.
"The student in college is being treated as a customer in a retail environment," he says, "and I have to worry about customer complaints."
Now academics are paying a lot of attention to two new studies that raise questions about the validity of student ratings of teaching and the tendency of professors to "teach to the evaluations." The reports on the studies, published last fall in Change magazine and American Psychologist, say professors who want high ratings have learned that they must dumb down material, inflate grades, and keep students entertained. The ratings can make or break a professor's career even though they do not always accurately measure teaching skills, the authors say.
"Evaluations may encourage faculty to grade easier and make course workloads lighter," says Anthony G. Greenwald, a psychology professor at the University of Washington who wrote the article in American Psychologist. He and Gerald Gilmore, director of the university's Office of Educational Assessment, examined student ratings of hundreds of courses at Washington and found that professors who are easy graders receive better evaluations than do professors who are tougher.
The Washington study and the one described in Change -- which showed that by being more enthusiastic, a professor sharply improved his student ratings even though the students did not actually learn more -- have shaken a long-standing consensus among researchers. Dozens of scholars in the United States and abroad have agreed for years that student evaluations are a good measure of a teacher's skills. Nearly 2,000 studies have been completed on the topic, making it the most extensive area of research on higher education.
As the number of studies supporting the value of student evaluations has grown, so has their use. Only about 30 per cent of colleges and universities asked students to evaluate professors in 1973, but it is hard to find an institution that doesn't today. And student ratings carry more and more weight, especially on campuses where the focus is on teaching. Such evaluations are now the most important, and sometimes the sole, measure of an instructor's teaching ability.
Good evaluations don't guarantee a professor tenure, particularly at research universities. "If you get good evaluations, that is just one hurdle you've cleared," says Tom Dickson, an associate professor of journalism at Southwest Missouri State University. But "if you don't get good evaluations," he says, "it doesn't matter how else you do." Dr. Dickson is trying to persuade administrators at his university to give less importance to student ratings.
Teaching evaluations were initiated on many U.S. campuses in the 1960s, as students clamored for more of a say in their education. Many institutions in Europe, Canada, Australia, and Asia have since adopted them as well. Typically, evaluation forms are passed out before the final exam in a course. Students are asked to rate a professor's communications skills, knowledge of the subject matter, ability to organize material, and fairness in grading. Many of the forms are designed to be analyzed by a computer, which responds with a numerical rating for each professor.
Many of the forms ask students to rate a professor's teaching techniques on a scale from "very effective" to "ineffective." They also ask students to give written comments, which can be the most devastating to professors' careers, and which often have nothing to do with teaching.
Undergraduates have been known to comment on a professor's clothing, hairstyle, and personal hygiene. Stephen J. Ceci, a Cornell University professor who wrote the article on the study in Change, the monthly magazine of the American Association for Higher Education, says one student suggested in an evaluation that he stop wearing a pair of orange corduroy pants. "You look like you work at Hardees," the student wrote.
Dr. Ceci acknowledges that the pants were a bit ostentatious. He also remembers what another student wrote in evaluating a female professor who had given a lecture in one of his courses. Asked if she should change anything about her presentation, the student wrote: "She shouldn't wear that outfit. Her hips are too big."
When students do stick to the subject, they can be just as critical. They complain that courses are boring and put them to sleep. Sometimes the evaluators get really nasty. A few years ago, a physics professor at Cornell who was using a pendulum as part of a lecture on friction had to duck to get out of its way. One student later wrote that the course would have been better had the pendulum hit the professor instead of the wall behind him.
Wendy L. Williams, a professor of human development at Cornell who wrote the Change article with Dr. Ceci, says just a few "bitter, nasty comments" can raise questions about a professor's competence. "It is worth examining whether we want people's entire careers to be derailed by a bunch of snitty undergraduates who didn't want to do an extra term paper," she says.
Proponents of the evaluations say mediocre teachers elicit low ratings from students. They note that service in most industries is judged by the customer. Besides, they contend, no one has developed a better measure of professors' performance. Faculty members have always been squeamish about passing judgment on each other's teaching, and most instructors don't relish having colleagues sit in on their lectures. "Research on peer review shows that if you have three or four different people go in and sit in on lectures by the same teacher, there will be relatively little agreement among them," says Herbert W. Marsh, a professor of education and dean of graduate research studies at the University of Western Sydney, in Australia, who has studied student evaluations for 25 years. "With student ratings, you've got someone who's sat through 40 hours of a course."
Undergraduates can be sincere in their comments, offering praise and acknowledging that a course has changed their lives. Dan Mansfield, a junior majoring in psychology at the University of Michigan, says students' comments about teaching are worthwhile. "I always try to be thoughtful about what I write. If I'm able to tell my professors what I like or dislike, I'm going to ultimately get a better education."
Michael Theall, an associate professor of educational administration at the University of Illinois at Springfield, calls the evaluations "valid measures of students' satisfaction with their experience." He and other researchers point to a set of about four dozen studies that have tested the validity of student ratings. Researchers compared the quality of teaching in several sections of the same course and gave students in each section the same final exam. The studies found that sections of students who did well overall tended to give higher ratings to the instructor than did those of students who did poorly. Many scholars read that data as confirmation of the worth of student evaluations as a measure of how well professors teach. For the researchers, it also served to show that how severely a professor grades has little to do with students' comments, because in these experiments, the exams were designed and graded by outsiders.
Dr. Marsh, the professor at the University of Western Sydney, has found that professors are likely to agree with students in choosing the best teachers on a campus. His research in Australia, he says, persuades him that high ratings are related to effectiveness. Professors have raised their ratings, he has found, by getting help to improve areas of their teaching that students have complained about.
Scholars say the tenure-and-promotion system works to counteract any incentive on the part of professors to inflate grades in order to improve their ratings. "If a promotion-and-review committee suspects somebody is grading too high, they really get dinged," says Wilbert J. McKeachie, a professor of psychology at Michigan. "Even though they may have gotten higher teaching evaluations, it is not popular among peers to give higher grades than other people are giving."
According to the article in Change magazine, however, giving high grades isn't the only way to boost evaluations. Dr. Ceci had taught developmental psychology at Cornell for nearly 20 years and was drawing mediocre reviews. Administrators had even asked him to attend a workshop with a "media consultant" to try to spice up his lectures.
To Dr. Ceci, the situation became an opportunity for research: What if he could improve his ratings simply by being a more enthusiastic lecturer, as the media consultant had advised? What would that say about the value of student ratings?
During a recent spring semester (Dr. Ceci would not identify which one), the professor taught developmental psychology covering the same material as in the previous semester, and using the same textbook he had used for years. But he added more hand gestures to his teaching style, varied the pitch of his voice, and generally tried to be more exuberant. The outcome was astounding: Students' ratings of Dr. Ceci soared. They even gave higher marks to the textbook, a factor that shouldn't have been affected by differences in his teaching style.
Despite the higher ratings, however, Dr. Ceci found no real improvement in students' performance on exams in the spring compared to those in the fall. He concluded that his new teaching style was probably no more effective than his old one.
In their article in Change, Dr. Ceci and Dr. Williams offered a blunt indictment of evaluations: "Student ratings are far from the bias-free indicators of instructor effectiveness that many have touted them to be. Student ratings can make or break the careers of instructors on grounds unrelated to objective measures of student learning, and for factors correctable with minor coaching."
Their findings have been dismissed by many scholars who have spent their careers assessing the validity of student evaluations. Of course a professor who is enthusiastic will satisfy students, and may even encourage them to retain more information, proponents of evaluations say. That's simply common sense and a basic tenet of good teaching. It doesn't mean that a bad teacher can get great ratings simply by being entertaining, they say.
What the Cornell professors found, though, rings true in the everyday lives of many other professors. Peter Sacks has written a book in which he recounts how poor evaluations from students almost cost him his job as an assistant professor of journalism at an unidentified community college on the West Coast. He salvaged his career, he says, after changing his teaching style.
"In my mind, I became a teaching teddy bear," he writes in Generation X Goes to College (Carus Publishing, 1996). "Students could do no wrong, and I did almost anything possible to keep them happy, all of the time, no matter how childish or rude their behavior, no matter how poorly they performed in the course."
His colleagues, too, told him that if he wanted to improve his ratings and get tenure, he should be more entertaining, both in and out of class, Mr. Sacks writes. "After the field trip, meet the class for a pizza dinner," one of his colleagues suggested. "Bring donuts," advised another.
The superficial changes he made in his teaching style led to significantly better ratings from students, he writes, although he doesn't believe that students learned any more. He earned tenure in 1995, but the following year -- after four years at the community college -- he left teaching for free-lance writing. (While teaching at the college, he used his first name, which he won't reveal now; Peter is his middle name.) "This was higher education as a consumeristic, pandering enterprise," he says in an interview. "The love of learning was completely whitewashed out."
Paul A. Trout, an associate professor of English at Montana State University, says most professors have learned how to get good ratings from students. "Some professors stick to their guns and get punished. But an awful lot of people have figured out how to get their numbers high enough so that the evaluations are not a liability to them. People are changing their teaching, the rigor of their courses, to insure they get tenure."
As a young assistant professor of political science at Kalamazoo College, Jeremy D. Mayer is acutely aware of how important student ratings are. He says he has always received topnotch ratings from students, in part because he is a good teacher. But he is also aware that this generation of students, raised on Sesame Street and MTV, wants to be entertained. "A college professor today, if he wants to be effective, should be able to be a bit of a Quentin Tarantino in the classroom," he says.
Mr. Sacks says a student once told him: "We want you guys to dance, sing, and cry." But a culture that allows students to determine what is good teaching, he says, "does not lend itself to the kind of critical, messy thinking that we need to be encouraging in higher education."
Copyright (c) 1998 by The Chronicle of Higher Education
Section: The Faculty
Do the Best Teachers Get the Best Ratings?
Nate Kornell* and Hannah Hausman
Department of Psychology, Williams College, Williamstown, MA, USA
Edited by: Lynne D. Roberts, Curtin University, Australia
Reviewed by: Sherri Horner, Bowling Green State University, USA; Ronny Scherer, Centre for Educational Measurement at the University of Oslo, Norway
*Correspondence: Nate Kornell, moc.liamg@llenrokn
This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology
Author information ►Article notes ►Copyright and License information ►
Received 2016 Jan 21; Accepted 2016 Apr 6.
Copyright © 2016 Kornell and Hausman.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
This article has been cited by other articles in PMC.
We review recent studies that asked: do college students learn relatively more from teachers whom they rate highly on student evaluation forms? Recent studies measured learning at two-time points. When learning was measured with a test at the end of the course, the teachers who got the highest ratings were the ones who contributed the most to learning. But when learning was measured as performance in subsequent related courses, the teachers who had received relatively low ratings appeared to have been most effective. We speculate about why these effects occurred: making a course difficult in productive ways may decrease ratings but enhance learning. Despite their limitations, we do not suggest abandoning student ratings, but do recommend that student evaluation scores should not be the sole basis for evaluating college teaching and they should be recognized for what they are.
Keywords: student evaluations of teaching, teacher ratings, long-term learning, grades, ratings
Do the Best Teachers Get the Best Ratings?
Calvin: “Here’s the latest poll on your performance as dad. Your approval rating is pretty low, I’m afraid.” Dad: “That’s because there’s not necessarily any connection between what’s good and what’s popular. I do what’s right, not what gets approval.” Calvin: “You’ll never keep the job with that attitude.” Dad: “If someone else offers to do it, let me know.”
–Calvin and Hobbes, Bill Watterson, February 13, 1994
Student evaluations of teaching are one of the main tools to evaluate college teaching (Clayson, 2009; Miller and Seldin, 2014). Ratings of factors like clarity, organization, and overall quality influence promotion, pay raises and tenure in higher education. Thus, we asked: Do better teachers get better ratings? Being a “better teacher” can be defined many ways, such as teaching that inspires students to work hard or get interested in a subject, but in this article we equate good teaching with meaningful student learning. Therefore, our question is, do students give the highest ratings to the teachers from whom they learn the most? Given the ubiquity and importance of teacher ratings in higher education, we limited our review to research conducted with college students.
A Framework for Understanding Teacher Ratings
Figure 1 presents a framework for understanding teacher ratings. This framework is simply a way of organizing the possible relationships among what students experience in a course, the ratings they give their instructor, and how much they learn. In this article, “ratings” refers to students’ responses to a single survey question about overall instructor quality. While students also typically rate instructors on preparedness, content knowledge, enthusiasm, clarity of lectures, etc., responses to these questions were not the primary focus of the studies we reviewed.
Framework for understanding possible influences on student evaluations of teaching.
In the figure, educational experience is the broad term we are using to refer to everything students experience in connection with the course they are evaluating (e.g., teacher age, gender, and charisma, topic of the course, font used on class handouts, and temperature in the classroom). The first course is the one taught by the professor being evaluated. Performance in the first course reflects students’ knowledge of the information that course was designed to teach. Subsequent course performance means how those same students do in related, follow-up courses. Subsequent course performance is included because, for example, a good Calculus I teacher should have students who do relatively well in follow-up courses that rely on calculus knowledge, like Calculus II and engineering. Our main interest was the relationship between how college students evaluate an instructor and how much they learn from that instructor, which is represented by the C and D links in Figure 1.
Educational Experience and Ratings
Some links in Figure 1 have been researched more extensively than others. “Literally thousands” (Marsh, 2007, p. 320) of articles have examined the relationship between educational experience and teacher ratings—that is, the B link in Figure 1. They have identified an extensive list of student, instructor, and course characteristics that can influence ratings, including student gender, prior subject interest, and expectations for the course; instructor gender, ethnicity, attractiveness, charisma, rank, and experience; and course subject area, level, and workload (for reviews, see Neath, 1996; Marsh and Roche, 1997; Wachtel, 1998; Kulik, 2001; Feldman, 2007; Pounder, 2007; Benton and Cashin, 2012; Spooren et al., 2013).
This literature is difficult to succinctly review because the results are so mixed. For many of the questions one can ask, it is possible to find two articles that arrive at opposite answers. For example, a recent randomized controlled experiment found that students gave online instructors who were supposedly male higher ratings than instructors who were supposedly female, regardless of their actual gender (MacNell et al., 2014). On the other hand, Aleamoni (1999) referred to the effect of instructor gender on teacher ratings as a “myth” (p. 156). Other studies suggest that the relationship between a teacher’s gender and ratings may depend on the student’s gender as well as whether the teacher’s behavior conforms to gender stereotypes (for a review see Pounder, 2007; e.g., Boring, 2015). One reason studies come to such different conclusions may be the fact that many studies do not exercise high levels of experimental control: They do not experimentally manipulate the variable of interest or do not control for other confounding variables. But variable results may also be inherent in effects of variables like instructor gender, which might not be the same for all types of students, professors, subjects, and course levels. Finally, the mixed results in this literature may be due to variability in how different teacher evaluation surveys are designed (e.g., negatively worded questions, number of response options, and neutral response options) and administered (e.g., was the teacher present, was a tough assignment handed back just prior, did it take place online, were there incentives for filling it out; Wachtel, 1998; Spooren et al., 2013; Stark and Freishtat, 2014).
The point is it is difficult to draw general conclusions from existing research on the relationship between teacher ratings and students’ educational experiences. Our goal is not to review this literature in detail, but to discuss what it means for the question of whether better teachers get higher ratings. The educational experience variables that affect ratings can be classified into two categories: those that also affect learning and those that do not. Presumably, instructor attractiveness and ethnicity should not be related to how much students learn. Instructor experience could be however. Instructors who have taught for a few years might give clearer lectures and assign homework that helps students learn more than instructors who have never taught before (McPherson, 2006; Pounder, 2007). If teacher ratings are mostly affected by educational experience variables that are not related to learning, like instructor attractiveness and ethnicity presumably, then teacher ratings are not a fair way to identify the best teachers. It is possible though that teacher ratings primarily reflect student learning, even if some variables like attractiveness and ethnicity also affect ratings, but to a much smaller degree. However, most of the studies covered in the reviews of the B link do not measure student learning objectively, if at all. Therefore, the studies identify educational experience factors that affect ratings, but do not shed light on whether students give higher ratings to teachers from whom they learn the most. Thus, they are not directly relevant to the present article.
Features of the Ideal Study of Ratings and Student Learning
To answer our main question—whether teachers with higher ratings engender more learning (i.e., the C and D links in Figure 1)—a study would, ideally, have all of the characteristics described in Table 1. These features describe what a randomized controlled experiment on the relationship between ratings and learning would look like in an educational setting.
Ideal features of a study that measures the relationship between ratings and learning.
The features in Table 1 are desirable for the following reasons. First, a lab study cannot simulate spending a semester with a professor. Second, if the subsequent courses are not required, the interpretation of the results could be obscured by differential dropout rates. For example, a particular teacher would appear more effective if only his best students took follow-up courses. Third, random assignment is necessary or else preexisting student characteristics could differ across groups—for example, students with low GPAs might gravitate toward teachers with reputations for being easy. Fourth, comparable (or identical) measures of student knowledge allow for a fair comparison of instructors. (Course grades are not a valid measure of learning because teachers write their own exams and the exams differ from course to course.) Next, we review the relationship between ratings and first course performance (i.e., in the professor’s own course). Then we turn to newer literature on the relationship between teacher ratings and subsequent course performance.
Teacher Ratings and First Course Performance
A wealth of research has examined the relationship between how much students learn in a course and the ratings they give their instructors (i.e., the C link in Figure 1). This research has been synthesized in numerous reviews (Abrami et al., 1990; Cashin, 1995; Kulik, 2001; Feldman, 2007; Marsh, 2007) and meta-analyses (Cohen, 1981, 1983; Dowell and Neal, 1982; McCallum, 1984; Feldman, 1989; Clayson, 2009). The studies included in these meta-analyses had the following basic design: Students took a course with multiple sections and multiple instructors. Objective measures of knowledge (e.g., a common final exam or essay) and teacher evaluations were administered at the end of the course.
The evidence from all of the meta-analyses suggests that there is a small positive relationship between ratings and first course performance: better teachers, as measured by the average of their students’ end-of-semester performance, do get slightly higher average ratings. Table 2 shows the mean correlation between an overall measure of teacher effectiveness and first course performance. Cohen (1981) reported the highest average correlation of 0.43 with a 95% confidence interval ranging from 0.21 to 0.61. This means that teacher ratings account for only about 18% of the variability in how much students learn, at best. Clayson (2009) reported the lowest mean correlation of 0.13 with a standard error of 0.19, concluding the correlation between ratings and first course performance is not significantly different from zero. Table 2 also shows that first course performance was positively correlated with other evaluation questions as well, such as ratings of the instructor’s preparation, the organization of the course material, and how much students thought they had learned. The studies in Table 2, and the studies described in the sections that follow, did not examine individual students’ ratings and performance; they measured something more interesting for present purposes: the correlation at the course-section level between an instructor’s mean ratings and his or her section’s mean test scores. (For a technical take on why and how to aggregate individual student ratings at the course-section level to estimate teacher effectiveness, see Lüdtke et al., 2009; Marsh et al., 2012; Scherer and Gustafsson, 2015).
Mean correlations between ratings and first course performance.
Teacher Ratings and Subsequent Course Performance
A few recent studies have examined the relationship between ratings, first course performance, and crucially, subsequent course performance, which has been advocated as a measure of long-term learning (Johnson, 2003; Yunker and Yunker, 2003; Clayson, 2009; Weinberg et al., 2009; Carrell and West, 2010; Braga et al., 2014). Subsequent-related course performance is arguably more important than first course performance because the long-term goal of education is for students to be able to make use of knowledge after a course is over.
It is important to distinguish between student knowledge and teacher contribution to student knowledge. Students who do well in the first course will tend to do well in subsequent related courses (e.g., Hahn and Polik, 2004; Lee and Lee, 2009; Kim et al., 2012), but raw student performance is not the issue at hand when evaluating teacher effectiveness. The issue is how much the teacher contributes to the student’s knowledge. The studies we describe next used value-added measures to estimate teacher contribution to knowledge.
Since there is typically a positive relationship between ratings and first course performance, we might also predict a positive relationship between ratings and subsequent performance. Yet, three recent studies suggest that ratings do not predict subsequent course performance (Johnson, 2003; Yunker and Yunker, 2003; Weinberg et al., 2009). These studies represent an important step forward, but they are open to subject-selection effects because students were not assigned to teachers randomly and follow-up courses were not required; additionally, only Yunker and Yunker (2003) used an objective measure of learning (a common final exam).
Only two studies, conducted by Carrell and West (2010) and Braga et al. (2014), have all of the characteristics in Table 1. We review these studies next. Carrell and West (2010) examined data collected over a 10-year period from over 10,000 students at the United States Air Force Academy. This dataset has many virtues. There was an objective measure of learning because students enrolled in different sections of a course took the same exam. (The professors could see the exams before they were administered.) Lenient grading was not a factor because each professor graded test questions for every student enrolled in the course. Students were randomly assigned to professors. Finally, the effectiveness of the nearly 100 Calculus I instructors was measured in two ways, once based on their students’ grades in Calculus I and once based on their students’ grades in related follow-up courses. The concern that only the best students in certain professors’ courses chose to take a follow-up course was eliminated because follow-up courses were mandatory.
Carrell and West (2010) used value-added scores to measure teacher effectiveness. For each student in a particular Calculus I section, the student’s characteristics (e.g., incoming test scores) and characteristics of the class (e.g., class size) were used to predict the student’s grade. The predicted grade was compared to the student’s actual grade. The difference between the actual and predicted grade can be attributed to the effect of the teacher, since non-teacher variables were controlled for. A single value-added score for was then computed for each teacher. This score was meant to capture the difference between the actual and predicted grades for all the students in their course section. A high value-added score indicates that overall, the teacher instilled more learning than the model predicted. Therefore, the value-added score is a measure of the teacher’s contribution to learning in Calculus I. A similar procedure was used to compute the Calculus I instructors’ contribution to learning subsequent courses. The same non-teacher variables were used to predict grades in Calculus II and other follow-up courses, which were then compared to actual grades.
Carrell and West (2010) found, consistent with prior studies, that teacher ratings were positively correlated with the teacher’s contribution to learning in first course. Subsequent course performance told a different story, however. The teachers who contributed more to learning as measured in follow-up courses had been given relatively low ratings in the first course. These teachers were also generally the more experienced teachers. In other words, getting low ratings in Calculus I was a sign that a teacher had made a relatively small contribution to learning as measured in Calculus I but a relatively large contribution to learning as measured in subsequent courses requiring calculus (Figure 2).
Summary of the relationship between teacher ratings, value-added to first course and value-added to subsequent courses. + Indicates positive and p < 0.05, ++ indicates positive and p < 0.01. - Indicates negative and p < 0.05, and...
Braga et al. (2014) did a similar study of a cohort of approximately 1,200 first-year students who enrolled in 230 course-sections over the course of their 4 years at Bocconi University. This dataset had the same virtues of Carrell and West’s (2010) dataset (Table 1). Braga et al.’s (2014) data are a good complement to Carrell and West’s (2010) data because they ask the same question about a different type of student population and learning materials: Instead of a military academy in the United States the students attended a non-military school in Italy, and instead of calculus-based courses, the students in Braga et al.’s (2014) study took a fixed sequence of management, economics, or law courses.
Braga et al. (2014) found the same basic pattern of results as Carrell and West (2010). Teachers given higher ratings tended to have less experience. Receiving low ratings at the end of course 1 predicted that a teacher had (i) made a relatively small contribution to learning as measured at the end of course 1 and (ii) made a relatively large contribution to learning as measured in subsequent courses (Figure 2).
There is one other key finding from Braga et al.’s (2014) study. Intuitively, it seems obvious that a good teacher is a good teacher, regardless of whether knowledge is measured at the end of the teacher’s course or in subsequent courses. Braga et al.’s data belied this assumption. When a teacher made a relatively large contribution to knowledge in the first course, it could be reliably predicted that the teacher’s contribution to knowledge as measured in subsequent courses would be smaller than average. Carrell and West’s (2010) data showed a similar negative correlation, but it was not significant. (In one analysis, Carrell and West ranked teachers in terms of both contribution to course 1 and contribution to subsequent courses. The correlation between ranks was r = -0.68, but a significance level was not reported.) It is important to remember that these claims have to do with teacher contribution to learning, not individual student aptitude. Students who did better in course one also did better in subsequent courses, but individual student aptitude was controlled for in the value-added models (and by the fact that students were assigned to courses randomly).
It is difficult to interpret the strength of the correlations in Figure 2 because of the complexity of the value added models, but three things seem clear. First, there is evidence from Carrell and West (2010), Braga et al. (2014), and other studies (Clayson, 2009), that when teacher contribution to learning is measured based on the teacher’s own course, teacher ratings are positively correlated with teacher effectiveness. Second, the data do not support the assumption that student ratings are an accurate way to estimate a teacher’s contribution to student knowledge after a course is over. Third, the data do not support the assumption that teacher contribution to learning in the teacher’s course is an accurate way to estimate a teacher’s contribution to student knowledge after a course is over.
Why Did Better Teachers Get Lower Ratings?
Our conclusion is that better teachers got lower rating in the studies conducted by Carrell and West (2010) and Braga et al. (2014). In drawing, this conclusion we assume that the long-term goal of education is for knowledge to be accessible and useful after a course is over. Therefore, we consider the better teachers to be the ones who contribute the most to learning in subsequent courses. We refer to this kind of generalizable knowledge as “deep learning.”
Future research should examine how teachers’ decisions and classroom behavior affect ratings and deep learning. Until this research has been done, we can only speculate about why better teachers got lower ratings in these two studies. Our hypothesis is that better teachers may have created “desirable difficulties” for their students. Research shows that making learning more difficult has the same three effects that good teachers had in the studies reviewed here: it decreases short-term performance, decreases students’ subjective ratings of their own learning, and it simultaneously increased long-term learning and transfer (Schmidt and Bjork, 1992; Bjork, 1994; Rohrer and Pashler, 2010; Bjork and Bjork, 2011). For example, mixing different types of math problems on a problem set, rather than practicing one type of problem at a time, impairs performance on the problem set but enhances performance on a later test (e.g., Taylor and Rohrer, 2010; Rohrer et al., 2014). Most research on desirable difficulties has examined memory over a short period of time. Short-term performance typically refers to a test a few minutes after studying and long-term learning is usually measured within a week, whereas course evaluations take a full semester into account. However, the benefits of desirable difficulties have also been observed over the course of a semester (Rohrer et al., 2014).
Multiple studies have shown that learners rate desirable difficulties as counterproductive because their short-term performance suffers (e.g., Simon and Bjork, 2001; Kornell and Bjork, 2008). A similar effect seems to occur with teacher ratings: Making information fluent and easy to process can create an illusion of knowledge (Abrami et al., 1982; Carpenter et al., 2013), whereas classes that students perceive as more difficult tend to receive low ratings (Clayson and Haley, 1990; Marks, 2000; Paswan and Young, 2002; Centra, 2003).
It is not always clear which difficulties are desirable and which are not. Difficulties that have been shown to benefit classroom learning include frequent testing (e.g., Roediger and Karpicke, 2006; Lyle and Crawford, 2011) and interleaving problem types (e.g., Rohrer et al., 2014). However, not all difficulty is desirable; for example, poorly organized lectures make students’ lives difficult but are unlikely to enhance learning. Table 3 lists teacher behaviors that seem likely to increase course difficulty and deep learning, but simultaneously decrease ratings. These behaviors are relevant even in situations where teaching to the test is not relevant, and their effects might be worth investigating in future research.
Activities that seem likely to increase difficulty and long-term learning but decrease teacher ratings (based solely on the authors’ intuition).
Of course, not every teaching decision that instills deep learning will decrease teacher ratings. In some circumstances students may be able to identifying effective teaching, even when it makes learning difficult. For example, Armbruster et al. (2009) reordered course content and added new lectures in an undergraduate introductory biology course to emphasize conceptual connections between topics. They also added daily in-class “clicker” quizzes and group problem solving activities. Final exam performance was significantly higher in semesters following the changes to the course than the semester prior to the changes. Furthermore, students gave higher overall ratings to the instructor after the course changed, despite also saying the course was more challenging. While there was no assessment of student performance in follow-up courses, these changes to the course seem likely to be desirable difficulties that enhanced deep learning, and not just performance on the end of semester exam.
Desirable difficulties have to do with the activities and processes learners are engaged in. It is possible that effective teachers also changed the content that they covered. In particular, perhaps these teachers broadened the curriculum and focused most on the most important, or difficult, concepts. Less effective teachers, by contrast, might have focused on preparing students to do well on the pre-determined set of questions that they knew would be on the test—that is, they might have been “teaching to the test” (Carrell and West, 2010; Braga et al., 2014).
Teaching to the test raises a potential limitation to our conclusions, because in a typical college course, if a teacher broadens the material, she can make the test correspond to the material she covered (i.e., “test to what she taught”). The existence of a pre-determined test might have meant the best teachers did not have the ability to adjust the test to fit their teaching. Thus, the results we have reviewed might have been different if there had been no common test to “teach to.” (In a typical college course there is no predetermined, unmodifiable test to teach to.) However, evidence against this potential concern already exists: Weinberg et al. (2009) examined courses in which teachers created their own tests and found that teacher ratings did not predict subsequent course performance when controlling grades in the first course. As mentioned earlier, Weinberg et al.’s study has its own limitations: it did not involve objective measures of learning and might have been affected by subject-selection effects. More research is needed about this potential objection.
How Teacher Ratings Shape Teacher Incentives
Based on the negative relationship between ratings and deep learning, teacher ratings seem like a bad basis for decisions about hiring and promotion. However, we do not believe student ratings should be abolished, because ratings affect what they measure and the overall set of incentives they create might boost overall learning even if their correlation with learning is negative. As an analogy, imagine a teacher who is such a bad grader that when he grades papers, the correlation between grades and paper quality is slightly negative. One might argue that because these grades are unfair, it would be better if the teacher did not change the assignment save for one thing: no more grades. The problem, of course, is that the students would put far less effort into their papers—especially the students who were already not motivated—and the paper quality would drop precipitously. The measure of performance (student evaluations of teaching or, in the analogy, grades) might not be accurate or fair, but abolishing it could make performance (teaching, or in the analogy student papers) far worse. Whether abolishing ratings would be beneficial is an empirical question. To test this question would require a study that manipulated whether or not teachers were being rated and examined outcomes in terms of fairness to the teachers, teacher performance, and student learning. (A natural experiment of sorts already exists, in the sense that some schools put more emphasis on evaluations than others—and the former tend to have better teachers—but this difference is confounded with many other factors such as the proclivities of faculty who apply for such jobs.)
There are two reasons why student ratings might have overall net benefits for teachers. One is that they provide teachers with feedback on how they are seen by their students. The other is that they create a set of incentives that probably have a mix of positive and negative effects. On the positive side, they insure that teachers are prepared, organized, and responsive to students. We suspect that the average professor would put less time and effort into teaching if they were not concerned about student ratings (Love and Kotchen, 2010). As we have said, we think the positives might outweigh the negatives. On the negative side, the incentive to get good ratings can push teachers into making decisions that hurt student learning. We have already described some of these decisions (Table 3). Teachers should serve their students broccoli, but they tend to get higher ratings when they serve chocolate, and this is not just an analogy—one study showed that ratings increased when teachers literally served their students chocolate (Youmans and Jee, 2007). More generally, students tend to give high ratings when courses are easier or when they expect teachers to give them good grades, even if higher grades do not mean the students have learned more (Greenwald and Gillmore, 1997; Worthington, 2002; Isely and Singh, 2005; McPherson, 2006; Ewing, 2012). As a result, teacher ratings may be one factor contributing to grade inflation in recent decades (Krautmann and Sander, 1999; Eiszler, 2002; Johnson, 2003; Love and Kotchen, 2010).
It is not just professors who have incentives to ensure that teacher ratings are high. Colleges and universities have strong incentives to boost the satisfaction, and perhaps charitable giving, of future alumni. Student ratings may be a perfect way to identify teachers who accomplish this goal, that is, teachers whose students enjoyed their courses and think they have learned a lot. (There is also an incentive for schools to insure that their students get a good education so they can succeed in their lives and careers, but it is infinitely easier to measure student evaluations than it is to parse out a single professor’s impact on their students’ lives twenty years later.)
Two recent studies found that when learning was measured as performance in subsequent related courses (i.e., when deep learning was measured), teachers who made relatively large contributions to student learning received relatively low teacher ratings (Carrell and West, 2010; Braga et al., 2014). If a college’s main goal is to instill deep, long-term learning, then teacher ratings have serious limitations.
Just as it is misguided to assume that ratings have any obvious relationship with student learning, it is also misguided to assume that end-of-semester test performance is a valid measure of deep learning. Teacher effectiveness as measured by students’ performance on end-of-semester exams was negatively correlated with teacher effectiveness as measured in subsequent courses (Braga et al., 2014). If these results generalize, it would suggest that standardized test performance can be at odds with durable, flexible knowledge (though it seems safe to assume that they match up well in some situations). This would be a troubling conclusion for at least two reasons. First, most measures of learning focus on the material learned during the preceding semester or year. Such measures may fail to capture deep learning, or even create an impression opposite to the truth. Second, primary and secondary school teachers are often incentivized, or required, to teach to tests. This requirement might actually undermine deep learning.
We do not recommend abandoning teacher ratings, at least not in the absence of controlled experiments comparing teachers who are being rated to teachers who are not. Teacher ratings provide incentives for teachers to invest effort in their teaching. However, these incentives might also harm teaching in some ways (Table 3), and we recommend that future research should investigate ways of eliciting student ratings that correlate better with deep learning.
How should teacher effectiveness be assessed? The student perspective is important, but students do not necessarily have the expertise to recognize good teaching. Their reports reflect their experiences, including whether they enjoyed the class, whether the instructor helped them appreciate the material, and whether the instructor made them more likely to take a related follow-up course. We think that these factors should be taken into account when assessing how good a teacher is. But it is also important to take into account how much the students learned, and students are simply not in a position to make accurate judgments about their learning. In the end, student ratings bear more than a passing similarity to a popularity contest.
We recommend that student ratings should be combined with two additional sources of data, both of which would require a significant investment of resources. First, teachers should receive more “coaching” from expert teachers, who can assess and rate them and also make suggestions (Murray, 1983; Braskamp and Ory, 1994; Paulsen, 2002; Berk, 2005). For one example of what a more holistic faculty review system could look like, consider the University of California, Berkeley’s statistics department (Stark and Freishtat, 2014). Second, where possible, steps should be taken to measure deep knowledge by examining teacher contribution to knowledge in a fair and objective way after students have completed a professor’s course.
The idea originated with NK. NK and HH did the writing together. HH did the majority of the literature review.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Mattia Wruble helped with the initial literature review.
Funding. This research was supported by a Scholar Award from the James S. McDonnell foundation.
- Abrami P. C., D’Apollonia S., Cohen P. A. (1990). Validity of student ratings of instruction: what we know and what we do not.J. Educ. Psychol.82 219–231. 10.1037/0022-06188.8.131.52 [Cross Ref]
- Abrami P. C., Leventhal L., Perry R. P. (1982). Educational seduction.Rev. Educ. Res.52 446–464. 10.3102/00346543052003446 [Cross Ref]
- Aleamoni L. M. (1999). Student rating myths versus research facts from 1924 to 1998.J. Pers. Eval. Educ.13 153–166. 10.1023/A:1008168421283 [Cross Ref]
- Armbruster P., Patel M., Johnson E., Weiss M. (2009). Active learning and student-centered pedagogy improve student attitudes and performance in introductory biology.CBE-Life Sci. Educ.8 203–213. 10.1187/cbe.09-03-0025 [PMC free article][PubMed][Cross Ref]
- Benton S. L., Cashin W. E. (2012). Student Ratings of Teaching: A Summary of Research and Literature. Available at: http://ideaedu.org/sites/default/files/idea-paper_50pdf
- Berk R. A. (2005). Survey of 12 strategies to measure teaching effectiveness.Int. J. Teach. Learn. Higher Educ.17 48–62.
- Bjork E. L., Bjork R. A. (2011). “Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning,” in Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society, eds Gernsbacher M. A., Pew R. W., Hough L. M., Pomerantz J. R., editors. (New York, NY: Worth Publishers; ), 56–64.
- Bjork R. (1994). “Memory and metamemory considerations in the training of human beings,” in Metacognition: Knowing about Knowing, eds Metcalfe J. E., Shimamura A. P., editors. (Cambridge, MA: The MIT Press; ), 185–205.
- Boring A. (2015). Gender Biases in Student Evaluations of Teachers. Available at: http://www.programme-presage.com/tl_files/presage/docs/Publications/anne%20boring%20-%20gender%20biases.pdf
- Braga M., Paccagnella M., Pellizzari M. (2014). Evaluating students’ evaluations of professors.Econ. Educ. Rev.41 71–88. 10.1016/j.econedurev.2014.04.002 [Cross Ref]
- Braskamp L. A., Ory J. C. (1994). Assessing Faculty Work: Enhancing Individual and Institutional Performance. San Francisco: Jossey-Bass.
- Carpenter S. K., Wilford M. M., Kornell N., Mullaney K. M. (2013). Appearances can be deceiving: instructor fluency increases perceptions of learning without increasing actual learning.Psychon. Bull. Rev.20 1350–1356. 10.3758/s13423-013-0442-z [PubMed][Cross Ref]
- Carrell S. E., West J. E. (2010). Does professor quality matter? Evidence from random assignment of students to professors.J. Polit. Econ.118 409–432. 10.1086/653808 [Cross Ref]
- Cashin W. E. (1995). Student Ratings of Teaching: The Research Revisited. Available at: http://ideaedu.org/sites/default/files/Idea_Paper_32.pdf
- Centra J. A. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work?Res. Higher Educ.44 495–518. 10.1023/A:1025492407752 [Cross Ref]
- Clayson D. E. (2009). Student evaluations of teaching: are they related to what students learn? A meta-analysis and review of the literature.J. Market. Educ.31 16–30. 10.1177/0273475308324086 [Cross Ref]
- Clayson D. E., Haley D. A. (1990). Student evaluations in marketing: what is actually being measured?J. Market. Educ.12 9–17. 10.1177/027347539001200302 [Cross Ref]
- Cohen P. A. (1981). Student ratings of instruction and student achievement: a meta-analysis of multisection validity studies.Rev. Educ. Res.51 281–309. 10.3102/00346543051003281 [Cross Ref]
- Cohen P. A. (1983). Comment on a selective review of the validity of student ratings of teaching.J. Higher Educ.54 448–458. 10.2307/1981907 [Cross Ref]
- Dowell D. A., Neal J. A. (1982). A selective review of the validity of student ratings of teachings.J. Higher Educ.54 51–62. 10.2307/1981538 [Cross Ref]
- Eiszler C. F. (2002). College students’ evaluations of teaching and grade inflation.Res. High. Educ.43 483–501. 10.1023/A:1015579817194 [Cross Ref]
- Ewing A. M. (2012). Estimating the impact of relative expected grade on student evaluations of teachers.Econ. Educ. Rev.31 141–154. 10.1016/j.econedurev.2011.10.002 [Cross Ref]
- Feldman K. A. (1989). The association between student ratings of specific instructional dimensions and student achievement: refining and extending the synthesis of data from multisection validity studies.Res. High. Educ.30 583–645. 10.1007/BF00992392 [Cross Ref]
- Feldman K. A. (2007). “Identifying exemplary teachers and teaching: evidence from student ratings,” in The Scholarship of Teaching and Learning in Higher Education: An Evidence-Based Perspective, eds Perry R. P., Smart J. C., editors. (Dortrecht: Springer; ), 93–143.
- Greenwald A. G., Gillmore G. M. (1997). No pain, no gain? The importance of measuring course workload in student ratings of instruction.J. Educ. Psychol.89 743–751. 10.1037/0022-06184.108.40.2063 [Cross Ref]
- Hahn K. E., Polik W. F. (2004). Factors influencing success in physical chemistry.J. Chem. Educ.81:567 10.1021/ed081p567 [Cross Ref]
- Isely P., Singh H. (2005). Do higher grades lead to favorable student evaluations?J. Econ. Educ.36 29–42. 10.3200/JECE.36.1.29-42 [Cross Ref]
- Johnson V. E. (2003). Grade Inflation: A Crisis in College Education. New York, NY: Springer-Verlag.
- Kim D. G., Garcia F., Dey I. (2012). Calculus and success in a business school.Res. Higher Educ. J.17:1.
- Kornell N., Bjork R. A. (2008). Learning concepts and categories: is spacing the “enemy of induction”?Psychol. Sci.19 585–592. 10.1111/j.1467-9280.2008.02127.x [PubMed][Cross Ref]
- Krautmann A. C., Sander W. (1999). Grades and student evaluations of teachers.Econ. Educ. Rev.18 59–63. 10.1016/S0272-7757(98)00004-1 [Cross Ref]
- Kulik J. A. (2001). Student ratings: validity, utility, and controversy.New Dir. Institut. Res.2001 9–25. 10.1002/ir.1 [Cross Ref]
- Lee B. B., Lee J. (2009). Mathematics and academic success in three disciplines: engineering, business and the humanities.Acad. Educ. Leadersh. J.13 95–105.
- Love D. A., Kotchen M. J. (2010). Grades, course evaluations, and academic incentives.Eastern Econ. J.36 151–163. 10.1057/eej.2009.6 [Cross Ref]
- Lüdtke O., Robitzsch A., Trautwein U., Kunter M. (2009). Assessing the impact of learning environments: how to use student ratings of classroom or school characteristics in multilevel modeling.Contemp. Educ. Psychol.34 120–131. 10.1016/j.cedpsych.2008.12.001 [Cross Ref]
- Lyle K. B., Crawford N. A. (2011). Retrieving essential material at the end of lectures improves performance on statistics exams.Teach. Psychol.38 94–97. 10.1177/0098628311401587 [Cross Ref]
- MacNell L., Driscoll A., Hunt A. N. (2014). What’s in a name: exposing gender bias in student ratings of teaching.Innov. Higher Educ.40 1–13. 10.1007/s10755-014-9313-4 [Cross Ref]
- Marks R. B. (2000). Determinants of student evaluations of global measures of instructor and course value.J. Market. Educ.22 108–119. 10.1177/0273475300222005 [Cross Ref]
- Marsh H. W. (2007). “Students’ evaluations of university teaching: dimensionality, reliability, validity, potential biases and usefulness,” in The Scholarship of Teaching and Learning in Higher Education: An Evidence-Based Perspective, eds Perry R. P., Smart J. C., editors. (Dortrecht: Springer; ), 319–383.
- Marsh H. W., Lüdtke O., Nagengast B., Trautwein U., Morin A. J. S., Abduljabbar A. S., et al. (2012). Classroom climate and contextual effects: conceptual and methodological issues in the evaluation of group-level effects.Educ. Psychol.47 106–124. 10.1080/00461520.2012.670488 [Cross Ref]
- Marsh H. W., Roche L. A. (1997). Making students’ evaluations of teaching effectiveness effective: the critical issues of validity, bias, and utility.Am. Psychol.52 1187–1197. 10.1037/0003-066X.52.11.1187 [Cross Ref]
- McCallum L. W. (1984). A meta-analysis of course evaluation data and its use in the tenure decision.Res. High. Educ.21 150–158. 10.1007/BF00975102 [Cross Ref]
- McPherson M. A. (2006). Determinants of how students evaluate teachers.J. Econ. Educ.37 3–20. 10.3200/JECE.37.1.3-20 [Cross Ref]
- Miller J. E., Seldin P. (2014). Changing practices in faculty evaluation.Academe10035.
- Murray H. G. (1983). Low-inference classroom teaching behaviors and student ratings of college teaching effectiveness.J. Educ. Psychol.75 138–149. 10.1037/0022-06220.127.116.11 [Cross Ref]
- Neath I. (1996). How to improve your teaching evaluations without improving your teaching.Psychol. Rep.78 1363–1372. 10.2466/pr0.1996.78.3c.1363 [Cross Ref]
- Paswan A. K., Young J. A. (2002). Student evaluation of instructor: a nomological investigation using structural equation modeling.J. Market. Educ.24 193–202. 10.1177/0273475302238042 [Cross Ref]
- Paulsen M. B. (2002). Evaluating teaching performance.New Direct. Institut. Res.2002 5–18. 10.1002/ir.42 [Cross Ref]
- Pounder J. S. (2007). Is student evaluation of teaching worthwhile? An analytical framework for answering the question.Q. Assurance Educ.15 178–191. 10.1108/09684880710748938 [Cross Ref]
- Roediger H. L., Karpicke J. D. (2006). Test-enhanced learning: taking memory tests improves long-term retention.Psychol. Sci.17 249–255. 10.1111/j.1467-9280.2006.01693.x [PubMed][Cross Ref]
- Rohrer D., Dedrick R. F., Burgess K. (2014). The benefit of interleaved mathematics practice is not limited to superficially similar kinds of problems.Psychon. Bull. Rev.21 1323–1330. 10.3758/s13423-014-0588-3 [PubMed][Cross Ref]
- Rohrer D., Pashler H. (2010). Recent research on human learning challenges conventional instructional strategies.Educ. Res.39 406–412. 10.3102/0013189X10374770 [Cross Ref]
- Scherer R., Gustafsson J. E. (2015). Student assessment of teaching as a source of information about aspects of teaching quality in multiple subject domains: an application of multilevel bifactor structural equation modeling.Front. Psychol.6:1550 10.3389/fpsyg.2015.01550 [PMC free article][PubMed][Cross Ref]
- Schmidt R. A., Bjork R. A. (1992). New conceptualizations of practice: common principles in three paradigms suggest new concepts for training.Psychol. Sci.3 207–217. 10.1111/j.1467-9280.1992.tb00029.x [Cross Ref]
- Simon D. A., Bjork R. A. (2001). Metacognition in motor learning.J. Exp. Psychol.27 907–912. 10.1037/0278-7318.104.22.1687 [PubMed][Cross Ref]
- Spooren P., Brockx B., Mortelmans D. (2013). On the validity of student evaluation of teaching: the state of the art.Rev. Educ. Res.83 598–642. 10.3102/0034654313496870 [Cross Ref]
- Stark P. B., Freishtat R. (2014). An Evaluation of Course Evaluations. Berkley, CA: Center for Teaching and Learning.
- Taylor K., Rohrer D. (2010). The effects of interleaved practice.Appl. Cogn. Psychol.24 837–848. 10.1002/acp.1598 [Cross Ref]
- Wachtel H. K. (1998). Student evaluation of college teaching effectiveness: a brief review.Assess. Eval. Higher Educ.23 191–212. 10.1080/0260293980230207 [Cross Ref]
- Weinberg B. A., Hashimoto M., Fleisher B. M. (2009). Evaluating teaching in higher education.J. Econ. Educ.40 227–261. 10.3200/JECE.40.3.227-261 [Cross Ref]
- Worthington A. C. (2002). The impact of student perceptions and characteristics on teaching evaluations: a case study in finance education.Assess. Eval. High. Educ.27 49–64. 10.1080/02602930120105054 [Cross Ref]
- Youmans R. J., Jee B. D. (2007). Fudging the numbers: distributing chocolate influences student evaluations of an undergraduate course.Teach. Psychol.34 245–247. 10.1080/00986280701700318