The Results – How Did the Students Do?
We know that a lot of people are awaiting the results of the Envision Academy pilot this summer using Khan Academy. Before we get to the results, a quick recap. We set out to pilot a new way to run classrooms via blended learning. At Envision Academy in Oakland, California high school students who had failed algebra were randomly assigned to one of two summer school classes. The “control” classroom received a traditional five-week summer school curriculum for Algebra 1. The “treatment” classroom used Khan Academy for almost all of the period each day, and both classes had the same teacher. We were curious about how the role of the teacher and student might change using blended learning, and we wanted to better understand its challenges and potential.
We took an open source approach to the experiment, welcoming visitors to the classroom and blogging about our learning. There were some immediately clear benefits to the blended learning environment and some equally clear challenges. We did not have a pre-set opinion in the blended learning debate, and we tried to remain objective. The d.school at Stanford helped to observe the classes, interview students, and capture our learning on this blog.
Before we discuss the results, a few important caveats are in order. First, no statistician will take our results particularly seriously, and they shouldn’t. The sample size is too small to attribute any real significance to the findings. Secondly, the pilot was very brief, lasting only five weeks, or twenty-four class sessions of two hours each. Thirdly, there is always the risk of the Hawthorne effect, or an observational bias because the students inevitably knew they were part of a study. Finally and perhaps most importantly, it was also difficult to find the right measure by which to evaluate the progress of students in the two classes. After consultation with several researchers, we settled on the University of California Mathematics Diagnostic Testing Program (MDTP) and their Elementary Algebra Diagnostic exam (EA50A90). The exam is designed to measure students’ readiness for an Algebra II course. We settled on this exam in consultation with the team at MDTP as an appropriate means to measure students’ success at the end of an Algebra I course (*see comments for a listing of the topics assessed and the number of questions per topic). A major concern with this assessment, however, was that it would not pick up any gains made on pre-algebra content because it focuses primarily on algebra content.
From the beginning, we knew that the pre and post course test data could not definitively assess the success of the pilot. For all the reasons listed above, we view the data as a single quantitative measure that should be considered alongside the qualitative data captured through the observations and interviews. As such, our hope is that others will not cite this data as proof one way or the other of the effectiveness of Khan or blended learning. It would be dangerous to over-generalize our findings. We see this pilot as providing one small piece of data that suggests reason to be cautiously optimistic, while also clearly showing the need for additional study.
Among the students in the study who had valid scores on the pre and post course assessment, the results were similar for the treatment and the control group. Students in the “control” or traditional summer school course, on average, increased their percentage of correct answers by 5.2% over the five-week period. Students in the “treatment” or Khan class, on average, increased their percentage of correct answers 6.4%. For example, a student who started the summer knowing 60% of the correct answers in the traditional class ended the five weeks knowing 65.2% of the correct answers. The same student in the Khan class would, on average, be able to answer 66.4% of the answers correctly at the end of the same period.
Increase in Percent of Questions Answered Correctly on the MDTP Algebra II Readiness Exam
Averages can obviously be deceiving. In terms of distribution, in each class approximately one third of the students saw some significant gains (ten percent or higher gains in percentage of questions answered correctly), whereas two thirds of the students’ scores were essentially flat (less than four percent increase or decrease). There were no particularly strong findings regarding in which content areas the two classes saw concentrated gains. The one exception is that the students in the traditional class saw most of their gains in the areas of “Graphical Representations” and “Polynomials and Polynomial Functions” whereas in the Khan class, students saw gains spread out among almost all the categories.
Remembering the limits of this data’s reliability, it is interesting to note that students in the two groups scored roughly the same, each showing some slight improvement over the five-week course. We wonder whether this trend would hold over a full-year course, and whether the slightly higher gains that the Khan students showed would be multiplied or reduced over the course of a school year.
It would be easy (and wrong) to use this data to conclude that blended learning and Khan are without value. If anything, we find it interesting that the teacher “doing her best” in the control class was roughly equivalent to the gains of the students using the Khan Academy, where students did more of the learning on their own with the teacher as the guide. In the treatment class, the teacher ended up doing mostly one-on-one consulting with pupils, and the students progressed through the assignments at their own pace and sequence. If it is true that Khan centered classes can match or even exceed the traditional teacher-lead pedagogy, there could be interesting implications for the future.
Regarding the more concentrated student gains on “Graphical Representations” and “Polynomials” in the teacher led class, it is plausible that this concentration of gain was due to the teacher focusing more time on these topics. In the Khan classroom, the teacher had less control over which content students devoted the most time to. It therefore makes sense that the gains were more evenly spread across the various topics.
It is also interesting to note that in the Khan classroom, many students spent a significant amount of time working on pre-algebra skills such as fractions, percentages, decimals, and even basic computation. If we could do it over again, we would have used a second measure to also evaluate student progress on these pre-algebra skills. Our hypothesis is that the Khan students would have made significant gains versus the control classroom that did not spend much time on these topics. The data within the Khan software shows that the treatment students were able to correctly answer ten questions in a row on many of the pre-algebra sections.
It is also interesting to consider that students in the treatment group spent approximately half of the summer working on pre-algebra skills. Because the Khan software is individualized, it identified that most of our students had significant pre-algebra skill gaps and delivered instruction and practice problems to address these deficits. Students in the Khan/treatment group therefore spent up to 50% less time than the control group on the algebra content that the MDTP exam measured. The treatment group, however, still performed at a similar level the control group on the algebra measures.
Questions that Remain
As with any pilot, we are left with as many questions as answers. We wonder how a Khan type classroom would work with a less skilled teacher. The teacher in this summer pilot had a positive rapport with students, good classroom management, and was a good motivator to both classes. In the hands of a less-than-good teacher, we wonder if the results would hold. That said, with a less-than-good teacher, the traditional classroom would likely suffer significantly too.
We saw strong engagement and interest from the students in the Khan/treatment classroom, as documented throughout this blog. We are curious to know if this is inherently true for such individualized and self-paced content, or whether students would “hit the wall” if the Khan approach were used for a longer period of time. Anecdotally, most of the students told us they preferred the Khan classroom to what they had experienced previously and would prefer to take a “blended” course next year.
Prior to seeing the results of the summer experiment, the teacher predicted that her students would do better on a traditional measure of proficiency such as the California Standards Test or CST if she ran her classroom in the Khan manner versus the traditional classroom approach. Given that she was not a convert prior to this pilot and developed these opinions only through teaching the course, we find this an interesting perspective to consider in the dialogue about how teachers will respond to blended learning.
Finally, there is still much to learn. We hope that this small experiment inspires others to tackle and document their learning in the blended learning space. Clearly we need larger sample sizes and longer trial periods by which to evaluate the approach. Envision Academy has decided to participate in a year long pilot of Khan Academy for this school year for all ninth grade students. They will compare end of year results on the CST compared to the three other Envision high school campuses in the Bay Area as well as to previous year’s ninth grade classes within Envision Academy to evaluate its success.
The qualitative evidence of these past five weeks point to the potential of blended learning. We are curious to hear what others think. Small sample size aside, do the findings that Khan students performed roughly the same as – or even slightly higher than – the traditional classroom students support or undermine the value of blended learning? Weigh in by clicking the comment bubble on the top of the post, and let us know your opinion.