Nightcrawler wrote:Question for Joe.
First of all, thanks to you I finally understood why a worse pool of applicants results in a lower scale (even if it's counter-intuitive to whoever was used to the good ol' grading curve in law school).
With that said, what can you tell about the grading calibration sessions? In particular, I read on the Cal Bar website that they meet, analyze some written answers of candidates, decide a grading grid, and repeat this process a few times to make sure that their grading grid is uniform.
So here is my specific question: doesn't this method result in a "less demanding" grid during an examination with a worse pool of applicants (such as February), therefore making better applicants score more points according to that grid (compared to when the same "better applicant" sits in July, where the grid is done looking at better averaging answers)?
Thanks in advance!
Be careful, if you get me started on this, I may never stop. I have looked at thousands of graded essays/MPTs and I sometimes can't understand how they are scored. I study them in all sorts of ways, even breaking them down into keyword analysis. For example, please take a look at the following:
https://seperac.com/pdf/J14-Essay%20Ana ... ay%201.pdf
It contains obvious and serious mistakes in J14 essay grading. You will need to Zoom in on this PDF to read the material (I try to put a number of essays on one page so it can be visually compared). This PDF is a small sample of 15 answers from Essay 1 from the July 2014 exam. As part of this Essay Analysis, I try to determine the weight of each issue and then I calculate each examinee’s score for each issue (for example, PROF-RES: Solicitation/Referral Fees (Seperac Est. score of 2/10)). The final result is the “Seperac Estimated Score.” Bar graders have neither the time or the interest to put similarly scored essays side by side to see if the grading is indeed accurate. However, when I do this, grading inaccuracies often come to light. For example, if you look at the 5th essay (Jul2014-Essay-001-ID 002-Typed-Score 38.66), this “Examinee J” received a score of 38.66. If you compare this essay to the other essays that scored around 38.66, you will see that this essay is far superior. I feel this essay score was severely discounted – just compare this essay to the released Model Answers and you will see what I mean. How this essay is not a passing essay is a complete mystery to me.
The biggest reason for unreliability in essay grading is multiple graders. In small states, this is not an issue – one grader will handle all the essays for a question. This makes the appraisal more accurate, but still subject to the whims of a subjective grader. NCBE tells the graders to put the essays in “buckets” so there is a 1 bucket, a 2 bucket, etc and to make sure there is an even distribution. Thus, if you are competing against strong essay writers, you will have a hard time. However, I still believe issue-spotting is paramount. If the graders are using the NCBE Scoring Analysis (which they should be using), then if you spot the issue, you have to receive credit for it. If all the examinees spot all the issues, then you will still have a problem because their writing is likely stronger than yours, but if they do not, you can do OK on the essays.
For example, the following was a passing J16 essay:
http://seperac.com/pdf/Jul2016-Essay%201-49.46.pdf
Although poorly written, it spotted all the issues. In looking at other examinee essays, many examinees failed to spot the dissociation issue. I feel that when the grader saw that this examinee spotted every issue, including the one no one spotted, I think the grader felt the essay had to be regarded as a passing essay.
According to a 1977 study entitled An Analysis Of Grading Practices On The California Bar Examination by Stephen P. Klein, Ph.D., the "
grading standards for the California bar exam essays are based on: analysis of the problem, knowledge of the law, application of legal principles, reasoning, and the appropriateness of the conclusions reached. The objective "correctness" of the answer are not supposed to affect the grade assigned." Bar exam graders have to be trained to apply these scoring rules consistently through a process called "calibration." According to NCBE's digest called The Bar Examiner:
The MBE has a reliability of about 0.9. The reliability of the MBE varies a little from administration to administration (from about 0.89 to 0.91) but is consistently high enough to meet the reliability requirement by itself. The reliability of the written component is generally lower and more variable than the reliability of the MBE. Assuming that the written component includes 6 to 10 tasks (including essay questions and performance tasks), that the candidate responses to each essay question and/or performance task are graded by a single grader (or a set of calibrated graders who have been trained to apply the scoring rules consistently), and that the overall written component score is the sum or average of the scores on the individual tasks, the reliability will tend to be about 0.7. So, the written components of most bar examinations are not reliable enough in themselves to meet the rule of thumb of 0.8 or 0.9, but when combined appropriately with the MBE, the overall score tends to have a reliability higher than 0.9.
The Bar Examiner: Volume 78, Number 4, November 2009
It can be inferred from this article that the reliability of the essays, while already lower than the MBE, will diminish further if the graders are not sufficiently trained to apply the scoring rules consistently. This was confimed in the 1977 Klein study:
"there was far more consistency among the readers before the regular reading process began (calibration data set) than there was once this process was underway. This difference is evident on all three indices of agreement and clearly illustrates that the initial calibration data does not reflect accurately the degree of agreement among, the readers in the scores that are subsequently used in determining an applicant's pass/fail status. For instance, with the calibration sample there was a range of 70-85 percent agreement on the pass/fail decision~ whereas this range dropped to 27-57 percent at the beginning and to 23-53 percent at the end of the regular reading period. In other words, during the normal reading process, the readers agreed with one another about one-half as well as they did during the calibration process!"
Basically, at the beginning of the grading process (immediately after calibration), the graders were most likely to be consistent (with the highest consistency being 85%). At the end of the grading process, the graders were least likely to be consistent (with the highest consistency being 53% and the lowest consistency being 25%). At the time of this study in 1977, the California bar exam graders convened just once to calibrate the essays. Currently, the California bar exam graders convene three times to "calibrate." (see The State Bar Of California Committee Of Bar Examiners/Office Of Admissions Description And Grading Of The California Bar Examination – General Bar Examination And Attorneys' Examination
http://admissions.calbar.ca.gov/LinkCli ... iqbATHUwY=)
However, despite currrently convening three times to calibrate, the California bar exam essay grading is still unreliable. Sometimes California bar exam failers send me their score sheets to review. On the California bar exam, if an examinee scores above a 1390 but below a 1440 passing score, the examinee's essays and PTs are re-read to ensure accuracy and both read scores are reported and averaged. I have seen instances where there is a 20 point difference in an essay score between re-reads.
Meanwhile, in NY, the NY graders convene only once to "calibrate." In a March 2011 discussion at New York Law School, Bryan R. Williams of the NYS Board of Law Examiners stated:
"The grading of the exam is done by seven people throughout state - all practicing lawyers. There are five board members of the NY Board of Law Examiners who are appointed by the Court of Appeals. Each one of those board members have seven people who are in their team. Each person is responsible for one essay, and that team of people, then they grade the essay and the MPT. So what happens is we have the question written, and then we have a model answer. And just like this exam that was just given, a few days after the exam, all of us, the seven graders and myself, will receive about 50 sample answers given by candidates, so we all get the same 50, and we individually go and we grade those exams based upon the model answer that they did, and then we have a meeting and we come together and we make sure that we are all grading the same way, so we can get calibrated, and there has never been a time since I've been doing this, and I've been doing this since 1986, there has never been a time where we would have had that meeting and because of the kinds of answers we get back, we don't in some way change our model answer because what we are trying to do, we are trying to rank order people."
See 2011 NYLS Bar Kickoff video @ 13:15-14:40
Think about this for a minute. California does more than New York to ensure their essay grading is reliable (e.g. commissioning studies on the reliability of their essay grading, having graders convene three times for calibration, scoring essays in 5 point increments), yet California essay scores can still experience 20 point swings between different graders. Image what point swings can occur with New York essays! This is why I tell examinees to focus on the MBE and take calculated risks on the MEE/MPT. On the essays, you can do everything right (study the right material, answer the questions properly, etc), but you are statistically less likely to get the score you deserve, as compared to the MBE.