My reseach,
I am in the process of writing a manuscript on the subject of the four methods of measuring the reliabiliy of human judgement and the reliability of psychological tests. It occured to me several years ago, that there was a mistake in the way measurement error was being calculated. This lead me to collect some data to prove my hypothesis. I submitted my results to the Journal of Applied Psychology, and my manuscript was soundly trounced by the three reviewers and the editor.
Since then, I have gathered further evidence. I am retired, but am working with a young PhD from the local university who graduated from the same school as I, and even had the same advisor, I worked with years ago. Even though my young collegue knows the field very well, it took him about a month to get my concept, since it is at odds with what is being taught. Once the light came on, he was the one that saw that my concept would apply to tests. I kind of saw that, but not as clearly as he did. Originally, I was more focused on the implications for human judgement.
At any rate, we have collected very complete data, and are ready to write a much expanded, detailed manuscript.
So, here are the different measures:
Alpha coeffiecient: This measures the degree of consistency of items in a test. For people, this is an index of their consistency of judgement at one point in time.
Test-retest reliability: This measures the error associated with real change over time.
Construct validity: This measure contains all sources of error, item inconsistency, changes in people, and nonparrelism in measurement between two tests or two people.
To correct for item inconsistency in construct validity, the square root of the validity coefficient should equal the test-retest value, since there is no item inconsistency reflected in the test-retest correlation. This correction, though a standard method, has not be considered in the past, for this kind of data.
Here are the results for cognitive tests:
Average alpha coefficient = .914. 1.000 - .914 = .086, which means that item inconsistency produces 8.6% error.
Average test-retest reliability is .892, which means that changes in people account for .108 or 10.8% error variance. The sum of these two sources of error are .194. 1.000 - .194 is .806, which is the highest correlation that could be obtained for construct validity. The average construct validity is .790 or only .016 less than this value. Therefore, the degree of nonparrelism between two tests measuring the same construct is less than 2% error, and item inconsistency and changes in people account for most of the error. Furthermore, the square root of construct validiy is .888 which is only .004 less than test-retest reliability. So, two different methods, adding up the error terms and correcting with the sqaure root value produce nearly identical results.
This same relationship holds for personality tests, and for measures of human judgement. No one in the 120 year history of psychometrics has tied this altogether.
One other measure of reliability is one set of scores correlated against an average from a large set of other scores. This can not be done with tests or measures of human judgement, in most cases. It can be done with simulated data using a method developed by Walter Borman in 1976. He termed his calcualtion, an accuracy score. In a set of data using a human judgement task, my collegue and I have collected, we can compare all four measures of reliability. The data is to be put in a data base next week, and we should have the analysis done in the near future.
I know this stuff is tiny detail, but it has major implications for some big issues in assessment, such as the degree of bias in testing or judging protected groups.
Just thought I would lay this all out for a few of my distractors, who think this Gramps is going brain dead.