For some studies, .6 might be an acceptable agreement. When you look at the doctor`s approval of who should have invasive surgery, you want an almost perfect deal. These are therefore only general guidelines and it is necessary to take into account the purpose of the study and the consequences of the inaccuracy. A good example of the source of concern about the importance of the kappa results obtained is shown in an article that compared the visual detection of abnormalities in biological samples by humans with automated detection (12). The results showed only moderate agreement between human and automated evaluators for kappa (κ = 0.555), but the same data gave an excellent percentage of agreement of 94.2%. The problem with interpreting the results of these two statistics is: how are researchers supposed to decide whether evaluators are reliable or not? Do the results obtained indicate that the vast majority of patients receive accurate laboratory results and therefore correct or incorrect medical diagnoses? In the same study, the researchers chose a data collector as the standard and compared the results of five other technicians with the standard. Although sufficient data to calculate a percentage match is not specified in the article, the kappa results were only moderate. How is the lab manager supposed to know if the results represent high-quality readings with little disagreement between trained lab technicians or if there is a serious problem and additional training is needed? Unfortunately, kappa statistics do not provide enough information to make such a decision. In addition, a kappa can have such a wide confidence interval (CI) that it understands everything from the good to the bad game. Pearson r is the most commonly used measure of bivariate correlation. It describes the degree to which there is a linear relationship between two continuous variables. It is often used to test theories, verify instrument reliability, evaluate proof of validity (predictive and simultaneous), evaluate the strengths of intervention programs, and other descriptive and inferential measures.
It provides a measure of the direction and strength of the relationship between two variables. In addition, if it is squared (R2), it can also provide the measure of the common variance between two variables. However, caution should be exercised as Pearson r may not be appropriate if the data contain outliers (extreme values), especially if the sample size is small or if there is a range limit (the sample is not representative of the population). In addition, it must be remembered that correlation does not imply correlation. However, it can also be used as a measure of effect size. Pearson r ranges from -1 to 1. This is preferable for interval and ratio scales. Unfortunately, marginal amounts may or may not estimate the amount of the opportunity assessor agreement in case of uncertainty. Therefore, it is questionable whether the reduction in the estimate of the agreement by kappa statistics is really representative of the level of agreement of opportunity assessor.
Theoretically, Pr(e) is an estimate of the match rate if the evaluators guess at each element and guess at rates similar to marginal proportions, and if the evaluators were completely independent (11). None of these assumptions are justified, and so there is much disagreement about the use of kappa among researchers and statisticians. Cicchetti DV, Feinstein AR: High approval, but low kappa: II. Solving paradoxes. J. Clin Epidemiol. 1990, 43: 551-558. 10.1016/0895-4356(90)90159-M. Gwet KL: Calculation of inter-evaluator reliability and its variance at high agreement.
Br J Math Stat Psychol. 2008, 61: 29-48. 10.1348/000711006X126600. Kappa statistics, or Cohen`s Kappa*, are a statistical measure of inter-evaluator reliability for categorical variables. In fact, it is almost synonymous with inter-rata reliability. So far, the discussion has assumed that the majority was right and that the minority reviewers were wrong in their scores and that all the reviewers made a conscious choice of a rating. Jacob Cohen realized that this assumption could be wrong. In fact, he explicitly remarked: “In the typical situation, there is no criterion for the `correctness` of judgments” (5). Cohen suggests the possibility that for at least some of the variables, none of the evaluators were sure which score to enter and simply made random assumptions.
In this case, the agreement reached is a false agreement. Cohen`s Kappa was developed to address this concern. Once the kappa is calculated, the researcher will probably want to assess the importance of the kappa obtained by calculating the confidence intervals for the kappa obtained. Percentage match statistics are a direct measure, not an estimate. So there is little need for confidence intervals. .