Skip Nav

Reliability (statistics)

❶The test-retest method assesses the external consistency of a test.

Assessing Reliability

Factors that contribute to consistency: Factors that contribute to inconsistency: The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.

A true score is the replicable feature of the concept being measured. It is the part of the observed score that would recur across different measurement occasions in the absence of error. Errors of measurement are composed of both random error and systematic error. It represents the discrepancies between scores obtained on tests and the corresponding true scores.

The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimized. The central assumption of reliability theory is that measurement errors are essentially random.

This does not mean that errors arise from random processes. For any individual, an error in measurement is not a completely random event. However, across a large number of individuals, the causes of measurement error are assumed to be so varied that measure errors act as random variables.

If errors have the essential characteristics of random variables, then it is reasonable to assume that errors are equally likely to be positive or negative, and that they are not correlated with true scores or with errors on other tests. It is assumed that: Reliability theory shows that the variance of obtained scores is simply the sum of the variance of true scores plus the variance of errors of measurement.

In its general form, the reliability coefficient is defined as the ratio of true score variance to the total variance of test scores. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:. Unfortunately, there is no way to directly observe or calculate the true score , so a variety of methods are used to estimate the reliability of a test.

Some examples of the methods to estimate reliability include test-retest reliability , internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently. It was well-known to classical test theorists that measurement precision is not uniform across the scale of measurement.

Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score. Four practical strategies have been developed that provide workable methods of estimating test reliability.

The correlation between scores on the first test and the scores on the retest is used to estimate the reliability of the test using the Pearson product-moment correlation coefficient: The key to this method is the development of alternate test forms that are equivalent in terms of content, response processes and statistical characteristics.

For example, alternate forms exist for several tests of general intelligence, and these tests are generally seen equivalent. If both forms of the test were administered to a number of people, differences between scores on form A and form B may be due to errors in measurement only.

The correlation between scores on the two alternate forms is used to estimate the reliability of the test. This method provides a partial solution to many of the problems inherent in the test-retest reliability method. For example, since the two forms of the test are different, carryover effect is less of a problem. Reactivity effects are also partially controlled; although taking the first test may change responses to the second test.

However, it is reasonable to assume that the effect will not be as strong with alternate forms of the test as with two administrations of the same test. This method treats the two halves of a measure as alternate forms. It provides a simple solution to the problem that the parallel-forms method faces: The correlation between these two split halves is used in estimating the reliability of the test. This halves reliability estimate is then stepped up to the full test length using the Spearman—Brown prediction formula.

There are several ways of splitting a test to estimate reliability. For example, a item vocabulary test could be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21 through However, the responses from the first half may be systematically different from responses in the second half due to an increase in item difficulty and fatigue.

In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which the odd-numbered items form one half of the test and the even-numbered items form the other. This arrangement guarantees that each half will contain an equal number of items from the beginning, middle, and end of the original test.

The most common internal consistency measure is Cronbach's alpha , which is usually interpreted as the mean of all possible split-half coefficients. As part of a stress experiment, people are shown photos of war atrocities. After the study, they are asked how the pictures made them feel, and they respond that the pictures were very upsetting. In this study, the photos have good internal validity as stress producers. External validity - the results can be generalized beyond the immediate study.

In order to have external validity, the claim that spaced study studying in several sessions ahead of time is better than cramming for exams should apply to more than one subject e. It should also apply to people beyond the sample in the study. Different methods vary with regard to these two aspects of validity. Experiments, because they tend to be structured and controlled, are often high on internal validity.

However, their strength with regard to structure and control, may result in low external validity. The results may be so limited as to prevent generalizing to other situations. In contrast, observational research may have high external validity generalizability because it has taken place in the real world.

However, the presence of so many uncontrolled variables may lead to low internal validity in that we can't be sure which variables are affecting the observed behaviors. Relationship between reliability and validity. If data are valid, they must be reliable. If people receive very different scores on a test every time they take it, the test is not likely to predict anything.

However, if a test is reliable, that does not mean that it is valid. For example, we can measure strength of grip very reliably, but that does not make it a valid measure of intelligence or even of mechanical ability. Reliability is a necessary, but not sufficient, condition for validity.

Main Topics

Reliability has to do with the quality of measurement. In its everyday sense, reliability is the "consistency" or "repeatability" of your measures. Before we can define reliability precisely we have to lay the groundwork.

Privacy FAQs

Reliability refers to whether or not you get the same answer by using an instrument to measure something more than once. In simple terms, research reliability is the degree to which research method produces stable and consistent results. A specific measure is considered to be reliable if its.