Common misconceptions

Common mistake
Wrong: A highly reliable measurement instrument is also valid.
Right: A measure can be highly reliable (consistent) yet invalid if it consistently measures the wrong construct; reliability is necessary but not sufficient for validity.
Reliability and validity are independent properties — a measure being consistent does not mean it's measuring the right thing. Think of a scale that's consistently miscalibrated to read 5 pounds too heavy: it's perfectly reliable (same error every time) but invalid (doesn't accurately reflect true weight). On the MCAT, if a question tells you an instrument has high test-retest reliability, that tells you nothing about whether it's actually measuring the intended construct.
Common mistake
Wrong: Inter-rater reliability and test-retest reliability both assess consistency of the same measure over time.
Right: Test-retest reliability assesses consistency of a measure across time points; inter-rater reliability assesses consistency of a measure across different raters or observers.
These two types of reliability are about different sources of inconsistency. Test-retest reliability asks: does the same person using the same instrument get the same result at two different time points? Inter-rater reliability asks: do two different people using the same instrument get the same result for the same subject? Test-retest is about stability over time; inter-rater is about agreement across observers. If a passage says two researchers scored the same video differently, that's an inter-rater problem — not a test-retest problem.
Common mistake
Wrong: Criterion validity means a test measures the theoretical construct it is designed to assess.
Right: Criterion validity means a test correlates with an external criterion (concurrent or predictive); construct validity means the test measures the intended theoretical construct.
Construct validity is about the theory — does the test actually measure the theoretical construct it's supposed to (e.g., does an 'anxiety scale' actually capture anxiety as psychologists define it)? Criterion validity is purely empirical — does the test correlate with some external, real-world criterion (e.g., does a new anxiety scale correlate with psychiatrist diagnoses)? The confusion comes because both sound like they're asking 'does the test measure what it should.' The key distinction: criterion validity requires an external benchmark, construct validity is about the conceptual fit to a theory.
Free Deck audit

See if your Anki deck covers this topic.

Upload your deck →
Guided session

Stuck on this? An AI tutor that probes your understanding.

Start a session →

What the exam tests

  1. Know the core definitions: reliability means a measurement produces consistent, repeatable results; validity means it actually measures what it's supposed to measure — and understand that a reliable instrument is not automatically a valid one.
  2. Distinguish between the three types of reliability: test-retest (same measure, different time points), inter-rater (same measure, different observers), and internal consistency (different items on the same instrument measuring the same construct).
  3. Distinguish between the main types of validity: content validity (covers all aspects of the construct), construct validity (measures the intended theoretical construct), criterion validity (correlates with an external criterion — either concurrently or predictively), and face validity (appears to measure what it claims on the surface).
  4. Read a passage describing a study's measurement tool and identify specific reliability or validity weaknesses — for example, recognizing that a questionnaire with inconsistent responses across administrations has a test-retest reliability problem, or that a test correlating with an external gold standard is demonstrating criterion validity.

Can you avoid these mistakes?

A researcher develops a new blood pressure cuff that consistently reads 10 mmHg higher than the gold-standard device for every patient. How would you characterize this instrument in terms of reliability and validity?
A psychology study uses a depression questionnaire. Two trained clinicians score the same patient's responses and get very different depression scores. What type of reliability is compromised, and what would you change to address it?
A new cognitive test for early Alzheimer's detection is administered to patients and the scores are compared against neurologist diagnoses made independently. The test scores correlate strongly with the diagnoses. What type of validity does this demonstrate — and how is it different from construct validity?
A passage describes a study where participants completed a stress questionnaire twice, two weeks apart, with no intervention in between. The scores were highly correlated across the two administrations. What type of reliability does this demonstrate, and does this finding alone tell you the questionnaire is a valid measure of stress? Why or why not?

Related topics

See how your Anki deck covers this topic.

Upload your deck for a free audit →