A test is a sample of behavior, i.e., a series of tasks (e.g., items) used to obtain systematic observations presumed to represent attributes or characteristics. A test is used as a measurement tool. Measurement is the process of assigning numbers to human attributes or characteristics. Assessment is the use of methods or processes to gather data about, or evidence of, human behavior.
Assessment is a preferred term because it (merely) connotes the collection of data concerning the present state of human behavior, whereas the term diagnosis connotes determination of the degree of abnormality. Interpretation is the act of stating the meaning and/or usefulness of behavioral data. Evaluation is the process of applying judgments to and/or making decisions based on the results of measurement. An evaluation program is a program test designed to measure and assess an individual’s growth, adjustment, and/or achievement, or a program’s effectiveness. Tests used in the counseling professions are usually (and generally) classified into five categories:
Personality Aptitude, achievement, and intelligence tests are sometimes clustered under the heading ability tests. An ability test is a standardized test that measures a test taker’s current level of performance in a specified area of cognitive, psychomotor, or physical functioning. An achievement test measures a test taker’s achievement level in one or more content or subject matter areas.
An adjustment inventory is a self-report instrument used to identify personal and social adjustment problems.
Cognitive assessment is a data-collecting technique used to assess an individual’s ability to perform mental activities relative to acquiring, processing, retaining, conceptualizing, and organizing verbal, spatial, psychomotor, sensory, and perceptual information. A diagnostic (ability) test measures specific aspects of achievement in a single subject or field. An intelligence test is a psychological or educational test designed to measure intellectual operations, functions, and general abilities. An inventory assesses an individual’s opinions, interests, and dispositions about specific situations. A mastery test assesses whether an individual has achieved mastery, generally defined by a passing or cut score, in a specific domain of knowledge or skill.
A multi-factor test measures multiple constructs that are relatively uncorrelated with one another. A performance test is one that generally requires the use and manipulation of physical objects and the application of physical and manual skills in situations rather than oral or written responses. A screening test is a beginning point in a selection or diagnostic process that identifies broad classifications of test takers. A projective test technique assesses personality dynamics through psychological projection. Test takers respond to ink blots, pictures, incomplete sentences, or other unstructured stimuli, in such a manner that they “project” into their responses manifestations of personality characteristics. An aptitude test is a cognitive or psycho-motor measure used to predict success in a course, job, or educational or training program.
An interest inventory measures preferences for one or more activities from a large set of possible activities.
A personality inventory measures one or more aspects of personality, including attributes, dynamics, or characteristic ways of behaving. A self-report inventory usually consists of questionnaire-type statements requiring a limited form of responding (e.g., true-false or multiple-choice items). An individual test is administered to one person at a time. A group test is administered simultaneously to a group of people. In a power test, speed is not measured as a component of performance, i.e., there is more than sufficient time to respond. In a speed(ed) test, time is measured as a component of performance. A verbal test necessitates command of language for effective responding. A nonverbal test de-emphasizes comprehension of language as a requirement for effective responding. An objective test has clear and unambiguous scoring criteria. A placement test is used to assign individuals to different levels or categories. Construct equivalence is the degree to which multiple tests measure essentially the same construct. It also refers to the extent to which the same test measures the same construct when administered to two different cultural or linguistic groups. Documentation includes supporting materials such as test manuals and research reports created by test authors and publishers to provide evidence of a test’s quality and promote use of that test. Discriminating power is the ability of a test item to differentiate between individuals who possess much of a given characteristic such as skill, knowledge, or attitude, and individuals who possess little of the characteristic.
Adaptive testing is an individualized, sequential form of testing in which successive test items are selected on the basis of a test taker’s responses to previous items. Test items also are selected based on psychometric properties and test content.
A pilot test is the administration of a test to a representative sample of examinees so that the test’s properties may be determined. A test battery is a group of tests for which the results are valued individually and/or in combination. It is standardized on the same population so that norm-referenced scores can be derived and used for comparison and decision-making purposes. A standardized test is one in which testing conditions are the same for all examinees, including directions, scoring procedures, test use, data on reliability and validity, and adequately determined norms. A field test is an administration of a test employed to examine the quality of testing procedures such as test administration, responding, scoring, and reporting, in a manner that is more extensive than in pilot testing. Alternate forms are two or more interchange-able versions of a test that generally assess the same construct, use the same instructions for test administration, and are given for the same purposes. Alternate forms include: parallel forms, which have identical content and psychometric properties; equivalent forms, which sample the same content areas and are considered equivalent in regard to derived scores; and comparable forms, which have similar content areas but do not share statistical similarity. Neuropsychological assessment is an evaluation that generates possible hypotheses and conclusions regarding processes that affect the central nervous system, or psychological or behavioral dysfunctions related to pathology in the central nervous system. Outcome evaluation is a practitioner-generated assessment of the efficacy of a particular intervention, program, or service.
A job analysis identifies the (a) knowledge, skills, abilities, and other personal qualities needed to perform a given job; and (b) the specific tasks to be performed relative to the job.
Portfolio assessment is the evaluation of systematically collected educational or work products over a period of time. Performance assessment is evaluation of observable products or behaviors in settings designed to represent real-life contexts in which knowledge and skills are actually utilized. Program evaluation is assessment of the efficacy of a planned set of procedures. Personality assessment is evaluation of normal or abnormal dimensions of personality. Psychological assessment is an evaluation of an individual’s psychological functioning that includes administering, scoring and interpreting tests and inventories, behavioral observations, client and third-party interviews, and analysis of prior educational, occupational, medical, and psychological records. Psychological testing is employment of tests and inventories to measure an individual’s psychological traits and dimensions. Vocational assessment is a form of psychological assessment that generates hypotheses and inferences related to constructs such as the test taker’s values, work needs, interests, and career- development status. Norms are statistics that describe the performance of individuals of various ages or grade levels who comprise the standardization group for the test. Age norms are scores that represent average performances for individuals by chrono-logical age. They usually are expressed as central tendencies, scores, percentiles, standard scores, or stanines. Local norms are a set of scores obtained from a specific sample that are not considered generalizable to populations beyond the sample. The reference population is the group of people from which a sample was used to establish norms for a given test.
The standardized sample is the group of people from the reference population whose performances were used to establish the norms for a given test.
Utility is an evaluation, often in cost-benefit form, of the relative value of using or not using a given test for a specific purpose. Ability is the power to perform a designated responsive act. The power may be potential or actual, native, or acquired. Achievement level is an individual’s performance and competency in a specified subject area. The description of achievement level is usually defined as a category on a continuum that ranges from “basic” to “advanced.” Aptitude is the capacity to gain proficiency with training. Intelligence is the cognitive ability to perceive and understand relationships, such as logical, spatial, verbal, numerical, and recall of associated meanings. Intelligence is sometimes considered synonymous with academic aptitude, scholastic aptitude, mental ability, capacity, or mental maturity. A raw score is an original and unadjusted test score, usually characterized by a sum of the correct answers or another combination of item scores. The “ceiling” is the upper limit of ability measured by a test. The “ceiling effect” is when many respondents achieve very high (raw) scores on a test or measurement, i.e., the test is too easy for most of the respondents. A criterion is a standard, norm, or judgment used as the basis for quantitative and/or qualitative comparison. In a criterion-referenced test, score interpretations are made based on the test taker’s independent performance level, rather than relative to the performance levels achieved by others. In a norm-referenced test, score interpretations are made relative to the performance levels achieved by others.
A composite score results from the combination of several scores as specified by a certain formula.
A cut score is the particular score value or point on a score scale that differentiates interpretation of scores below or above the point. If one cut score is used, the potential scores may fall into ranges of either “pass” and “mastery” or “fail” and “nonmastery.” A gain score is the difference between an individual’s two test scores on the same or equivalent test. Holistic scoring is a method that uses previously specified criteria to determine an overall appraisal of performance on a test or test item. A derived score is one numerically converted from a quantitative or qualitative mark on one scale into the units of another scale. It is also referred to as a scaled score. Examples include grade placement, chronological age equivalent, chronological age placement, educational age, intelligence quotient, percentile rank, and standard score. An equated score is a derived score that is comparable from test to test, such as standard scores, grade placements, and mental ages. A grade-equivalent score is the real or estimated mean or median score for a grade-level population. An intelligence quotient (IQ) is a measure of potential rate of intellectual growth that is expressed as the ratio of mental age (MA) to chronological age (CA). The formula is IQ = MA/CA x 100. A mental age is the average or normal chronological age for a given score on an intelligence test. A deviation IQ is an intelligence test score that is a derived score based on the individual’s deviation from the mean of the norm group in standard deviation units. A scaled score is a unit in a system of equated scores established for the raw scores of a test.
A scaled score usually is interpreted relative to the mean performance of a given reference group, whereby the interval between any pair of scaled scores represents meaningful differences in terms of the characteristics of the reference group.
A scoring rubric is the set of principles, rules, and standards used to assess an individual’s performance, a product, or a response to a test item. Scoring rubrics vary by the amount of judgment involved, number of distinct score levels, and latitude for intermediate or fractional score values. A standard score (e.g., Sigma score, T score, or z score) is a type of derived score that indicates the extent to which a score deviates from the mean. A distribution of standard scores for a specified population will have values for the mean and standard deviation that can be readily interpreted and understood. A “true score” is the mean score of the theoretical distribution of scores that would be obtained by the individual test taker on an unlimited number of identical administrations of the same test. In “true score theory” X (the actual/observed score received) = true score (i.e., actual trait level) +
systematic error (e.g., test anxiety) +
random error (e.g., not feeling well) Classification accuracy is the degree of accurate categorizations and diagnoses when a test is used to classify an individual or event. A false negative is an error whereby an outcome or performance that is predicted not to meet an expected criteria actually meets those criteria. A false positive is an error whereby an outcome or performance is predicted to meet an expected criteria but actually does not fulfill those criteria. In a high-stakes test, results have a significant and direct impact for the individual test taker, program, or institution being evaluated.
In a low-stakes, results have inconsequential impact on the individual test taker, program, or institution.
Intervention planning is the work behavior of a practicing helping professional that involves the development of treatment goals, plans, and protocols. The local setting is the place where a test is used. Local evidence is the reliability and/or validity data collected for a given set of test takers at a single institution or specific location. A test user is an individual or organization that chooses to administer and interpret test scores elicited in a given setting so that test-based decisions and actions may be made. Psychodiagnosis is the use of psychological test data to classify an individual’s mental health status. Selection is an objective of testing that results in either accepting or rejecting candidates for specific opportunities in educational and employment contexts. Sensitivity is the extent to which a diagnostic test identifies a disorder when it actually is present. (Test) Bias is the under representation or irrelevance of construct components in test scores that results in one group of test takers being typically favored over another. Response bias is the systematic error caused by the test taker’s tendency to respond in a certain way to test items. Translational equivalence is the extent to which the (original) content of a test corresponds to a linguistically translated version of the test. Sociometry is the measurement of the interpersonal relationships among members of a group. Coaching is the process of helping prospective examinees increase their test scores. It includes practices such as learning test-taking strategies that are independent of the curricula of schools and training programs.
Correction for guessing is a score-change technique that compensates for guessing on a test. The number of right answers on a test is adjusted by subtracting a proportion of the total number of incorrect responses from the total number of correct answers.
Flagging is the process of attaching an indicator to a test score to signify that the score was obtained in a nonstandardized testing administration. Item analysis is a method used in test construction to determine how well a given test item discriminates among individuals who differ in some characteristic. Item-effectiveness considerations include validity relative to curriculum content and educational objectives, discriminating power relative to validity and internal consistency, and level of difficulty. A construct is the underlying theoretical concept or characteristic to be measured by a specific test. The construct domain is a set of associated attributes to be assessed by a specific test. The content domain is the specific set of skills or level of knowledge that is measured by a given test. The criterion domain is the variable used as a frame of reference when making comparisons for a specific test. An item pool is a set of potential items from which items are extracted for either the development of a test or the selection of successive items when adapting the test. An item prompt is a stimulus, such as a question or set of instructions, that guides the test taker in formulating a response. A test manual is a publication (aka a “user’s guide”) prepared by test developers and publishers to provide information on administering and scoring the test, and interpreting scores. It also may provide information on test characteristics, and procedures used in developing the test and evaluating the technical quality of its test scores. A technical manual is a publication prepared by test authors and publishers that provides technical and psychometric data concerning the respective test.
A test developer is the individual(s) or organization that constructed a test and its supporting materials.
Test developmentis the process of designing, constructing, assessing, and modifying a test. It includes the development of content, administration, and scoring procedures, and determination of technical quality. Test documents are publications, written works, and technical information concerning a test that test users may use to evaluate the test for appropriateness and technical adequacy for a particular intended purpose. Classical test theory is a school of thought that defines an individual’s observed test score as the product of two separate components: a true test score and an independent error of measurement. Classical test theory and its premises about the components of a test score yield (traditional) implications for relationships among validity, reliability, and other statistical measures. Generalizability theory is an extension of classical test theory in which analyses are used to evaluate the generalizability of scores beyond the specific sample of items, persons, and observational conditions that were studied. Item response theory (IRT) is a theory of test performance that highlights the relationship between the mean item score and the calibrated level of the ability or trait measured by the item to theoretically yield the maximally appropriate items for each respondent. A population is the group of people to whom results will apply, typically considered as the group to whom results will be generalized. A sample is a subset of a given population. A random sample is a sample of a given population that is selected in such a way that selection bias is eliminated and every member of the population has an equal chance of being included in the sample.
Appraisal Part 2
Statistical Concepts for Appraisal
A frequency distribution is a tabulation of scores in numerical order showing the number of persons who obtain each score or group of scores. A frequency distribution is usually described in terms of its measures of central tendency (i.e., mean, median, and mode), range, and standard deviation. The (arithmetic) mean is the sum of a set of scores divided by the number of scores. The median is the middle score or point above or below which an equal number of ranked scores lie; it corresponds to the 50th percentile. The mode is the most frequently occurring score or value in a distribution of scores. The range is the arithmetic difference between the lowest and the highest scores obtained on a test by a given group. Variability is the dispersion or spread of a set of scores; it is usually discussed in terms of standard deviations. The standard deviation is a measure of the variability in a set of scores (i.e., frequency distribution). The standard deviation is the square root of the squared deviations around the mean (i.e., the square root of the variance for the set of scores). The normal distribution curve is a bell-shaped curve derived from the assumption that variations from the mean are by chance, as determined through repeated occurrences in the frequency distributions of sets of measurements of human characteristics in the behavioral sciences.
Scores are symmetrically distributed above and below the mean, with the percentage of scores decreasing in equal amounts (standard deviation units) as the scores progress away from the mean.