This resource provides explanations of the key terms encountered when exploring the Measurement Guidance Toolkit.


A measure of the relation between two variables (or factors). If the values of one variable increase as the values of another increase, then the two variables have a positive association. If the values of one variable decrease as the values of another increase, the two variables have a negative association. An association does not necessarily indicate that one variable causes the other. It could be, for example, that two variables are associated because they are both influenced by another, third variable, and thus tend to increase or decrease together for this reason. Also see Correlation.


A measurable substance in an organism that indicates biological processes, pathogenic processes, or pharmacologic responses. It refers to indications of an objective a physical state as opposed to symptoms, which are limited to indications of health or illness perceived by persons themselves.

Comparison group

A group of people that does not receive the program being studied and that is compared to a group of program recipients on measures of interest. Comparison groups differ from “control groups” in that participants are not necessarily assigned randomly to be in the comparison group or the group receiving the program. Also see Control group.


A variation in services received as part of an evaluation. In a study that examines the difference between youth who receive an intervention (i.e., “intervention group”) and those who do not (i.e., “control” or “comparison group”), each of these groups would represent a condition in the study.

Continuous score

A score that can take on a large number of possible values within a particular range. In comparison, discrete scores are those that can take on only a limited number of possible values. For example, a question that asks a youth whether he or she engages in a behavior or not would generate discrete values (e.g., “yes” or “no”). The scoring for some of the scales provided in this toolkit assumes that the score generated for a respondent is continuous and thus that a youth responding to the scale can score anywhere between the minimum and maximum values. For example, a scale with 8 questions, each scored from 0 to 4, would have a minimum value of 0 and a maximum value of 32. A youth responding to the questions could have a score anywhere within that range. The dividing line between whether a score is best regarded as continuous or discrete is not always clear-cut and often may be informed by different types of statistical analyses conducted as part of the scale validation process.

Control group; Randomly assigned control group

A group that had the chance of being offered an intervention, but was randomly selected to not receive it. The data collected for this group are intended to represent what would have been the case for the intervention group had the youth in this group not been offered the program. Also see Comparison group.

Control statistically

Reducing the potential effect of a variable or set of variables (or factors) on an association between two other variables. For example, when looking at the association between mentoring with high school graduation, other variables that could affect graduation (e.g., academic performance, parent education, school connectedness) could be held constant to ensure that their influence on high school graduation does not distort or bias conclusions that are made about the potential contributions of mentoring to this outcome.


A statistical measure that indicates the degree to which two or more variables are associated with each other. A positive correlation indicates that as one variable increases (or decreases), the other does as well; a negative correlation indicates that as one variable increases, the other decreases (or vice versa). The correlation between two variables can be anywhere from -1.0 (perfect negative correlation) to 1.0 (perfect positive correlation). A correlation does not necessarily indicate that one variable causes the other. It could be, for example, that two variables are correlated because they are both influenced by another, third variable, and thus tend to increase or decrease together for this reason. Also see Association.

Cut-off score

An established score that categorizes responses on a particular outcome measure. For example, in assessing youth physical activity levels using the measure provided in this toolkit, a cut-off score of 3 or more days of physical activity a week can be used to classify youth as “physically active.”

Demographic subgroup

A group that shares demographic characteristics. For example, African American youth are a demographic subgroup of youth, and female African American youth are a subgroup of African American youth.

Developmental applicability; Developmentally appropriate

An approach that considers the developmental stage of the young person. A developmentally appropriate measure is one that takes the age and cognitive abilities of youth into account in its structure (e.g., wording/reading level, applicability of terms/scenarios, length, etc.).


An aspect of a construct being measured. For example, school engagement typically includes three dimensions in its measurement - behavioral, emotional, and cognitive engagement.

Effect size

A measure of the magnitude of the difference between groups, such as intervention and control groups, on an outcome measure of interest. Effect sizes are often expressed in standardized units to permit comparisons across both different outcomes and different measures of the same outcome. Cohen's d is an example of this type of index and classifies effect sizes as small (less than 0.2 standard deviation difference between groups), medium (greater than 0.2 but less than 0.8), and large (greater or equal to 0.8). For a more detailed description of effect size and how to calculate it, see:

Empirical evidence; Evidence base

Current knowledge within a particular field that is based on scientifically acquired information (i.e., experiments or observation).

Environmental risk factors

A challenge in the young person’s surrounding life circumstances, such as poverty or living in a dangerous neighborhood that is associated with increased likelihood of later difficulties (e.g., lack of high school graduation).

Forced choice

Questions that ask a respondent to select one or more responses. “Open-ended” responses, by contrast, allow respondents to write their responses in an open-ended format.

Grounded in theory

Being informed by, or based on, the principles or assumptions of a particular theory. For example, a program that aims to increase helping behavior in youth by exposing them to peers exhibiting this behavior is grounded in social modeling theory.

Individual risk factors

A challenge in a young person’s own behavior, including those related to personality traits or cognitive abilities, that is associated with increased likelihood of later difficulties.


When the strength or direction of the association between one variable and an outcome depends on the level of one or more other variables. For example, when examining gender in relation to the association between mentoring and school misbehavior, the presence of an interaction would suggest that receiving mentoring or not has a different association with school misbehavior among boys (for example, a weaker or stronger association) than it does among girls.

Intervention group; Program group; Treatment group

A group that is offered the intervention or program that is being evaluated in a study.

Likert-type response

Options provided to survey respondents that use ordered response levels from which respondents choose one option that best aligns with their view (for example, the extent to which they agree or disagree with a particular statement). These response levels are anchored with consecutive numbers and/or labels that connote fairly evenly-spaced gradations along a continuum from least to most (e.g., strongly disagree, disagree, neutral, agree, strongly agree). More information available at


A statistical technique that systematically combines results from several studies to develop a single conclusion. For example, results from several mentoring evaluations could be combined to assess the effectiveness of mentoring programs across those studies.

Nationally representative sample

A group of participants selected in a way that makes their characteristics closely match those of the population across the nation as a whole.

Normative data

Data that characterize how a characteristic, attitude, or skill is distributed in a given population to give a sense for what is usual in that population. Normative data are typically obtained from a large, randomly selected, representative sample from the wider population, and often may be provided separately for different subgroups (e.g., males and females).


The time young people spend outside of the typical school day including, for example, time before school, after school, on weekends, or during school breaks (e.g., summer).

Psychometric properties; Psychometric evidence

How well the instrument measures what it is designed to measure (e.g., the reliability and validity of the instrument). Also see Reliability and Validity.


The combination of social and psychological factors.

Public domain

Being available to the public and therefore not subject to copyright.

Randomized controlled trial; Randomized control trial; Random assignment

A study in which, or the process through which, individuals are assigned to intervention and control groups on the basis of chance so that every individual has the same probability as any other to be selected for either group. Using chance to assign people to groups helps to ensure that the groups are similar on all characteristics except for the intervention’s group opportunity to receive the intervention. Also see intervention group and control group.


Encouraging or establishing a belief or pattern of behavior (e.g., through reward).

Reliability; Reliable

The consistency of scores on a measure over repeated use. For a more detailed discussion of reliability as it applies to youth program outcomes, see page 53 of the Forum for Youth Investment’s soft skills measures compendium.

Response format; Response choices

How the answers to a question are collected from respondents. For example, a Likert response scale is one type of response format. Also see Likert-type responses.

Reverse-scoring; Reverse direction

The process of reversing the numerical values assigned to responses on an item that reflects the absence (or opposite) of the outcome being measured. For example, when assessing Social Acceptance, a child’s response to a question asking to what extent he/she feels lonely, might be reversed so that a response of greater disagreement reflects higher Social Acceptance and can then be averaged with responses to the other items in the scale. The formula for reverse scoring an item is: New score = Number of response options possible plus 1 minus the original score. For example, if a youth chooses a response that would normally be scored a “4” on a 6-point scale, if that item is reverse-scored her new score for the item would be 6+1-4 = 3.

Scale vs. subscales

A set of questions used to assess an outcome of interest. A subscale is a part of an overall scale that measures a component of the outcome of interest. For example, the Self-Control subscale is part of the larger Social-Emotional and Character Development Scale (SECDS).

School-based sample

A group of participants that are selected from the larger body of students attending the school.


A measure’s ability to correctly identify those respondents who exhibit a state or outcome of interest (e.g., a measure of depressive symptoms as being “sensitive” to detecting actual clinical depression). It is formally calculated as the number of persons correctly identified by the measure as having the state or outcome of interest (referred to as “true positives”) divided by the sum of all persons who actually have the state or outcome (i.e., true positives + false negatives). In contrast, a measure’s ability to correctly identify those respondents who do not exhibit a state or outcome of interest is referred to as its “specificity.” Specificity is formally calculated as the number of persons correctly identified by measure as not having the state or outcome of interest (“true negatives”) divided by the sum of all persons for whom the state or outcome is in fact not present (i.e., true negatives + false positives).

Standard deviation

A measure of the extent to which the scores for the members of a group vary from the group’s mean (i.e., average). For example, if a T-score of 50 represents a score equal to the average score for youth in the normative sample, a score of 60 would indicate a score that is one standard deviation higher than this average and thus an elevated level on the outcome measure compared to the typical youth in the normative sample.

Social modeling

The idea that we learn by observing others’ behavior.

Summing vs. averaging scores

When an outcome measure uses a Likert or Likert-type response set, you can compute a score to represent the youth’s overall response on the scale by either summing or averaging the values across each item in the scale. In this toolkit, the scoring method used by the developers of the scale is provided. For purposes of most statistical analyses (for example, looking at differences in an outcome between those in a mentored group and those in a comparison group), findings and conclusions will not be affected by whether summing or averaging is used in scoring the outcome measure. Also see Likert-type responses.


A measure of how far a given score is (in standard deviations) from a group average or mean. T-scores are constructed so that they have a standard deviation of 10 and an average of 50. For example, a T-score of 60 for a respondent would indicate that his or her score on a measure is one standard deviation higher than the average score of a reference group (for example, youth in the sample that was used to develop norms for the measure).


The process of assessing and confirming the reliability and validity of a measure.


The extent to which a measure actually measures what it is intended to measure. For a more detailed discussion of validity as it applies to youth program outcomes, see page 55 of the Forum for Youth Investment’s soft skills measures compendium.

Request no-cost help for your program

Advanced Search