Collecting Data


Summary based on

Ravid, R.  (1994). Practical Statistics for Educators

Gall, M., Borg, W. & Gall, J.  (1996).  Educational Research: An Introduction (6th Edition).

Popham, J.  (1993).  Educational Evaluation.


Research Ethics & Collecting Data from Human Subjects


Guidelines for legal and ethical standards in the design and conduct and reporting of research come from three places:  government agencies have issued regulations, professional associations of researchers have developed statements, and universities and research institutes have established institutional review boards.


Governmental Regulation

Began after World War II in response to research conducted during the war.  Nuremberg Code provided a statement of human rights to understand and freely choose whether to participate in research.

U.S. federal regulations began in 1960s and expanded in 1970s. Updated in 1991 (Federal Policy for the Protection of Human Subjects: Notices and Rules).  Requires any institution that conducts research funded by U.S. government agencies to establish Institutional Review Boards (IRB). Government takes position that institutions should have review and protect rights of subjects regardless of funding sources.

Some research is exempt from review because of minimal risks - - including research involving the use of educational tests, survey procedures or observations of public behavior.  Exemptions do not apply to research with children except when observing public behavior when the investigator is not participating in the activities being observed.


Professional Associations

AERA Ethical Standards - -  most relevant to educational research.  45 standards organized under six topics. 

American Psychological Association has Ethics Code which includes a general category of ethical standards  and standards for evaluation, assessment of intervention’ advertising or public statements; therapy; privacy and confidentiality; teaching, training, supervision, research, and publishing; forensic activities; and resolution of ethical issues.


Institutional Review Boards

established locally and procedures and regulations may vary.  Generally researcher must complete a form or protocol with attached instruments, letters of consent, and procedures for review by the IRB.  General criteria include: clear research design, fair selection of subjects, subjects informed of how they were selected, voluntary nature specified, informed consent procedures, protection of privacy/confidentiality adequate, potential risks are identified and mitigated, benefits outweigh the risks.


Procedures to protect human subjects from risks

Selection - subjects must be selected equally

Informed Consent - individuals must be informed about the research and give their consent. (Minors give assent and parents/guardians must give consent for their participation).  Schools/districts must also provide consent for research carried out in the schools. Subjects should be informed of: purpose of the research, description of procedures to be used (time required, etc.), a description of what will be done with the information, explanation of how they were selected, the voluntary nature of participation, confidentiality of data, potential risks, and their right to withdraw without penalty. Written letters must be in a language and at a level understandable to the subject and should contain information about who to contact with questions.


Right to Privacy

Subjects have a right to privacy and confidentiality.  They should be told who has access to the data. Every effort should be made to prohibit unauthorized access to the data; a good rule - minimize the number of individuals who know the identity of the participants.  In general, research data do not have privileged status.  Confidentiality should be maintained in publications/ presentations (do not use the names of individuals, locations, etc.).  There are some situations when the participants may want to be identified.    Ways to enhance confidentiality: ask for anonymous information, use third parties to select sample and collect data, use a detachable identifier, have subjects make up a code when matched data is required, dispose of sensitive data after study is completed.


Right to Informed Consent and Protection from Harm

May consider risk-benefit ratio - - how much risk is the participant exposed to and how much good will likely result from the study.  This comparison is subjective. Risks to subjects may be physical, emotional, psychological, or legal.  In some cases, deception is needed in order to gather accurate information (deception creates a false impression in the minds of the participants by withholding information, telling lies or using accomplices). Some researchers are opposed to using deception in any research because they deem it morally wrong, it is not practical if one is interested in participant constructions, or because it has been too widely used among certain populations (college students).  If deception is used, there must be two forms of debriefing - - dehoaxing (convincing subjects that they were deceived) and desensitization (removing undesirable effects - by suggesting that their behavior resulted from the circumstances of the experiment and not  from a defect in their personality/character or that their behavior is not abnormal or unusual. (video aggression study example) Some argue that this promotes unethical behavior.


Ethical and Legal Issues

Planning and Designing Research

Researcher qualifications - - competence, perspective and character of researcher

Vulnerable populations - - dissimilarity with researcher may expose individuals to risk because of researcher’s lack of knowledge.  Good approach to have members of the group help design the study.

Conflict of interest - - choice of data collection instrument or intervention has financial implications for the researcher.

Neglecting important topics - - -generalizing findings too far, ignoring ethnic differences, ignoring areas where research on the topic is relevant and needed

Conducting Research

Control group procedures - - subjects receive differential treatment

Termination of treatment condition - - research schedule may not coincide with participant needs

Use of tests - - test-anxiety, issues of self-disclosure, computer-based testing and issues of privacy.

Reporting Research

Partial and dual publication (salami science)

Plagiarism and paraphragiarism



Human Relations

Locating a Research Site - - advantages and disadvantages of using your own institution (pro: easier to get approval, familiarity with context.  con: limited view, bias, personal relationships)

Securing permission and cooperation - - clarity in describing research, concerns about costs and inconvenience, concerns about use of the information (negative reflection), issues of administrative hierarchy, informing of key groups/stakeholders (parents and community). Three areas of potential concern: conceptual soundness of the research (what’s the question, is it important, and is this likely to give us an answer), feasibility (how will data be collected, will it be disruptive, how much time is needed), and ethical concerns (does it place people at risk, what are the responsibilities of those participating, how will informed consent be obtained).

Building institutional relationships - - accessibility to site personnel, keeping them informed, developing warm and personal relationships, using expertise to address their needs.

Dealing with human relations problems - - build in additional time to deal with problems, public relations efforts.


Selecting Data Collection Tools (Types of Instruments/Measures)


Standardized versus locally constructed tests. 

Standardized - developed by commercial publishers or government agencies, used in large number of sites, items generally well written, standard conditions of administration and scoring, tables of norms are provided. Disadvantages include guessing, response sets, random or careless answers., restricted time limits, do not reflect unique experiences of individuals.  Scores on standardized tests correlate highly with socioeconomic status but minimally with indicators of instruction. Claims of bias against certain groups.

Locally constructed tests are generally inadequate for research purposes.  Developers lack training in test construction. 



Test - any structured performance situation that can be analyzed to yield numerical scores from which inferences can be made.

Test bias - a test consistently and unfairly discriminates against a group of people (gender-biased, culturally-biased).

Criteria for test selection

Objectivity - - ruling out possibility of personal errors in measurement due to subjectivity in administration and scoring

Standard conditions of administration and scoring - - manual specifying standards for administration (time allowed, instructions repeated, how to answer questions, etc.) and scoring procedures.  Tests with consistency in administration and scoring are called standardized tests.

Normative data - - tests are interpreted relative to something - - norm-referenced compared to performance of a defined group and criterion-referenced relative to a performance standard.

Validity and reliability - - good tests have reliable scores from which one can make valid inferences


Norm Referenced Tests

Allow comparison of individual performance with performance of a norm-group (a group of similar persons who have taken the test previously).

Assumes normal distribution.

Norm groups must be large and represent the characteristics of potential test takers. 

Test creators should report the demographic characteristics of the norm group and identify when the norming was done.

Norm-referenced tests typically report data in terms of (1) percentile ranks, stanines, and grade equivalents.

Percentile rank - percentage of people who scored below that score.

Percentile bands - provide an estimated range of the true percentile rank since tests always contain some measurement error

Stanines - “Standard nine” 9 point scale with a mean of 5 and a standard deviation of 2, converts percentile ranks into larger units based on normal curve with 23% in stanines 1-3, 23% in stanines 7-9, and 54% in stanines 4-6.

Grade equivalents - convert raw scores to grade level norms. Often misunderstood and misinterpreted.

Test items for norm-referenced tests are written to maximize differences (variability) - - some easy items, some difficult items, and the majority in the average difficulty range designed to be answered correctly by 30-80%.

Major problem is test score pollution - - over time performance of test takers may increase/decrease for various reasons making the norms meaningless (Lake Woebegone phenomenon).

Problems:  cultural bias, mismatches between testing and teaching.

Limited usefulness for evaluation because of (1) curricular incongruence, (2) descriptive vagueness (generality), (3) systematic elimination of items on which most students succeed.


Criterion-Referenced Tests

Allow comparison of individual performance to a pre-specified criteria.

Also called domain-referenced or content-referenced.

Criteria must be specific, clear, based on skills or objectives.

Looks at learners’ level of performance and specific deficiencies or used for making absolute decisions (as in mastery learning).

Task is to measure the extent to which criteria have been met, either through pencil-paper or performance-based tests.

Typically reported in terms of percent correct or master/nonmastery (pass/no pass).

Two types: Domain-referenced involves selection of items from pool that is representative.  Objectives-referenced looks at items measuring attainment of specific objectives.


Individual-referenced measurement

compares individual’s performance at one point with their performance at another point. Used to track changes in performance.


Types of Tests

Performance tests


intelligence - - general intellectual ability - - (IQ tests, Otis-Lennon, Staford-Binet, Wechsler), construct validity is most important. Does not measure innate ability.  Represents one type of intelligence acquired in a specific cultural context. Have low to modest validity for predicting achievement in school.


aptitude - - predicting future performance in specific area, (Differential Aptitude Test, Metropolitan Readiness, Measure of Musical Talent, SAT), predictive validity is most important.


achievement - - measure knowledge of specific facts, understanding, or problem-solving ability (Comprehensive Test of Basic Skills, Stanford Achievement Test), content validity is most important.  Achievement testing issues with “teaching to the test” and problems of “test ceiling.”


diagnostic measures - - used to identify specific strengths and weaknesses, usually focusing on low-end of spectrum. Subscores often have low reliability. (Diagnostic Math Inventory), content validity most important.


Performance assessment (authentic assessment, alternative assessment) - - examining performance on tasks that represent complex, complete, real-life tasks.  One type is portfolio which includes samples of student work in the content domain incorporating criteria for inclusion/judgment (rubrics) and personal reflections.


Behavior Observations

observe public behavior. Use of hidden recording instruments. Use of accomplices.



Personality measures


personality inventories - - assess variety of personality traits, typically pencil-and paper, depend on truthfulness, problems with response set  (social desirability - represent self in positive light, acquiescence - tend to agree with items, deviance - tend to respond in atypical way)

(Adjective Checklist, California Psychological Inventory)


Projective Technique - - amorphous stimuli and freedom of response, less subject to faking than self-report inventories. (Thematic Apperception Test, Rorschach Test)


Self-Concept - - set of cognitions and feelings of the individual (Piers-Harris self-concept Scale, Tennessee Scale)


Measures of Learning Styles and Habits - -  measures approaches to learning tasks (Learning Style Inventory, Leaning and Study Strategies Inventory)


Attitude Scales - - measures viewpoint or disposition and have three components: affective, cognitive, and behavioral. Measured using Likert, Thurstone, and semantic-differential scales. (Teacher Attitude Inventory)


Vocational Interest - - measure degree of interest in of preferences for various activities, etc. (Kuder Occupational Interest Survey, Strong-Campbell Interest Inventory)



Assessing Affect - - creating a valid measure is difficult.

Generating items;  imagine individual who possesses desired affective attitude and one who does not, generate potential behavior-differentiating situations, select practicable and valid situations.

Assumes honesty of responses.  Cultural forces that create dispositions to answer in socially acceptable ways.  Measuring humans and unlikely to report absolute truth.

Increasing honesty of responses:  reducing cues that trigger “expected” responses. Use “most people” instead of “you feel”. Anonymity. Partially concealing the purpose.


Measuring Satisfaction

Five rules for developing satisfaction instruments: (1) Clarity - - clear writing, no ambiguous phrasing or complex language, careful editing. (2) decision focus - -focused on program relevant decisions, information is useful. Three areas typically covered: content, instructional activities, logistical arrangements. (3) brevity - - inverse relationship between number of items and quality of responses. (4) anonymity - careful with use of demographic items, separation of handwritten responses. (5) suggestions - provide opportunity for additional suggestions, favorite/least favorite.


Locating Tests AND Instruments - - best place on the web

Mental Measurements Yearbook (Burros)

Tests in Print

Questions to Ask:  Is evidence of validity/reliability given? Is reading level appropriate? Can test be administered within time constraints? Is it at an appropriate level of difficulty? Do the norms come from a similar population?


Validity of Measures


An instrument is valid when it measures what it is supposed to measure.  Appropriateness and usefulness of specific inferences and interpretations made.  A test is not “valid”, rather it is valid for an intended use.  Validity is not inherent in the instrument; an instrument is considered valid for a particular purpose for a particular group only. Validity is relative to the purpose of the testing.  Validity is a matter of degree - - how valid is it, not whether it is valid.

A valid test is assumed to be reliable, but a reliable test may not be valid for a specific purpose.

Types of Validity





expert judgments of the appropriateness of the content

adequacy with which an instrument measures a representative sample of behaviors and content domain about which inferences are to be made.  Must be a representative sample of the content domain to be valid.

Most important for achievement tests.

Items are examined and compared to the content to be covered.

Well-defined content domain and behaviors increase test content validity.

For instructional use, tests should cover items that correspond to what was covered in class or the test is not valid or has low validity.

Standardized tests may have high content validity for some schools and low validity for others because of differences in curricula. For example, tests that ask for facts are not probably valid tests to measure the content domain in constructivist classes focused on problem-solving.

For content validity, a broad sample of content is better than a narrow one, important material should be emphasized, and questions should be written to measure the appropriate skills covered in instruction.

Most closely associated with achievement testing, but also useful in measuring constructs (such as self-concept).


Face Validity

judgments based on superficial appearance using causal, subjective inspection

Extent an instrument appears to measure what it was intended to measure. Affects the acceptability of a test. Usually based on a superficial inspection of the test.  Face validity is no guarantee that the test measures what it is supposed to measure.



Also called Criterion-Related Validity

Extent performance on instrument is related to performance on another measure (criterion measure).


Concurrent validity

 how well test correlates with another instrument measuring the same thing (criterion).  The two measures are administered to the same group at the same time and the scores are correlated.


Predictive validity

how well a test predicts some future performance (especially useful for aptitude and readiness tests).  The test to be validated is the predictor and the future performance is the criterion. Data are collected for the same group on both variables and the scores are correlated (called a validity coefficient) and indices the extent of the instruments predictive validity. Predictor measure collected at present time and criterion measure administered later/in the future.




Construct Validity

Constructs are characteristics that cannot be measured directly (sociability, aggression, honesty, depression, introversion, self-concept, etc.).  Construct validity is the extent to which a test measures and provides accurate information about ma theoretical trait or characteristic.  The test is given to a group of people and other data is also collected on the group.  Theoretical predictions are made about the other information based on how people score on the test.  Establishing construct validity consists of accumulating supporting evidence (more than once, with many samples, and multiple data sources). Begin with a hypothesis about how people who possess different degrees of the construct might behave, then test the hypothesis.  Most common methods of obtaining construct validity are (1) comparisons of scores before and after a particular treatment, (2) comparisons of sores of known groups, (3) correlations with other tests assumed to be valid measures of the same variable.

Most closely associated with validity of personality sales but useful for all types of measures.




Consequential validity

Assumes test scores, the theory and beliefs behind the constructs, and the language used to label the construct embody values and have value-laden consequences when used to make decisions about individuals. Must check values and consequences to determine if inferences and the ways cores are used are valid for a particular use.  Examples: intelligence tests measure abilities required to do well in school (not innate ability) and imply that school performance is important.  The scores can be used appropriately or inappropriately and their use has consequences.  With intelligence tests, the intended effect is to identify and advance gifted students without regard to social class or geographic location; the unintended effect is to create a large difference in the percentage of whites who are promoted to certain environments.  One of the arguments for performance/authentic assessment is that they reflect values for instruction and learning not found in traditional standardized tests.



Reliability of Measures


Provides consistent and accurate results.

Measures of physical traits are more reliable than measures of affective or cognitive traits.  Other factors are more likely to influence measures of affect and cognition (mood, pressure, fatigue, anxiety, etc.)

Reliability reported as “r” (same as correlation) and can range from 0 to 1.00 (although measures of human behavior will never quite reach 1.00).

Theory:  Observed scores include two components - true score and error score.  True scores are not directly observable.  Assumes that errors of measurement are random.

Error variance can be decreased by: writing good test items, including clear instructions, providing optimum environments for test-taking. 

Observed score variance can be increased by using heterogeneous groups and by using longer tests.

Factors that might cause measurement error: items on the test are not equivalent in how they sample the construct domain; tests are not administered consistently; inconsistent scoring procedures; poor testing conditions; individual variability that causes atypical performance.



Methods of assessing reliability



(Coefficient of stability)

Administer the same test twice to the same group of people and correlate the scores.

Problems: time interval (too short may cause higher reliability as they remember the items, too long and other factors may influence results); need for testing twice increases costs and is time-consuming.

Recommended interval is less than 6 months.


Alternate/Equivalent/Parallel Forms

(Coefficient of equivalence)

Administer two forms of the test to the same group and correlate their scores.

The two forms must be equivalent in terms of statistical properties (means, variances, item intercorrelations), content coverage, and types of items used.

Problems: Students have to be tested twice which increases costs and is time-consuming; difficult and impractical to develop alternate forms.

Alternate forms are useful for security reasons and for use in pre-test/post-test studies to eliminate effect of previous exposure to the test.


Measures of Internal Consistency

Use of scores from a single testing session to estimate reliability and based on assumptions that test measures a single concept and items should correlate with one another and that those who answer one item correctly are likely to correctly answer similar items.


Split half method

(Coefficient of internal consistency) - test is split into two halves and scores are correlated (considered alternate forms).Two halves should be comparable in terms of content coverage and item difficulty and must take into account student’s level of fatigue and practice.

Underestimates reliability because longer tests are more reliable so uses the Spearman-Brown prophecy formula to estimate reliability for entire test.


Kuder-Richardson (KR-20 and KR-21)

(method of rational equivalence) used with dichotomous items (yes/no, correct/incorrect) and measures intercorrelation among test items. 


Coefficient Alpha (Cronbach Alpha)

similar to KR-20 but used when scoring is not dichotomous (i.e. Likert-scales). (Available on SPSS and widely used in education.)


Inter-Scorer, Inter-tester  or Inter-Rater reliability

Used when measure is not objective instrument and involves some subjectivity and judgment of the raters/scorers (i.e. essay tests, writing samples, behavior ratings, etc.).  Compares the scores of two or more raters and computes either a percentage of agreement or a correlation coefficient. Good scoring guidelines or rubrics increase inter-rater reliability. Errors are more likely if testers are not well trained or test is individually administered and requires interaction between tester and subject.


Generalizability Theory

Analysis of variance is used to analyze the data in order to assess the effect of each measurement error source and their interaction. A generalizability coefficient an be calculated reflecting the combined measurement error due to all sources investigated. Not in common use.


Standard Error of Measurement

Allows determination of the probable range within which the true score falls. The more reliable the test, the less measurement error and the narrower the estimate band (range) in which the true score lies.  Lower reliability results in higher standard errors of measurement.


Factors that Affect Reliability



Higher variability of scores results in higher reliability estimates (remember correlation rules and the effect of range restriction?).  Pay attention to the group the developers used to estimate reliability.  If their estimates were based on scores of a group of high school students (grades 9-12) and you plan to use the test with only 9th graders, the reliability of the test for your group would likely be lower than the one reported.


Test Length

The longer the test, in general, the more reliable. The effect of guessing is reduced. Means the reliability of subsections are generally lower than the reliability of the entire test.


Difficulty of Items

Tests that are too easy or too difficult reduce variability of scores and thus reduce reliability.


Quality of Items

Better quality items (item clarity, reduced ambiguity, clear instructions, readability, standardized administration and scoring) increases reliability.


How High Should Reliability Be?

Teacher made tests have lower reliabilities.

Tests of affective domain have lower reliability than tests of cognitive domain.

Subtests have lower reliability than full tests.

Higher reliability are always preferable.

For exploratory research, reliability as low as .50 are generally acceptable.

For group decisions .60 and higher is generally acceptable. For individual decisions, generally a level of .90 or higher is preferred.