Classical
Test Theory
By. Muhammad
Tahir
Executive Summary
Educational
assessment has become a central part of increasing teaching and learning
performance as well as standardizing testing.
The result of assessment may provide information about to what extend
does the assessment affect teaching outcome and to whom this assessment is used
for. Conducting assessment means that we
assess a variety of task such as students, teachers and the test itself. The test is intended to find its implication
and social consequences in shaping standard curriculum, standard test, and
standard measurement so that all education goals would be achieved. Designing and applying testing for
educational purpose is a challenging task. However in applying testing in
educational field one should familiar in terms of classical theory, item
response theory and item discrimination. Besides, the tester may also pay much
attention to the Cronbach alpha. “TIMSS 2007
uses Item Response Theory (IRT) scaling to summarize students’ achievement on
the assessment and to provide accurate measures of trends from previous
assessments” ( Foy & Olson, 2007).
The
purpose of this paper is to explore the classical item theory which becomes
keystone for several decades of measurement and testing. It is also discusses
some ideas regarding with item discrimination, and item response theory. For
the most specific, it tries to find out validity and reliability of the test
using TIMMS 2007 data as a secondary data. Understand the contexts in which
students learn best. TIMSS enables international comparisons among the key
policy variables in curriculum, instruction, and resources that result in
higher levels of student achievement (Mullis et al. 2005).
TIMSS
2007 consist of science and mathematics item and has two kinds of test;
multiple choice and constructed items. Multiple choice item is scored 1 while
constructed item is scored 2 (Mullis et al. 2005)
Introduction
Classical
test theory is a theory about test scores that introduce three concepts (often
called the observed score), true score and error score (Jhon & Humbleton, 1993).
Classical test theory has simple formula such X = T + E where X is linking the observable test score, it is the
sum of two unobservable (or often called latent) variables, and true score (T)
and error score (E) (Jhon & Humbleton, 1993 p.255). The true score of a person can be found by taking the mean score that
the person would get on the same test if they had an infinite number of testing
sessions.
According to Jones and Hambleton
(1993, p.256) “the assumption in classical test theory are true score and error
score are uncorrelated, the average error score in the population of examinees
is zero, and error score on parallel test are uncorrelated. In addition true
score is derived from error score and total score in a test.
Item
response theory is a general statistical theory about examinee items and tests
performance and how performance related to the abilities that are measured by
the items in the test. Basically, IRT model has two fundamental assumptions
where one refers to structure of the test data and the other relates to
mathematics from of the item characteristic function or curve (Jones &
Hambleton 1993).
It is clearly
that item difficulties and item discrimination may bring detrimental effects to
the decision maker, curriculum developers and stake holders. However, this
should be reexamined before we implement it. Indonesia, for example has begun
standardizing its national examination where only 5 subjects cover in test.
This national examination is regulated annually by involving national committee
who are responsible for designing test. Each students should achieve 5.0 score
to pass the national examination. Unfortunately, 19 senior high schools around
the country could not pass their students (100%) (www.kompas.com).This means
that the test is probably difficult for students and need to be revised for the
next test. Furthermore, a test maker should also take care to students’ needs
and curriculum underpinning national curriculum.
Regarding
with this, Cronbach (in Moss, 1998 p. 115) suggested that “using stake holders’
interest as well as evaluator’s concern to generate a list of potential
questions and then prioritizing the questions based on; prior uncertainty about
the questions, information to be yielded by a feasible study compared to how
much uncertainty will remain, cost of the investigation in terms of time and
dollar, and laverage for achieving consensus about the use of the test in the
relevance audience”.
This
paper represents the TIMSS 2007 assessment as secondary data. Unlike PISA,
TIMSS 2007 assessment covered only two topics, mathematics and science and has
353 items for fourth grade plus 429 items at the grade of eight (Olson et al.,
2008). In addition, Australia book 1 will be chosen as representative for computing
data analysis and grade fourth as a sample of study. This study mainly focuses
on mathematics subject with 29 multiple choice items
Validity of the Test
Validity
has essential role in educational assessment. It can be developed by using some techniques
along with theories and evidence to support and analyze the result of data
measurement that can be used specifically (AERA et al., 1999, p. 17). Regarding
with this, McDonald (2007, p. 20) contended that validity refers to the
appropriateness of the interpretation of the test score. It does not measures
or tests it self. In other words validity of the test should be supported with
some evidence in which we can interpret the test score. Therefore a test can be
valid if it is used to measure what the researcher really wants to measure and
not use the measure for other purpose of the tests.
Similarly,
McMillan (2001) clearly stated that a tester should have more experience in
making judgments through validation test. Furthermore McDonald (2007) stated
that the skills on validating the test consist of gathering sufficient evidence
that can lead us to make inference regarding with the test results. These
evident are very important to develop the interpretation of the test results
which in turn can help people to use particular test result to make decision
(AERA, et al. p. 11). Respect to this, Nitko & Brookhart (2007) suggested
that to achieve the validity of test, a test developer should obtain more and
more evidence through a series of action such as documented, checked, weighted,
and mixed all relevant evidence which can enrich the interpretation on the test
results. To achieve this, Nitko & Brookhart (2007, p. 67) stated that
“validity is the confidence we may have in interpreting students’ assessment
results and using to make decisions.
Accordingly, in validity, a test developer may
determine what they want to measure is that it appears to be measured. As suggested by McMillan (2007) that “a test needs face
validity so that it appears to be valid to the test consumer. However if the
test seems to be irrelevant or inconsistent to what really the researcher wants
to test, it can affect not only on the students’ performance but also on test
takers. This idea is also important because if the test is inappropriate or
irrelevant to the students, it can cause poor results on test (Anastasi
&Urbina, 1997).
Data is derived from TIMMS 2007 data set
which consists of 29 multiple choice items. Before executing data Book1.BSG_Aus,
we should first score the data into dichotomous data because data which is
derived previously are nominal and scale data and need to be converted. All
data are treated into two categories namely 1 for correct answer and 0 for
incorrect answer. In the beginning the data showed students’ correct answer has
1 value while incorrect answer was 0. However, we have another problem relating
with missing data with 1, 2, and 3 digits (99, 999, 96, 9, 7 and 6). These
missing value have 0 value, then they removed because they have the same value
with 0 (incorrect response, not reached or incorrect answers). In other word,
an item is scored 1 if it is correct conversely 0 for incorrect answers. To
treat this problem, we need to score them into dichotomous data which only has two
values 1 and 0. Furthermore, after scoring the data we directly analyse the
mean, P, index discrimination, coefficient reliability, and standard of
error measurement.
Item Analysis
According
to Nitko and Brokhart (2007) item difficulty level can be defined as the
percentage of the student answering the test correctly. There are two basic purpose of conducting
assessment. First, to rank the student performance (excellent, second excellent
and next best students), second to place
the students into several level of achievement (beginner, intermediate, and
advance) thus the items should cover more difficult items (Nitko & Brokhart
2007, p. 120). Therefore it is very important to bear in mind that the purpose
of item difficulty is to distinguish between higher thinking student and low
thinking students. Similarly McDonald (2007) pointed out that if the test has P value 1.00 (100 %) it cannot give
contribution to the reliability of the test. However, the average of good
tests, at least, has mean P value
around .30 and .80. (Kehoe in McDonald 2007, p. 242). The equitation of item
difficulty is:
Where
P* = adjusted item difficulty
=
weighted mean of the item
= maximum score point available for the item
=minimum score for the item
Table 1 depicts the mean P value of
each item representing the percent of correct answer. This gives us short
description about how difficult the test in total. It can also be seen from
Table 1 above that item number 5, 10, 12, 22, 27, and 28 are not good items
because they do not have power to discriminate between high achiever and low
achiever students. Respect with this, Kehoe (1995) defined that test can be a
good test if its P value is range from .30 and 0.80 and if it is above
.85 or 1.00 then the test have no discrimination power. For example the last
two items (27 and 28) have P value above .80. According to these
criteria, both upper level and lower level students can answer the questions
correctly. Other items (5, 10, 12, and 22) are also categorised as difficult
items and need revision because their P value is below .30.
Discrimination indices (D) can be calculated for each dichotomous
item. The higher the D, the more the item discriminates. Items with p
levels in the midrange usually have the best D values and, the
opportunity for D to be highest occurs when the p level for the
item is at 0.50.
The extreme group method is used to calculate D. There are three
simple steps to calculating D. First, those who have the highest and
lowest overall test scores are grouped into upper and lower groups. The upper
group is made up of the 25%–33% who are the best performers (have the highest
overall test scores), and the lower group is made up of the bottom 25%–33% who
are the poorest performers (have the lowest overall test scores).
The most appropriate percentage to use in creating these extreme groups
is to use the top and bottom 27% of the distribution, as this is the critical
ratio that separates the tail from the mean of the standard normal distribution
of response error (Cureton, 1957). Step two is to examine each item and
determine the p levels for the upper and lower groups, respectively.
Step three is to subtract the p levels of the two groups; this provides
the D. Accordingly, Nitko &
Brokehart (2007, p. 324) item discrimination index (D) is the difference
between the fraction of the upper group answering the item correctly and the
fraction of the lower group answering the test correctly. The formula of D can
be defined as follows:
D = or
d* =
where
= the
item difficulty for the bottom 27% of the students
= the
item difficulty for the top 73% of the students
To calculate each item difficulties with cut
points 27% and 73% of and , therefore we need to use SPSS Frequency
command. Then the result of frequency of each item discrimination can be drawn
as follows :
The index can range from -1.00 to +1.00. According
to McDonald (2007) it is rare to have -1.00 and +1.00 of index discrimination
in a test performance. However, Table 2 above gives a description of how well
the item has discrimination power. It is interesting to notice here that item
12 and item 22 have extremely low value (D = 0). According to Wu &
Adam (2007, p. 64) argued that discrimination value of 0 shows that there is no
relationship between item score and the total score. They claimed clearly that
the higher discrimination index, the better the item is able to discriminate
between students according to their autonomy level (p. 64). In addition, one
would not accept any item with discrimination less than 0.2. Wu & Adam
again stated that it would be preferable to accept items with index
discrimination are above 0.4. In addition, based on this assumption, item 1, 6,
10, 11, 14, 15, 17, 24, and 30 indicate that their D is below 0.4 and
this means those item might be discarded.
McDonald
(2007) stated that point biserial index may describe the power of item to
discriminate which is the basic item quality for multiple choice tests. Further
He defined that the higher PBI, the better the item discriminate upper level
and low level students (McDonald, 2007 p. 243).
Internal Consistency Reliability Estimates
One of the most popular reliability statistics in use today is Cronbach's
alpha. Cronbach's alpha determines the internal consistency or average
correlation of items in a survey instrument to gauge its reliability (Moss,
19980. When we have a variable generated from such a set of questions that
return a stable response, then your variable is said to be reliable. Cronbach's
alpha is an index of reliability associated with the variation accounted for by
the true score of the "underlying construct." Construct is the hypothetical
variable that is being measured (Hatcher, 1994).
Alpha coefficient ranges in value from 0 to 1 and may be used to
describe the reliability of factors extracted from dichotomous (that is,
questions with two possible answers) and/or multi-point formatted questionnaires
or scales (i.e., rating scale: 1 = poor, 5 = excellent). The higher the score,
the more reliable the generated scale is. Nunnaly (in McDonald, 2007) pointed
out that 0.7 to be an acceptable reliability coefficient but lower thresholds
are sometimes used in the literature. One good method of screening for
efficient items is to run an exploratory factor analysis on all the items
contained in the survey to weed out those variables that failed to show high
correlation.
Table
3. Cronbach Alpha reliability
It
can be seen from table 3 that through SPSS command program, the Cronbach alpha
denotes 0.86 with 29 numbers of items, and this means that the test is reliable.
This result is consistent with some experts who stated that coefficient
realibility of a test is usually vary between 0.60 and 0.85 (Linn &
Grondlund 2000, Kehoe 1995, Frisbe,
1988). In other word, the result of coefficient reliability may lead to take
decision particularly in assigning grades.
Standard Error of Measurement
The standard deviation of the distribution of random errors around the
true score is called the standard error of measurement. The lower it is,
the more tightly packed around the true score the random errors will be. Nitko &
Brokkhart (2007, p. 76) describes SEM
as an estimation on difference students’ test result from their true score. In
addition, when one refers to the standard error of measurement on a test,
he/she is referring to the standard deviation of test scores that would have
been obtained from a single student had that student been tested multiple
times. It is a measure of the "spread" of scores within a student had
the student been tested repeatedly.
(http://ritter.tea.state.tx.us/student.assessment/taks/standards/sem.pdf)
The formula of SEM is:
SEM = or
Where or
S is the standard deviation of the obtained score of the assessment and is Cronbach alpha reliability
estimates. Nitko & Brokhart (2007) pointed out that if the scored obtained
tend to be closed to true score then the score also tend to be consistent. This
means that the consistent score may create smaller error of measurement. In
other words, the larger the SEM, the
lower the reliability of the test and the less precision there is in the measures
taken and scores obtained (http://www.fldoe.org/ese/pdf/y1996-7.pdf)
One
useful application of the standard error of measurement is that it can be used
to estimate a band of scores around any cut-point wherein students are treated
with special care. For instance, if the test in question had a cut-point for
failing of 30, we should recognize that, if we want to be 68% sure of our
decision, the standard error of measurement indicates that the students within
one SEM of the cut point (i.e., 30 +- 1.47, or 28.53 to 31.47) might fluctuate
randomly to the other side of the cut point if they were to take the test
again, so it might behave you to gather additional information e.g., other test
scores, homework grades, interviews with the students, or whatever is
appropriate) in deciding whether or not they should pass the test. In addition The SEM also functions as an estimation of the students’
score where their score are definitely different from their true score. This is
important because it can give us misconception if we just measure solely on the
students’ true score (Lyman in McDonald, 2007).
In respect with this, McDonald
(2007, p. 241) stated that “the best way of grading studets’ score is at the
end of the course by calculating its reliability coefficient, means, and SEMs
for all exams; consider the final score, add points to the final grade
assignment”. the formula of SEM is given below:
SEM =
Table 4. Coefficient reliability and
From
the result on Table 4, we can estimate the standard error of measurement where = 47.53 and reliability
coefficient = 0.86. The result of SEM could be computed by using the previous
formula as follows.
SEM =
=
47.53 = 47.53 = 47.53 (.36)
= 13
Based on the calculation
above, the result of SEM = 13 means that the students gained the score
are roughly to be about 13 points above or below their true score.
Conclusion
Currently
item response theory and classical test theory has been widely used in
psychological and educational field. A test can be reliable if it follows some
fundamental procedures, validity and reliability. There are some components to
be considered in measuring the test results, such as reliability coefficient, Cronbach
alpha and standard error measurement. The test is assumed reliable if its the reliability coefficient is above 0.60
or greater than 0.80. Item difficulty tells us about how difficult each item in
a test. Their purpose is to discriminate upper level students and lower level
students. If the item is too easy then it cannot discriminate all students,
conversely if the item is too difficult it can effect to the students
performance. A good test may have mean p value
between 0.30 and 0.80 that is answered correctly and incorrectly by the
students. In addition, if the items have p
value more than 85, thus the test has poor discrimination power. Another
technique to measure reliability is by using Cronbach alpha, KR 20 and KR 21
however the latter is not be discussed in this session.
Bibliography
American Education Research Association,
American Psychological Association, & National Council of Measurement in
Education. 1999. Standard for Educational and Psychological testing.
Washington, DC: National Council of Measurement in Education.
Anastasi, A., & Urbina, S. 1997.
Psychological Testing (7 th ed). Upper Sadle River. NJ. Prentice Hall.
Brown,
J.D. 1999. Questions and answers
about language testing statistics: Standard
error versus standard error measurement. JALT testing and evaluation SIG
Newsletter. 3(1): 20-25.
Foy,
P., Olson, J.F. 2009. TIMSS 2007 User Guide for the
International Database. TIMSS & PIRLS International Study Center, USA.
Lynch School of Education, Boston College
Jones, R.W., & Hambleton, R.K.
(1993). Comparison of classical Test Theory and Item response Theory and Their
applications to Test Development. Instructional Topics in Measurement. Practical,
Research & Evaliation. 4 (6) Retrieved September 2, 2009 from
http://pareonline.net/getvn.asp?v=4&n=6
Kehoe, J. 1995. Basic Item Analysis for
Multiple-choice Tests, Practical Assessment, Research & Evaluation. 40
(10). Retrieved September 2, 2009 from http://pareonline.net/getvn.asp?v=4&n=10
Linn, L.R., & Gronlund, N.E. 2000.
Measurement and assessment in teaching. Upper Sadle River, NJ: Prentice Hall.
McDonald, M.E. 2007. Guide to Assessing
Learning Outcome. NY: Brooklyn. Jones
& Barlet Publisher
McMillan, J.H. 2001. Essential
assessment concepts for teachers and administrators. Thousand Oaks, CA: Sage
Publication.
Moss, P.A. 1998. Testing the test of the
test: A response to “ Mulltiple inquiry in the Validation of wirting test”.
Assessing writing. 5(1) 111-122
Mullis, I.V.S., Martin, M.O., Ruddock,
G.J.,O’Sullivan, C.Y., Arora, A., Erberber, E. 2005. TIMSS 2007. Assessment
Framework.TIMSS & PIRLS.International Study Center Boston College.
Nitko, J.A., & Brookhart, S.M. 2007.
Educational Assessment of Students. New Jersey. Pearson Prentice Hall.
Olson, J.F., Martin, M.D., & Mullis,
I.V.S. 2008. TIMSS 2007. Technical Report. Boston. TIMSS & PIRLS
International Study Center.
Wu, M. & Adam, R. 2007.
Applying the Rasch model to
psycho-social measurement: A practical approach. Educational Measurement
Solution.