Jumat, 12 Juni 2015

Classical Test Theory By. Muhammad Tahir



Classical Test Theory
By. Muhammad Tahir

Executive Summary

Educational assessment has become a central part of increasing teaching and learning performance as well as standardizing testing.  The result of assessment may provide information about to what extend does the assessment affect teaching outcome and to whom this assessment is used for.  Conducting assessment means that we assess a variety of task such as students, teachers and the test itself.  The test is intended to find its implication and social consequences in shaping standard curriculum, standard test, and standard measurement so that all education goals would be achieved.  Designing and applying testing for educational purpose is a challenging task. However in applying testing in educational field one should familiar in terms of classical theory, item response theory and item discrimination. Besides, the tester may also pay much attention to the Cronbach alpha. “TIMSS 2007 uses Item Response Theory (IRT) scaling to summarize students’ achievement on the assessment and to provide accurate measures of trends from previous assessments” ( Foy & Olson, 2007).
The purpose of this paper is to explore the classical item theory which becomes keystone for several decades of measurement and testing. It is also discusses some ideas regarding with item discrimination, and item response theory. For the most specific, it tries to find out validity and reliability of the test using TIMMS 2007 data as a secondary data. Understand the contexts in which students learn best. TIMSS enables international comparisons among the key policy variables in curriculum, instruction, and resources that result in higher levels of student achievement (Mullis et al. 2005).
TIMSS 2007 consist of science and mathematics item and has two kinds of test; multiple choice and constructed items. Multiple choice item is scored 1 while constructed item is scored 2 (Mullis et al. 2005)

Introduction

Classical test theory is a theory about test scores that introduce three concepts (often called the observed score), true score and error score (Jhon & Humbleton, 1993). Classical test theory has simple formula such X = T + E where X is linking the observable test score, it is the sum of two unobservable (or often called latent) variables, and true score (T) and error score (E) (Jhon & Humbleton, 1993 p.255). The true score of a person can be found by taking the mean score that the person would get on the same test if they had an infinite number of testing sessions.
 According to Jones and Hambleton (1993, p.256) “the assumption in classical test theory are true score and error score are uncorrelated, the average error score in the population of examinees is zero, and error score on parallel test are uncorrelated. In addition true score is derived from error score and total score in a test.
Item response theory is a general statistical theory about examinee items and tests performance and how performance related to the abilities that are measured by the items in the test. Basically, IRT model has two fundamental assumptions where one refers to structure of the test data and the other relates to mathematics from of the item characteristic function or curve (Jones & Hambleton 1993).
It is clearly that item difficulties and item discrimination may bring detrimental effects to the decision maker, curriculum developers and stake holders. However, this should be reexamined before we implement it. Indonesia, for example has begun standardizing its national examination where only 5 subjects cover in test. This national examination is regulated annually by involving national committee who are responsible for designing test. Each students should achieve 5.0 score to pass the national examination. Unfortunately, 19 senior high schools around the country could not pass their students (100%) (www.kompas.com).This means that the test is probably difficult for students and need to be revised for the next test. Furthermore, a test maker should also take care to students’ needs and curriculum underpinning national curriculum.
Regarding with this, Cronbach (in Moss, 1998 p. 115) suggested that “using stake holders’ interest as well as evaluator’s concern to generate a list of potential questions and then prioritizing the questions based on; prior uncertainty about the questions, information to be yielded by a feasible study compared to how much uncertainty will remain, cost of the investigation in terms of time and dollar, and laverage for achieving consensus about the use of the test in the relevance audience”.
This paper represents the TIMSS 2007 assessment as secondary data. Unlike PISA, TIMSS 2007 assessment covered only two topics, mathematics and science and has 353 items for fourth grade plus 429 items at the grade of eight (Olson et al., 2008). In addition, Australia book 1 will be chosen as representative for computing data analysis and grade fourth as a sample of study. This study mainly focuses on mathematics subject with 29 multiple choice items

Validity of the Test

Validity has essential role in educational assessment.  It can be developed by using some techniques along with theories and evidence to support and analyze the result of data measurement that can be used specifically (AERA et al., 1999, p. 17). Regarding with this, McDonald (2007, p. 20) contended that validity refers to the appropriateness of the interpretation of the test score. It does not measures or tests it self. In other words validity of the test should be supported with some evidence in which we can interpret the test score. Therefore a test can be valid if it is used to measure what the researcher really wants to measure and not use the measure for other purpose of the tests.
Similarly, McMillan (2001) clearly stated that a tester should have more experience in making judgments through validation test. Furthermore McDonald (2007) stated that the skills on validating the test consist of gathering sufficient evidence that can lead us to make inference regarding with the test results. These evident are very important to develop the interpretation of the test results which in turn can help people to use particular test result to make decision (AERA, et al. p. 11). Respect to this, Nitko & Brookhart (2007) suggested that to achieve the validity of test, a test developer should obtain more and more evidence through a series of action such as documented, checked, weighted, and mixed all relevant evidence which can enrich the interpretation on the test results. To achieve this, Nitko & Brookhart (2007, p. 67) stated that “validity is the confidence we may have in interpreting students’ assessment results and using to make decisions.
 Accordingly, in validity, a test developer may determine what they want to measure is that it appears to be measured. As suggested by McMillan (2007) that “a test needs face validity so that it appears to be valid to the test consumer. However if the test seems to be irrelevant or inconsistent to what really the researcher wants to test, it can affect not only on the students’ performance but also on test takers. This idea is also important because if the test is inappropriate or irrelevant to the students, it can cause poor results on test (Anastasi &Urbina, 1997).
Data is derived from TIMMS 2007 data set which consists of 29 multiple choice items. Before executing data Book1.BSG_Aus, we should first score the data into dichotomous data because data which is derived previously are nominal and scale data and need to be converted. All data are treated into two categories namely 1 for correct answer and 0 for incorrect answer. In the beginning the data showed students’ correct answer has 1 value while incorrect answer was 0. However, we have another problem relating with missing data with 1, 2, and 3 digits (99, 999, 96, 9, 7 and 6). These missing value have 0 value, then they removed because they have the same value with 0 (incorrect response, not reached or incorrect answers). In other word, an item is scored 1 if it is correct conversely 0 for incorrect answers. To treat this problem, we need to score them into dichotomous data which only has two values 1 and 0. Furthermore, after scoring the data we directly analyse the mean, P, index discrimination, coefficient reliability, and standard of error measurement.

Item Analysis

According to Nitko and Brokhart (2007) item difficulty level can be defined as the percentage of the student answering the test correctly.  There are two basic purpose of conducting assessment. First, to rank the student performance (excellent, second excellent and next best  students), second to place the students into several level of achievement (beginner, intermediate, and advance) thus the items should cover more difficult items (Nitko & Brokhart 2007, p. 120). Therefore it is very important to bear in mind that the purpose of item difficulty is to distinguish between higher thinking student and low thinking students. Similarly McDonald (2007) pointed out that if the test has P value 1.00 (100 %) it cannot give contribution to the reliability of the test. However, the average of good tests, at least, has mean P value around .30 and .80. (Kehoe in McDonald 2007, p. 242). The equitation of item difficulty is:
Where
P*   = adjusted item difficulty
     = weighted mean of the item
 = maximum score point available for the item
  =minimum score for the item
Table 1 depicts the mean P value of each item representing the percent of correct answer. This gives us short description about how difficult the test in total. It can also be seen from Table 1 above that item number 5, 10, 12, 22, 27, and 28 are not good items because they do not have power to discriminate between high achiever and low achiever students. Respect with this, Kehoe (1995) defined that test can be a good test if its P value is range from .30 and 0.80 and if it is above .85 or 1.00 then the test have no discrimination power. For example the last two items (27 and 28) have P value above .80. According to these criteria, both upper level and lower level students can answer the questions correctly. Other items (5, 10, 12, and 22) are also categorised as difficult items and need revision because their P value is below .30.
Discrimination indices (D) can be calculated for each dichotomous item. The higher the D, the more the item discriminates. Items with p levels in the midrange usually have the best D values and, the opportunity for D to be highest occurs when the p level for the item is at 0.50.
The extreme group method is used to calculate D. There are three simple steps to calculating D. First, those who have the highest and lowest overall test scores are grouped into upper and lower groups. The upper group is made up of the 25%–33% who are the best performers (have the highest overall test scores), and the lower group is made up of the bottom 25%–33% who are the poorest performers (have the lowest overall test scores).
The most appropriate percentage to use in creating these extreme groups is to use the top and bottom 27% of the distribution, as this is the critical ratio that separates the tail from the mean of the standard normal distribution of response error (Cureton, 1957). Step two is to examine each item and determine the p levels for the upper and lower groups, respectively. Step three is to subtract the p levels of the two groups; this provides the D. Accordingly, Nitko & Brokehart (2007, p. 324) item discrimination index (D) is the difference between the fraction of the upper group answering the item correctly and the fraction of the lower group answering the test correctly. The formula of D can be defined as follows:
D  =  or
d*  = 
where     = the item difficulty for the bottom 27% of the students
 = the item difficulty for the top 73% of the students

To calculate each item difficulties with cut points 27% and 73% of  and , therefore we need to use SPSS Frequency command. Then the result of frequency of each item discrimination can be drawn as follows :

The index can range from -1.00 to +1.00. According to McDonald (2007) it is rare to have -1.00 and +1.00 of index discrimination in a test performance. However, Table 2 above gives a description of how well the item has discrimination power. It is interesting to notice here that item 12 and item 22 have extremely low value (D = 0). According to Wu & Adam (2007, p. 64) argued that discrimination value of 0 shows that there is no relationship between item score and the total score. They claimed clearly that the higher discrimination index, the better the item is able to discriminate between students according to their autonomy level (p. 64). In addition, one would not accept any item with discrimination less than 0.2. Wu & Adam again stated that it would be preferable to accept items with index discrimination are above 0.4. In addition, based on this assumption, item 1, 6, 10, 11, 14, 15, 17, 24, and 30 indicate that their D is below 0.4 and this means those item might be discarded.
 McDonald (2007) stated that point biserial index may describe the power of item to discriminate which is the basic item quality for multiple choice tests. Further He defined that the higher PBI, the better the item discriminate upper level and low level students (McDonald, 2007 p. 243).

Internal Consistency Reliability Estimates

One of the most popular reliability statistics in use today is Cronbach's alpha. Cronbach's alpha determines the internal consistency or average correlation of items in a survey instrument to gauge its reliability (Moss, 19980. When we have a variable generated from such a set of questions that return a stable response, then your variable is said to be reliable. Cronbach's alpha is an index of reliability associated with the variation accounted for by the true score of the "underlying construct." Construct is the hypothetical variable that is being measured (Hatcher, 1994).
Alpha coefficient ranges in value from 0 to 1 and may be used to describe the reliability of factors extracted from dichotomous (that is, questions with two possible answers) and/or multi-point formatted questionnaires or scales (i.e., rating scale: 1 = poor, 5 = excellent). The higher the score, the more reliable the generated scale is. Nunnaly (in McDonald, 2007) pointed out that 0.7 to be an acceptable reliability coefficient but lower thresholds are sometimes used in the literature. One good method of screening for efficient items is to run an exploratory factor analysis on all the items contained in the survey to weed out those variables that failed to show high correlation.
Table 3. Cronbach Alpha reliability


It can be seen from table 3 that through SPSS command program, the Cronbach alpha denotes 0.86 with 29 numbers of items, and this means that the test is reliable. This result is consistent with some experts who stated that coefficient realibility of a test is usually vary between 0.60 and 0.85 (Linn & Grondlund  2000, Kehoe 1995, Frisbe, 1988). In other word, the result of coefficient reliability may lead to take decision particularly in assigning grades.

Standard Error of Measurement

The standard deviation of the distribution of random errors around the true score is called the standard error of measurement. The lower it is, the more tightly packed around the true score the random errors will be. Nitko & Brokkhart (2007, p. 76) describes SEM as an estimation on difference students’ test result from their true score. In addition, when one refers to the standard error of measurement on a test, he/she is referring to the standard deviation of test scores that would have been obtained from a single student had that student been tested multiple times. It is a measure of the "spread" of scores within a student had the student been tested repeatedly.
(http://ritter.tea.state.tx.us/student.assessment/taks/standards/sem.pdf)
 The formula of SEM is:

 SEM  =   or 

Where    or S is the standard deviation of the obtained score of the assessment and  is Cronbach alpha reliability estimates. Nitko & Brokhart (2007) pointed out that if the scored obtained tend to be closed to true score then the score also tend to be consistent. This means that the consistent score may create smaller error of measurement. In other words, the larger the SEM, the lower the reliability of the test and the less precision there is in the measures taken and scores obtained (http://www.fldoe.org/ese/pdf/y1996-7.pdf)
One useful application of the standard error of measurement is that it can be used to estimate a band of scores around any cut-point wherein students are treated with special care. For instance, if the test in question had a cut-point for failing of 30, we should recognize that, if we want to be 68% sure of our decision, the standard error of measurement indicates that the students within one SEM of the cut point (i.e., 30 +- 1.47, or 28.53 to 31.47) might fluctuate randomly to the other side of the cut point if they were to take the test again, so it might behave you to gather additional information e.g., other test scores, homework grades, interviews with the students, or whatever is appropriate) in deciding whether or not they should pass the test. In addition The SEM also functions as an estimation of the students’ score where their score are definitely different from their true score. This is important because it can give us misconception if we just measure solely on the students’ true score (Lyman in McDonald, 2007).
 In respect with this, McDonald (2007, p. 241) stated that “the best way of grading studets’ score is at the end of the course by calculating its reliability coefficient, means, and SEMs for all exams; consider the final score, add points to the final grade assignment”. the formula of SEM  is given below:
SEM =    
Table 4. Coefficient reliability and   
From the result on Table 4, we can estimate the standard error of measurement where   = 47.53 and reliability coefficient = 0.86. The result of SEM could be computed by using the previous formula as follows.
SEM    =    
            =  47.53     =  47.53    = 47.53 (.36)
            =  13
Based on the calculation above, the result of SEM  = 13 means that the students gained the score are roughly to be about 13 points above or below their true score.                               

Conclusion

Currently item response theory and classical test theory has been widely used in psychological and educational field. A test can be reliable if it follows some fundamental procedures, validity and reliability. There are some components to be considered in measuring the test results, such as reliability coefficient, Cronbach alpha and standard error measurement. The test is assumed reliable if  its the reliability coefficient is above 0.60 or greater than 0.80. Item difficulty tells us about how difficult each item in a test. Their purpose is to discriminate upper level students and lower level students. If the item is too easy then it cannot discriminate all students, conversely if the item is too difficult it can effect to the students performance. A good test may have mean p value between 0.30 and 0.80 that is answered correctly and incorrectly by the students. In addition, if the items have p value more than 85, thus the test has poor discrimination power. Another technique to measure reliability is by using Cronbach alpha, KR 20 and KR 21 however the latter is not be discussed in this session.





Bibliography
American Education Research Association, American Psychological Association, & National Council of Measurement in Education. 1999. Standard for Educational and Psychological testing. Washington, DC: National Council of Measurement in Education.
Anastasi, A., & Urbina, S. 1997. Psychological Testing (7 th ed). Upper Sadle River. NJ. Prentice Hall.
Brown, J.D. 1999. Questions and answers about language testing statistics: Standard error versus standard error measurement. JALT testing and evaluation SIG Newsletter. 3(1): 20-25.

Foy, P., Olson, J.F. 2009. TIMSS 2007 User Guide for the International Database. TIMSS & PIRLS International Study Center, USA. Lynch School of Education, Boston College

Jones, R.W., & Hambleton, R.K. (1993). Comparison of classical Test Theory and Item response Theory and Their applications to Test Development. Instructional Topics in Measurement. Practical, Research & Evaliation. 4 (6) Retrieved September 2, 2009 from http://pareonline.net/getvn.asp?v=4&n=6
Kehoe, J. 1995. Basic Item Analysis for Multiple-choice Tests, Practical Assessment, Research & Evaluation. 40 (10). Retrieved September 2, 2009 from http://pareonline.net/getvn.asp?v=4&n=10
Linn, L.R., & Gronlund, N.E. 2000. Measurement and assessment in teaching. Upper Sadle River, NJ: Prentice Hall.
McDonald, M.E. 2007. Guide to Assessing Learning Outcome. NY: Brooklyn.  Jones & Barlet Publisher
McMillan, J.H. 2001. Essential assessment concepts for teachers and administrators. Thousand Oaks, CA: Sage Publication.
Moss, P.A. 1998. Testing the test of the test: A response to “ Mulltiple inquiry in the Validation of wirting test”. Assessing writing. 5(1) 111-122
Mullis, I.V.S., Martin, M.O., Ruddock, G.J.,O’Sullivan, C.Y., Arora, A., Erberber, E. 2005. TIMSS 2007. Assessment Framework.TIMSS & PIRLS.International Study Center Boston College.
Nitko, J.A., & Brookhart, S.M. 2007. Educational Assessment of Students. New Jersey. Pearson Prentice Hall.
Olson, J.F., Martin, M.D., & Mullis, I.V.S. 2008. TIMSS 2007. Technical Report. Boston. TIMSS & PIRLS International Study Center.
Wu, M. & Adam, R. 2007. Applying  the Rasch model to psycho-social measurement: A practical approach. Educational Measurement Solution.

















puisi anak



Cita-citaku

karya: Andi Sahtiani Jahrir
 
Jika aku kelak dewasa
Aku ingin negara ini
Negara Indonesia yang aku cintai
Aku ingin negaraku
Menjadi negara yang makmur
                        Tidak seperti negara yang saat ini
Negara yang selalu bergejolak
Negara yang selalu bermusuhan
Negara yang selalu tidak tentram
Andai aku kelak diberi rezki
Andai aku sudah dewasa nanti
Aku ingin negara ini menjadi
Negara yang cinta damai....
                Tidak ada lagi terdengar....
                Kata bergejolak.
                Tidak ada lagi terdengan....
                Kata bermusuhan.
Hanya satu kataku....
Aman, tentram, dan damai
Indonesiaku