How reliable are multiple intelligence 'quick' tests?

Note: This text is from a larger dissertation. Labeling of figures retains original system.

One of the legacies of prevalent learning styles theories on Multiple Intelligences Theory is the assumption that a student’s MI profile can be uncovered by the use of a quick test or checklist.

The central challenge of applying MI theory to education is its inherent lack of testability, and Gardner’s resistance to the use of quick psychometric tests.

‘…for most of us in Western society, intelligence is a construct or capacity that can be measured by a set of short questions and answers, presented orally or in writing.’ (Gardner, 1999, p135)

Indeed, Gardner has much to say on the topic. Of course certain of the intelligences can be more feasibly tested: linguistic and logical-mathematical intelligences are essentially what traditional tests have assessed and Gardner himself was initially drawn into the search for suitable assessment tools to create such a test but soon found ‘…that the standard technology could not be applied appropriately to several of the intelligences. For instance, how do you measure someone’s understanding of himself, or of other people, using a short-answer instrument? What would be an appropriate short-answer measure of individual’s bodily-kinesthetic intelligence.’ (Gardner, 1999, p136)

His eventual response was testing MI in the spirit of MI and the resulting ‘Spectrum’ project which more resembled an interactive hands-on section of a children’s museum: a range of activities and materials that kids could explore so that their intelligence profile could be uncovered over time.(Gardner, 1993) The resulting assessment ‘product’ was neither a grade nor a percentage point, but, at year end, an essay revealing the child’s intellectual profile and informal advice as to possible applications of that profile. Within many educational settings – particularly that of ELT, such testing is, of course, wildly infeasible.

Multiple Intelligence Checklists

And so quick tests, or checklists abound. While tests vary in their validity, reliability and supporting data, all MI tests would seem to face the following conundrums:

1 Paper based tests are inherently linguistic in nature – and in fact are part of the original problem that Gardner sought to find an alternative to.

2 All but linguistic, logical, and to an extent spatial intelligence are too difficult to pin down in paper based tests.

3 While a Likert scale (preferable with a ‘not-applicable’ option) will alleviate the extremities of the following reservations, the following inherent flaws need be born in mind when selecting a checklist.

A: People with lower interpersonal and intrapersonal intelligences will, by definition, be less able to accurately self-reflect: the intrapersonal being more self aware of strengths and weaknesses; the interpersonal more able to judge how others respond to you.

Are you a good judge of character? (MIDAS)

I have a pleasant singing voice. (LDP)

Can you sing ‘in tune’? (MIDAS)

B: Such tests cannot adequately distinguish between a person’s real skill in an area and a mere interest (compare, for example, someone who enjoys moderate exercise and watching sport on the TV, and a fully fledged professional athlete.)

I enjoy physical exercise. (MB/ WM)

I enjoy art activities. (MB/ WM)

C: Popular misconceptions of intelligence abound, in particular the intrapersonal/introvert confusion, or the actual nature of naturalistic intelligence

(On intrapersonal intelligence)

I would prefer to spend a weekend alone in a cabin in the woods rather than at a fancy resort with lots of people around. (LDP on intrapesonal intelligence)

I go to the cinema alone. (MB/ WM)

Classification helps me make sense of new data (WM on naturalist intelligence)

D: Often questions target a particular intelligence that actually sum up attitudes that few people would deny

My life would be poorer if there were no music in it. (LDP)

I enjoy informal chat and serious discussion. (WM)

I enjoy talking to my friends (MB/ WM)

I can tell you some things I’m good at doing. (MB/ WM)

E: Questions in response to which someone with a higher specific intelligence may grade themselves lower than someone without that intelligence strength due to increased awareness or sensitivity within specific related domains.

I’m sensitive to colour (LDP)

I can tell when a musical note is off key. (LDP)

I’m a good singer. (MB/ WM)

The ‘MIDAS’ Test

One assessment tool that has been written, not specifically with ELT in mind, but with an awareness of potential misunderstanding and misuse of MI, is the MIDAS (Multiple Intelligences Development Assessment Scales) test. Its developer was keen to avoid the following ‘pitfalls’ of the quick checklist that can create, or perpetuate, ‘superficial’, ‘quick-fix’ and ‘mindless’ understanding. (Branton Shearer 2005:1-2)

While the test, on the surface, may look like any other, the test was developed over a six-year period, has items which may score in different intelligence areas, and assess subsets of skills with the broader intelligence boundaries. Musical intelligence, for example, is divided into appreciation, instrument, vocal and composer, kinesthetic into athleticand dexterity. The MIDAS development involved a huge sample and correlations between intelligence and job type have strong correlations.

The professional/administration version is only available after the administrator has completed an assessment of knowledge of fundamental MI concepts. In this way the developer seeks to limit misuse through limited understanding of MI and also place the assessment not in the hands of a student, but in the hands of a practitioner with proven expertise. It also has to be paid for. So unless a practitioner with institutional backing is committed to pursuing MI theory in their school, this is not a test that is going to be used and so is not the test used in this research.

I am more interested in using and assessing the sort of checklists that are both instantly and freely available (copyright permission notwithstanding) to an ELT practitioner either on the web or included in an MI resource book, as that is the sort more likely to be actually used. However I also sought to see how such tests can be used carefully.

The best checklist I could find beyond the MIDAS was the test found in the non ELT specific book ‘So Each May Learn (Silver et al 2000). Due to the fact that this test, or to my knowledge any test, includes no descriptors for existential intelligence, I added 10 ‘existential’ questions. The subsequent stages of the development project helped us, as a team, to gain a deeper understanding of how to apply the intelligences. I now see that my questions are to some extent flawed. The questions concern themselves with the bigger elements of ‘existential’ intelligence, but could perhaps have focused on smaller ethical and interest focused issues.

Reliability

As the checklist results were needed for developmental as well as research purposes, reliability assessment using a regular technique such as split-half method were inappropriate in so much as all investigation needed to be done in parallel and with sensitivity to the development project rather than compromising it by making the participants feel like specimens. This limited my options and I realize that the best I could hope for was the opportunity to reveal a tendency of reliability as opposed to more concrete ‘proof’. I chose a rather extreme alternate form approachby devising 2 different instruments which, while being less transparently MI related (so the participants would not feel they were repeating the same activity) and enabling discussion and reflection in workshop contexts, would also give me an indication of internal reliability while acting also as an indication of intra-observer reliability.

For external reliability I used an inter-observer technique that could also play a legitimate role within a development as well as research context.

I would suggest that MIT itself has implications for the nature of determining reliability of such MI checklists. The Johari Window (Luft, 1969) of Figure 3.1. frames the situation well.

The Johari Window (JW) is a simple heuristic device illustrating communication about self to others, and communication from others about the self. MI checklists usually stay firmly within area 1: that which is obvious to the individual within the ‘public’ realm.

Firstly, the model helps to demonstrate how the personal intelligences will affect the success of the self-assessment. Interpersonal individuals have a greater propensity to see how others react to their own actions and therefore be aware of non-verbal feedback from area 2 without necessitating actual feedback. Interpersonal individuals who are also strong intrapersonally should, with their increased self-knowledge and propensity to self-reflect, be that much more able to incorporate such insights into behaviour and self-image. By comparison, individuals with less ‘personal’ proclivity, or less balanced intrapersonal and interpersonal intelligences, will, if the theory is correct, be less able to access and assimilate non-verbal feedback and be less able to accurately self-assess.

The impact of area 2 will also be an affective factor in the self-assessment of people who are highly skilled in a more performance related domain of a particular intelligence. For example during the testing phase I observed that a participant who sings in a well-known a capella group rarely rated herself beyond a score of 3 in any question related to musical productive skills.

To increase feedback from JW area 2 therefore, I requested that in addition to the teachers doing the test themselves, they gave a copy, where possible, to a close friend or relative who were asked to complete the checklist aboutthe teacher. Any score (on a 1 to 5 Likert Scale) that had a difference of 2 (in the testing stage a difference of 1 seemed too subjective) were then to be discussed with the partner. Based on this input the teacher could then choose to adjust their original score for that descriptor.

The main challenge in assessing the internal reliability of the tests is the question of what to compare to for correlation, given the infeasibility of putting the staff through anything like Spectrum Project. In keeping with the ELT context I therefore created two tests for correlation assessment. (Importantly these were given prior to the MI test, which in turn was given before any input on the theory itself. I felt this was important for maintaining a more objective response than may have been obtained had the teachers known more about the theory and second-guessed the questions.)

The alternate-form tests were:

1) A survey in which teachers rate a list of 27 potential unit topics , that is 3 examples from each intelligence type (see appendix 3). The teachers were simply asked to rate the 3 they would most prefer to teach and the 3 they would least prefer. These results were then to be correlated with the results of the MI test. Again, I by no means expected to find any direct correlations as the two tests are completely different, but rather was seeking a tendency of correlation.

2) A survey, similar to the first, but listing classic classroom activities (see Appendix 4). The list was adapted from Sliver et al’s ‘So Each May Learn’ with additional existential activity types. (Silver et al, 1997:102-103) Looking back, I would have framed these existential activities differently. As mentioned, at the start of the project may own grasp on how to actually apply existential (and naturalist) intelligence to the classroom was, with hindsight, hazy. The survey comprised 63 items, 7 items for each intelligence. Teachers were asked to mark each one using the rubric shown in the survey appendix 4.

I had anticipated that the relative specificity of activity types, as opposed to the generic thematic areas of survey 1, would provide greater correlation to the MI test, in part due to the fact that in reality teachers have more day to day experience of selecting activities than they do course unit themes.

Inter-observer Reliability of the Multiple Intelligences Indicator

As explained in Chapter 3, prior to giving the teachers the Multiple Intelligences Indicator (MII), two questionnaires were given to further indicate the teacher’s individual intelligence profiles. No input on the theory itself had been given at this point. Despite the actual sequence, to contextualize the two questionnaires for the reader I will present the MII results first.

Of the 18 MIIs given and returned, 12 were completed with a partner. These 12, therefore, are the focus of the correlation study. The full table of data is shown in appendix 4.1. For the purposes of individual reflection the teachers were each given a chart to visually represent the results as in figure 4.1.1. I will briefly discuss these individual results before discussing the results more generally.

Figure 4.1.1. Example of an Individual Teacher Profile (1)

This particular chart will reveals a general tendency of agreement between the teacher (‘My’) and the partner for most of the intelligences. Note that actual ‘scores’ are omitted to focus attention on the relative levels of each intelligence indicted, and away from the arbitrary nature of the scores themselves (i.e. so not to be contrasted as ‘absolutes’ with other participants). The teacher is generally rating herself lower than her partner. This may be an objective interpretation of the 1 to 5 Likert scale, in which the partner is providing a more ‘flattering’ assessment of the teacher than the teacher herself. Conversely, it may be based on a more subjective interpreting of the scale descriptors. The advice to the teachers to discuss each questionnaire item in which the two scores differed by two or more points rather than by one sought to add balance to such subjectivity. The resulting conversations between the teacher and partner show that the teacher did, in each case, seek to balance her own scores in the light of the partner’s feedback.

Interestingly, though somewhat beyond the scope of this research to analyze in depth, it may be noted that the teacher’s highest intelligence is interpersonal – theoretically suggesting that the teacher is more likely to actively incorporate feedback from others. Compare with figure 4.1.2 in which the teacher’s interpersonal intelligence is rated by both parties to be comparatively more average while intrapersonal is higher. This teacher generally kept her re-evaluation at similar levels as her own original scores as opposed to seeking the balance we saw in the first example.

Figure 4.1.2. Example of an Individual Teacher Profile (2)

For the more general analysis, and to aid explanation of the data in appendix 4.1, table 4.1 shows, by way of example, the results for one teacher. The initial raw figures are translated into percentages. Though this may negate somewhat the objective differences on the Likert scale, the percentages allow for easier comparison or analysis of the group as a whole, while leveling subjective differences between teachers and partners and between the teachers team as a whole .

Table 4.1. Example of data for an individual student

Once the percentages were calculated, subsequent calculations could then take place:

The percentage difference between teacher scores and partner scores for each intelligence (T-P) with negative scores made positive as the direction of difference is irrelevant)

The percentage difference between the teacher scores and the re-evaluation scores for each intelligence (T-R)

In both cases negative scores were made positive as the direction of difference was irrelevant (i.e. both a score of +5 or -5 show the same degree of difference)

Each of the above were then averaged for each individual, and then collated and averaged for the 12 teachers in the sample. (Table 4.2)

Table 4.2. Differences Between Teacher and Partner Scores, and Teacher Initial and Reevaluation scores

The average discrepancy between teacher scores and partner scores was therefore 1.8% per each of the nine intelligences, or 16.8% overall if multiplied by nine (intelligences). As such then there was an 83.2% correlation over the 108 different scores under comparison (12 individuals by 9 intelligences). This would suggest a strong tendency of reliability from an inter-observer perspective. The conceptual reliability is of course not assumed and this research only used the tool for what it was – an indicator not an actual measurement of an individual’s intelligence profile.

The average discrepancy between the teacher’s initial score and their re-assessment score was somewhat lower: 0.6% per intelligence or 5.4% overall. While this shows that the teachers, on average, settled more towards their original assessment than that of their partner, and while 5.4% is not insignificant, it would also suggest that using the indicator without the use of a partner is, for the purposes of a simple ‘indicated’ profile, still fairly useful. This is important in the context of a teacher wishing to give the indicator to a group of ‘exchange’ students who may not yet have an appropriately close friend or relative at hand to complete the test with. It also indicates the validity of the MIIs of the teachers who did not complete the test with a partner.

Not initially anticipated but emerging from the analysis of data, an additional indication of reliability is illustrated by figure 4.3

Figure 4.3. All Teacher MII Samples Averaged

The pie chart represents the average MII scores for the sample of the faculty from which 18 questionnaires in total were received (12 with partners, 6 without). As would be theoretically expected from a balanced indicator, each individual has a different profile containing strengths and weaknesses in different areas. Consequently, when the results are averaged over the whole group, the intelligences would be expected to display a high degree of equilibrium (rather than being skewed in a smaller group of intelligences), which in fact they do. Given that the sample is one of city dwelling English teachers I would suggest that the presumed equilibrium is inherent in the results: though as expected, the linguistic intelligence overall is highest at the expense perhaps of the logical-mathematical. The urban location perhaps impacts or reflects the lower naturalist score. Notably five of the nine intelligences provide the mode of 11%.

Figure 4.4 shows the results when students were given the test. In this case the sample is of 96 students.

Figure 4.4. All Student MII Samples Averaged

This chart is even less ‘peaked’ than the teacher faculty chart, which would be expected given that there is less specific group cohesion of interest or skill than the predictably linguistic English teachers, and a far wider range of geographical and cultural background. Naturalist is therefore on a par with logical, kinesthetic and linguistic. Given the main spread of the age group (16-22) it may also be unsurprising that the interpersonal is a little higher and existential a little lower than the other intelligences. Still, there is only a 6% difference between the lowest and highest intelligence, and, exactly in line with the faculty results, the mode is 11%.

Alternate Form Test: The Topic Survey

The challenge in presenting these particular results is how to correlate two completely different assessment tools. My solution was to lable each teacher’s intelligence results from the MII as follows:

Highest three intelligences =A

Middle three intelligences = B

Lowest three intelligences = C

These labels were then cross referenced with the teachers’ 3 most preferred (Table 4.3.) and 3 least preferred (Table 4.4.) choices from the topics survey. For example, in Table 4.3. two of Teacher 1’s three most preferred topics, ‘Extreme sports’, and ‘Study Skills’ fall, respectively, in the intelligences of kinesthetic and intrapersonal. As these are two of Teacher 1’s three highest intelligences on the MII, the boxes are marked with an ‘A’. Her other topic choice, ‘Social Skills’, falls under interpersonal intelligence which is one of her middle three intelligences, and so marked with a ‘B’. In Table 4.4 (least preferred topics), two of Teacher 1’s choices are ‘Great Artists’ (Spatial) and ‘Musical Instruments’ (Musical), both of which are amongst her three lowest intelligences and so are marked with a ‘C’. Her other topic choice of ‘Dance’ represents one of her highest intelligences, kinesthetic, and therefore the box is marked with an ‘A’

In order for the Topic Questionnaire to affirm the reliability of the MII, it would be expected that Table 4.3: ‘Most preferred topics’ would comprise of mainly As and perhaps Bs (given the imprecision of the tool due to the wide range of potential domains within each intelligence), while Table 4.4: ‘Least preferred topics’ would comprise of mainly C’s and perhaps Bs. This is in fact largely the case.

Table 4.3. Most preferred topics cross-referenced with MII results

Table 4.4 Least preferred topics cross referenced with MII results

The main flaw of the Topic Survey is related to Gardner’s concept of ‘domains’ (see Chapter 1: Definition of Terms). For example, in Table 4.4, one of Teacher 1’s least preferred topics is ‘Dance’ which would fall under kinesthetic intelligence which is one of her highest intelligences. Unlike another of her kinesthetic choices ‘Extreme Sports’, the topic of dance, though a kinesthetic domain, it is not one that interests the teacher as perhaps Dance also relies on higher musicalproclivity. The survey presents therefore an over-simplified view of MIT. This perhaps accounts for the number of Bs and especially Cs in Table 4.3. and the number of As and Bs in Table 4.4

The second problematic area in question is that of existential intelligence in Table 4.3. This is the most polarized section with 7As, 1B and 5Cs, and comprises of 5 of the 7 Cs in the whole table. I would suggest that to an extent the topics I chose to represent existential intelligence are skewed. ‘Great philosophers’, ‘World religions’ and ‘Good and Evil’, I now realize, are rather extreme examples of how existential intelligence may be reached via topic in the ESL classroom. More balanced descriptors may have been ‘ethical dilemas’ ‘health and society’, and ‘future what ifs’. Such titles, through being less emotive, may have been less appealing to the less existential type than ‘world religions’ and ‘good and evil’.

Despite such possible skewing, the general tendencies are as predicted. Figure 4.5 demonstrates this tendency for correlation with the MII.

Figure 4.5 Tendency of Correlation of the Multiple Intelligences Indicator and the Topic Survey

Alternate Form Test: The Activity Survey

The final strategy to assess the reliability of the MII was the classroom activity survey (see appendix 4) Figure 4.6 shows the results when added together across the faculty. The proximity of totals to 100 is coincidental and does not represent a percentage.

Figure 4.6 Results of the Activity Survey

When the activities from the ‘used and like’ and the ‘not used but would like’ categories are taken alone and compared to the results from the MII, a degree of correlation is evident (these results appear as percentages):

Figure 4.7. Comparison of Activity Survey with Averaged MII Scores

Though not an exact correlation, the ‘liked and used’ activities differ markedly from the averaged MII results in only kinesthetic and interpersonal categories. With the leveling addition of ‘liked and not used’ activities we see a strong similarity of mode to the MII: 7 of the 9 intelligences scored either 11% or 12%, the MII mode was 11%.

Given the complete difference in alternate format of the MII and the activities survey, I would suggest that these similarities indicate a reasonable degree of correlation.

However, given subsequent findings and also an increased grasp of the application of the theory to classroom reality, I have some reservations about the survey questionnaire.

This reservation is based on syntactical concern. The findings are based on ‘activities used’. This does not give a balance to the frequency of such activities being used, only that, at some point, they have been used. This does not distinguish between an activity used once and an activity used daily. However, while this issue would spur the next phase of the research project, the aim of both the activity and topic surveys was to provide evidence of a tendency of correlation with the MII. The results, particularly of the activity survey are actually closer than I had anticipated.

How reliable are multiple intelligence ‘quick’ tests?