Show Summary Details

Page of

 PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, CRIMINOLOGY AND CRIMINAL JUSTICE ( (c) Oxford University Press USA, 2016. All Rights Reserved. Personal use only; commercial use is strictly prohibited (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 19 July 2018

Using Cognitive Interviews to Guide Questionnaire Construction for Cross-National Crime Surveys

Summary and Keywords

What is a “snowball”? For some, a snowball is a drink made of advocaat and lemonade; for others, a mix of heroin and cocaine injected; for yet others, a handful of packed snow commonly thrown at objects or people; for gamblers, it refers to a cash prize that accumulates over successive games; for social scientists, it is a form of sampling. There are other uses for the term in the stock market and further historical usages that refer to stealing things from washing lines or that are racist. Clearly then, different people in different contexts and different times will have used the term “snowball” to refer to various activities or processes. Problems like this—whereby a particular word or phrase may have various meanings or may be interpreted variously—are just one of the issues for which cognitive interviews can offer insights (and possible solutions).

Cognitive interviews can also help researchers designing surveys to identify problems with mistranslation of words, or near-translations that do not quite convey the intended meaning. They are also useful for ensuring that terms are understood in the same way by all sections of society, and that they can be used to assess the degree to which organizational structures are similar in different countries (not all jurisdictions have traffic police, for example). They can also assess conceptual equivalence. Among the issues explored here are the following:

• What cognitive interviews are

• The background to their development

• Why they might be used in cross-national crime and victimization surveys

• Some of the challenges associated with cross-national surveys

• Ways cognitive interviews can help with these challenges

• Different approaches to cognitive interviewing (and the advantages of each)

• How to undertake cognitive interviews

• A “real-world” example of a cognitive interviewing exercise

• Whether different probing styles make any difference to the quality of the data derived.

Keywords: research methods, victimology, criminal victimization, surveys, cognitive interviews

What are Cognitive Interviews?

Cognitive interviewing is the administration of survey questions in draft form to respondents in order to collect additional verbal material about their thought processes as they answered the questions and to gain some insight into the quality of the questions in order to refine them (Beatty & Willis, 2007, p. 287; De Maio & Landreth, 2004). The technique is commonly used to explore specific aspects of survey questions by asking respondents about how they understood key elements of drafts of those question(s) and what thought processes they used to derive answers. Typically, researchers are interested in how respondents interpreted keywords or concepts in the questions (e.g., “slapped you,” Ackerman, 2016; “safe,” Ferraro & LaGrange, 1987, or “the police make reasonable decisions,” Farrall et al., 2012). As such, the process reveals something about the quality of the responses collected (in terms of whether the participant is answering the question in the way intended by the question designers) and any difficulties that the participant had when answering the draft questions (Beatty, 2004). In recent years cognitive interviewing, in particular that which uses verbal probing, has become more commonly used as a survey pretesting technique (Thompson, 2008; Willis, 2005).

What Is the Background to Cognitive Interviewing?

Cognitive interviewing was developed during the 1980s as part of a wider exploration of the cognitive aspects of survey methodology (CASM; Belson, 1986). The basic tenet of CASM is that responding to survey questions requires a complex set of cognitive processes (Tourangeau, Rips, & Rasinski, 2000). Although no one model is currently universally accepted, the model outlined by Willis (2005, pp. 37–42) summarizes the main processes and the order in which they happen (see also Tourangeau et al., 2000):

  1. 1. The respondents read the question; they assess whether or not they understand it.

  2. 2. They retrieve from memory the relevant information needed to answer the question.

  3. 3. They assess how accurately and thoughtfully they should answer the question, and whether they wish to tell the truth or give a more “socially acceptable” answer.1

  4. 4. They tailor their response to match any multiple-choice options offered.

Over time, as documented by both Beatty and Willis (2007) and Conrad and Blair (2009), two families of cognitive interviewing techniques emerged, each with a quite different theoretical paradigm and methodological characteristics.

Why Undertake Cross-Nation Crime and Victimization Surveys?

International social surveys are becoming increasingly popular with researchers seeking to make comparisons between countries undergoing similar social, economic, or political transformations (Smith, 2009, p. 267, Medina et al., 2009, p. 333). Indeed, some see comparative social research as a necessity (Hantrais, 1999). As a result, processes once considered to be unique to one society have become recognized as part of wider transformations. This has led to a further recognition that unless more comparative research is undertaken, global or continentally based processes may continue to be viewed and tackled as purely “domestic” matters (Smith, 2009). As Medina et al. (2009, p. 334) note, by comparing attitudes and behaviors across countries and through the evaluation of the relationships between variables and how these vary by context, social scientists can clarify their understanding and appreciation of causal mechanisms, relationships, and processes and how these unfold over time and in different societies. Cross-national research, like interdisciplinary research, forces social scientists to consider new possibilities in terms of causal relationships and mechanisms for accounting for change and is to be welcomed (Smith, 2009, p. 267).2 Cross-national research has produced theoretical insights into a range of topics in political science (such as rates of political participation, the role of social values in economic growth, and levels of political action). Similar research has elucidated the frequency with which people experience “core” emotions (Scherer et al., 2004), for example. In these respects, social scientists are becoming increasingly involved in debates about social processes and social change in more than just one country. Crime rates, responses to these rates, and citizen views on them are one such topic.

What Challenges are Associated with Cross-National Surveys?

The challenges presented by comparative social survey research are many, varied, and not easily resolved (Smith, 2009). The sorts of factors considered when designing survey items in one language include, but are not limited to, question order effects, general wording of questions, the need to avoid “leading” questions, the creation of “nonattitudes,” the use of “self” or “proxy” questions, scale length (1 to 4 vs. 1 to 7), the use of a midpoint, the labeling (1 = “very good,” 2 = “good,” and so on) or anchoring (0 = “always do” and 10 = “never do”) of scales,3, reference to a time period (and if so, how long this ought to be), if a filter question is needed, and how to avoid socially desirable responses (Schuman & Presser, 1981; Tourangeau et al., 2000). As such, while the design of survey questions is a complex matter in one language, it becomes more complex when one is trying to accomplish all of this in a survey to be used in several countries. As Smith notes:

The basic goal of cross-national survey research is to construct questionnaires that are functionally equivalent across populations. Questions need not only be valid, but also must have comparable validity across nations. But the very differences in language, culture, and structure that make cross-national research so analytically valuable, seriously hinder achieving measurement equivalency.

(2009, p. 268)

In short, what may seem like a “workable” question in one language may prove to be unfeasible for people who speak a different language or who live in a different context. As Heath et al. (2009, p. 293) summarize, the problem presented by cross-national research is one of deciding if observed differences are “real” or if they are due to methodological shortcomings. As recent reviews (Smith, 2004, pp. 431–432; Blasius & Thiessen, 2006, cited in Heath et al., 2009) have suggested, few researchers are willing to recognize that observed differences are due to poor research design rather than real differences. This is particularly worrying, given that question design is the “weakest link” in producing data of sufficient quality to enable rigorous cross-national research. Presented with the problems inherent in designing good survey questions for cross-national crime surveys, researchers may find the task of undertaking high-quality research employing survey data insurmountable. Yet techniques for understanding how respondents interpret survey questions (usually employed to assess the “meaning” of both questions and the answers they elicit) are emerging.

How Can Cognitive Interviews Assist with the Design of Cross-National Surveys?

Cognitive interviews are one tool for assessing the validity of cross-national survey questions (Willis, 2005, p. 266). Few studies, however, have carried out cognitive interviews in an international or cross-cultural context; Willis reviews 32 such studies (2015). Willis and Zahnd (2007) carried out cognitive interviews to understand how health care survey questions worked with four different groups: monolingual Koreans; non-Korean English speakers; bilingual Koreans interviewed in English, and bilingual Koreans interviewed in Korean. This study was undertaken to determine whether Koreans understood questions in the same way as the non-Koreans and whether the questions were understood differently by bilingual Koreans in Korean or English. Willis and Zahnd discovered that certain types of error were more common for particular cultural groups. They also reported on assumptions that had been made in the wording of the question and that did not apply to people’s situations. They identified three types of cross-cultural problems: translation, cultural adaptation; and generic questionnaire design problems. Cognitive interviewing has also been used as a tool in the piloting of national and cross-national surveys, including the bilingual version of the U.S. census (Goerman et al., 2007).

Researchers from the European Social Survey (ESS)4 have also worked to develop cognitive interviewing for cross-national research design (Fitzgerald et al., 2009). A cross-national cognitive interviewing project was initiated to test questions in six European countries and the United States. Interviewers investigated people’s understandings of a sample of survey questions. This project involved the examination of “nonresponses” (“don’t knows” or refusals to answer); examination of other behavior(s) suggesting confusion (such as hesitation or requests for repetition of questions); assessment of how each respondent understood and answered each question, and whether it was understood in the way survey designers intended by the survey designers; classifying the errors discovered; producing recommended improvements; and country verification, where research teams were asked to verify the findings from their countries.

Are There Different Approaches to Cognitive Interviewing?

As noted earlier, two broad approaches to cognitive interviewing can be identified in the literature. These are known as “Thinking Aloud” and “Verbal Probing.”

Thinking Aloud

Seen as the original cognitive interview technique (Beatty & Willis, 2007), in Thinking Aloud (TA) the respondents are encouraged to vocalize their thought processes as they answer survey questions. After the interview, the researcher uses the interview transcript to examine the respondents’ understanding of the question and information that they drew upon when answering. The researcher can then redesign questions to remove as many potential sources of error as possible. It is acceptable within the TA approach for interviewers to use some interjections in order to keep the interviewee focused on the task (e.g., “keep talking” or “tell me what you’re thinking”; Willis, 2005, pp. 46–47).

Verbal Probing

Researchers employing the Verbal Probing (VP) technique take a more active role, asking questions and probing the respondent in order to elicit data. Beatty and Willis (2007) suggest that VP was developed for pragmatic reasons, as researchers realized that they required more information about people’s thought processes than could reliably be gained through the TA style. Beatty and Willis (2007, p. 300) identify four different types of probes used within this technique, with regard to the way that they are presented by the researcher. These are:

  1. 1. Anticipated Probes—scripted in anticipation of the question being problematic (e.g., “could you tell me what you understand by the term ‘police’?”).

  2. 2. Spontaneous Probes— used by researchers to search for potential problems that may have occurred to them within the course of the interview.

  3. 3. Conditional Probes—scripted probes that are only used if the respondent exhibits certain behavior (e.g., “I noticed you paused for a long time before you answered, could you tell me why?”).

  4. 4. Emergent Probes—unscripted probes that may occur to the interviewer within the course of the interview in response to comments by or behaviors of the participant. (For example: You mentioned that your house had been burgled and the police were not very effective; could you tell me some more about that?)

What Are the Advantages and Disadvantages of Each Approach?

Advocates of TA suggest that it is more valid than VP because:

  • It is a standardized procedure, in which each participant is asked to complete an identical task (Beatty & Willis, 2007, p. 292, Willis, 2005, p. 56).

  • VP can change the content and flow of people’s responses to questions, thereby creating artificial problems (Beatty & Willis, 2007, p. 292).

  • TA is reported during the process of comprehending a question (rather than reconstructed afterward), and so TA is a more authentic reflection of the thought process (Beatty & Willis, 2007, p. 292). (See Conrad et al., 2000, p. 4 for a refutation of this claim.)

  • TA is less open to interviewer effects, which may distort the answers given (since the interviewer is relatively silent during the interview), and is more open-ended (Willis, 2005, p. 53).

While supporters of TA suggest that the technique does not interfere with the thought process, others (Willis, 2005) suggest that TA leads to people thinking about the question differently (and for longer) than they would do otherwise, and thus their answer and attitude to the question are changed through this process. Other cited issues with TA include the fact that some subjects may need training in how to “think aloud” and may never be able to undertake this activity (Beatty & Willis, 2007, p. 293, Willis, 2005, p. 54). Or they may stray from the task (Willis, 2005, p. 54), and that useful information may not always be forthcoming (i.e., TA may allow respondents to avoid giving an answer to a question, thereby raising uncertainty over whether the question was unanswerable or just not answered (Willis, 2005, p. 54). It has also been suggested that TA may suggest that problems exist but not be able to identify what the problems are or how they may be solved (Beatty & Willis, 2007, p. 294).

Advocates of the VP approach make the following suggestions:

  • As some respondents are poor at thinking aloud, VP overcomes this difficulty.

  • VP means that the researcher can focus attention on pertinent issues (rather than allowing the respondent to wander off topic).

  • If the VP is undertaken immediately after the question has been delivered, then respondents’ thought processes are not interfered with. It is argued that this thought process is still within the short-term memory of the respondent and can thus be asked about afterward (Willis, 2004).

  • The VP approach generates information that may not come to light unless explicitly asked about (Beatty, 2004, p. 64).

  • The VP approach enables the interviewer to retain greater control of the interview and keeps the respondent focused on the task (Willis, 2005, p. 55) and provides data that are useful to both the identification and resolution of problems (Beatty & Willis, 2007, p. 294).

Although there is increasing support for VP, it is not necessarily without its drawbacks. Willis (2005) suggests that VP can lead to reactivity, with respondents’ subsequent answers being affected by probing. He also suggests that there is the potential for bias to be introduced in the phrasing of probes and that there is an increased need for the training of interviewers in comparison to what is required for TA.

Which One Is Better?

Very few have studied which of these two approaches may produce better data. One such study, conducted by Priede and Farrall (2011), found that the VP approach revealed more problems than the TA approach. However this difference is not statistically significant (p = .258). Both approaches found similar sorts of problems. They reported that VP allowed for a more in-depth investigation of some of the issues emanating from the questions than did the TA approach. They felt that VP was better than TA for understanding respondents’ comprehension of questions (see also Beatty & Willis, 2007) and that VP was clearly better than TA for understanding the definitions of key terms (such as “the police” and “the courts”). It is less clear as to whether VP is better at understanding how respondents define other terms in the questions examined (such as “deal effectively” or “handling problems”), where probed responses appear to have a lot of post-hoc elaboration.

It has been suggested that TA means that the respondent does not always provide codeable answers to survey questions (Willis, 2005). In the Priede and Farrall (2011) study, 12 of the TA participants did not give an answer to at least one of the survey questions, as opposed to 3 of the VP participants (p = .001). These “no answer” incidents were particularly clustered around more complicated, longer questions. This is most likely because participants spent a long time thinking aloud about these questions and thus did not get around to placing an answer on the scale provided. Alternatively, the processes being asked about became so complex in the minds’ of the interviewees that they could not give one answer. It has also been suggested that TA is a more open-ended technique, whereas VP allows for more focus to be placed on specific issues and thus allows the interviewer to exert more control over the interview (Willis, 2005). Priede and Farrall (2011) also reported on interview control.

Willis (2005) suggests that TAs are freer from interviewer bias than VPs. This may be the case, for there are fewer interviewer interventions with the TA approach, but TA cannot be seen to be completely unbiased. Unless a “pure” TA approach is undertaken, there is always going to be some input from the interviewer. The TA approach therefore cannot be seen to be completely free of interviewer bias. While a completely nonbiased approach may be desirable in a purely scientific study, the interviews in the Priede and Farrall study were undertaken for pragmatic, question (re)design reasons associated with a wider project. As such, the interviews were conducted to determine whether certain concepts worked in the field when “translated” into survey questions, and if interviewees had not been guided to address these concepts, information collected in them would have been of less worth to the wider research project. Therefore, while a pure TA may be the most bias-free style of interview, this quality does not necessarily make it the “best” or most useful style.

Both TA and VP approaches have their benefits and drawbacks. VP allows for the exploration of specific concepts, but some probes apparently provide information that is not used by the respondents in their original answers to the question. While TA does not necessarily provide the desired information about specific concepts, it does not suffer the same drawback of post-hoc elaboration. TA also allows better understanding of the retrieval aspect of the cognition process. In addition, if the interviewer makes greater use of discretionary probes during TAs, some of the issues which the TA approach hints at, by not allowing the exploration of problems, could be overcome. When Priede and Farrall were undertaking their research, there was a need to discover both specific issues (such as the respondent’s knowledge of a public body) and more general conceptual issues surrounding the questions. Therefore, careful thought had to be given to construction of the probes used in order to ensure that they delivered both the general and the specific. The probes also had to be made user friendly for the respondents, and not be too repetitive so as to keep them engaged in the interview process.

How Does One Undertake Cognitive Interviewing in Practice?

This discussion is based on both previous reviews (Fitzgerald et al., 2009; Willis & Zahnd, 2007), which provided some guidelines as to the cross-cultural interviewing process, and personal experiences.

The Selection of Countries

  • Cognitive interviews must be carried out in the source country (i.e., the country in which the questions were first developed or designed). This approach helps detect whether the questions themselves are flawed and whether changes are needed in the source country and elsewhere (Willis, 2015, p. 380).

  • It is necessary to work with different language groups. In our research (Farrall et al., 2012), we conducted interviews in countries speaking a Romance language (Italy), Finno-Ugric (Finland), and Slavic (Bulgaria) , as well as Anglo-Saxon (England). Ideally, this approach enables identification of as many potential linguistic issues as possible in the design stage, rather than remaining undiscovered until the pilot or main fieldwork.

  • Countries might need to be chosen on the basis of assumed variance in the object of enquiry. In our research, countries were selected on the basis of their differing confidence in their criminal justice systems. Traditionally, Finns have a high level of confidence in their police and court systems, whereas Italian confidence is very low. As a comparatively recent democracy, Bulgaria faces its own challenges regarding confidence in the justice system, whereas the UK is a stable democracy.

  • Countries should also be selected on the basis of their differences in key variables. All four of the countries we researched had criminal justice systems that differed in their compositions. As a result, we could identify problems with the conceptual apparatus and the assumptions that were being made about the nature and operation of the processes being explored.

  • When appropriate and possible, countries should also be selected on the basis of their different economic, social, and political histories. Differences in these areas may lead to different attitudes to the criminal justice system inasmuch as relationships between public and state in these countries are traditionally different.

The Process of Cognitive Interviewing

  • From the beginning, a dialogue should be maintained between members of the research team and those translating the questions. This will minimize the number of translation errors and provide guidance when designers of questions are choosing among a variety of possible translations of terms or concepts.

  • It therefore follows that the team undertaking the cognitive interviews are multilingual (Willis, 2015, p. 382) enabling them to understand the differences in how respondents use the same word to mean different things or how two seemingly similar words in different languages may mean quite different things.

  • In deciding which words, questions, or response codes to take into cognitive interviewing, researchers should include key concepts (such as “trust”), institutions (“your family”), organizations (“the police”), jargon (“stop and search”), and any other aspects they feel may be contentious. The cognitive interviews may also be used to explore the public acceptability of some words or phrases.

  • A training day should be held with interviewers from all participating countries. Training should be given in the use of probes, when to ask them, and what key issues are being investigated. The process of interview data analysis should also be outlined, including the use of an error coding framework.

  • The process of interviewing and analysis should be documented thoroughly in all countries involved. This allows the spotting and explanation of errors or oddities in results.

  • Demographic data about the respondents in cognitive interviews ought to be collected, so that consistency of usage and comprehension can be assessed along key sociodemographic variables (i.e., do older people use a word differently than younger people, do some ethnic groups differ in their preference for response codes, and so on). Such data ought to extend to age, educational level, ethnicity, gender, income, and any other variables that may influence the understanding of key concepts in the study (in some cases, data about housing type or type of prior victimization may be required).

  • Recruitment of participants should be carried out in similar ways in all countries involved, if possible. This means that a similar cross section of the population should be recruited. A similar number of participants should be interviewed in each country. The cognitive interview technique needs just a relatively small number of participants, which ought to be achieved in all countries so that each country has an equal chance to identify errors in draft survey questions.

  • For most studies that require cognitive interviews, around 15 to 30 interviews are sufficient. However, cross-national studies appear to place a greater reliance on larger sample sizes (Willis, 2015, p. 380). This is in part driven by the need to have 15 to 30 interviews in each country, so inevitably a four-country study will result in 60 to 120 interviews if each country undertakes 15 to 30 interviews.

  • Some studies undertake two rounds of interviews; the first is aimed at identifying problems, and the second seeks to assess the degree to which the drafted solutions have improved the questions.

In terms of analyzing cognitive interviews, Fitzgerald et al. (2009, p. 12) suggest the following analysis stages:

  1. 1. Examination of evidence of “nonresponses” (“don’t know” or refusals to answer).

  2. 2. Examination of evidence of other behavior suggesting confusion, such as hesitation or requests for repetition of questions.

  3. 3. Identification of contextual information that may account for or explain findings. This includes contextual information identified by the country representatives, in terms of translation or other country-specific issues.

  4. 4. Assessment of how each respondent understood and answered each question, and whether the questions were understood in the way the question designers intended.

  5. 5. Identification of the key findings from each country and for each question.

  6. 6. Identification of overall conclusions, including the classification of errors discovered, using an error source typology.

  7. 7. Production of recommended improvements and changes to questions.

  8. 8. Country verification, where research teams were asked to verify the findings from their countries.

Fitzgerald and colleagues (2009) suggest that all of these stages are necessary for understanding how and whether questions “work” in different countries. They used an error-source typology, and as the name suggests, this typology is based on identifying how errors were introduced into the questions. The four sources identified were the following:

  1. 1. Source question—it is inherently flawed.

  2. 2. Translation errors were introduced into the question (e.g., in the ESS, “wealthy” was mistakenly translated as “healthy” in France).

  3. 3. The source question appears to work well, but some features in its design make translation difficult.

  4. 4. Cultural differences, meaning that the concepts being measured do not exist in all countries (for example, in the U.S. Census work, it was discovered there was no concept of “foster child” in Spanish).

These are similar to the errors identified by Willis and Zahnd (2007), but with the additional issue of the interaction of the source question with the translation.

EURO-Justis: A Worked Example of an International Comparative Cognitive Interviewing Exercise

Euro-Justis, a project funded by the European Commission, was designed to provide EU institutions and member states with new indicators for assessing public confidence in justice. For some time, member states had used social indicators to improve policy and its assessment, but limited progress has been made on this project in criminal justice. Common-sense indicators based on readily available statistics—such as crime trends—were used extensively. Much less attention was paid to crucial but hard to measure indicators about public confidence in “justice,” a term that embraces issues relating to fairness, trust, and insecurity. Without such indicators, there is a risk that crime polices may become overfocused on short-term objectives of crime control, at the expense of equally important longer-term objectives relating to justice.


The project aimed to develop and pilot survey-based indicators of public confidence in justice. One key aspect of this work was the design and cognitive testing of measures of trust and confidence in the criminal justice system. This work took place in four countries: England, Italy, Bulgaria, and Finland. These countries were selected because they represented different language groups and had different social, economic, and political histories, which led them to develop different attitudes to the police. All countries completed 16 to 30 interviews.

It was necessary to develop a standardized protocol for the interviews and to train interviewers from different countries together to ensure that the differences encountered in responses were due to cross-national variances rather than different interviewing techniques. It was also important to take a standardized approach to analysis, so that problems were discussed in similar ways in all countries and country reports could therefore be directly compared. In order to check for translation errors, or issues with how the questions interacted with the translation which the international researchers may have overlooked, the translated questionnaires were “back translated” into English.

In all, 21 questions from the draft questionnaire were selected for examination. The interviews were carried out using standardized probes, developed with the aims of understanding how certain terms were understood by respondents; determining whether certain concepts and the words used to express them “worked” across Europe; and examining the cognitive processes by which people come from thinking about the question to selecting an answer from the scale offered. Interviewers were also allowed to use discretionary probes to discover more about issues that arose during the interviews. The interviews were recorded and analyzed using a framework developed by De Maio and Landreth (2004), with additional codes developed for issues that arose due to question translation.

Some of the questions came from existing surveys, whereas others were based on redrafts of existing survey questions or were entirely new. Researchers were asked to document any changes they made to the questions during the translation process (for example, there is no Finnish equivalent to “probation service,” and so an alternative similar wording was found). Researchers were also asked to comment on whether the questions and themes worked well in their country, as suggested in the ESS work (Fitzgerald et al., 2009).

Headline Results

Between 16 and 30 interviews were carried out in the four countries in June and July 2009 (total n = 94). Each country made efforts to include participants from a wide range of age groups, although, as shown in Table 1, there was variation between countries in terms of the proportion of respondents from different age groups. In Finland and Italy, a 50:50 gender split was achieved, whereas in the UK and Bulgaria, more women than men were interviewed. The number of errors coded per question ranged from 6 to 47, with a mean of 21 errors per question (Table 2).

Table 1. Demographic Profile of Respondents (numbers)

























































Table 2. Number of Errors Coded per Question

Number of Errors Coded

Number of Questions Affected













A total of 440 errors were coded by researchers (averaging at 4.5 per participant, Table 3). By far the greatest number of errors were coded by Bulgaria (262 errors, accounting for 59.5% of total errors; 9.4 errors per participant), followed by Finland, the UK, and then Italy. When these results were examined in more detail, it appeared that these differences could be ascribed to the Bulgarian researchers seeing different things as “errors” than researchers from other countries. For example, the Bulgarian researchers tended to code a comprehension error when a respondent, asked for the basis of his or her answer, referred to only one type of court or a certain branch of the police. When the UK and other countries encountered similar responses, they were not coded as an “error” but rather as one of a number of legitimate interpretations of the term “police.”

Table 3. Types of errors recorded by country

Interviewer Difficulties


Retrieval Issues


Response Selection



































No. of errors (%)

7 (2)

281 (64)

34 (8)

23 (5)

80 (18)

15 (3)


The largest number of coded errors were comprehension errors, where the respondents had difficulty understanding the survey question. The next most common type of error related to “response selection,” where a respondent had difficulty choosing which answer to select from a scale. This is similar to (but not the same as) judgment errors (where a respondent has difficulty making a numerical estimation). The only country to code for specific interviewer difficulties was the UK, where one particular question contained a phrase that the interviewer pronounced in a way that respondents did not understand (mishearing “all people” for “old people”).

Noncoded Errors

In addition to errors that were coded by the researchers, there were also some question errors that were revealed only when the translated questionnaires were examined in more detail. It was discovered that some of the questions were translated in ways that led them to have somewhat different meanings than intended. For example, one question asked respondents to what extent they thought the police “treat people fairly.” In Finland, this phrase was translated as “treat people equally,” which leads to a subtly different meaning, as it is possible for the police to treat people equally unfairly. Another example of misinterpretation is found in the Italian questionnaire. In Italy, the word “tribunali” refers to both the “court” (in the sense of the physical building) and “judges.” Therefore, some of the questions about the courts were answered solely about either the building or judges.

Through cognitive interviewing, it was discovered that certain questions were more problematic than others. Some questions were very straightforward and produced few issues that would lead to revisions being needed. Other questions, such as those involving lesser known branches of the criminal justice system, did not appear to work in a cross-national context. More importantly, however, key concepts that the surveys were designed to investigate appeared to be understood in the same way in all the countries. For example, it was important from the survey that “fairness” was understood in the same way in all countries, so a probe was designed to examine respondents’ understandings of this term. This probe showed that fairness was understood across Europe as “[b]ehave[ing] similarly to all people they come across in the same sort of context.” The idea of trust in the police being essential to social order was also understood across Europe. For example, in Finland one respondent replied: “[L]ike I earlier said, they represent legal order and stuff, if everybody did like they wanted and stuff, it would lead to some kind of chaos and disorder.”

Examples of Specific Errors

The first question we will examine—How good a job do you think the probation service is doing?— was one of the highest in terms of the number of errors that were coded (47 in total). The majority of these errors were coded in Bulgaria (23) and Italy (15), with the UK and Finland having 4 errors each. The majority of these errors related to comprehension (n = 42); respondents had trouble understanding what the “probation service” was. This question was designed as a basic measure of the respondent’s confidence in the performance of the probation service and has been used in the British Crime Survey. The majority of respondents saw the probation service as doing a “good” or “fair” job (25% n = 23 for each). However, a significant proportion of respondents selected “don’t know” (n = 36, 40%), and there appeared to be problems with people’s understanding of the term “probation.” For example, there were issues with the initial Finnish translation of “the probation service.” In Bulgaria, few respondents had knowledge of the role of the probation service (17 of 28 respondents answered “don’t know”). The Bulgarian team expanded the definition offered. The lengthier definition was only used when respondents said they did not know what the probation service did. However, it did not appear to be successful in aiding people’s responses, as 11 of the 19 respondents (58%) who offered it still could not provide an answer. In Bulgaria, the probation service had only recently been introduced, and the Bulgarian team suggested that this was the reason for the high level of “don’t knows,” even when individuals were offered a definition, they still had nothing to base an assessment on.

Finland has no probation service, and thus the question was translated as How well do you think the supervision and aftercare of released prisoners work? There are a number of problems with this question. First, “released prisoners” means people who have been released on parole and have been released “for good.” Some respondents answered the question twice, once for “supervision” and once for “aftercare.” This translation also did not encompass sentences that are carried out in the community, which also come under the banner of community disposals. Therefore, while there were few issues coded with the Finnish respondent’s understanding of the question, this consideration is immaterial, as the question being answered was not equivalent to that employed in other countries.

Researchers in Italy reported that the authority responsible for probation (the “magistratura di sorveglianza”) was not well known. When asked about their understanding of the question, the vast majority of respondents (15 out of 20) were unable to define what they understood by the term “probation service.” So, even though at first this question about probation did not appear to be a problem in Italy, in reality respondents were basing their attitudes on many considerations other than the probation service itself.

Our investigations have suggested that the concept of “probation” does not work in a European context, as there is either no term for probation or the service is not well known. For this question to be used, more thought had to be given to how it was worded and defined. The above question demonstrated the fact that certain services are not well known (or indeed do not exist) across Europe.

In addition to some countries’ criminal justice systems not being organized in such a way as to enable the use of some terms which British or North American criminologists would find unproblematic, other questions contained argot that was so specific in the source language that the entire question did not work in a cross-cultural context. For example, one question (Do you strongly agree, somewhat agree, somewhat disagree or strongly disagree that the police make decisions on who to stop and search based on reasonable suspicion, not prejudice?) was designed to examine the respondent’s moral alignment with the police, and whether respondents approved of and supported the way that the police act. This question taps into ideas involving police legitimacy and whether individuals view the police as being legitimate (Tyler, 2006). Although the majority of responses to the question were positive (57% n = 53), there were a significant number of negative responses (39% n = 36). This was not split equally between countries, with around 90% of Italian respondents agreeing with the above statement.

In the UK, no problems were identified with this question, and only Finland (n = 8) and Bulgaria (n = 13) coded errors. In Bulgaria, problems were encountered with the phrase “stop and search.” Stop and search on foot rarely happens in Bulgaria, other than to specific social groups (one respondent said: “the Roma and certain youth groups provoke suspicion, looking in a certain way”). As such, most people associated stop and search solely with the traffic police. In Finland, there is no expression for “stop and search”—it does not happen, so the Finnish team used the term “police raids.” This term resulted in a few Finnish respondents interpreting the question as being performed by customs officer rather than police officers, as they also carry out raids when looking for drugs and the like. Either way, “raids” are different from “stop and searches” by a number of criteria, meaning that the question asked in Finland was not equivalent to that asked in the other European countries. Even within an English context, “stop and search” was seen as irrelevant for some respondents, as they saw “stop and search” as being something that was relevant in London but not in other areas of England and Wales. Therefore, this question required alteration for an international audience.

As well as highlighting some problematic questions, the cognitive interviews also demonstrated that some questions worked well, and differences in the responses between countries could be ascribed to differences in the countries’ criminal justice systems, rather than different understandings of the question. The following question (To what extent do you feel the courts do the following?: Process cases quickly and efficiently) about the courts processing of cases demonstrates this point. This question was one of a series of new questions designed to examine court effectiveness. Overall, the response to this question was largely negative (71% of responses), and there were a larger than usual number of “don’t know” responses (n = 9). The majority of “don’t know” responses were given in England, as some respondents there did not feel they had the necessary knowledge and experience to make a decision. England was the only country that had an equal split between positive and negative responses, and the only two “very effective” responses came from there. Italy was the most negative country, with over 70% of respondents saying “not at all effectively” and no one answering higher than “neither effectively nor ineffectively.” Did this represent a problem with the translation from English to Italian or some other bias in the question design in Italy? It appears to be neither of these reasons, but rather a reflection of the amount of time that cases take to come to court in Italy and that after a number of years are dismissed if they are not heard. The speed of cases being processed was mentioned in all countries as being too slow. For example, F3 in Finland said: “I have the impression that the length of proceedings is too long,” but the fact that in Italy, the delaying of cases leads to them never being heard makes this far more significant than in other countries.

Few other errors were coded with this question (none in the UK or Italy, 3 in Bulgaria, and 2 in Finland). In both Finland and Bulgaria, it appeared that some respondents did not feel they knew enough about how the courts worked to answer this question. Other than this, there were no issues with this question, and conceptually this question appeared to work well. People from all the countries seemed to base their answer on what they knew of court procedures:

F5 (Finland): I don’t think that they are quick or efficient. I think “not at all effectively.” I think you have to wait for a pretty long time before anything happens.

I20 (Italy): “In Italy justice is too slow in deciding a case.”

It appears, therefore, that with this question, cross-cultural equivalence in understanding was achieved. All participants understood this question as being about the effectiveness of the courts and whether the speed they took to process cases reduced this effectiveness. The differences observed in countries’ responses to this question are thus linked to genuine differences in opinion rather than to differences in understanding the question.

Conceptual Equivalence of Terms

As Smith (2004, p. 431) notes, one of the key goals in designing cross-national surveys is to produce questions that are functionally equivalent across languages. Hantrais (1999, pp. 104–105) notes that not all concepts travel well across boundaries and that the issue of conceptual equivalence has accordingly increased as this recognition has grown. In the case of Euro-Justis, it was key for us to design survey questions that measured the key aspects of confidence in the Criminal Justice System in a way that was uniformly understood by respondents in each country. This section deals with some of the problems we found when we used the word “respect” in two survey questions:

  • To what extent do you agree with these statements about the police? They treat people with respect.

  • Could you tell me the extent to which you agree with these statements about the courts? The courts treat defendants, victims and witnesses with respect.

The Finnish, UK, and Italian reports suggest broadly similar understandings to one another. The Finnish team wrote in their report that “‘respect’ was understood mainly as police treating people equally in relation to their own position and so that people have a chance to be heard. The police must treat people respectfully and take their problems seriously no matter how disadvantaged they might be. ‘Respectfully’ was also understood as a synonym to friendly and polite.” Similar sentiments were found in both the UK and Italy. This finding conformed with wider evidence about the ways in which respondents in these three countries thought about issues relating to trust, effectiveness, and fairness (all key concepts for the project). The Finnish report concluded:

The conceptual separation of trust in the police as effectiveness and fairness works. The respondents did not have problems when answering these questions and they interpreted the core terms fairly homogenously. Also, there were differences in the respondents’ answers to effectiveness and fairness questions which could indicate that the separation between these two concepts actually exists in people’s minds too.

Further evidence came from excerpts from interviews conducted in England and Italy:

I think how they treat people depends on how they perceive people to be. Because I’m middle class and educated I get treated with respect, but I think if I was a black person or not a very well dressed person or not a very articulate person I don’t think I’d be treated with respect, So I’d say 3. E28.

All people must be treated in the same way, and their finances, political position should be irrelevant, especially in the eyes and actions of those who should guarantee the enforcing of the law. Respect means that also if you committed a crime, you are still a person and the police, in dealing with you, must remember this and act accordingly. I1.

In Bulgaria, however, things were not so straightforward, where there was a problem with the use of the term “respect,” which has become established as a foreign loanword in Bulgarian (“respekt”). For some of the respondents, this term is identical to the Bulgarian word “uvazhenie.” The term “respect” when used in Bulgarian, as well as meaning “uvazhenie,” also implies showing regard for the rights of people and equal treatment of all people, regardless of the situation or their personal characteristics. As such, in one question, “respekt” was used in place of “uvazhenie.” During the interviews, only a few respondents discussed the use of “respekt” or “uvazhenie” in relation to this question. In those interviews where there was an evident problem in the respondents’ understanding of the term “respect,” the problem was resolved when the word “uvazhenie” was put in its place.

Unlike this question, where this problem was touched upon by only a few respondents, in another question this was a central problem in the majority of interviews. In translating the questionnaire into Bulgarian, it was again decided to retain the term “respekt” for this question. In many of the interviews, however, the respondents interpreted the term by substituting (when thinking about the question) the Bulgarian version “uvazheni,” and suggested that one should not use foreign loanwords but the original Bulgarian term. According to some other respondents, the two terms (“respekt” and “uvazhenie”) did not have the same meaning and were not interchangeable. Of key importance here was the impression that “respekt” is a term related to the institution, while “uvazhenie” refers to the person. Hence, use of the term “uvazhenie” in one question was viewed by some respondents as entirely or partially erroneous. The fact that some respondents distinguish between “respekt” and “uvazhenie” in terms of the legitimacy of institutions suggests that it would be more appropriate to use “respekt” in the final questionnaire, as this usage is closer to the research team’s interests in the social institutional roles played by the courts and police. This appeared to lead to respondents interpreting the second question when formulated with “respekt” as referring to observance of civil and human rights in court procedures, including the right to have one’s say (as desired and in keeping with views in other countries).

The Revisions That Were Made

While the key concepts these questions were designed to investigate were understood as they were intended, errors and issues discovered through the cognitive interviewing process led to several revisions of the questions before they were used in surveys. It was realized that little was to be gained by asking about prosecution services or probation services across Europe, so it was decided to drop these questions. Certain other terms such as “stop and search” were found to be too Anglo-specific to use, and others, such as “public disorder,” had specific connotations that led to them being thought of in different ways in different countries. In the UK, public disorder was near-exclusively seen as referring to police response to political protests rather than as meaning “low-level public crime” as it was meant by the question designers. It was also suggested that certain questions should be introduced in different ways to make sure that participants were clear as to the focus of the questions. For example, Bulgaria has no equivalent term to “criminal justice System,” and as such it was necessary to ensure that all questions about courts referred to “criminal courts.” The Bulgarian researchers also noted that people hold very negative attitudes toward the traffic police, and thus questions about the police should be introduced in a way that ensured respondents answered with reference to the nontraffic police.

Do Different Probes Make a Difference?

The degree to which different styles of cognitive interviewing may make a difference to the data produced has been fully assessed. However, little research has been done on the benefits and drawbacks of the different types of probes that might be used in cognitive interviews. The research that has been undertaken is anecdotal and reports impressionistic data from interviewers about respondent behavior and the quality of responses. In Priede et al. (2014), which, again, drew on the Euro-Justis data, albeit just for the UK and Finland, the aim was to provide initial answers to the following questions:

  • Do certain types of probes produce data that are more useful in understanding people’s responses to survey questions than other types of probes?

  • Do some respondents produce more useful data than others?

  • Do different interviewers produce more useful data?

Within the VP technique of cognitive interviewing, respondents are asked survey questions, which are followed by a series of probe questions to understand more about their thought processes as they answered the questions. These probe questions can either be asked after each survey question (concurrent probing—sometimes called “immediate retrospective”) or at the end of the whole survey (retrospective probing) (Willis, 2005). The study reported here used concurrent probing within a researcher as investigator paradigm where the researcher is “guided by intuition, experience and flexibility” (Beatty & Willis, 2007) and allowed to adapt probes and produce new ones through the course of an interview.

Four different types of probes were used within this technique, with regard to the way they are presented by the researcher: anticipated probes; spontaneous probes; conditional probes (Conrad & Blair, 2009), and emergent probes. In addition, the interviewers also used what are termed functional remarks (Beatty et al., 1997). These are remarks uttered by the interviewer in order to keep the respondent talking (for example, “ah-hav,” “I see,” or “that’s interesting”).

Analytic Orientation

Each utterance from the interviewer that prompted a response was coded using a framework set out by Beatty and Willis (2007). Responses to the probes were also coded in terms of their usefulness. This coding scheme ranged from 1 (not very useful) to 3 (very useful):

  1. 1. Not very useful: Little is revealed about how the respondents understood the question or produced their answer. A further probe was required to get the required information.

  2. 2. Useful: Sufficient information is provided about how the question/coding scheme was understood and on what information the respondent based their response. The data were of sufficient quality for the researchers to assess whether there was a problem or not with that question and what the cause of the problem might be.

  3. 3. Very useful: As in (2) above but with additional insights into the question/coding scheme revealed.

In total, 2955 probes were delivered in the 49 interviews. The number of probes per interview ranged from 39 to 80, and the interview length between 20 and 50 minutes. The interviews were open and expansive, and the length of interviews tended to vary because of the length of discussion about various questions. Similarly, the number of probes in a cognitive interview can vary because of the amount of prompting (or clarification) that is required during the interview. In terms of the distribution of probes, it can be seen from Table 4 that anticipated probes dominated.

Table 4. Distribution of Probe Styles















Functional Remarks






The key focus of our work was the usefulness of each type of probe (see Table 5). The majority of responses were judged to be “useful”— that is, that the probe worked as intended and revealed enough information without it being in some way groundbreaking or new to the research. The missing values equate to the times when a probe was asked which produced no codable information.

Table 5. Usefulness of Probes



Not very useful






Very useful









In order to analyze the data, ordinal multilevel modeling was employed since the data on the usefulness of the probes were “nested” within respondents (n = 49), while both probe type and interviewer were used as fixed effects in the model. Multilevel modeling (Raudenbusch & Bryk, 1992) is a statistical technique employed when one or more observations share some common source. For example, siblings share parents, schoolchildren share classes, schools, and education authorities, while employees share an employer. Under such conditions, the assumption of independence of observation is violated, and correctly attributing observed variances to the individual (schoolchild) or collective (class) levels becomes impossible. Multilevel modeling solves this conundrum by approaching observations as being nested within higher levels or groups. In our case, as each respondent had answered several different probes, answers were nested within probe types that were nested within respondents. Respondents themselves were nested within three different interviewers.

To explore the impact of differing types of probes on the usefulness of the data generated, we estimated a series of increasingly complex models. First, we estimated a variance components model (M1), which serves as a useful comparison against subsequent models. This gave us an idea of the percentage of variance in the usefulness of the probes at the individual (i.e., respondent) level. All of our models used “very useful” as the reference category of the dependent variable (hence, strictly speaking, we were modeling unusefulness).

Following M1, a model (M2) was estimated which included information about which interviewer carried out the interview (using Interviewer 1 as the reference category). This tested the extent to which different interviewers may have been associated with greater levels of usefulness of the probes. The third model (M3) added to this characteristics of the respondent (such as their age, gender, and educational status) to assess the extent to which such factors influence the usefulness of data produced by the probes. Being male, aged 16–24, and having completed only compulsory schooling were the respective reference categories for these variables. Finally, the fourth model (M4) added to M3 data relating to the nature of the probe itself (whether it was a spontaneous, conditional, emergent, or functional remark), using anticipated probes as the reference category. Table 6 reports on these models.

Table 6. Modeling Usefulness of Data Produced from Different Types of Probes






Interviewer 2 (ref: Interviewer 1)




Interviewer 3




Gender (ref: male)



Aged 25–40 (ref: 16–24)



Aged 41–55



Aged 56–65



Aged 66 and above



Edn noncompulsory (ref: only compulsory schooling)



Edn 1st degree



Spontaneous (ref: anticipated)






Functional remark


% variance





Notes: (*) = significant at the p < .05 level;

(**) = significant at the p < .01 level;

(***) = significant at the p < .001 level.

The percentage of variance refers to the proportion of variance that is attributable to between-respondent differences. For the base model (M1), this is 9% of the variability. The fact that the inclusion of more terms leads the proportion of variance to decline in smaller increments is not surprising and would be the same in any model (if one entered each variable separately, one would be likely to see each of them make a larger contribution). Additionally, in M4, the variables entered are measured at the question level (not the respondent level). Given that the inclusion of these will explain variability between questions, it is perfectly possible for the proportion of the remaining unexplained variance attributable to differences between respondents actually to go up a little. In short, despite the decreasing proportion of variance that is attributable to between-respondent differences, the models are robust. Taking the inverse log of the coefficient gives us the odds ratio. For the significant variables in our final model, this means that spontaneous probes are 3.36 (ln of 1.212) less useful than anticipated probes (although due to the frequency with which these probes were used, there is a high standard error and accordingly a large confidence interval around the estimate for the spontaneous probes). Similarly, conditional probes are much more useful than anticipated probes, with the odds of the probe being un-useful more than 50% lower for the conditional probes. The effects of interviewers 2 and 3 when compared to interviewer 1 were similar (1.7 and 1.8 times less useful). This finding suggests the following:

1. That spontaneous probes were generally less useful than anticipated probes, while conditional probes were more useful than anticipated probes.

This is in line with expectations. First, spontaneous probes were asked when an interviewer had a hunch that there was a problem with the respondents’ answer to a question. Although such probes may sometimes be useful (as a whole new issue or way of understanding how a question operates may be revealed), more often than not, spontaneous probes do not reveal any new insights. As well, the small number of spontaneous probes used may attest to their relative uselessness; in the case of these questions, quite simply the interviewers did not ask spontaneous probes, as there was no need to do so. In cases where a different set of anticipated probes is used, it may be that there is the need to ask more spontaneous probes, which may be found to be of more use.

Conditional probes are likely to be of more use than other probes because they are asked only when a respondent has given a certain answer or exhibited certain behavior, and therefore are more likely to produce information than anticipated probes (which are not tailored to the respondent’s individual situation and thus may not be of relevance to them). This is not, however, to suggest that other forms of probing should be abandoned in favor of conditional probes; to do so would be to miss the richness and variety of information that comes from a varied probing strategy. It may be wise, however, other than in any pilot interviews undertaken, to avoid doing too much in the way of spontaneous probes in order to maintain the focus of the interview and keep the respondent engaged in the topic at hand.

2. That the characteristics of respondents do not appear to be related to the quality of the data they can provide.

Our model tested the impact of gender, age, and education level on the quality of the data generated by the probes. None of these variables were found to be statistically significant factors in explaining the quality of the data generated by the probes. In other words, all sectors of society were equally likely to provide data of use in understanding how survey questions operated and hence could be improved (or left unaltered if no problems were found with the question).

3. That one of our interviewers produced more useful data than the other two.

One of the Finnish interviewers produced more useful data than the other two interviewers. The reason cannot be found in any difference in the probing strategy employed. It was not that this interviewer used more or fewer or different probes from the other two. Equally, because the Finnish interviewers coded a mixture of their own and each other’s interviews, it cannot be suggested that one interviewer was more generous with rating usefulness than the others. We cannot explain why the probes posed by this interviewer produced more useful data than those posed by the other two interviewers.


The initial purpose of this study was to develop a “best practice” for the use of verbal probes in cognitive interviews. Through undertaking 49 interviews in two countries, almost 3000 verbal probes were analyzed to discover which types of probes were the most useful and whether certain respondents provide more useful data. Through use of scripted anticipated probes as the reference, it was discovered that scripted conditional probes were of the most value in terms of the usefulness of the data produced. Spontaneous probes were of least use in this regard, whilst emergent probes and functional remarks were as useful as scripted anticipated probes.

Cognitive interviewing is still, in many respects, in its infancy. While the techniques used have become more systematic and have started to become embedded in the design phases of many well-resourced survey teams, it is not widespread in criminological studies. Nevertheless, there is a growing recognition that such exercises can (and do) lead to improvements in the design of survey question both for cross-national surveys and for those surveys undertaken in just one country or culture. Cognitive interviewing should be employed in any survey, especially in criminology surveys since processes, institutions, organizations, and terminology (1) are not consistent between countries and (2) vary over time too. A good round of cognitive interviews will both improve the questions fielded and help interpret the resulting data, producing insights into how some respondents responded to specific items and terms within them. Of course, such exercises mean that an additional process needs to be built into the design phase, but this is a small price to pay for the improvements yielded.

Further Reading

Beatty, P. C., & Willis, G. B. (2007). Research synthesis: The practice of cognitive interviewing. Public Opinion Quarterly, 71(2), 287–311.Find this resource:

    Farrall, S., Priede, C., Ruuskanen, E., Jokinen, A., Galev, T., Arcai, M., & Maffei, S. (2012). Using cognitive interviews to refine translated survey questions: An example from a cross-national crime survey, International Journal of Social Research Methodology, 15(6), 467–483.Find this resource:

      Fitzgerald, R., Widdop, S., Gray, M., & Collins, D. (2009). Testing for equivalence using cross-national cognitive interviewing. Centre for Comparative Social Surveys Working Paper Series. City University London, Centre for Comparative Social Surveys.Find this resource:

        Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of survey response. Cambridge: Cambridge University Press.Find this resource:

          Willis, G. (2015). The practice of cross-cultural cognitive Interviewing. Public Opinion Quarterly, 79 (Special Issue), 359–395.Find this resource:

            Willis, G. B. (2004). Cognitive interviewing revisited: A useful technique, in theory? In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questions. Chichester: Wiley.Find this resource:

              Willis, G. B. (2005). Cognitive interviewing: A tool for improving questionnaire design. London: SAGE.Find this resource:


                Ackerman, J. (2016). Over-reporting intimate partner violence in Australian survey research, British Journal of Criminology, 56(4), 646–667.Find this resource:

                  Beatty, Paul C. (2004). The dynamics of cognitive interviewing. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questions (pp. 45–66). Chichester: Wiley.Find this resource:

                    Beatty, P. C., Schechter, S., & Whitaker, K. (1997). Variation in cognitive interviewer behavior—Extent and consequences. Paper read at Proceedings on the Section on Survey Research Methods, American Statistical Association.Find this resource:

                      Beatty, Paul C., & Gordon B. (2007). Research synthesis: The practice of cognitive interviewing. Public Opinion Quarterly, 71(2), 287–311.Find this resource:

                        Belson, William A. (1986). Validity in survey research. Aldershot: Gower.Find this resource:

                          Blasius, J., & Thiessen, V. (2006). Assessing data quality and construct comparability in cross-national surveys. European Sociological Review, 22, 229–242.Find this resource:

                            Conrad, F. G., & Blair, J. (2009). Sources of error in cognitive interviews. Public Opinion Quarterly, 73(1), 32–55.Find this resource:

                              Conrad, F. G., Blair, J., & Tracy, E. (2000). Verbal reports are data! A theoretical approach to cognitive interviews. In Proceedings of the 1999 Federal Committee on Statistical Methodology Research Conference. Washington, DC: Office of Management and Budget.Find this resource:

                                De Maio, T. J., & Landreth, A. (2004). Do different cognitive interview techniques produce different results? In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questions (pp. 89–102). Chichester: Wiley.Find this resource:

                                  Farrall, S., Priede, C., Ruuskanen, E., Jokinen, A., Galev, T., Arcai, M., & Maffei, S. (2012). Using cognitive interviews to refine translated survey questions: An example from a cross-national crime survey. International Journal of Social Research Methodology, 15(6), 467–483.Find this resource:

                                    Ferraro, K. F., & LaGrange, R. (1987). The measurement of fear of crime. Sociological Inquiry, 57(1), 70–101.Find this resource:

                                      Fitzgerald, R., Widdop, S., Gray, M., & Collins, D. (2009). Testing for equivalence using cross-national cognitive interviewing. In Centre for Comparative Social Surveys Working Paper Series: City University London, Centre for Comparative Social Surveys.Find this resource:

                                        Goerman, P., Caspar, R., Sha, M., et al. (2007). Census bilingual questionnaire research final round 2. Report US Census Study Series.Find this resource:

                                          Hantrais, L. (1999). Contextualisation in cross-national comparative research. International Journal of Social Research Methodology, 2(2), 93–108.Find this resource:

                                            Heath, A., Martin, J., & Spreckelsen, T. (2009). Cross-national comparability of survey attitude measures, International Journal of Public Opinion Research, 21(3), 293–315.Find this resource:

                                              King, G., Murray, C. J. L., Saloman, J. A., & Tandon, A. (2004). Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, 98(1), 191–207.Find this resource:

                                                Medina, T. R., Smith, S. N., & Long, S. J. (2009). Measurement models matter: Implicit assumptions and cross-national research. International Journal of Public Opinion Research, 21(3), 333–361.Find this resource:

                                                  Priede, C., & Farrall, S. (2011). Comparing results from different styles of cognitive interviewing: “Verbal Probing” vs. “Thinking Aloud.” International Journal of Social Research Methodology, 14(4), 271–287.Find this resource:

                                                    Priede, C., Jokinen, A., Ruuskanen, E., & Farrall, S. (2014). Which probes are most useful when undertaking cognitive interviews? International Journal of Social Research Methodology, 17(5), 559–568.Find this resource:

                                                      Raudenbusch, S., & Bryk, A. (1992). Hierarchical Linear Models: Applications and Data Analysis Methods. London: SAGE.Find this resource:

                                                        Scherer, K. R., Wranik, T., Sangsue, J., Tran, V., & Scherer, U. (2004). Emotions in everyday life: Probability of occurrence, risk factors, appraisal and reaction patterns. Social Science Information, 43(4), 499–570.Find this resource:

                                                          Schuman, S., & Presser, H. (1981) Questions and Answers in Survey Questions. London: SAGE.Find this resource:

                                                            Smith, T. W. (2004). Developing and evaluating cross-national survey instruments. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questions (pp. 431–452). Chichester: Wiley.Find this resource:

                                                              Smith, T. W. (2009). Editorial: Comparative survey research, International Journal of Public Opinion Research, 21(3), 267–270.Find this resource:

                                                                Thompson, M. E. (2008). International surveys: Motives and methodologies. Survey Methodology, 34(2), 131–141.Find this resource:

                                                                  Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of survey response. Cambridge: Cambridge University Press.Find this resource:

                                                                    Tyler, T. (2006). Why People Obey the Law. Princeton: Princeton University PressFind this resource:

                                                                      Willis, Gordon B. (2004). Cognitive interviewing revisited: A useful technique, in theory? In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questions (pp. 23–44). Chichester: Wiley.Find this resource:

                                                                        Willis, G. B. (2005). Cognitive interviewing: A tool for improving questionnaire design. London: SAGE.Find this resource:

                                                                          Willis, G. (2015). The practice of cross-cultural cognitive interviewing. Public Opinion Quarterly, 79 (Special Issue), 359–395.Find this resource:

                                                                            Willis, G., & Zahnd, E. (2007). Questionnaire design from a cross-cultural perspective. Journal of Health Care for the Poor and Underserved, 18(4 Suppl.), 197–217.Find this resource:


                                                                              (1.) Unconscious processes and biases may, of course, be present.

                                                                              (2.) See Hantrais (1999, pp. 94–97) on the epistemological stance that comparative researchers have taken and the extent to which such research is “cross-national” or “international comparative research” (pp. 97–99).

                                                                              (3.) Although not referred to herein, readers are referred to the work of Gary King (King et al., 2004). King’s work attempts to design survey questions using vignettes to ensure cross-national equivalence of meaning.

                                                                              (4.) The European Social Survey is an academically led repeat cross-sectional social survey that has been running since 2001. It has been designed to chart and explore shifts in the attitudes, beliefs and behavior patterns of the European citizens. Funded through the European Commission’s Framework Programmes, the European Science Foundation, and national funding bodies in each country, it won the Descartes Prize for Research in 2005. See