Using Card Sorting To Test Information Architecture
Jeff Sauro • March 19, 2013
Card sorting is a popular method for improving the organization of websites and software.
There are several software packages that allow you to conduct card sorting quickly and remotely, including solutions from UserZoom
and Optimal Workshop
So it shouldn't be surprising that around half of UX practitioners reported using the method in 2011
We'll cover Card Sorting in detail at the Denver UX boot camp
, but here are some questions and answers to help make the method more approachable.
When do you use Card Sorting?
When you don't know what to label a group or what items go together, you've got a good case for running a card sort
. There are two types of card sorts: open and closed. Open sorts allow users to assign the labels to the groups and vary the number of groups. Closed sorts have a fixed number of categories and established group names. I tend to favor open sorts as they allow users more freedom in assigning groups and provide valuable insights on naming and labels. Closed card sorts are best used when you have a fixed number of categories or can't change the labels because of design or organization constraints.
Card sorting is a great method to use after you've run a usability test or tree-test and identified the navigation as a key pain point. Card sorting can be used for all platforms, as intuitive navigation is essential on websites, software, TV menus, and almost anything with more than one feature. What's nice is that you can have participants perform a card test on desktop software regardless of the platform and still get valuable insights into the intuitiveness of menus and the application organization.
What are the pros and cons of conducting an online unmoderated card sort vs. an in-person study?
The biggest advantage of an unmoderated IA study is the same as for an unmoderated usability test. You are able to generate much more precise metrics as you can run more participants for less cost and in less time. The cost is losing the utterances, the pauses, and the ability to probe users into the problems they might encounter, as well as identifying the mismatch in the design model versus the user mental model.
What's more, a good moderator can probe a user on difficulties and encourage them to think aloud. Not all of this is lost with unmoderated studies. The inclusion of questions regarding which items were difficult and specific questions about categories or options helps to offset the loss. As with most UX methods
, it's not an "either /or" choice. The approach we like to take is to run a few users in-person (or moderated remote) to get some deeper insights, and then run more users in an unmoderated test to see how much the findings hold up and to generate more precise estimates of findability.
How do you select which items to test in a card sort?
This is the same question you have with a standard usability test when selecting tasks. There are usually hundreds to thousands of items or use-cases with often dozens of departments you want or need to test. There are a few ways to handle these situations. One solution is to use the data from a tree test to identify the poor areas and focus on those. If you don't have that data or you have too many poor areas, then consider running a top-tasks study
or looking to log data in order to understand the popular sections. If your taxonomy is reasonable (less than 100 items), including all of them seems reasonable. Finally, pick a decent cross section from across multiple branches in your navigation to represent a group works.
Can you describe the dendrogram card sorting results? How do you interpret and use dendrograms?
A dendrogram is the quintessential cluster analysis output. It is a visualization of the distances between nodes using a mathematical technique based on something called a similarity matrix. The more times an item is sorted together with another item, the more similar they are. They then appear closer in proximity in the dendrogram. The figure below is a selection of a dendrogram from an open card sort on Target.com. Figure 1: Part of a dendrogram from an Open Card sort of 40 items on Target.com. Items next to each other within the same color band tended to be sorted together by the 75 participants.
How many items would you recommend testing per card sort?
It's difficult to ask a user to be vigilant for more than 45-60 minutes in an unmoderated study. A large number of items can also be very daunting and can lead to more error or carelessness. The total time will depend on the complexity of the categories and how comfortable users are with the domain and the software they use to sort. In two recent studies from eCommerce websites, the median time for users to sort 40 items into 10 categories was 11 minutes, and 22 minutes for the sorting of 100 items. This also included a few minutes for users to answer questions about items they had problems with. Keep in mind that these are median times, so half the participants took longer than the 11 and 22 minutes.
How many participants do you suggest for a card sort?
For any sample size question, it initially comes down to what the outcome metric is. For card-sorts, there are usually two outcomes: the denrdrogram, and to a lesser extent the percentage of users labeling a category with a similar name and the most difficult items to sort.
The latter two are proportions and we use the table below based on a 95% level of confidence. For example, to have a margin of error of around 10% you should plan on having around 93 participants.
Table 1: Sample size for proportions used in card sorting (95% confidence and assumes percentages of 50%).
|Sample Size||Margin of Error (+/-)|
The computations are explained in Chapter 3 and Chapter 6 of Quantifying the User Experience.
Dendrograms are pricklier to nail down, and there is little data to help. Tullis and Wood (2004
provided data from one study in which Fidelity's employees sorted the usability department's taxonomy. They saw diminishing returns and stable patters with sample sizes between 20 and 30. Interpret this with caution, as complex taxonomies may generate different results.
Do you have any strategies for incorporating follow up survey questions with card sorting exercises? How do these help to supplement the card sorting results?
We like to ask which items were the most difficult to sort after the study, and get comments about the categories and general experience. This is the opportunity to make up for not having direct interaction with a user. It can provide more reasons behind the more precise numbers.
Can you share with us your process of how you go from the results to a report or presentation?
First, we examine the raw dendrogram and then adjust the number of clusters. This is a bit of an art, as you want to minimize single items (runts) that don't hang well together. There are objective methods for determining what a grouping is, but these often require just as much judgment and can be arbitrary applied.
For a closed sort, we count the number of times each item was associated with each grouping. Items are almost always associated with multiple groups, so we show at least the first three most popular choices. For example, the table below shows items from a card sort on Target.com's navigation (data is not proprietary). The first two items have a high first choice agreement (65 and 63% respectively) and the bottom two items have low first choice agreement (27% and 23% respectively). Notice that the same percentage of users placed "Tiddliwinks Blue Dot Plush" in "Toys" and "Kids."
Table 2: Example of common item groupings form the Target.com Open Card sort. For example, most users placed the Black Griffin Case for Apple iPod Touch in an "Electronics" group with "Entertainment" as a distant second. The proportions were more evenly split for the latter two items in the table.
Black Griffin Case for Apple iPod touch
PC Diablo 3 Computer Game
Marvel POWER ATTACK AVENGER: IRONMAN
Tiddliwinks Blue Dot Plush Monkey
For an Open Card sort, we do the same grouping, but we first convert the raw category names users generate and sort them into similar categories. We do this process by hand, but it's worth it. For example, we group "Men's wear," "Men's Dept" and "Men's" all into the same grouping of "Men." The name we'd pick would come from whichever label was used the most in an aggregated category. This allows us to associate item with groups, as shown in the table above.
Wherever possible, we generate confidence intervals around whatever we can, such as the items that were the most difficult to sort or the percentage of users who came up with a category. The confidence intervals show the level of precision in our estimates. When a stakeholder asks if the sample size is large enough, we refer to the width of the confidence intervals, and ask whether we'd make a different decision about findability if an item was at the low or higher end of the interval.