vignettes/pkgdown/hclus.Rmd
hclus.Rmd
Determine the appropriate number of segments
The goal of Cluster Analysis is to group respondents (e.g., consumers) into segments based on needs, benefits, and/or behaviors. The tool tries to achieve this goal by looking for respondents that are similar, putting them together in a cluster or segment, and separating them from other, dissimilar, respondents. The researcher compares the segments and provides a descriptive label for each.
First, go to the Data > Manage tab, select examples from the Load data of type
dropdown, and press the Load
button. Then select the toothpaste
dataset. The dataset contains information from 60 consumers who were asked to respond to six questions to determine their attitudes towards toothpaste. The scores shown for variables v1-v6 indicate the level of agreement with statements on a 7-point scale where 1 = strongly disagree and 7 = strongly agree.
We first establish the number of segments/clusters in the data using Hierarchical Cluster Analysis. Ward’s method with Squared Euclidean distance is often used to determine how (dis)similar individuals are. These are the default values in Radiant but they can be changed if desired. The most important information from this analysis is provide by the plots, so we will focus our attention there.
Select variables v1 through v6 in the Variables box and click the Estimate
button or press CTRL-enter
(CMD-enter
on mac) to generate results. Note that Hierarchical Cluster Analysis can be time-consuming and memory intensive for large datasets. If your dataset has more than 5,000 observations make sure to increase the value in the Max cases
input to the appropriate number. The Dendrogram shown below provides information to help you determine the most appropriate number of clusters (or segments).
Hierarchical cluster analysis starts with many segments, as many as there are respondents, and in a stepwise (i.e., hierarchical) process adds the most similar respondents or groups together until only one segment remains. To determine the appropriate number of segments look for a jump along the vertical axis of the plot. At that point two dissimilar segments have been joined. The measure along the vertical axis indicates of the level of heterogeneity within the segments that have been formed. The purpose of clustering is to create homogeneous groups to avoid segments with heterogeneous characteristics, needs, etc. Since the most obvious jump in heterogeneity occurs when we go from 3 to 2 segments we choose 3 segments (i.e., we avoid creating a heterogeneous segment).
Another plot that can be used to determine the number of segments is a scree-plot. This is a plot of the within-cluster heterogeneity on the vertical axis and the number of segments on the horizontal axis. Again, Hierarchical cluster analysis starts with many segments and groups respondents together until only one segments is left. The scree plot is created by selecting Scree
(and Change
) from the Plot(s)
dropdown menu. If Plot cutoff
is set to 0 we see results for all possible cluster solutions. To make the plot easier to evaluate, we can set Plot cutoff
to, for example, 0.05 (i.e. show only solutions that have Within-cluster heterogeneity
above 5%).
Reading the plot from left-to-right we see that within-segment heterogeneity increases sharply when we move from 3 to 2 segments. This is also clear from the Change in within-cluster heterogeneity
plot (i.e., Change
). To avoid creating a heterogeneous segment we, again, choose 3 segments. Now that we have determined the appropriate number of segments to extract we can use either Cluster > Hierarchical or Cluster > K-clustering to generate the final cluster solution.
To download the plots click the download button on the top-right of the screen.
Standardize
box is un-checkedNumber of clusters
, then provide a name for the variable that will contain cluster assignment information, and finally, press the Store
buttongower
distance will automatically be selected. For more information on the gower distance and R-package see the package vignette
Add code to Report > Rmd to (re)create the analysis by clicking the icon on the bottom left of your screen or by pressing ALT-enter
on your keyboard.
If a plot was created it can be customized using ggplot2
commands or with gridExtra
. See example below and Data > Visualize for details.
To add, for example, a sub-title to a dendrogram plot use title(sub = "Data used from ...")
. See the R graphics documentation for additional information.
For an overview of related R-functions used by Radiant to conduct cluster analysis see Multivariate > Cluster
The key function from the stats
package used in the hclus
tool is hclust
.