Class 9: Candy Mini-Project

Exploratory Analysis of Halloween Candy

Author
Affiliation

UC San Diego

Published

October 26, 2021

Modified

February 2, 2026

1 Background

Halloween candy bucket

In this mini-project, you will explore FiveThirtyEight’s Halloween Candy dataset. FiveThirtyEight, sometimes rendered as just 538, is an American website that focuses mostly on opinion poll analysis, politics, economics, and sports blogging. They recently ran a rather large poll to determine which candy their readers like best. From their website: “While we don’t know who exactly voted, we do know this: 8,371 different IP addresses voted on about 269,000 randomly generated candy matchups”.

So what is the top ranked snack-sized Halloween candy? What made some candies more desirable than others? Was it price? Maybe it was just sugar content? Were they chocolate? Did they contain peanuts or almonds? How about crisped rice or other biscuit-esque component, like a Kit Kat or malted milk ball? Was it fruit flavored? Was it made of hard candy, like a lollipop or a strawberry bon bon? Was there nougat? What even is nougat? I know I like nougat, but I still have no real clue what the damn thing is.

Your task is to explore their candy dataset to find out answers to these types of questions - but most of all your job is to have fun, learn by doing hands on data analysis, and hopefully make this type of analysis less frightening for the future! Let’s get started.

1.1 Learning Objectives

Building on your work with PCA in the previous labs (UK Foods, RNA-seq and beast cancer biopsy analysis), this mini-project gives you the opportunity to apply your data analysis skills to a new, tastier dataset. By the end of this lab, you should be able to:

  • Import and explore an unfamiliar dataset to quickly characterize its structure and identify variables requiring special handling;
  • Create and iteratively refine effective visualizations including ranked bar plots and scatter plots with ggrepel() labels, and interactive plotly() graphics to explore and communicate patterns in data;
  • Generate and interpret correlation matrices to identify relationships between variables;
  • Conduct principal component analysis (PCA) with appropriate scaling, interpret variance explained, and create score plots (PC1 vs PC2) to reveal clustering structure in high-dimensional data;
  • Interpret PCA loadings to understand which original variables drive separation along principal components, connecting these back to the correlation structure;
  • Synthesize multiple analyses (exploration, correlation, PCA) to draw substantive, data-driven conclusions.

2 Importing candy data

First things first, let’s get the data from the FiveThirtyEight GitHub repo. You can either read from the URL directly or download this candy-data.csv file and place it in your project directory. Either way we need to load it up with read.csv() and inspect the data to see exactly what we’re dealing with.

candy_file <- "candy-data.csv"

candy = ____(____, row.names=___)
head(candy)

2.1 What is in the dataset?

The dataset includes all sorts of information about different kinds of candy. For example, is a candy chocolaty? Does it have nougat? How does its cost compare to other candies? How many people prefer one candy over another?

According to 538 the columns in the dataset include:

  • chocolate: Does it contain chocolate?
  • fruity: Is it fruit flavored?
  • caramel: Is there caramel in the candy?
  • peanutyalmondy: Does it contain peanuts, peanut butter or almonds?
  • nougat: Does it contain nougat?
  • crispedricewafer: Does it contain crisped rice, wafers, or a cookie component?
  • hard: Is it a hard candy?
  • bar: Is it a candy bar?
  • pluribus: Is it one of many candies in a bag or box?
  • sugarpercent: The percentile of sugar it falls under within the data set.
  • pricepercent: The unit price percentile compared to the rest of the set.
  • winpercent: The overall win percentage according to 269,000 matchups (more on this in a moment).

We will take a whirlwind tour of this dataset and in the process answer the questions highlighted in red throughout this page that aim to guide your exploration process. We will then wrap up by trying Principal Component Analysis (PCA) on this dataset to get yet more experience with this important multivariate method. It will yield a kind of “Map of Halloween Candy Space”. How cool is that! Let’s explore…

  • Q1. How many different candy types are in this dataset?
  • Q2. How many fruity candy types are in the dataset?

The functions dim(), nrow(), table() and sum() may be useful for answering the first 2 questions.

2.2 What is your favorite candy?

One of the most interesting variables in the dataset is winpercent. For a given candy this value is the percentage of people who prefer this candy over another randomly chosen candy from the dataset (what 538 term a matchup). Higher values indicate a more popular candy.

We can find the winpercent value for Twix by using its name to access the corresponding row of the dataset. This is because the dataset has each candy name as rownames (recall that we set this when we imported the original CSV file). For example the code for Twix is:

candy["Twix", ]$winpercent
[1] 81.64291

An alternate way to do this is with the help of the dplyr package that we have seen previously:

library(dplyr)

candy |> 
  filter(row.names(candy)=="Twix") |> 
  select(winpercent)
  • Q3. What is your favorite candy (other than Twix) in the dataset and what is it’s winpercent value?
  • Q4. What is the winpercent value for “Kit Kat”?
  • Q5. What is the winpercent value for “Tootsie Roll Snack Bars”?

Side-note: the skimr::skim() function

There is a useful skim() function in the skimr package that can help give you a quick overview of a given dataset. Let’s install this package and try it on our candy data.

library("skimr")
skim(candy)
Click to show output
Data summary
Name candy
Number of rows 85
Number of columns 12
_______________________
Column type frequency:
numeric 12
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
chocolate 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
fruity 0 1 0.45 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
caramel 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
peanutyalmondy 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
nougat 0 1 0.08 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
crispedricewafer 0 1 0.08 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hard 0 1 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
bar 0 1 0.25 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
pluribus 0 1 0.52 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
sugarpercent 0 1 0.48 0.28 0.01 0.22 0.47 0.73 0.99 ▇▇▇▇▆
pricepercent 0 1 0.47 0.29 0.01 0.26 0.47 0.65 0.98 ▇▇▇▇▆
winpercent 0 1 50.32 14.71 22.45 39.14 47.83 59.86 84.18 ▃▇▆▅▂

From your use of the skim() function use the output to answer the following:

  • Q6. Is there any variable/column that looks to be on a different scale to the majority of the other columns in the dataset?
  • Q7. What do you think a zero and one represent for the candy$chocolate column?

Look at the “Variable type” print out from the skim() function. Most variables (i.e. columns) are on the zero to one scale but not all. Some columns such as chocolate are exclusively either zero or one values.

3 Exploratory analysis

A good place to start any exploratory analysis is with a histogram. You can do this most easily with the base R function hist(). Alternatively, you can use ggplot() with geom_hist(). Either works well in this case and (as always) its your choice.

  • Q8. Plot a histogram of winpercent values using both base R an ggplot2.
  • Q9. Is the distribution of winpercent values symmetrical?
  • Q10. Is the center of the distribution above or below 50%?
  • Q11. On average is chocolate candy higher or lower ranked than fruit candy?
  • Q12. Is this difference statistically significant?

Hint: The chocolate, fruity, nougat etc. columns indicate if a given candy has this feature (i.e. one if it has nougat, zero if it does not etc.). We can turn these into logical (a.k.a. TRUE/FALSE) values with the as.logical() function. We can then use this logical vector to access the corresponding candy rows (those with TRUE values). For example to get the winpercent values for all nougat containing candy we can use the code: candy$winpercent[as.logical(candy$nougat)]. In addition the functions mean() and t.test() should help you answer the last two questions here.


    Welch Two Sample t-test

data:  chocolate and fruity
t = 6.2582, df = 68.882, p-value = 2.871e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 11.44563 22.15795
sample estimates:
mean of x mean of y 
 60.92153  44.11974 

4 Overall Candy Rankings

Let’s use the base R order() function together with head() to sort the whole dataset by winpercent. Or if you have been getting into the tidyverse and the dplyr package you can use the arrange() function together with head() to do the same thing and answer the following questions:

  • Q13. What are the five least liked candy types in this set?
  • Q14. What are the top 5 all time favorite candy types out of this set?

Hint: Using base R we could use head(candy[order(candy$winpercent),], n=5), whilst using dplyr we have: candy |> arrange(winpercent) |> head(5). Which approach do you prefer and why?

To examine more of the dataset in this vain we can make a barplot to visualize the overall rankings. We will use an iterative approach to building a useful visualization by getting a rough starting plot and then refining and adding useful details in a stepwise process.

  • Q15. Make a first barplot of candy ranking based on winpercent values.

HINT: Use the aes(winpercent, rownames(candy)) for your first ggplot like so:

library(____)

ggplot(____) + 
  ___(___, rownames(candy)) +
  geom_____()

A first rough bar chart that we need to improve

  • Q16. This is quite ugly, use the reorder() function to get the bars sorted by winpercent?

HINT: You can use aes(winpercent, reorder(rownames(candy),winpercent)) to improve your plot.

An improved bar chart with y-axis ordering

4.0.1 Time to add some useful color

Let’s setup a color vector (that signifies candy type) that we can then use for some future plots. We start by making a vector of all black values (one for each candy). Then we overwrite chocolate (for chocolate candy), brown (for candy bars) and red (for fruity candy) values.

my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "brown"
my_cols[as.logical(candy$fruity)] = "pink"

Now let’s try our barplot with these colors. Note that we use fill=my_cols for geom_col(). Experiment to see what happens if you use col=mycols.

ggplot(candy) + 
  aes(winpercent, reorder(rownames(candy),winpercent)) +
  geom_col(fill=my_cols) 

A third bar chart with color for major candy types

Now, for the first time, using this plot we can answer questions like:

  • Q17. What is the worst ranked chocolate candy?
  • Q18. What is the best ranked fruity candy?

5 Taking a look at pricepercent

What about value for money? What is the best candy for the least money? One way to get at this would be to make a plot of winpercent vs the pricepercent variable. The pricepercent variable records the percentile rank of the candy’s price against all the other candies in the dataset. Lower values are less expensive and higher values are more expensive.

To this plot we will add text labels so we can more easily identify a given candy. There is a regular geom_label() that comes with ggplot2. However, as there are quite a few candies in our dataset lots of these labels will be overlapping and hard to read. To help with this we can use the geom_text_repel() function from the ggrepel package.

library(ggrepel)

# How about a plot of win vs price
ggplot(___) +
  aes(___, ___, label=rownames(candy)) +
  geom_point(col=my_cols) + 
  geom_text_repel(col=my_cols, size=3.3, max.overlaps = 5)

A first scatter plot of winpercent vs pricepercent

  • Q19. Which candy type is the highest ranked in terms of winpercent for the least money - i.e. offers the most bang for your buck?
  • Q20. What are the top 5 most expensive candy types in the dataset and of these which is the least popular?

Hint: To see which candy is the most expensive (and which is the least expensive) we can order() the dataset by pricepercent.

ord <- order(candy$pricepercent, decreasing = TRUE)
head( candy[ord,c(11,12)], n=5 )

5.1 Optional

  • Q21. Make a barplot again with geom_col() this time using pricepercent and then improve this step by step, first ordering the x-axis by value and finally making a so called “dot chat” or “lollipop” chart by swapping geom_col() for geom_point() + geom_segment().

A first barchat of pricepercent

# Make a lollipop chart of pricepercent
ggplot(candy) +
  aes(pricepercent, reorder(rownames(candy), pricepercent)) +
  geom_segment(aes(yend = reorder(rownames(candy), pricepercent), 
                   xend = 0), col="gray40") +
    geom_point()

A first lollipop chart of pricepercent

One of the most interesting aspects of this chart is that a lot of the candies share the same ranking, so it looks like quite a few of them are the same price.

6 Exploring the correlation structure

Now that we’ve explored the dataset a little, we’ll see how the variables interact with one another. We’ll use correlation and view the results with the corrplot package to plot a correlation matrix.

library(corrplot)
corrplot 0.95 loaded
cij <- cor(candy)
corrplot(cij)

A corrplot of Pearson correlation

  • Q22. Examining this plot what two variables are anti-correlated (i.e. have minus values)?
  • Q23. Similarly, what two variables are most positively correlated?

HINT: Do you like chocolaty fruity candies?

7 Principal Component Analysis

Let’s apply PCA using the prcomp() function to our candy dataset remembering to set the scale=TRUE argument.

Side-note: Feel free to examine what happens if you leave this argument out (i.e. use the default scale=FALSE). Then examine the summary(pca) and pca$rotation[,1] component and see that it is dominated by winpercent (which is after all measured on a very different scale than the other variables).

pca <- ____(candy, ____)
summary(___)
Importance of components:
                          PC1    PC2    PC3     PC4    PC5     PC6     PC7
Standard deviation     2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
Cumulative Proportion  0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
                           PC8     PC9    PC10    PC11    PC12
Standard deviation     0.74530 0.67824 0.62349 0.43974 0.39760
Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
Cumulative Proportion  0.89998 0.93832 0.97071 0.98683 1.00000

Now we can plot our main PCA score plot of PC1 vs PC2.

plot(pca$____[,____])

A first PCA score plot that we need to improve

We can change the plotting character and add some color:

plot(pca$x[,1:2], col=my_cols, pch=16)

A slightly better PCA score plot

We can make a much nicer plot with the ggplot2 package but it is important to note that ggplot works best when you supply an input data.frame that includes a separate column for each of the aesthetics you would like displayed in your final plot. To accomplish this we make a new data.frame here that contains our PCA results with all the rest of our candy data. We will then use this for making plots below

# Make a new data-frame with our PCA results and candy data
my_data <- cbind(candy, pca$x[,1:3])
p <- ggplot(my_data) + 
        aes(x=___, y=___, 
            size=winpercent/100,  
            text=rownames(my_data),
            label=rownames(my_data)) +
        geom_point(col=___)

p

Our first ggplot PCA score plot

Again we can use the ggrepel package and the function ggrepel::geom_text_repel() to label up the plot with non overlapping candy names like. We will also add a title and subtitle like so:

library(ggrepel)

p + geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7)  + 
  theme(legend.position = "none") +
  labs(title="Halloween Candy PCA Space",
       subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
       caption="Data from 538")

A refined and polished PCA score plot

If you want to see more candy labels you can change the max.overlaps value to allow more overlapping labels or pass the ggplot object p to plotly like so to generate an interactive plot that you can mouse over to see labels:

library(plotly)
ggplotly(p)

Let’s finish by taking a quick look at PCA our loadings. Do these make sense to you? Notice the opposite effects of chocolate and fruity and the similar effects of chocolate and bar (i.e. we already know they are correlated).

ggplot(pca$rotation) +
  aes(___, ___) +
  geom____()

A PCA loadings plot

  • Q24. Complete the code to generate the loadings plot above. What original variables are picked up strongly by PC1 in the positive direction? Do these make sense to you? Where did you see this relationship highlighted previously?

8 Summary

Starting with basic data import and exploration, you characterized the structure of the candy dataset and identified key variables. You then built increasingly sophisticated visualizations—from simple histograms to ranked bar plots to labeled scatter plots—to reveal patterns in candy popularity and pricing.

The correlation analysis showed you how candy attributes relate to each other (chocolate and fruity are essentially mutually exclusive, for instance), and this set the stage for PCA. By reducing the original variables down to a few PCs, you created a “map of candy space” where similar candies cluster together. The PCA loadings revealed that the primary axis of variation separates chocolate-based candies from fruity ones - the same pattern we observed in the correlation matrix, now visualized in a single, interpretable plot.

  • Q25. Based on your exploratory analysis, correlation findings, and PCA results, what combination of characteristics appears to make a “winning” candy? How do these different analyses (visualization, correlation, PCA) support or complement each other in reaching this conclusion?

Optional extension questions

losers = candy[which(candy$winpercent < 50),]
winners = candy[which(candy$winpercent >= 50),]
  • Q26. Are popular candies more expensive? In other words: is price significantly different between “winners” and “losers”? List both average values and a P-value along with your answer.

  • Q27. Are candies with more sugar more likely to be popular? What is your interpretation of the means and P-value in this case?