candy_file <- "candy-data.csv"
candy = ____(____, row.names=___)
head(candy)Class 9: Candy Mini-Project
Exploratory Analysis of Halloween Candy
1 Background
In this mini-project, you will explore FiveThirtyEight’s Halloween Candy dataset. FiveThirtyEight, sometimes rendered as just 538, is an American website that focuses mostly on opinion poll analysis, politics, economics, and sports blogging. They recently ran a rather large poll to determine which candy their readers like best. From their website: “While we don’t know who exactly voted, we do know this: 8,371 different IP addresses voted on about 269,000 randomly generated candy matchups”.
So what is the top ranked snack-sized Halloween candy? What made some candies more desirable than others? Was it price? Maybe it was just sugar content? Were they chocolate? Did they contain peanuts or almonds? How about crisped rice or other biscuit-esque component, like a Kit Kat or malted milk ball? Was it fruit flavored? Was it made of hard candy, like a lollipop or a strawberry bon bon? Was there nougat? What even is nougat? I know I like nougat, but I still have no real clue what the damn thing is.
Your task is to explore their candy dataset to find out answers to these types of questions - but most of all your job is to have fun, learn by doing hands on data analysis, and hopefully make this type of analysis less frightening for the future! Let’s get started.
1.1 Learning Objectives
Building on your work with PCA in the previous labs (UK Foods, RNA-seq and beast cancer biopsy analysis), this mini-project gives you the opportunity to apply your data analysis skills to a new, tastier dataset. By the end of this lab, you should be able to:
- Import and explore an unfamiliar dataset to quickly characterize its structure and identify variables requiring special handling;
- Create and iteratively refine effective visualizations including ranked bar plots and scatter plots with
ggrepel()labels, and interactiveplotly()graphics to explore and communicate patterns in data; - Generate and interpret correlation matrices to identify relationships between variables;
- Conduct principal component analysis (PCA) with appropriate scaling, interpret variance explained, and create score plots (PC1 vs PC2) to reveal clustering structure in high-dimensional data;
- Interpret PCA loadings to understand which original variables drive separation along principal components, connecting these back to the correlation structure;
- Synthesize multiple analyses (exploration, correlation, PCA) to draw substantive, data-driven conclusions.
2 Importing candy data
First things first, let’s get the data from the FiveThirtyEight GitHub repo. You can either read from the URL directly or download this candy-data.csv file and place it in your project directory. Either way we need to load it up with read.csv() and inspect the data to see exactly what we’re dealing with.
2.1 What is in the dataset?
The dataset includes all sorts of information about different kinds of candy. For example, is a candy chocolaty? Does it have nougat? How does its cost compare to other candies? How many people prefer one candy over another?
According to 538 the columns in the dataset include:
- chocolate: Does it contain chocolate?
- fruity: Is it fruit flavored?
- caramel: Is there caramel in the candy?
- peanutyalmondy: Does it contain peanuts, peanut butter or almonds?
- nougat: Does it contain nougat?
- crispedricewafer: Does it contain crisped rice, wafers, or a cookie component?
- hard: Is it a hard candy?
- bar: Is it a candy bar?
- pluribus: Is it one of many candies in a bag or box?
- sugarpercent: The percentile of sugar it falls under within the data set.
- pricepercent: The unit price percentile compared to the rest of the set.
- winpercent: The overall win percentage according to 269,000 matchups (more on this in a moment).
We will take a whirlwind tour of this dataset and in the process answer the questions highlighted in red throughout this page that aim to guide your exploration process. We will then wrap up by trying Principal Component Analysis (PCA) on this dataset to get yet more experience with this important multivariate method. It will yield a kind of “Map of Halloween Candy Space”. How cool is that! Let’s explore…
- Q1. How many different candy types are in this dataset?
- Q2. How many fruity candy types are in the dataset?
The functions
dim(),nrow(),table()andsum()may be useful for answering the first 2 questions.
2.2 What is your favorite candy?
One of the most interesting variables in the dataset is winpercent. For a given candy this value is the percentage of people who prefer this candy over another randomly chosen candy from the dataset (what 538 term a matchup). Higher values indicate a more popular candy.
We can find the winpercent value for Twix by using its name to access the corresponding row of the dataset. This is because the dataset has each candy name as rownames (recall that we set this when we imported the original CSV file). For example the code for Twix is:
candy["Twix", ]$winpercent[1] 81.64291
An alternate way to do this is with the help of the dplyr package that we have seen previously:
library(dplyr)
candy |>
filter(row.names(candy)=="Twix") |>
select(winpercent)- Q3. What is your favorite candy (other than Twix) in the dataset and what is it’s
winpercentvalue? - Q4. What is the
winpercentvalue for “Kit Kat”? - Q5. What is the
winpercentvalue for “Tootsie Roll Snack Bars”?
Side-note: the skimr::skim() function
There is a useful skim() function in the skimr package that can help give you a quick overview of a given dataset. Let’s install this package and try it on our candy data.
library("skimr")
skim(candy)Click to show output
| Name | candy |
| Number of rows | 85 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| chocolate | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| fruity | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| caramel | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| peanutyalmondy | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| nougat | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| crispedricewafer | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hard | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| bar | 0 | 1 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| pluribus | 0 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| sugarpercent | 0 | 1 | 0.48 | 0.28 | 0.01 | 0.22 | 0.47 | 0.73 | 0.99 | ▇▇▇▇▆ |
| pricepercent | 0 | 1 | 0.47 | 0.29 | 0.01 | 0.26 | 0.47 | 0.65 | 0.98 | ▇▇▇▇▆ |
| winpercent | 0 | 1 | 50.32 | 14.71 | 22.45 | 39.14 | 47.83 | 59.86 | 84.18 | ▃▇▆▅▂ |
From your use of the skim() function use the output to answer the following:
- Q6. Is there any variable/column that looks to be on a different scale to the majority of the other columns in the dataset?
- Q7. What do you think a zero and one represent for the
candy$chocolatecolumn?
Look at the “Variable type” print out from the skim() function. Most variables (i.e. columns) are on the zero to one scale but not all. Some columns such as chocolate are exclusively either zero or one values.
3 Exploratory analysis
A good place to start any exploratory analysis is with a histogram. You can do this most easily with the base R function hist(). Alternatively, you can use ggplot() with geom_hist(). Either works well in this case and (as always) its your choice.
- Q8. Plot a histogram of
winpercentvalues using both base R an ggplot2. - Q9. Is the distribution of
winpercentvalues symmetrical? - Q10. Is the center of the distribution above or below 50%?
- Q11. On average is chocolate candy higher or lower ranked than fruit candy?
- Q12. Is this difference statistically significant?
Hint: The
chocolate,fruity,nougatetc. columns indicate if a given candy has this feature (i.e. one if it has nougat, zero if it does not etc.). We can turn these into logical (a.k.a. TRUE/FALSE) values with theas.logical()function. We can then use this logical vector to access the corresponding candy rows (those with TRUE values). For example to get thewinpercentvalues for all nougat containing candy we can use the code:candy$winpercent[as.logical(candy$nougat)]. In addition the functionsmean()andt.test()should help you answer the last two questions here.
Welch Two Sample t-test
data: chocolate and fruity
t = 6.2582, df = 68.882, p-value = 2.871e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
11.44563 22.15795
sample estimates:
mean of x mean of y
60.92153 44.11974
4 Overall Candy Rankings
Let’s use the base R order() function together with head() to sort the whole dataset by winpercent. Or if you have been getting into the tidyverse and the dplyr package you can use the arrange() function together with head() to do the same thing and answer the following questions:
- Q13. What are the five least liked candy types in this set?
- Q14. What are the top 5 all time favorite candy types out of this set?
Hint: Using base R we could use
head(candy[order(candy$winpercent),], n=5), whilst using dplyr we have:candy |> arrange(winpercent) |> head(5). Which approach do you prefer and why?
To examine more of the dataset in this vain we can make a barplot to visualize the overall rankings. We will use an iterative approach to building a useful visualization by getting a rough starting plot and then refining and adding useful details in a stepwise process.
- Q15. Make a first barplot of candy ranking based on
winpercentvalues.
HINT: Use the
aes(winpercent, rownames(candy))for your first ggplot like so:
library(____)
ggplot(____) +
___(___, rownames(candy)) +
geom_____()- Q16. This is quite ugly, use the
reorder()function to get the bars sorted bywinpercent?
HINT: You can use
aes(winpercent, reorder(rownames(candy),winpercent))to improve your plot.
4.0.1 Time to add some useful color
Let’s setup a color vector (that signifies candy type) that we can then use for some future plots. We start by making a vector of all black values (one for each candy). Then we overwrite chocolate (for chocolate candy), brown (for candy bars) and red (for fruity candy) values.
my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "brown"
my_cols[as.logical(candy$fruity)] = "pink"Now let’s try our barplot with these colors. Note that we use fill=my_cols for geom_col(). Experiment to see what happens if you use col=mycols.
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col(fill=my_cols) Now, for the first time, using this plot we can answer questions like:
- Q17. What is the worst ranked chocolate candy?
- Q18. What is the best ranked fruity candy?
5 Taking a look at pricepercent
What about value for money? What is the best candy for the least money? One way to get at this would be to make a plot of winpercent vs the pricepercent variable. The pricepercent variable records the percentile rank of the candy’s price against all the other candies in the dataset. Lower values are less expensive and higher values are more expensive.
To this plot we will add text labels so we can more easily identify a given candy. There is a regular geom_label() that comes with ggplot2. However, as there are quite a few candies in our dataset lots of these labels will be overlapping and hard to read. To help with this we can use the geom_text_repel() function from the ggrepel package.
library(ggrepel)
# How about a plot of win vs price
ggplot(___) +
aes(___, ___, label=rownames(candy)) +
geom_point(col=my_cols) +
geom_text_repel(col=my_cols, size=3.3, max.overlaps = 5)- Q19. Which candy type is the highest ranked in terms of
winpercentfor the least money - i.e. offers the most bang for your buck? - Q20. What are the top 5 most expensive candy types in the dataset and of these which is the least popular?
Hint: To see which candy is the most expensive (and which is the least expensive) we can
order()the dataset bypricepercent.
ord <- order(candy$pricepercent, decreasing = TRUE) head( candy[ord,c(11,12)], n=5 )
5.1 Optional
- Q21. Make a barplot again with
geom_col()this time usingpricepercentand then improve this step by step, first ordering the x-axis by value and finally making a so called “dot chat” or “lollipop” chart by swappinggeom_col()forgeom_point()+geom_segment().
# Make a lollipop chart of pricepercent
ggplot(candy) +
aes(pricepercent, reorder(rownames(candy), pricepercent)) +
geom_segment(aes(yend = reorder(rownames(candy), pricepercent),
xend = 0), col="gray40") +
geom_point()One of the most interesting aspects of this chart is that a lot of the candies share the same ranking, so it looks like quite a few of them are the same price.
6 Exploring the correlation structure
Now that we’ve explored the dataset a little, we’ll see how the variables interact with one another. We’ll use correlation and view the results with the corrplot package to plot a correlation matrix.
library(corrplot)corrplot 0.95 loaded
cij <- cor(candy)
corrplot(cij)- Q22. Examining this plot what two variables are anti-correlated (i.e. have minus values)?
- Q23. Similarly, what two variables are most positively correlated?
HINT: Do you like chocolaty fruity candies?
7 Principal Component Analysis
Let’s apply PCA using the prcomp() function to our candy dataset remembering to set the scale=TRUE argument.
Side-note: Feel free to examine what happens if you leave this argument out (i.e. use the default
scale=FALSE). Then examine thesummary(pca)andpca$rotation[,1]component and see that it is dominated bywinpercent(which is after all measured on a very different scale than the other variables).
pca <- ____(candy, ____)
summary(___)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
Cumulative Proportion 0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
PC8 PC9 PC10 PC11 PC12
Standard deviation 0.74530 0.67824 0.62349 0.43974 0.39760
Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
Cumulative Proportion 0.89998 0.93832 0.97071 0.98683 1.00000
Now we can plot our main PCA score plot of PC1 vs PC2.
plot(pca$____[,____])We can change the plotting character and add some color:
plot(pca$x[,1:2], col=my_cols, pch=16)We can make a much nicer plot with the ggplot2 package but it is important to note that ggplot works best when you supply an input data.frame that includes a separate column for each of the aesthetics you would like displayed in your final plot. To accomplish this we make a new data.frame here that contains our PCA results with all the rest of our candy data. We will then use this for making plots below
# Make a new data-frame with our PCA results and candy data
my_data <- cbind(candy, pca$x[,1:3])p <- ggplot(my_data) +
aes(x=___, y=___,
size=winpercent/100,
text=rownames(my_data),
label=rownames(my_data)) +
geom_point(col=___)
pAgain we can use the ggrepel package and the function ggrepel::geom_text_repel() to label up the plot with non overlapping candy names like. We will also add a title and subtitle like so:
library(ggrepel)
p + geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7) +
theme(legend.position = "none") +
labs(title="Halloween Candy PCA Space",
subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
caption="Data from 538")If you want to see more candy labels you can change the max.overlaps value to allow more overlapping labels or pass the ggplot object p to plotly like so to generate an interactive plot that you can mouse over to see labels:
library(plotly)
ggplotly(p)Let’s finish by taking a quick look at PCA our loadings. Do these make sense to you? Notice the opposite effects of chocolate and fruity and the similar effects of chocolate and bar (i.e. we already know they are correlated).
ggplot(pca$rotation) +
aes(___, ___) +
geom____()- Q24. Complete the code to generate the loadings plot above. What original variables are picked up strongly by PC1 in the positive direction? Do these make sense to you? Where did you see this relationship highlighted previously?
8 Summary
Starting with basic data import and exploration, you characterized the structure of the candy dataset and identified key variables. You then built increasingly sophisticated visualizations—from simple histograms to ranked bar plots to labeled scatter plots—to reveal patterns in candy popularity and pricing.
The correlation analysis showed you how candy attributes relate to each other (chocolate and fruity are essentially mutually exclusive, for instance), and this set the stage for PCA. By reducing the original variables down to a few PCs, you created a “map of candy space” where similar candies cluster together. The PCA loadings revealed that the primary axis of variation separates chocolate-based candies from fruity ones - the same pattern we observed in the correlation matrix, now visualized in a single, interpretable plot.
- Q25. Based on your exploratory analysis, correlation findings, and PCA results, what combination of characteristics appears to make a “winning” candy? How do these different analyses (visualization, correlation, PCA) support or complement each other in reaching this conclusion?
Optional extension questions
losers = candy[which(candy$winpercent < 50),]
winners = candy[which(candy$winpercent >= 50),]Q26. Are popular candies more expensive? In other words: is price significantly different between “winners” and “losers”? List both average values and a P-value along with your answer.
Q27. Are candies with more sugar more likely to be popular? What is your interpretation of the means and P-value in this case?