Barry Grant < http://thegrantlab.org/teaching/ >
2021-10-26 (19:58:47 PDT on Tue, Oct 26)
In this mini-project, you will explore FiveThirtyEight’s Halloween Candy dataset. FiveThirtyEight, sometimes rendered as just 538, is an American website that focuses mostly on opinion poll analysis, politics, economics, and sports blogging. They recently ran a rather large poll to determine which candy their readers like best. From their website: “While we don’t know who exactly voted, we do know this: 8,371 different IP addresses voted on about 269,000 randomly generated candy matchups”.
So what is the top ranked snack-sized Halloween candy? What made some candies more desirable than others? Was it price? Maybe it was just sugar content? Were they chocolate? Did they contain peanuts or almonds? How about crisped rice or other biscuit-esque component, like a Kit Kat or malted milk ball? Was it fruit flavored? Was it made of hard candy, like a lollipop or a strawberry bon bon? Was there nougat? What even is nougat? I know I like nougat, but I still have no real clue what the damn thing is.
Your task is to explore their candy dataset to find out answers to these types of questions - but most of all your job is to have fun, learn by doing hands on data analysis, and hopefully make this type of analysis less frightining for the future! Let’s get started.
First things first, let’s get the data from the FiveThirtyEight GitHub repo. You can either read from the URL directely or download this candy-data.csv file and place it in your project directory. Either way we need to load it up with read.csv()
and inspect the data to see exactly what we’re dealing with.
<- "candy-data.csv"
candy_file
= ____(____, row.names=1)
candy head(candy)
The dataset includes all sorts of information about different kinds of candy. For example, is a candy chocolaty? Does it have nougat? How does its cost compare to other candies? How many people prefer one candy over another?
According to 538 the columns in the dataset include:
We will take a whirlwind tour of this dataset and in the process answer the questions highlighted in red throught this page that aim to guide your exploration process. We will then wrap up by trying Principal Component Analysis (PCA) on this dataset to get yet more experience with this important multivariate method. It will yield a kind of “Map of Hallowen Candy Space”. How cool is that! Let’s explore…
The functions
dim()
,nrow()
,table()
andsum()
may be useful for answering the first 2 questions.
One of the most interesting variables in the dataset is winpercent
. For a given candy this value is the percentage of people who prefer this candy over another randomly chosen candy from the dataset (what 538 term a matchup). Higher values indicate a more popular candy.
We can find the winpercent
value for Twix by using its name to access the corresponding row of the dataset. This is because the dataset has each candy name as rownames
(recall that we set this when we imported the original CSV file). For example the code for Twix is:
"Twix", ]$winpercent candy[
## [1] 81.64291
winpercent
value?winpercent
value for “Kit Kat”?winpercent
value for “Tootsie Roll Snack Bars”?Side-note: the skimr::skim() function
There is a useful
skim()
function in the skimr package that can help give you a quick overview of a given dataset. Let’s install this package and try it on our candy data.library("skimr") skim(candy)
Data summary Name candy Number of rows 85 Number of columns 12 _______________________ Column type frequency: numeric 12 ________________________ Group variables None Variable type: numeric
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist chocolate 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆ fruity 0 1 0.45 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆ caramel 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂ peanutyalmondy 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂ nougat 0 1 0.08 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁ crispedricewafer 0 1 0.08 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁ hard 0 1 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂ bar 0 1 0.25 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂ pluribus 0 1 0.52 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇ sugarpercent 0 1 0.48 0.28 0.01 0.22 0.47 0.73 0.99 ▇▇▇▇▆ pricepercent 0 1 0.47 0.29 0.01 0.26 0.47 0.65 0.98 ▇▇▇▇▆ winpercent 0 1 50.32 14.71 22.45 39.14 47.83 59.86 84.18 ▃▇▆▅▂
From your use of the skim()
function use the output to answer the following:
candy$chocolate
column?Hint: look at the “Variable type” print out from the
skim()
function. Most varables (i.e. columns) are on the zero to one scale but not all. Some columns such aschocolate
are exclusively either zero or one values.
A good place to start any exploratory analysis is with a histogram. You can do this most easily with the base R function hist()
. Alternatively, you can use ggplot()
with geom_hist()
. Either works well in this case and (as always) its your choice.
winpercent
valueswinpercent
values symmetrical?Hint: The
chocolate
,fruity
,nougat
etc. columns indicate if a given candy has this feature (i.e. one if it has nougart, zero if it does not etc.). We can turn these into logical (a.k.a. TRUE/FALSE) values with theas.logical()
function. We can then use this logical vector to access the coresponding candy rows (those with TRUE values). For example to get thewinpercent
values for all nougat contaning candy we can use the code:candy$winpercent[as.logical(candy$nougat)]
. In addation the functionsmean()
andt.test()
should help you answer the last two questions here.
##
## Welch Two Sample t-test
##
## data: chocolate and fruity
## t = 6.2582, df = 68.882, p-value = 2.871e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.44563 22.15795
## sample estimates:
## mean of x mean of y
## 60.92153 44.11974
Let’s use the base R order()
function together with head()
to sort the whole dataset by winpercent
. Or if you have been getting into the tidyverse and the dplyr package you can use the arrange()
function together with head()
to do the same thing and answer the following questions:
Hint: Using base R we could use
head(candy[order(candy$winpercent),], n=5)
, whilst using dplyr we have:candy %>% arrange(winpercent) %>% head(5)
. Which apprach do you prefer and why?
To examine more of the dataset in this vain we can make a barplot to visualize the overall rankings. We will use an iterative approach to building a useful visulization by getting a rough starting plot and then refining and adding useful details in a stepwise process.
winpercent
values.HINT: Use the
aes(winpercent, rownames(candy))
for your first ggplot like so:
library(____)
ggplot(____) +
aes(winpercent, rownames(candy)) +
geom_____()
reorder()
function to get the bars sorted by winpercent
?HINT: You can use
aes(winpercent, reorder(rownames(candy),winpercent))
to improve your plot.
Let’s setup a color vector (that signifies candy type) that we can then use for some future plots. We start by making a vector of all black values (one for each candy). Then we overwrite chocolate (for chocolate candy), brown (for candy bars) and red (for fruity candy) values.
=rep("black", nrow(candy))
my_colsas.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "brown"
my_cols[as.logical(candy$fruity)] = "pink" my_cols[
Now let’s try our barplot with these colors. Note that we use fill=my_cols
for geom_col()
. Experement to see what happens if you use col=mycols
.
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col(fill=my_cols)
Now, for the first time, using this plot we can answer questions like:
- Q17. What is the worst ranked chocolate candy?
- Q18. What is the best ranked fruity candy?
What about value for money? What is the the best candy for the least money? One way to get at this would be to make a plot of winpercent
vs the pricepercent
variable. The pricepercent
variable records the percentile rank of the candy’s price against all the other candies in the dataset. Lower vales are less expensive and high values more expensive.
To this plot we will add text labels so we can more easily identify a given candy. There is a regular geom_label()
that comes with ggplot2. However, as there are quite a few candys in our dataset lots of these labels will be overlapping and hard to read. To help with this we can use the geom_text_repel()
function from the ggrepel package.
library(ggrepel)
# How about a plot of price vs win
ggplot(candy) +
aes(winpercent, pricepercent, label=rownames(candy)) +
geom_point(col=my_cols) +
geom_text_repel(col=my_cols, size=3.3, max.overlaps = 5)
## Warning: ggrepel: 50 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
winpercent
for the least money - i.e. offers the most bang for your buck?Hint: To see which candy is the most expensive (and which is the least expensive) we can
order()
the dataset bypricepercent
.<- order(candy$pricepercent, decreasing = TRUE) ord head( candy[ord,c(11,12)], n=5 )
geom_col()
this time using pricepercent
and then improve this step by step, first ordering the x-axis by value and finally making a so called “dot chat” or “lollipop” chart by swapping geom_col()
for geom_point()
+ geom_segment()
.# Make a lollipop chart of pricepercent
ggplot(candy) +
aes(pricepercent, reorder(rownames(candy), pricepercent)) +
geom_segment(aes(yend = reorder(rownames(candy), pricepercent),
xend = 0), col="gray40") +
geom_point()
One of the most interesting aspects of this chart is that a lot of the candies share the same ranking, so it looks like quite a few of them are the same price.
Now that we’ve explored the dataset a little, we’ll see how the variables interact with one another. We’ll use correlation and view the results with the corrplot package to plot a correlation matrix.
library(corrplot)
## corrplot 0.90 loaded
<- cor(candy)
cij corrplot(cij)
HINT: Do you like chocolaty fruity candies?
Let’s apply PCA using the prcom()
function to our candy dataset remembering to set the scale=TRUE
argument.
Side-note: Feel free to examine what happens if you leave this argument out (i.e. use the default
scale=FALSE
). Then examine thesummary(pca)
andpca$rotation[,1]
component and see that it is dominated bywinpercent
(which is after all measured on a very different scale than the other variables).
<- ____(candy, ____)
pca summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
## Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
## Cumulative Proportion 0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.74530 0.67824 0.62349 0.43974 0.39760
## Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
## Cumulative Proportion 0.89998 0.93832 0.97071 0.98683 1.00000
Now we can plot our main PCA score plot of PC1 vs PC2.
plot(pca$____[,____])
We can change the plotting character and add some color:
plot(pca$x[,1:2], col=my_cols, pch=16)
We can make a much nicer plot with the ggplot2 package but it is important to note that ggplot works best when you supply an input data.frame that includes a separate column for each of the aesthetics you would like displayed in your final plot. To accomplish this we make a new data.frame here that contains our PCA results with all the rest of our candy data. We will then use this for making plots below
# Make a new data-frame with our PCA results and candy data
<- cbind(candy, pca$x[,1:3]) my_data
<- ggplot(my_data) +
p aes(x=PC1, y=PC2,
size=winpercent/100,
text=rownames(my_data),
label=rownames(my_data)) +
geom_point(col=my_cols)
p
Again we can use the ggrepel package and the function ggrepel::geom_text_repel()
to label up the plot with non overlapping candy names like. We will also add a title and subtitle like so:
library(ggrepel)
+ geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7) +
p theme(legend.position = "none") +
labs(title="Halloween Candy PCA Space",
subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
caption="Data from 538")
If you want to see more candy labels you can change the max.overlaps
value to allow more overlapping labels or pass the ggplot object p
to plotly like so to generate an interactive plot that you can mouse over to see labels:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p)
Let’s finish by taking a quick look at PCA our loadings. Do these make sense to you? Notice the opposite effects of chocolate
and fruity
and the similar effects of chocolate
and bar
(i.e. we already know they are correlated).
par(mar=c(8,4,2,2))
barplot(pca$rotation[,1], las=2, ylab="PC1 Contribution")
HINT. pluribus means the candy comes in a bag or box of multiple candies.