Barry Grant < http://thegrantlab.org/teaching/ >
2021-11-23 (12:41:08 on Tue, Nov 23)
The goal of this hands-on mini-project is to examine and compare the Covid-19 vaccination rates around San Diego.
We will start by downloading the most recently dated “Statewide COVID-19 Vaccines Administered by ZIP Code” CSV file from: https://data.ca.gov/dataset/covid-19-vaccine-progress-dashboard-data-by-zip-code
Side-Note: With some web browsers this file may display as a new web page rather than simply download to your computer. In this case select “Save As” from your browser to obtain the CSV file.
Whilst you are on this website have a look at the Data Dictionary file that explains the various columns within the CSV file that you just downloaded.
There are also important notes about the limitations of the data. For example: “These data do NOT include doses administered by the following federal agencies who received vaccine allocated directly from CDC: Indian Health Service, Veterans Health Administration, Department of Defense, and the Federal Bureau of Prisons.” One obvious implication here would be that Zip code areas that include military bases will likely show artificially low vaccination rates. We will bare this in mind for later.
Be sure to move your downloaded CSV file to your project directory and then read/import into an R object called vax
. We will use this data to answer all the questions below.
# Import vaccination data
<- read.csv( ___ )
vax head(vax)
Q1. What column details the total number of people fully vaccinated?
Q2. What column details the Zip code tabulation area?
Q3. What is the earliest date in this dataset?
Q4. What is the latest date in this dataset?
Look at the first and last entry of the vax$as_of_date
column.
As we have done previously, let’s call the skim()
function from the skimr package to get a quick overview of this dataset:
::skim(vax) skimr
Name | vax |
Number of rows | 81144 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
as_of_date | 0 | 1 | 10 | 10 | 0 | 46 | 0 |
local_health_jurisdiction | 0 | 1 | 0 | 15 | 230 | 62 | 0 |
county | 0 | 1 | 0 | 15 | 230 | 59 | 0 |
vem_source | 0 | 1 | 15 | 26 | 0 | 3 | 0 |
redacted | 0 | 1 | 2 | 69 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
zip_code_tabulation_area | 0 | 1.00 | 93665.11 | 1817.39 | 90001 | 92257.75 | 93658.50 | 95380.50 | 97635.0 | ▃▅▅▇▁ |
vaccine_equity_metric_quartile | 4002 | 0.95 | 2.44 | 1.11 | 1 | 1.00 | 2.00 | 3.00 | 4.0 | ▇▇▁▇▇ |
age12_plus_population | 0 | 1.00 | 18895.04 | 18993.94 | 0 | 1346.95 | 13685.10 | 31756.12 | 88556.7 | ▇▃▂▁▁ |
age5_plus_population | 0 | 1.00 | 20875.24 | 21106.05 | 0 | 1460.50 | 15364.00 | 34877.00 | 101902.0 | ▇▃▂▁▁ |
persons_fully_vaccinated | 8256 | 0.90 | 9456.49 | 11498.25 | 11 | 506.00 | 4105.00 | 15859.00 | 71078.0 | ▇▂▁▁▁ |
persons_partially_vaccinated | 8256 | 0.90 | 1900.61 | 2113.07 | 11 | 200.00 | 1271.00 | 2893.00 | 20185.0 | ▇▁▁▁▁ |
percent_of_population_fully_vaccinated | 8256 | 0.90 | 0.42 | 0.27 | 0 | 0.19 | 0.44 | 0.62 | 1.0 | ▇▆▇▆▂ |
percent_of_population_partially_vaccinated | 8256 | 0.90 | 0.10 | 0.10 | 0 | 0.06 | 0.07 | 0.11 | 1.0 | ▇▁▁▁▁ |
percent_of_population_with_1_plus_dose | 8256 | 0.90 | 0.50 | 0.26 | 0 | 0.30 | 0.53 | 0.70 | 1.0 | ▅▅▇▇▃ |
NA
values there in the persons_fully_vaccinated
column? persons_fully_vaccinated
values are missing (to 2 significant figures)? What does the following code do: sum( is.na(vax$persons_fully_vaccinated) )
and how would this help you answer Q6 and Q7?
One of the “character” columns of the data is as_of_date
, which contains dates in the Year-Month-Day format.
Dates and times can be annoying to work with at the best of times. However, in R we have the excellent lubridate package, which can make life allot easier. Here is a quick example to get you started:
library(lubridate)
What is today’s date (at the time I am writing this obviously)
today()
## [1] "2021-11-23"
The as_of_date
column of our data is currently not that usable. For example we can’t easily do math with it like answering the simple question how many days have passed since data was first recorded:
# This will give an Error!
today() - vax$as_of_date[1]
However if we convert our date data into a lubridate format things like this will be much easier as well as plotting time series data later on.
# Specify that we are using the year-month-day format
$as_of_date <- ymd(vax$as_of_date) vax
Now we can do math with dates. For example: How many days have passed since the first vaccination reported in this dataset?
today() - vax$as_of_date[1]
## Time difference of 322 days
Using the last and the first date value we can now determine how many days the dataset span?
$as_of_date[nrow(vax)] - vax$as_of_date[1] vax
## Time difference of 315 days
One of the numeric columns in the dataset (namely vax$zip_code_tabulation_area
) are actually ZIP codes - a postal code used by the United States Postal Service (USPS). In R we can use the zipcodeR package to make working with these codes easier. For example, let’s install and then load up this package and to find the centroid of the La Jolla 92037 (i.e. UC San Diego) ZIP code area.
library(zipcodeR)
geocode_zip('92037')
Calculate the distance between the centroids of any two ZIP codes in miles, e.g.
zip_distance('92037','92109')
More usefully, we can pull census data about ZIP code areas (including median household income etc.). For example:
reverse_zipcode(c('92037', "92109") )
Optional: We can use this reverse_zipcode()
to pull census data later on for any or all ZIP code areas we might be interested in.
# Pull data for all ZIP codes in the dataset
#zipdata <- reverse_zipcode( vax$zip_code_tabulation_area )
We could also access socioeconomic data for different ZIP code areas in a similar way if we wanted to investigate factors that might be correlated with different vaccine uptake rates.
Another informative data exploration might be to plot the various values along with the ZIP codes latitude and longitude values on a map using a package like leafelet or using ggplot2 itself similar to this post. For now we will leave this as an optional extension exercise.
Let’s now focus in on the San Diego County area by restricting ourselves first to vax$county == "San Diego"
entries. We have two main choices on how to do this. The first using base R the second using the dplyr package:
# Subset to San Diego county only areas
<- vax[ ___ , ] sd
Using dplyr the code would look like this:
library(dplyr)
<- filter(vax, county == "San Diego")
sd
nrow(sd)
## [1] 4922
Using dplyr is often more convenient when we are subsetting across multiple criteria - for example all San Diego county areas with a population of over 10,000.
.10 <- filter(vax, county == "San Diego" &
sd> 10000) age5_plus_population
The functions unique()
and length()
should help with the first question here and the which.max()
function should help with the second question.
Using dplyr select all San Diego “county” entries on “as_of_date” “2021-11-16” and use this for the following questions.
UC San Diego resides in the 92037 ZIP code area and is listed with an age 5+ population size of 36,144.
<- filter(sd, zip_code_tabulation_area=="92037")
ucsd 1,]$age5_plus_population ucsd[
## [1] 36144
ggplot(ucsd) +
aes(___,
+
___) geom____() +
geom_line(group=1) +
ylim(c(0,1)) +
labs(___, y="Percent Vaccinated")
This plot shows an initial slow roll out in January into Febuary (likely due to limited vaccine availability). This is followed with rapid ramp up until a clear slowing trend from June time onward. Interpertation beyond this requies context from other zip code areas to answer questions such as: is this trend representative of other areas? Are more people fully vaccinated in this area compared to others? Etc.
Let’s return to the full dataset and look across every zip code area with a population at least as large as that of 92037 on as_of_date “2021-11-16”.
# Subset to all CA areas with a population as large as 92037
.36 <- filter(vax, age5_plus_population > 36144 &
vax== "2021-11-16")
as_of_date
#head(vax.36)
geom_hline()
function?## [1] 0.6629812
Q17. What is the 6 number summary (Min, 1st Qu., Median, Mean, 3rd Qu., and Max) of the “Percent of Population Fully Vaccinated” values for ZIP code areas with a population as large as 92037 (La Jolla) as_of_date “2021-11-16”?
Q18. Using ggplot generate a histogram of this data.
## Warning: Removed 2 rows containing missing values (geom_bar).
%>% filter(as_of_date == "2021-11-16") %>%
vax filter(zip_code_tabulation_area=="92040") %>%
select(percent_of_population_fully_vaccinated)
age5_plus_population > 36144
.36.all <- filter(vax, ___)
vax.
ggplot(vax.36.all) +
aes(___,
percent_of_population_fully_vaccinated, group=zip_code_tabulation_area) +
geom_line(alpha=0.2, color=___) +
ylim(___) +
labs(x=___, y=___,
title=___,
subtitle=___) +
geom_hline(yintercept = ___, linetype=___)
Q21. How do you feel about traveling for Thanksgiving and meeting for in-person class next Week?
Here we use the sessionInfo()
function to report on our R systems setup at the time of document execution.
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.5 dplyr_1.0.7 zipcodeR_0.3.3 lubridate_1.8.0
## [5] labsheet_0.1.2
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.0 tidyr_1.1.4 bit64_4.0.5
## [5] jsonlite_1.7.2 bslib_0.3.1 assertthat_0.2.1 sp_1.4-6
## [9] highr_0.9 blob_1.2.2 yaml_2.2.1 tidycensus_1.1
## [13] pillar_1.6.4 RSQLite_2.2.8 lattice_0.20-45 glue_1.5.0
## [17] uuid_1.0-3 digest_0.6.28 rvest_1.0.2 colorspace_2.0-2
## [21] htmltools_0.5.2 pkgconfig_2.0.3 raster_3.5-2 purrr_0.3.4
## [25] scales_1.1.1 terra_1.4-11 tzdb_0.2.0 tigris_1.5
## [29] tibble_3.1.6 proxy_0.4-26 farver_2.1.0 generics_0.1.1
## [33] ellipsis_0.3.2 withr_2.4.2 cachem_1.0.6 repr_1.1.3
## [37] skimr_2.1.3 magrittr_2.0.1 crayon_1.4.2 memoise_2.0.0
## [41] maptools_1.1-2 evaluate_0.14 fansi_0.5.0 xml2_1.3.2
## [45] foreign_0.8-81 class_7.3-19 tools_4.1.2 hms_1.1.1
## [49] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0 compiler_4.1.2
## [53] jquerylib_0.1.4 e1071_1.7-9 rlang_0.4.12 classInt_0.4-3
## [57] units_0.7-2 grid_4.1.2 rappdirs_0.3.3 labeling_0.4.2
## [61] base64enc_0.1-3 rmarkdown_2.11 gtable_0.3.0 codetools_0.2-18
## [65] DBI_1.1.1 curl_4.3.2 R6_2.5.1 knitr_1.36
## [69] rgdal_1.5-27 fastmap_1.1.0 bit_4.0.4 utf8_1.2.2
## [73] KernSmooth_2.23-20 readr_2.1.0 stringi_1.7.5 Rcpp_1.0.7
## [77] vctrs_0.3.8 sf_1.0-4 tidyselect_1.1.1 xfun_0.28
Powered by labsheet