(Part 1)
UC San Diego
R is a powerful data programming language that you can use to explore and understand data in an open-ended, highly interactive, iterative way.
Learning R will give you the freedom to experiment and problem solve during data analysis — exactly what we need as bioinformaticians.
Before delving into working with real data in R, we need to learn some basics of the R language. In this section, we’ll learn how to do simple calculations in R, assign values to variables, and call functions.
Then, in part 2, we’ll look in more detail at R’s vectors and data.frames, along with the critically important topic of vectorization that underpins how we approach many problems in R.
Below is a interactive “Example Exercise” code chunk that you can edit and run by clicking the > Run Code button.
Give it a try:
Note that you can click on line 2 and edit the code if you wish.
We will use these types of “code chunks” to ask you to complete some missing code, typically indicated by the blank ______ place holder.
For example, complete the code below so your answer sums to 10:
Hint
Your job here is to replace the ______
blank with your answer, in this case a number, so the final answer sums to 10. Give it a try and be sure to run the code chunk to see if you are correct.
Solution: note no high level math is required for this class ;-)
15 + 3 + 2
OK, so R is a big calculator but it is also so much more than that. For example, we will see soon that R can do complicated statistical analysis on big datasets that you would never want to use a calculator for.
What’s more R allows you to do your analysis in a robust and re-usable way so you can automate things later.
At the heart of this is saving your answers as you go along (a.k.a. object assignment).
Lets make an assignment and then inspect the object you just created.
All R statements where you create objects have this form:
objectName <- value
and in my head I hear, e.g., “x gets 12”.
You will make lots of assignments and the little arrow operator <-
is a pain to type. Don’t be lazy and use =
, although it would work, because it will just sow confusion later when we learn about functions.
In RStudio folks will often utilize the keyboard shortcut:
Alt
+-
(that is pushing theAlt
andminus
keys together or theOption
andminus
keys if you are on a Mac).
Object names cannot start with a digit and can not contain certain other “special” characters such as a comma or a space.
You will be wise to adopt a convention for demarcating words in names so as to avoid spaces, for example:
i_use_snake_case
other.people.use.periods
evenOthersUseCamelCase
only-crazy-folks-use-kabab-case
Let’s make another assignment
and then try to inspect this object:
What happened? What about the 2nd line? Did you run both chunks? Can you fix this?
When we program we enter into an implicit contract with the computer. The computer will do tedious computation for you. In return, you will be completely precise in your instructions.
Typos matter. Case matters. The order that you do things matters.
Therefore get better at typing!
Being productive in R is all about using functions - we call functions when we read a dataset, perform a statistical analysis, make a fancy visualization or do just about anything else.
All functions are accessed like so:
functionName(arg1 = val1, arg2 = val2, and so on)
Notice that we use the =
sign here for setting function inputs (a.k.a. setting function “arguments”) and that each one is separated by a comma.
Let’s try using the seq()
function, which makes regular sequences of numbers:
You can always access the help/documentation of a particular function by using the help(myFunction)
function, e.g. help(seq)
. Try it out above.
The previous seq(1, 10)
example above also demonstrates something about how R resolves function arguments.
You can always specify in name = value
form e.g. seq(from=1, to=10, by=1)
. But if you do not, R attempts to resolve by position (1st argument, 2nd argument, etc.).
So in seq(1, 10)
it is assumed that we want a sequence from = 1
that goes to = 10
. Since we didn’t specify step size, the default value of by
in the function definition is used, which ends up being 1 in this case.
What does the following command do?
Run it and see. Could you of figured out this result from the documentation accessed via the command help(seq)
?
One very useful part of a functions documentation is the Examples section (typically found at the very end of the documentation entry).
You can copy and paste these into your session or execute them all in one go with the examples()
function, e.g.
R has built in help accessible with the question mark or help()
function.
However, it is rather heavy on nerd-speak and takes some getting used to.
Remember Google and ChatGPT are your friends. Asking Google and ChatGPT is often the quickest way to get the help you need before you are full on nerd-speak fluent.
We will go into more details of vectors later 😉
Vectors are the most fundamental data structures in R. Their job is to hold a contiguous collection of “elements” of the same “type”.
For example a set of numbers (what we call numeric type), words (character type) or True and False vales (logical type).
To create vectors we combine values with the function c()
:
Vectors are the basis of one of R’s most important features: vectorization. Vectorization allows us to loop over all elements in a vector without the need to write an explicit loop.
For example, R’s arithmetic operators (and tones of other functions) are all vectorized:
Unlike other languages, R allows us to completely forgo explicitly looping over vectors with a for loop.
Later on, we’ll see other methods used for more explicit looping. For now I want you to appreciate that the vectorized approach is not only more clear and readable but it’s also computationally faster.
An index is a number (or logical) vector that specifies which element, or subset of elements, in a vector you want to retrieve.
Basically, we use indexing to get to certain elements of a vector:
We can change specific vector elements by combining indexing and assignment. For example:
Our next session will cover more on vectors, and the other major R data structures (data.frames and lists).
A data.frame is a table (or two-dimensional array-like structure) where columns can contain different types of data (numeric, character, logical, etc.).
Data.frames are ideal for managing and analyzing datasets with rows and columns like we find in spreadsheets.
We can use numeric or logical index values for subseting a data.frame, similar to those we used for vectors but with one number for rows and the second for columns.
You can also access specific columns via their column name:
R will show red text in the console pane in three different situations:
Errors: When the red text is a legitimate error, it will be prefaced with “Error in…” and will try to explain what went wrong. Generally when there’s an error, the code will not run.
Warnings: When the red text is a warning, it will be prefaced with “Warning:’ and R will try to explain why there’s a warning. Generally your code will still work, but with some caveats.
Messages: When the red text doesn’t start with either “Error” or “Warning”, it’s just a friendly message. You’ll see these messages when you load R packages for example.
R reports errors, warnings, and messages in a glaring red font, which makes it seem like R is scolding you.
Don’t panic, it doesn’t necessarily mean anything is wrong and indeed it is not always a bad thing. However, you should read this text to make sure things are as you expect.
Over time we will learn more “R speak” and get more comfortable generally working in the console. Again Google and ChatGPT can help you interpret these important messages.
Well done for getting this far !
Feel free to close this now 😀