Essential Statistics with R and Quarto
Introduction: Demo
- Department of Mathematics and Computational Sciences , University of Zimbabwe
- Business and Systems Analyst at @Optimum
- Biostatistician and infectious disease modeler.
What is R?
- R is an open source programming language and environment.
- It is designed for data analysis, graphical display and data simulations.
- It is one of the world’s leading statistical programming environments.
Why use R?
- R is open-source! This means that it is free, and constantly being updated and improved.
- It is compatible. R works on most existing operating systems.
- R can help you create tables, produce graphs and do your statistics, all within the same program. So with R, there is no need to use more than one program to manage data for your publications. Everything can happen in one single program.
- More and more scientists are using R every year. This means that its capacities are constantly growing and will continue to increase over time. This also means that there is a big online community of people that can help with any problems you have in R.
Question 1
Which of the following statements about R are True:
✓R is open-source
✗You need to pay a monthly fee to Posit inorder to use R
✗R is mainly used by software developers
✗R is used for game development and general Applications
Question 2
R Stands for React:
✗TRUE
✓FALSE
R as a One stop shop for data science workflows
Always saves my day!
very flexible
Amazing Community
all you need
RStudio is an integrated development environment (IDE) for R. Basically, it’s a place where you can easily use the R language, visualize tables and figures and even run all your statistical analyses. We recommend using it instead of the traditional command line as it provides great visual aid and a number of useful tools that you will learn more about over the course of this workshop.
It includes a console, and a syntax-highlighting editor that supports direct code execution with tools for plotting, history, debugging and workspace management.
It integrates with R (and other programming languages) to provide a lot of useful features:
- RStudio supports authoring HTML, PDF, Word, and presentation documents
- RStudio supports version control with Git (direction to Github) and Subversion
- RStudio makes it easy to start new or find existing projects
- RStudio supports interactive graphics with Shiny and ggvis
There are other IDE for R: Atom, Visual Studio, Jupyter notebook, and Jupyter lab
Open RStudio
The RStudio interface
When you open RStudio for the first time, the screen will be divided across three main Pane groups:
- Console, Terminal, Job group;
- Environment, History, Connections group;
- Files, Plot, Packages, Help, Viewer panes; and
- Script pane group
Once you Open a Script or Create a New Script (File > New File > R script or Ctrl/Cmd + Shift + N
), the fourth panel will appear.
interface
pane 1
pane 2
cheatsheet
Writing an R script
An R script is a text file that contains all of the commands you will use. Once written and saved, your R script will allow you to make changes and re-run analyses with little effort.
Question 3
Which of the following statements about Rstudio is FALSE:
✗RStudio supports authoring HTML, PDF, Word, and presentation documents.
✗RStudio supports version control with Git (direction to Github) and Subversion.
✗RStudio makes it easy to start new or find existing projects
✓RStudio does not support interactive graphics with Shiny and ggvis
- Use the ‘# symbol’ to denote comments in scripts.
- The ‘# symbol’ tells R to ignore anything remaining on a given line of the script when running commands.
- Since comments are ignored when running script, they allow you to leave yourself notes in your code or tell collaborators what you did.
- A script with comments is a good step towards reproducible science, and annotating someone’s script is a good way to learn. Try to be as detailed as possible!
# This is a comment, not a command
Header
It is recommended that you use comments to put a header at the beginning of your script with essential information: project name, author, date, version of R
## R for Actuaries training ##
## session 1 - inroduction to basic R
## Author: Bongani Ncube
## Date:
## R version 4.2.2
Section Heading
You can use four # signs in a row to create section headings to help organize your script. This allows you to move quickly between sections and hide sections. For example:
#### Housekeeping ####
Housekeeping
- The first command at the top of all scripts should be
rm(list=ls())
. This will clear R’s memory, and will help prevent errors such as using old data that has been left in your workspace.
<-"Test" # Put some data in workspace
A<- "Test" # Add some spaces to organize your data!
A = "Test" # You can do this, but it does not mean you should
A # Check objects in the workspace
ls()
# [1] "A"
A# [1] "Test"
# Clean Workspace
rm(list=ls())
A
Important Reminders
- R is ready for commands when you see the chevron ‘>’ displayed in the terminal. If the chevron isn’t displayed, it means you typed an incomplete command and R is waiting for more input. Press ESC to exit and get R ready for a new command.
- R is case sensitive. i.e. “A” is a different object than “a”
<-10
a<-5
A
a
A
rm(list=ls()) # Clears R workspace again
Question 4
Which of the following symbols is used for comments in R:
✓#
✗%--%
✗//
✗/*
Using R as a calculator
The first thing to know about the R console is that you can use it as a calculator.
Arithmetic Operators
- Additions
- Subtractions
- Multiplications
- Divisions
- Exponents
Exercise 1
Use R to calculate the following skill testing question:
\(2 + 16 * 24 -56\)
Exercise 2
Use R to calculate the following skill testing equation:
\(2 + 16 * 24 - 56 / (2 + 1) - 457\)
Pay attention to the order of operations when thinking about this question!
Note that R always follows the order of priorities.
Objects
You have learned so far how to use R as a calculator to obtain various numerical values. However, it can get tiresome to always write the same code down in the R console, especially if you have to use some values repeatedly. This is where the concept of object becomes useful.
R is an object-oriented programming language. What this means is that we can allocate a name to values we’ve created to save them in our workspace. An object is composed of three parts:
- a value we’re
- an identifier and
- the assignment operator.
- The value can be almost anything we want: a number, the result of a calculation, a string of characters, a data frame, a plot or a function.
- The identifier is the name you assign to the value. Whenever you want to refer to this value, you simply type the identifier in the R console and R will return its value. Identifiers can include only letters, numbers, periods and underscores, and should always begin with a letter.
- The assignment operator resembles an arrow (
<-
) and is used to link the value to the identifier.
The following code clarifies these ideas:
Here, (2 + 6) / 2
is the value you want to save as an object. The identifier mean_x
is assigned to this value. Typing mean_x
returns the value of the calculation (i.e. 4). You have to be careful when typing the identifier because R is case-sensitive: writing mean_x
is not the same as writing MEAN_X
. You can see that the assignment operator <-
creates an explicit link between the value and the identifier. It always points from the value to the identifier. Note that it is also possible to use the equal sign =
as the assignment operator
Name
- Try having short and explicit names for your variables. Naming a variable
var
is not very informative. - Use an underscore (
_
), or a dot (.
) to separate words within a name and try to be consistent! - Avoid using names of existing functions and variables (e.g.,
c
,table
,T
, etc.)
Space
- Add spaces around all operators (
=
,+
,-
,<-
, etc.) to make the code more readable. - Always put a space after a comma, and never before (like in regular English).
Question 4
Which of the following symbols is an assigning operator in R:
✗#
✓<-
✗:=
✗==
Question 5
Which of the following is not a good practice when naming variables:
✗Use an underscore (_
), or a dot (.
) to separate words within a name and try to be consistent!
✓Use spaces to separate words within a name when you are under pressure
✗Try having short and explicit names for your variables
✓You can use names of existing functions and variables
Question 6
How do you create a variable named x
with a numeric value of 5
:
✓x<-5
✗x->5
✗int x=5
✗x:5
Question 6
Which one of the following is not an arithmetic sympol in R:
✓$
✗+
✗-
✗^
CHALLENGE 1
Create an object with a value of 1 + 1.718282 (Euler’s number) and name it
euler_value
Create an object with the area of a circle that has radius of
26 cm
Data types and structure
Core data types in R
Data types define how the values are stored in R. We can obtain the type and mode of an object using the function typeof()
. The core data types are:
- Numeric-type with integer and double values
- Character-type (always between
" "
)
- Logical-type
Question 6
Which one of the following is not a data type:
✗Numeric
✗Character
✗logical
✓vector
Data structure in R: scalars
Until this moment, we have created objects that had just one element inside them. An object that has just a single value or unit like a number or a text string is called a scalar.
combinations of scalars
By creating combinations of scalars, we can create data with different structures in R.
Using R to analyze your data is an important aspect of this software. Data comes in different forms and can be grouped in distinct categories. Depending on the nature of the values enclosed inside your data or object, R classifies them accordingly. The following figure illustrates common objects found in R.
Data structure in R: vectors
- A vector is an entity consisting of several scalars stored in a single object.
- All values in a vector must be the same mode which are either numeric, character and logical.
- Character vectors include text strings or a mix of text strings and other modes. You need to use
""
to delimit elements in a character vector. - Logical vectors include
TRUE/FALSE
entries only. A vector with a single value (usually a constant) is called an atomic vector. - When you have more than one value in a vector, you need a way to tell R to group all these values to create a vector. The trick here is to use the
c()
function in this format:vector.name <- c(value1, value2, value3, ...)
. - The function
c()
means combine or concatenate. It is a quick and easy function so remember it!
examples in R
Question 8
How do you create a vector of all in the vowels?:
✗[a,b,c,d,e]
✗(a,b,c,d,e)
✓c(a,b,c,d,e)
✗{a,b,c,d,e}
Creating vectors of sequential values:
a:b
The a:b
takes two numeric scalars a
and b
as arguments, and returns a vector of numbers from the starting point a
to the ending point b
, in steps of 1
unit:
seq()
seq()
allows us to create a sequence, like a:b
, but also allows us to specify either the size of the steps (the by
argument), or the total length of the sequence (the length.out
argument):
rep()
rep()
allows you to repeat a scalar (or vector) a specified number of times, or to a desired length:
CHALLENGE 2
- Create a vector containing the first five odd numbers (starting from 1) and name it odd_n.
- Create a vector containing any five cities you know in Zimbabwe
Operations using vectors
- vectors can be used for calculations. The only difference is that when a vector has more than 1 element, the operation is applied on all elements of the vector. The following example clarifies this.
A list allows you to
- gather a variety of objects under one name in an ordered way
- these objects can be matrices, vectors, data frames, even other lists
- a list is some kind super data type
- you can store practically any piece of information in it!
example in R
indexing lists
- While vectors have one dimension, matrices have two dimensions, determined by rows and columns.
- like vectors and scalars matrices can contain only one type of data:
numeric
,character
, orlogical
.
There are many ways to create your own matrix. Let us start with a simple one:
We can also combine multiple vectors using cbind()
and rbind()
:
calculations with matrices
Similarly as in the case of vectors, operations with matrices work just fine:
The product of the matrices is:
Division of matrices
CHALLENGE 3
- Create an object containing a matrix with 2 rows and 3 columns, with values from 1 to 6, sorted per column.
- Create another object with a matrix with 2 rows and 3 columns, with the names of six animals you like.
- A data frame is a group of vectors of the same length (i.e. the same number of elements). Columns are always variables and rows are observations, cases, sites or replicates.
- Differently than a matrix, a data frame can contain different modes saved in different columns (but always the same mode in a column).
It is in this format that ecological data are usually stored. The following example shows a fictitious dataset representing 4 sites where soil pH and the number of plant species were recorded. There is also a “fertilised” variable (fertilized or not). Let’s have a look at the creation of a data frame.
site_id | soil_pH | num_sp | fertilised |
---|---|---|---|
A1.01 | 5.6 | 17 | yes |
A1.02 | 7.3 | 23 | yes |
B1.01 | 4.1 | 15 | no |
B1.02 | 6.0 | 7 | no |
example in R
Note how the data frame integrated the name of the objects as column names
Indexing a vector
Typing an object’s name in R returns the complete object. But what if our object is a huge data frame with millions of entries? It can easily become confusing to identify specific elements of an object. R allows us to extract only part of an object. This is called indexing. We specify the position of values we want to extract from an object with brackets [ ]
. The following code illustrates the concept of indexation with vectors.
Lets do it in R
- Create a vector to use:
- To obtain the value in the second position, we do as follows:
- We can also obtain values for multiple positions within a vector with c()
- We can remove values pertaining to particular positions from a vector using the minus (-) sign before the position value
- If you select a position that is not in the numeric vector
There is no sixth value in this vector so R returns a null value (i.e. NA). NA stands for ‘Not available’.
CHALLENGE 4
Using the vector num_vector
and our indexing abilities:
- Extract the 4th value of the
num_vector
vector.
- Extract the 1st and 3rd values of the
num_vector
vector.
- Extract all values of the
num_vector
vector excluding the 2nd and 4th values.
- Extract from the 6th to the 10th value.
Naming vectors
Inspect my_vector using:
- the
attributes()
function - the
length()
function - the
str()
function
Indexing a data frame
For data frames, the concept of indexation is similar, but we usually have to specify two dimensions: the row and column numbers. The R syntax is
dataframe[row number, column number]
.
Here are a few examples of data frame indexation. Note that the first four operations are also valid for indexing matrices.
We can subset columns from it using the column names:
A quick note on logical statements
R gives you the possibility to test logical statements, i.e. to evaluate whether a statement is true or false. You can compare objects with the following logical operators:
Operator | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | exactly equal to |
!= | not equal to |
x | y | x OR y |
x & y | x AND y |
The following examples illustrate how to use these operators properly.
testing conditions
We can, for instance, test if values within a vector or a matrix are numeric:
Or whether they are of the character type:
And, also, if they are vectors:
CHALLENGE 5
- Extract the
num_sp
column fromsoil_fertilisation_data
and multiply its values by the first four values of thenum_vector
vector. - After that, write a statement that checks if the values you obtained are greater than 25. Refer to challenge 9 to complete this challenge.
R packages
- R packages extend the functionality of R by providing additional functions, and can be downloaded for free from the internet.
Install and load an R package
The ggplot2
package is a very popular package for data visualisation.