This is an R markdown document. See this website for more information.
We’ll start be entering data directly into R. If we wanted the numbers one through six…
x <- c(1, 2, 3, 4, 5, 6)
Wow! That was easy. Here we’ve used the c function with the arguments one through six, separated by commas, to assign a numeric vector, with the assignment operator <- to the object we’ve named x. We’ve just run our first R expression.
We can check what is stored in x just by typing x.
x
## [1] 1 2 3 4 5 6
What about this?
x <- seq(from = 1, to = 6)
x
## [1] 1 2 3 4 5 6
The seq function is another function for creating vectors. Unlike c it has named arguments.
And this?
x <- 1:6
x
## [1] 1 2 3 4 5 6
If we just want a vector of consecutive integers the colon operator is the easiest option.
Often we’ll need to get some tabular data into R. The function read.csv reads data into R but expects the data to be formatted in a particular way; in a plain text file with values separated by commas and the first line containing the column names.
seedling_density <- read.csv("seedling_density.csv")
Let’s look at the first six rows of data with the head function.
head(seedling_density)
## SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 1 B1011O Unburnt FALSE 1860 1 ASTTRY 0
## 2 B1011O Unburnt FALSE 1860 1 GREAUS 0
## 3 B1011O Unburnt FALSE 1860 1 PHESQU 0
## 4 B1011O Unburnt FALSE 1860 2 ASTTRY 0
## 5 B1011O Unburnt FALSE 1860 2 GREAUS 0
## 6 B1011O Unburnt FALSE 1860 2 PHESQU 0
The function read.csv has stored the data as a data.frame.
In RStudio we can easily get help for functions by searcing in the help pane.
Th R language has a package for just about everything. install.packages("rmarkdown") will install the rmarkdown package. rmarkdown extends R so we can compile our R markdown files and turn them into documents!
install.packages("rmarkdown", repos = "https://cloud.r-project.org")
Yay, it installed!
Now we have to load the package with the library function.
library(rmarkdown)
We’ve used head already. tail will show us the last six rows instead of the first six.
tail(seedling_density)
## SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 175 C2131O Medium TRUE 1720 4 ASTTRY 5
## 176 C2131O Medium TRUE 1720 4 GREAUS 20
## 177 C2131O Medium TRUE 1720 4 PHESQU 10
## 178 C2131O Medium TRUE 1720 5 ASTTRY 5
## 179 C2131O Medium TRUE 1720 5 GREAUS 10
## 180 C2131O Medium TRUE 1720 5 PHESQU 20
It’s a data.frame right? class will tell us if this is the case.
class(seedling_density)
## [1] "data.frame"
Yes, yes it is.
This class stuff is interesting let’s investigate more.
class(1)
## [1] "numeric"
class("a")
## [1] "character"
class(factor("a"))
## [1] "factor"
Cool!
How long is this data.frame?
length(seedling_density)
## [1] 7
That looks like the number of columns.
Let’s make sure though.
ncol(seedling_density)
## [1] 7
The colnames function can show us the column names.
colnames(seedling_density)
## [1] "SITE" "SEVERITY" "GRAZED"
## [4] "ALTITUDE" "QUADRAT" "SPECIES"
## [7] "SEEDLING_DENSITY"
So how many rows are there?
nrow(seedling_density)
## [1] 180
Wow, that’s a lot.
Even though it’s actually a list, a data.frame can behave like a matrix. In that case the dim function should tell us the number of rows and cols, i.e., the dimensions.
dim(seedling_density)
## [1] 180 7
It does indeed.
Let’s see how it’s all put together then.
str(seedling_density)
## 'data.frame': 180 obs. of 7 variables:
## $ SITE : Factor w/ 12 levels "B1011O","B1041O",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SEVERITY : Factor w/ 4 levels "High","Low","Medium",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ GRAZED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ALTITUDE : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
## $ QUADRAT : int 1 1 1 2 2 2 3 3 3 4 ...
## $ SPECIES : Factor w/ 3 levels "ASTTRY","GREAUS",..: 1 2 3 1 2 3 1 2 3 1 ...
## $ SEEDLING_DENSITY: int 0 0 0 0 0 0 0 0 0 1 ...
Wow that’s interesting stuff.
We can subset a data.frame with square brackets. To get the equivalent of the head function we do the following.
seedling_density[1:6, ]
## SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 1 B1011O Unburnt FALSE 1860 1 ASTTRY 0
## 2 B1011O Unburnt FALSE 1860 1 GREAUS 0
## 3 B1011O Unburnt FALSE 1860 1 PHESQU 0
## 4 B1011O Unburnt FALSE 1860 2 ASTTRY 0
## 5 B1011O Unburnt FALSE 1860 2 GREAUS 0
## 6 B1011O Unburnt FALSE 1860 2 PHESQU 0
Rows are before the comma and columns are after it.
So, the first six rows of only the first six columns looks like this.
seedling_density[1:6, 1:6]
## SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES
## 1 B1011O Unburnt FALSE 1860 1 ASTTRY
## 2 B1011O Unburnt FALSE 1860 1 GREAUS
## 3 B1011O Unburnt FALSE 1860 1 PHESQU
## 4 B1011O Unburnt FALSE 1860 2 ASTTRY
## 5 B1011O Unburnt FALSE 1860 2 GREAUS
## 6 B1011O Unburnt FALSE 1860 2 PHESQU
We already stored the numbers one through six in a variable called x. We could use that object instead of 1:6
seedling_density[x, x]
## SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES
## 1 B1011O Unburnt FALSE 1860 1 ASTTRY
## 2 B1011O Unburnt FALSE 1860 1 GREAUS
## 3 B1011O Unburnt FALSE 1860 1 PHESQU
## 4 B1011O Unburnt FALSE 1860 2 ASTTRY
## 5 B1011O Unburnt FALSE 1860 2 GREAUS
## 6 B1011O Unburnt FALSE 1860 2 PHESQU
Since data.frames are lists we can refer to their columns with a dollar sign.
seedling_density$SEEDLING_DENSITY
## [1] 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 50 10 0 50 5 0 20 10
## [24] 0 10 20 0 10 20 0 0 0 5 5 1 1 10 1 0 50 1 5 20 5 5 10
## [47] 1 0 1 0 0 1 0 0 5 0 0 10 0 0 0 0 0 0 0 0 0 0 0
## [70] 0 1 0 0 5 20 5 0 0 5 10 0 10 0 0 0 0 0 1 0 0 0 0
## [93] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 10 0 5 0 10 5 10 10
## [116] 5 0 0 0 0 0 5 0 0 5 0 0 20 0 0 10 0 5 5 0 1 5 50
## [139] 5 20 20 0 5 50 0 5 50 0 1 50 1 10 0 50 5 0 1 5 0 10 1
## [162] 0 50 20 0 20 20 0 20 20 0 20 5 20 5 20 10 5 10 20
The subset function is handy for subseting rows based on logical expressions. In this case we extract the rows that for the fire_severity column, have the value unburnt.
seedling_density_unburnt <- subset(seedling_density, SEVERITY == "Unburnt")
Note the double equals sign. They mean ‘is equal to?’ rather than one equals sign, =, which like a <-, which you could think of as ‘make equal to’.
We can also use negative indices to remove columns or rows.
(seedling_density_unburnt <- seedling_density_unburnt[, -2])
## SITE GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 1 B1011O FALSE 1860 1 ASTTRY 0
## 2 B1011O FALSE 1860 1 GREAUS 0
## 3 B1011O FALSE 1860 1 PHESQU 0
## 4 B1011O FALSE 1860 2 ASTTRY 0
## 5 B1011O FALSE 1860 2 GREAUS 0
## 6 B1011O FALSE 1860 2 PHESQU 0
## 7 B1011O FALSE 1860 3 ASTTRY 0
## 8 B1011O FALSE 1860 3 GREAUS 0
## 9 B1011O FALSE 1860 3 PHESQU 0
## 10 B1011O FALSE 1860 4 ASTTRY 1
## 11 B1011O FALSE 1860 4 GREAUS 1
## 12 B1011O FALSE 1860 4 PHESQU 0
## 13 B1011O FALSE 1860 5 ASTTRY 0
## 14 B1011O FALSE 1860 5 GREAUS 0
## 15 B1011O FALSE 1860 5 PHESQU 0
## 91 C1071O TRUE 1700 1 ASTTRY 0
## 92 C1071O TRUE 1700 1 GREAUS 0
## 93 C1071O TRUE 1700 1 PHESQU 0
## 94 C1071O TRUE 1700 2 ASTTRY 0
## 95 C1071O TRUE 1700 2 GREAUS 0
## 96 C1071O TRUE 1700 2 PHESQU 0
## 97 C1071O TRUE 1700 3 ASTTRY 0
## 98 C1071O TRUE 1700 3 GREAUS 0
## 99 C1071O TRUE 1700 3 PHESQU 0
## 100 C1071O TRUE 1700 4 ASTTRY 0
## 101 C1071O TRUE 1700 4 GREAUS 0
## 102 C1071O TRUE 1700 4 PHESQU 0
## 103 C1071O TRUE 1700 5 ASTTRY 0
## 104 C1071O TRUE 1700 5 GREAUS 0
## 105 C1071O TRUE 1700 5 PHESQU 0
This removes fire_severity, which we don’t need, since we only have one value for fire_severity in the subset. Note the brackets around the expression. This forces printing of the object as well as the assignment.
Let’s read in a new dataset of rainforest leaves from New Caledonia.
leaves <- read.csv("leaves.csv")
What do these data look like.
str(leaves)
## 'data.frame': 199 obs. of 9 variables:
## $ PLOT : Factor w/ 3 levels "ae2","can","cm1": 1 1 1 1 1 1 1 1 1 1 ...
## $ TREE : Factor w/ 68 levels "1","102","106",..: 30 30 30 31 31 31 32 32 32 33 ...
## $ LEAF : int 1 2 3 1 2 3 1 2 3 1 ...
## $ AREA : num 687473 1204104 623927 145695 161048 ...
## $ LENGTH : num 1436 1984 1381 629 661 ...
## $ PETIOLE_LENGTH: num 78.4 81 65.4 11.3 12.6 ...
## $ MARGIN_ENTIRE : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ CURVE : Factor w/ 3 levels "L","N","R": 3 1 1 2 1 2 2 2 2 2 ...
## $ WEIGHT : num 1.732 1.903 3.834 0.774 0.807 ...
Often we’d like to work with transformed data. Like a natural log maybe?
lnAREA <- log(leaves$AREA)
Now we have a new object that is the natural log of leaf area.
It’s not that good to have these floating around though, we should really stick’em back in the data.frame.
leaves <- cbind(leaves, lnAREA)
str(leaves)
## 'data.frame': 199 obs. of 10 variables:
## $ PLOT : Factor w/ 3 levels "ae2","can","cm1": 1 1 1 1 1 1 1 1 1 1 ...
## $ TREE : Factor w/ 68 levels "1","102","106",..: 30 30 30 31 31 31 32 32 32 33 ...
## $ LEAF : int 1 2 3 1 2 3 1 2 3 1 ...
## $ AREA : num 687473 1204104 623927 145695 161048 ...
## $ LENGTH : num 1436 1984 1381 629 661 ...
## $ PETIOLE_LENGTH: num 78.4 81 65.4 11.3 12.6 ...
## $ MARGIN_ENTIRE : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ CURVE : Factor w/ 3 levels "L","N","R": 3 1 1 2 1 2 2 2 2 2 ...
## $ WEIGHT : num 1.732 1.903 3.834 0.774 0.807 ...
## $ lnAREA : num 13.4 14 13.3 11.9 12 ...
This combines the original data with the new columns. cbind stands for column bind.
But we could skip these steps and just add transformations straight to the data.frame.
leaves$clnAREA <- lnAREA - mean(lnAREA)
str(leaves)
## 'data.frame': 199 obs. of 11 variables:
## $ PLOT : Factor w/ 3 levels "ae2","can","cm1": 1 1 1 1 1 1 1 1 1 1 ...
## $ TREE : Factor w/ 68 levels "1","102","106",..: 30 30 30 31 31 31 32 32 32 33 ...
## $ LEAF : int 1 2 3 1 2 3 1 2 3 1 ...
## $ AREA : num 687473 1204104 623927 145695 161048 ...
## $ LENGTH : num 1436 1984 1381 629 661 ...
## $ PETIOLE_LENGTH: num 78.4 81 65.4 11.3 12.6 ...
## $ MARGIN_ENTIRE : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ CURVE : Factor w/ 3 levels "L","N","R": 3 1 1 2 1 2 2 2 2 2 ...
## $ WEIGHT : num 1.732 1.903 3.834 0.774 0.807 ...
## $ lnAREA : num 13.4 14 13.3 11.9 12 ...
## $ clnAREA : num 0.95 1.51 0.853 -0.601 -0.501 ...
This will center the logged leaf area vector around zero.
We can get rid of the extra object we don’t need now.
rm(lnAREA)
Now lets have a closer look at the indexing variables for the leaves dataset, PLOT, TREE and LEAF. How many rows are there?
nrow(leaves)
## [1] 199
So how many unique leaf ID’s are there?
unique(leaves$LEAF)
## [1] 1 2 3 4
That’s no good? Ideally that should have been the same as the number of rows.
What about the combination of TREE and LEAF?
nrow(unique(leaves[2:3]))
## [1] 199
That’s more like it.
Let’s create proper a leaf ID variable based on the TREE and LEAF columns.
leaves$LEAF_ID <- do.call(paste, leaves[2:3])
The do.call function feeds the two columns of leaves to paste as a list of arguments.
And the result is…
leaves$LEAF_ID
## [1] "AR0 1" "AR0 2" "AR0 3" "BP0 1" "BP0 2" "BP0 3" "BW0 1"
## [8] "BW0 2" "BW0 3" "BW1 1" "BW1 2" "BW1 3" "CA0 1" "CA0 2"
## [15] "CA0 3" "CA1 1" "CA1 2" "CA1 3" "CA2 1" "CA2 2" "CN0 1"
## [22] "CN0 2" "CN0 3" "CS0 1" "CS0 2" "CS0 3" "EY0 1" "EY0 2"
## [29] "EY0 3" "GA0 1" "GA0 2" "GA0 3" "GA1 1" "GA1 2" "GA1 3"
## [36] "GA2 1" "GA2 2" "GA2 3" "GA2 4" "L5 1" "L5 2" "L5 3"
## [43] "L6 1" "L6 2" "L6 3" "L8 1" "L8 2" "L8 3" "LP0 1"
## [50] "LP0 2" "LP0 3" "PD0 1" "PD0 2" "PF0 1" "PF0 2" "PF0 3"
## [57] "PF1 1" "PF1 2" "PF1 3" "PF2 1" "PF2 2" "PF2 3" "PF3 1"
## [64] "PF3 2" "PF3 3" "PL1 1" "PL1 2" "PL1 3" "PT0 1" "PT0 2"
## [71] "PT0 3" "PT1 1" "PT1 2" "PT1 3" "PT2 1" "PT2 2" "PT2 3"
## [78] "SA0 1" "SA0 2" "SA0 3" "SCH0 2" "SCH0 3" "SM0 1" "SM0 2"
## [85] "SM0 3" "SM1 1" "SM1 2" "SM1 3" "SN0 1" "SN0 2" "SN0 3"
## [92] "SP0 1" "SP0 2" "SP0 3" "ST0 1" "ST0 2" "ST0 3" "SYZ0 1"
## [99] "SYZ0 2" "TR0 1" "TR0 2" "TR0 3" "AN 1" "AN 2" "DM 1"
## [106] "DM 2" "DM 3" "L77 1" "L77 2" "L77 3" "LN 1" "LN 2"
## [113] "LN 3" "LN 4" "N1 1" "N1 2" "N1 3" "1 1" "1 2"
## [120] "1 3" "3 1" "3 2" "3 3" "46 1" "46 2" "46 3"
## [127] "50 1" "50 2" "50 3" "54 1" "54 2" "54 3" "55 1"
## [134] "55 2" "55 3" "67 1" "67 2" "67 3" "86 1" "86 2"
## [141] "86 3" "87 1" "87 2" "87 3" "102 1" "102 2" "102 3"
## [148] "106 1" "106 2" "106 3" "111 1" "111 2" "111 3" "113 1"
## [155] "113 2" "113 3" "114 1" "114 2" "114 3" "114 4" "118 1"
## [162] "118 2" "118 3" "119 1" "119 2" "119 3" "120 1" "120 2"
## [169] "120 3" "127 1" "127 2" "127 3" "129 1" "129 2" "129 3"
## [176] "135 1" "135 2" "149 1" "149 2" "149 3" "150 1" "150 2"
## [183] "150 3" "151 1" "151 2" "151 3" "157 1" "157 2" "157 3"
## [190] "158 1" "158 2" "158 3" "163 1" "163 2" "165 1" "165 2"
## [197] "165 3" "168 1" "168 2"
It worked!
So now the length of the unique values of this now column should be the same as the number of rows in the leaves data.frame.
length(unique(leaves$LEAF_ID)) == nrow(leaves)
## [1] TRUE
Lets learn how to merge two data frames with a common variable. We’ll use the seedling density dataset, and a new dataset plant community cover values collected at the same time.
fire_community <- read.csv("fire_community.csv")
Let’s look at the structure.
str(fire_community)
## 'data.frame': 10 obs. of 14 variables:
## $ SITE : Factor w/ 10 levels "B1011O","B1041O",..: 6 5 7 10 1 4 2 8 9 3
## $ ACANOV: num 0 0 0 0.1 0.2 0 0 0.2 0.5 0.5
## $ AUSVEL: num 0 0.8 0.2 0 0 0 0 0 0 0.5
## $ CARHEB: num 0 0 0 0.2 0 0.5 0.2 0.7 0.5 0.5
## $ EUCFOR: num 0 0 0 0 0.3 0.7 0 0 0 0
## $ GERANT: num 0 0.1 0.7 0.2 0.2 0.6 0.9 0 0 0.2
## $ LEPFIL: num 0 0 0 0 0 0.1 0 0 0 0.1
## $ PHESQU: num 0 0 0.6 0.3 0 0.5 0 2.5 0 0
## $ PIMALP: num 0 0.1 0.2 0.2 0.4 0.7 0.9 0.9 3.8 4.1
## $ RANVIC: num 0.9 0.9 1.5 0.3 0.5 2.1 0 0.2 0.6 0.9
## $ RYTNUD: num 0.4 0 0.2 0 0 1 0 0 3.1 1
## $ SCLBIF: num 0.4 0.4 0 0 0.4 0 0 0.1 0 0
## $ TRISPI: num 0.3 0 0 0.1 0.2 0 0 0 0 0
## $ VIOBET: num 0 0.1 0.1 0.5 0 0.6 0.3 0.5 0 0.5
The merge function will by default join two data frames indexed by their common columns.
seedling_community <- merge(
fire_community[, c(1, 8)],
subset(seedling_density, SPECIES == "PHESQU")[, -6]
)
Here we have merged the cover data for the species PHESQU from the fire_community dataset with the seedling density data for the same species, indexing by SITE, the column common to both data.frames. We drop the SPECIES column with -6 because there is now only one species present in our merged data table.
Let’s see if it worked.
str(seedling_community)
## 'data.frame': 50 obs. of 7 variables:
## $ SITE : Factor w/ 10 levels "B1011O","B1041O",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ PHESQU : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SEVERITY : Factor w/ 4 levels "High","Low","Medium",..: 4 4 4 4 4 1 1 1 1 1 ...
## $ GRAZED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ALTITUDE : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
## $ QUADRAT : int 3 2 4 1 5 2 4 5 1 3 ...
## $ SEEDLING_DENSITY: int 0 0 0 0 0 0 0 0 0 0 ...
Wonderful!
We can write this new dataset to a file so that we can email it to someone with the write.csv function.
write.csv(seedling_community, file = "seedling_community.csv")
We can also save the object in R’s native file format .Rdata
save(seedling_community, file = "seedling_community.Rdata")
Let’s get rid of it from our workspace.
rm(seedling_community)
And now reload it with the load function.
load(file = "seedling_community.Rdata")
We often want to count the frequency of things and quite often the frequency of combinations of factors. Lets crosstabulate a new dataset of kangaroo behaviour.
kangaroos <- read.csv("kangaroos.csv")
So what does it look like?
str(kangaroos)
## 'data.frame': 3718 obs. of 5 variables:
## $ DATE_TIME: int 1266169500 1266169500 1266169500 1266169500 1266169500 1266169500 1266169500 1266169500 1266171300 1266171300 ...
## $ SPECIES : Factor w/ 2 levels "red","western grey": 2 2 1 1 2 2 1 1 2 2 ...
## $ BEHAVIOUR: Factor w/ 11 levels "Crouching","Drinking",..: 3 3 3 3 10 10 10 10 3 3 ...
## $ AIR_TEMP : num 18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.1 18.1 ...
## $ HUMIDITY : int 78 78 78 78 78 78 78 78 80 80 ...
Crosstabulation is achieved with the function table. Let’s crosstabulate date/time by species by behaviour.
behav_freq <- table(kangaroos[1:3])
The table function gives us an array of class table, with as many dimensions as the number of factors.
Lets have a look at how it is structured.
str(behav_freq)
## 'table' int [1:355, 1:2, 1:11] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 3
## ..$ DATE_TIME: chr [1:355] "1266169500" "1266171300" "1266172200" "1266173100" ...
## ..$ SPECIES : chr [1:2] "red" "western grey"
## ..$ BEHAVIOUR: chr [1:11] "Crouching" "Drinking" "Foraging" "Grooming" ...
But we can easily convert the array into something more palatable like a data.frame.
behav_freq <- as.data.frame(behav_freq)
So now it should be a data.frame.
str(behav_freq)
## 'data.frame': 7810 obs. of 4 variables:
## $ DATE_TIME: Factor w/ 355 levels "1266169500","1266171300",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ SPECIES : Factor w/ 2 levels "red","western grey": 1 1 1 1 1 1 1 1 1 1 ...
## $ BEHAVIOUR: Factor w/ 11 levels "Crouching","Drinking",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq : int 0 0 0 0 0 0 0 0 0 0 ...
Let’s have another look at the leaves dataset.
head(leaves)
## PLOT TREE LEAF AREA LENGTH PETIOLE_LENGTH MARGIN_ENTIRE CURVE
## 1 ae2 AR0 1 687473 1435.938 78.42 Y R
## 2 ae2 AR0 2 1204104 1984.151 81.04 Y L
## 3 ae2 AR0 3 623927 1380.783 65.37 Y L
## 4 ae2 BP0 1 145695 628.989 11.26 Y N
## 5 ae2 BP0 2 161048 661.247 12.60 Y L
## 6 ae2 BP0 3 126758 597.103 12.87 Y N
## WEIGHT lnAREA clnAREA LEAF_ID
## 1 1.7322 13.44078 0.9500288 AR0 1
## 2 1.9028 14.00125 1.5104973 AR0 2
## 3 3.8340 13.34379 0.8530397 AR0 3
## 4 0.7743 11.88927 -0.6014783 BP0 1
## 5 0.8067 11.98946 -0.5012913 BP0 2
## 6 0.6302 11.75004 -0.7407140 BP0 3
Time to calculate some summary statistics of the leaves vectors. First, we’ll work out the mean of leaf length is.
mean(leaves$LENGTH)
## [1] 1022.656
Similarly, we can calculate the median…
median(leaves$LENGTH)
## [1] 946.104
And some quantiles.
quantile(leaves$LENGTH)
## 0% 25% 50% 75% 100%
## 326.068 704.633 946.104 1200.787 2790.411
By default, the quantile function returns the 0, 25, 50, 75 and 100% quantiles, but we can ask for other quantiles by passing a vector of probabilities through the probs argument. For example:
quantile(leaves$LENGTH, probs = c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## 440.0971 946.1040 2306.9389
Ok, we would also like to know something about the variability of the data. How about we calculate the standard deviation.
sd(leaves$LENGTH)
## [1] 461.63
And if we want to calculate the precision of our estimate of the population mean, we can calculate the standard error. Surely this will be as simple as using the se function, right?
se(leaves$LENGTH)
## Error in se(leaves$LENGTH): could not find function "se"
Oh dear, that was unexpected…
R doesn’t have a built-in function for calculating the standard error of the mean. :(
All is not lost. We know how to calculate the standard error bit by bit… we just need to work out the standard deviation and the sample size. We are pretty much expert coders now, so without too much trouble we should be able to create our own function that will do our bidding.
se <- function(x) {
s <- sd(x)
n <- length(x)
s / sqrt(n)
}
We’ve just defined a function!
The se function expects us to provide a numeric vector, x. The first thing se does is to calculate the standard deviation of x, storing the result in the object s. It then calculates the sample size (the number of elements of x) and stores this in the object n. The final line of the function uses these intermediate objects to calcualte the standard error, and, because it’s the last line, it is also the value that the function will return. It’s useful to note that the objects s and n only exist temporarily, and effectively disappear once the function has finished doing its thing.
Let’s try it out.
se(leaves$LENGTH)
## [1] 32.72408
Remarkable. Now that we’ve defined the se function, we no longer receive that error.
When creating, summarising, manupulating or analysing data, there are often tasks that we will perform again and again. For example, we might want to: * repeat a scalar or vector a number of times; * apply a particular function to subsets of data, e.g. to each column or row of a matrix or data.frame, or each element of a list or vector; or * run certain lines of code a pre-specified number of times, or while some condition is true.
Let’s begin with the first point above, and take a look at one of the simplest ways to repeat values, with the rep function.
How about the numbers one though ten four times
rep(1:10, times = 4)
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3
## [24] 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
We get a different vector if we use the each argument rather than times.
rep(1:10, each = 4)
## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6
## [24] 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10
There are a some built-in functions that apply functions to each column or row of a matrix or data frame. For example, the colMeans function calculates the mean each column of the object you pass to it:
colMeans(leaves[4:6])
## AREA LENGTH PETIOLE_LENGTH
## 349999.72362 1022.65645 24.71251
Another function that performs operations on each column of a data frame or matrix is the summary function. This returns summaries of each column, the details of which depend on the column’s class. For example, summaries for factors give level frequencies, while summaries for numeric vectors give the min, mean, max and quartiles. Let’s take a look at the summary of leaves.
summary(leaves)
## PLOT TREE LEAF AREA LENGTH
## ae2:102 114 : 4 Min. :1 Min. : 35598 Min. : 326.1
## can: 15 GA2 : 4 1st Qu.:1 1st Qu.: 160107 1st Qu.: 704.6
## cm1: 82 LN : 4 Median :2 Median : 284449 Median : 946.1
## 1 : 3 Mean :2 Mean : 350000 Mean :1022.7
## 102 : 3 3rd Qu.:3 3rd Qu.: 481882 3rd Qu.:1200.8
## 106 : 3 Max. :4 Max. :1366244 Max. :2790.4
## (Other):178
## PETIOLE_LENGTH MARGIN_ENTIRE CURVE WEIGHT lnAREA
## Min. : 1.680 N: 15 L: 43 Min. :0.0740 Min. :10.48
## 1st Qu.: 9.345 Y:184 N:120 1st Qu.:0.4964 1st Qu.:11.98
## Median : 19.420 R: 36 Median :0.8925 Median :12.56
## Mean : 24.713 Mean :1.2654 Mean :12.49
## 3rd Qu.: 32.300 3rd Qu.:1.7329 3rd Qu.:13.09
## Max. :135.580 Max. :9.7833 Max. :14.13
##
## clnAREA LEAF_ID
## Min. :-2.01070 Length:199
## 1st Qu.:-0.50717 Class :character
## Median : 0.06756 Mode :character
## Mean : 0.00000
## 3rd Qu.: 0.59471
## Max. : 1.63683
##
Well that’s great, but what about all the other functions? What if we wanted to apply the se function we created earlier to these columns? It turns out R has a family of functions that makes this relatively painless. These are the *apply functions, including apply, lapply, sapply, tapply, mapply. We’ll only look at apply, and tapply today. They can be daunting at first, but very powerful. That is, they apply functions to elements (or col, row, etc.) of your object all in one go, rather than handling each element in turn.
Let’s dive in and take a look at the apply function.
We need to provide three arguments to the apply function. * The first is the object to which we want to apply our function. For the apply function this must be a data frame or an array. * The second argument specifies the “margin” (i.e. the dimension) that we want to apply the function over. For a data frame, this should be either 1 (which means we will apply to function to the first dimension, i.e. the rows), 2 (apply to the second dimension, i.e. columns), or c(1, 2) (meaning we will apply it to the intersection of rows and columns, i.e. to each cell). * The third required argument is the function that you want to apply to the specified margin.
Here goes then…
apply(leaves[4:6], 2, se)
## AREA LENGTH PETIOLE_LENGTH
## 17979.89481 32.72408 1.50087
Onwards to tapply. When we use tapply before applying the function, the vector is subset according to one or more grouping variables. For example, we could calculate the mean leaf length for each leaf curve direction.
(curve_means <- tapply(leaves$LENGTH, leaves$CURVE, mean))
## L N R
## 1149.9834 943.0866 1135.8041
Here the second argument specifies the grouping variable (in this case leaves$CURVE), and so the vector leaves$LENGTH is split into subsets according to the direction they curve, prior to the mean function being applied.
We mentioned above that more than one grouping variable can be provided. Let’s demonstrate this by grouping our leaf length data according leaf margin type as well as curvature, and then once again calculating group means.
with(leaves, tapply(LENGTH, list(MARGIN_ENTIRE, CURVE), mean))
## L N R
## N 1057.939 853.8015 896.782
## Y 1156.887 951.2035 1149.864
Plotting univariate data in R is typically really easy. Often you just need to know what type of plot you want to make and the specify the vector of data you want plot.
Like a boxplot for instance.
boxplot(leaves$LENGTH)
Or a simple histogram.
hist(leaves$LENGTH)
A really common plot to summarise multidimensional data is a scatterplot matrix. The pairs function is what you want in this case.
pairs(leaves[4:6])
The most general plotting function is unsurprisingly called plot. Give it two vectors of data and it will give a simple scatterplot. Let’s have a go using the temperature data that is part of the kangaroos dataset.
First a simple scatterplot.
plot(kangaroos[, 1], kangaroos[, 4])
Since it is a time series maybe we’d like to plot it as a line?
plot(kangaroos[, 1], kangaroos[, 4], type = "l")
How about some axis labels?
plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
ylab = "Temperature (Celsius)")
Lets increase the y axis limits and make the tick labels in the right orientation with las=1
plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1)
It’d be nice if we could adjust those margins and fit another axis on the right hand side with the temperature in Fahrenheit. For pre-plotting adjustments like that, we turn to par. This function can set all sorts of graphical parameters before we even draw anything on the screen. The mar argument will help us determine the size of the margin in units of text line heights, parallel to the axis. We specify the margins with a length four vector in the order: bottom, left, top, right.
par(mar = c(5, 5, 1, 5))
plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1)
That looks like enough room for a new axis.
Up till now, plot was plotting our axes and axis labels for us. But we can use the axis function to plot our own as well. mtext will do the equivalent job for the axis label.
par(mar=c(5, 5, 1, 5))
plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1)
axis(side = 4, at = 0:10 * 5, labels = (0:10 * 5 * 1.8) + 32, las = 1)
mtext(side = 4, text = "Temperature (Fahrenheit)", line = 3)
We can turn off the automatic x-axis provided by plot, by setting the x-axis type to ‘none’, xaxt = "n". Then we can use axis as we did before, to give us more sensible axis in units.
par(mar=c(5, 5, 1, 5))
plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1, xaxt = "n")
axis(side = 4, at = 0:10 * 5, labels = (0:10 * 5 * 1.8) + 32, las = 1)
mtext(side = 4, text = "Temperature (Fahrenheit)", line = 3)
axis(side = 1, labels = pretty(as.POSIXct(kangaroos[, 1], origin = "1970-01-01")),
at = pretty(as.POSIXct(kangaroos[, 1], origin = "1970-01-01")))