Data Analysis and Managment in R

Introducing R and RStudio

This is an R markdown document. See this website for more information.

Getting data into R

We’ll start be entering data directly into R. If we wanted the numbers one through six…

x <- c(1, 2, 3, 4, 5, 6)

Wow! That was easy. Here we’ve used the c function with the arguments one through six, separated by commas, to assign a numeric vector, with the assignment operator <- to the object we’ve named x. We’ve just run our first R expression.

We can check what is stored in x just by typing x.

## [1] 1 2 3 4 5 6

What about this?

x <- seq(from = 1, to = 6)
x

## [1] 1 2 3 4 5 6

The seq function is another function for creating vectors. Unlike c it has named arguments.

And this?

x <- 1:6
x

## [1] 1 2 3 4 5 6

If we just want a vector of consecutive integers the colon operator is the easiest option.

Often we’ll need to get some tabular data into R. The function read.csv reads data into R but expects the data to be formatted in a particular way; in a plain text file with values separated by commas and the first line containing the column names.

seedling_density <- read.csv("seedling_density.csv")

Let’s look at the first six rows of data with the head function.

head(seedling_density)

##     SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 1 B1011O  Unburnt  FALSE     1860       1  ASTTRY                0
## 2 B1011O  Unburnt  FALSE     1860       1  GREAUS                0
## 3 B1011O  Unburnt  FALSE     1860       1  PHESQU                0
## 4 B1011O  Unburnt  FALSE     1860       2  ASTTRY                0
## 5 B1011O  Unburnt  FALSE     1860       2  GREAUS                0
## 6 B1011O  Unburnt  FALSE     1860       2  PHESQU                0

The function read.csv has stored the data as a data.frame.

Getting Help in R

In RStudio we can easily get help for functions by searcing in the help pane.

Extending R with R packages

Th R language has a package for just about everything. install.packages("rmarkdown") will install the rmarkdown package. rmarkdown extends R so we can compile our R markdown files and turn them into documents!

install.packages("rmarkdown", repos = "https://cloud.r-project.org")

Yay, it installed!

Now we have to load the package with the library function.

library(rmarkdown)

Interrogating data

We’ve used head already. tail will show us the last six rows instead of the first six.

tail(seedling_density)

##       SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 175 C2131O   Medium   TRUE     1720       4  ASTTRY                5
## 176 C2131O   Medium   TRUE     1720       4  GREAUS               20
## 177 C2131O   Medium   TRUE     1720       4  PHESQU               10
## 178 C2131O   Medium   TRUE     1720       5  ASTTRY                5
## 179 C2131O   Medium   TRUE     1720       5  GREAUS               10
## 180 C2131O   Medium   TRUE     1720       5  PHESQU               20

It’s a data.frame right? class will tell us if this is the case.

class(seedling_density)

## [1] "data.frame"

Yes, yes it is.

This class stuff is interesting let’s investigate more.

class(1)

## [1] "numeric"

class("a")

## [1] "character"

class(factor("a"))

## [1] "factor"

Cool!

How long is this data.frame?

length(seedling_density)

## [1] 7

That looks like the number of columns.

Let’s make sure though.

ncol(seedling_density)

## [1] 7

The colnames function can show us the column names.

colnames(seedling_density)

## [1] "SITE"             "SEVERITY"         "GRAZED"          
## [4] "ALTITUDE"         "QUADRAT"          "SPECIES"         
## [7] "SEEDLING_DENSITY"

So how many rows are there?

nrow(seedling_density)

## [1] 180

Wow, that’s a lot.

Even though it’s actually a list, a data.frame can behave like a matrix. In that case the dim function should tell us the number of rows and cols, i.e., the dimensions.

dim(seedling_density)

## [1] 180   7

It does indeed.

Let’s see how it’s all put together then.

str(seedling_density)

## 'data.frame':    180 obs. of  7 variables:
##  $ SITE            : Factor w/ 12 levels "B1011O","B1041O",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEVERITY        : Factor w/ 4 levels "High","Low","Medium",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ GRAZED          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ALTITUDE        : int  1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
##  $ QUADRAT         : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ SPECIES         : Factor w/ 3 levels "ASTTRY","GREAUS",..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ SEEDLING_DENSITY: int  0 0 0 0 0 0 0 0 0 1 ...

Wow that’s interesting stuff.

Subsetting data

We can subset a data.frame with square brackets. To get the equivalent of the head function we do the following.

seedling_density[1:6, ]

##     SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 1 B1011O  Unburnt  FALSE     1860       1  ASTTRY                0
## 2 B1011O  Unburnt  FALSE     1860       1  GREAUS                0
## 3 B1011O  Unburnt  FALSE     1860       1  PHESQU                0
## 4 B1011O  Unburnt  FALSE     1860       2  ASTTRY                0
## 5 B1011O  Unburnt  FALSE     1860       2  GREAUS                0
## 6 B1011O  Unburnt  FALSE     1860       2  PHESQU                0

Rows are before the comma and columns are after it.

So, the first six rows of only the first six columns looks like this.

seedling_density[1:6, 1:6]

##     SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES
## 1 B1011O  Unburnt  FALSE     1860       1  ASTTRY
## 2 B1011O  Unburnt  FALSE     1860       1  GREAUS
## 3 B1011O  Unburnt  FALSE     1860       1  PHESQU
## 4 B1011O  Unburnt  FALSE     1860       2  ASTTRY
## 5 B1011O  Unburnt  FALSE     1860       2  GREAUS
## 6 B1011O  Unburnt  FALSE     1860       2  PHESQU

We already stored the numbers one through six in a variable called x. We could use that object instead of 1:6

seedling_density[x, x]

##     SITE SEVERITY GRAZED ALTITUDE QUADRAT SPECIES
## 1 B1011O  Unburnt  FALSE     1860       1  ASTTRY
## 2 B1011O  Unburnt  FALSE     1860       1  GREAUS
## 3 B1011O  Unburnt  FALSE     1860       1  PHESQU
## 4 B1011O  Unburnt  FALSE     1860       2  ASTTRY
## 5 B1011O  Unburnt  FALSE     1860       2  GREAUS
## 6 B1011O  Unburnt  FALSE     1860       2  PHESQU

Since data.frames are lists we can refer to their columns with a dollar sign.

seedling_density$SEEDLING_DENSITY

##   [1]  0  0  0  0  0  0  0  0  0  1  1  0  0  0  0 50 10  0 50  5  0 20 10
##  [24]  0 10 20  0 10 20  0  0  0  5  5  1  1 10  1  0 50  1  5 20  5  5 10
##  [47]  1  0  1  0  0  1  0  0  5  0  0 10  0  0  0  0  0  0  0  0  0  0  0
##  [70]  0  1  0  0  5 20  5  0  0  5 10  0 10  0  0  0  0  0  1  0  0  0  0
##  [93]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  5 10  0  5  0 10  5 10 10
## [116]  5  0  0  0  0  0  5  0  0  5  0  0 20  0  0 10  0  5  5  0  1  5 50
## [139]  5 20 20  0  5 50  0  5 50  0  1 50  1 10  0 50  5  0  1  5  0 10  1
## [162]  0 50 20  0 20 20  0 20 20  0 20  5 20  5 20 10  5 10 20

The subset function is handy for subseting rows based on logical expressions. In this case we extract the rows that for the fire_severity column, have the value unburnt.

seedling_density_unburnt <- subset(seedling_density, SEVERITY == "Unburnt")

Note the double equals sign. They mean ‘is equal to?’ rather than one equals sign, =, which like a <-, which you could think of as ‘make equal to’.

We can also use negative indices to remove columns or rows.

(seedling_density_unburnt <- seedling_density_unburnt[, -2])

##       SITE GRAZED ALTITUDE QUADRAT SPECIES SEEDLING_DENSITY
## 1   B1011O  FALSE     1860       1  ASTTRY                0
## 2   B1011O  FALSE     1860       1  GREAUS                0
## 3   B1011O  FALSE     1860       1  PHESQU                0
## 4   B1011O  FALSE     1860       2  ASTTRY                0
## 5   B1011O  FALSE     1860       2  GREAUS                0
## 6   B1011O  FALSE     1860       2  PHESQU                0
## 7   B1011O  FALSE     1860       3  ASTTRY                0
## 8   B1011O  FALSE     1860       3  GREAUS                0
## 9   B1011O  FALSE     1860       3  PHESQU                0
## 10  B1011O  FALSE     1860       4  ASTTRY                1
## 11  B1011O  FALSE     1860       4  GREAUS                1
## 12  B1011O  FALSE     1860       4  PHESQU                0
## 13  B1011O  FALSE     1860       5  ASTTRY                0
## 14  B1011O  FALSE     1860       5  GREAUS                0
## 15  B1011O  FALSE     1860       5  PHESQU                0
## 91  C1071O   TRUE     1700       1  ASTTRY                0
## 92  C1071O   TRUE     1700       1  GREAUS                0
## 93  C1071O   TRUE     1700       1  PHESQU                0
## 94  C1071O   TRUE     1700       2  ASTTRY                0
## 95  C1071O   TRUE     1700       2  GREAUS                0
## 96  C1071O   TRUE     1700       2  PHESQU                0
## 97  C1071O   TRUE     1700       3  ASTTRY                0
## 98  C1071O   TRUE     1700       3  GREAUS                0
## 99  C1071O   TRUE     1700       3  PHESQU                0
## 100 C1071O   TRUE     1700       4  ASTTRY                0
## 101 C1071O   TRUE     1700       4  GREAUS                0
## 102 C1071O   TRUE     1700       4  PHESQU                0
## 103 C1071O   TRUE     1700       5  ASTTRY                0
## 104 C1071O   TRUE     1700       5  GREAUS                0
## 105 C1071O   TRUE     1700       5  PHESQU                0

This removes fire_severity, which we don’t need, since we only have one value for fire_severity in the subset. Note the brackets around the expression. This forces printing of the object as well as the assignment.

Creating and combining data

Let’s read in a new dataset of rainforest leaves from New Caledonia.

leaves <- read.csv("leaves.csv")

What do these data look like.

str(leaves)

## 'data.frame':    199 obs. of  9 variables:
##  $ PLOT          : Factor w/ 3 levels "ae2","can","cm1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TREE          : Factor w/ 68 levels "1","102","106",..: 30 30 30 31 31 31 32 32 32 33 ...
##  $ LEAF          : int  1 2 3 1 2 3 1 2 3 1 ...
##  $ AREA          : num  687473 1204104 623927 145695 161048 ...
##  $ LENGTH        : num  1436 1984 1381 629 661 ...
##  $ PETIOLE_LENGTH: num  78.4 81 65.4 11.3 12.6 ...
##  $ MARGIN_ENTIRE : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CURVE         : Factor w/ 3 levels "L","N","R": 3 1 1 2 1 2 2 2 2 2 ...
##  $ WEIGHT        : num  1.732 1.903 3.834 0.774 0.807 ...

Often we’d like to work with transformed data. Like a natural log maybe?

lnAREA <- log(leaves$AREA)

Now we have a new object that is the natural log of leaf area.

It’s not that good to have these floating around though, we should really stick’em back in the data.frame.

leaves <- cbind(leaves, lnAREA)
str(leaves)

## 'data.frame':    199 obs. of  10 variables:
##  $ PLOT          : Factor w/ 3 levels "ae2","can","cm1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TREE          : Factor w/ 68 levels "1","102","106",..: 30 30 30 31 31 31 32 32 32 33 ...
##  $ LEAF          : int  1 2 3 1 2 3 1 2 3 1 ...
##  $ AREA          : num  687473 1204104 623927 145695 161048 ...
##  $ LENGTH        : num  1436 1984 1381 629 661 ...
##  $ PETIOLE_LENGTH: num  78.4 81 65.4 11.3 12.6 ...
##  $ MARGIN_ENTIRE : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CURVE         : Factor w/ 3 levels "L","N","R": 3 1 1 2 1 2 2 2 2 2 ...
##  $ WEIGHT        : num  1.732 1.903 3.834 0.774 0.807 ...
##  $ lnAREA        : num  13.4 14 13.3 11.9 12 ...

This combines the original data with the new columns. cbind stands for column bind.

But we could skip these steps and just add transformations straight to the data.frame.

leaves$clnAREA <- lnAREA - mean(lnAREA)
str(leaves)

## 'data.frame':    199 obs. of  11 variables:
##  $ PLOT          : Factor w/ 3 levels "ae2","can","cm1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TREE          : Factor w/ 68 levels "1","102","106",..: 30 30 30 31 31 31 32 32 32 33 ...
##  $ LEAF          : int  1 2 3 1 2 3 1 2 3 1 ...
##  $ AREA          : num  687473 1204104 623927 145695 161048 ...
##  $ LENGTH        : num  1436 1984 1381 629 661 ...
##  $ PETIOLE_LENGTH: num  78.4 81 65.4 11.3 12.6 ...
##  $ MARGIN_ENTIRE : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CURVE         : Factor w/ 3 levels "L","N","R": 3 1 1 2 1 2 2 2 2 2 ...
##  $ WEIGHT        : num  1.732 1.903 3.834 0.774 0.807 ...
##  $ lnAREA        : num  13.4 14 13.3 11.9 12 ...
##  $ clnAREA       : num  0.95 1.51 0.853 -0.601 -0.501 ...

This will center the logged leaf area vector around zero.

We can get rid of the extra object we don’t need now.

rm(lnAREA)

Now lets have a closer look at the indexing variables for the leaves dataset, PLOT, TREE and LEAF. How many rows are there?

nrow(leaves)

## [1] 199

So how many unique leaf ID’s are there?

unique(leaves$LEAF)

## [1] 1 2 3 4

That’s no good? Ideally that should have been the same as the number of rows.

What about the combination of TREE and LEAF?

nrow(unique(leaves[2:3]))

## [1] 199

That’s more like it.

Let’s create proper a leaf ID variable based on the TREE and LEAF columns.

leaves$LEAF_ID <- do.call(paste, leaves[2:3])

The do.call function feeds the two columns of leaves to paste as a list of arguments.

And the result is…

leaves$LEAF_ID

##   [1] "AR0 1"  "AR0 2"  "AR0 3"  "BP0 1"  "BP0 2"  "BP0 3"  "BW0 1" 
##   [8] "BW0 2"  "BW0 3"  "BW1 1"  "BW1 2"  "BW1 3"  "CA0 1"  "CA0 2" 
##  [15] "CA0 3"  "CA1 1"  "CA1 2"  "CA1 3"  "CA2 1"  "CA2 2"  "CN0 1" 
##  [22] "CN0 2"  "CN0 3"  "CS0 1"  "CS0 2"  "CS0 3"  "EY0 1"  "EY0 2" 
##  [29] "EY0 3"  "GA0 1"  "GA0 2"  "GA0 3"  "GA1 1"  "GA1 2"  "GA1 3" 
##  [36] "GA2 1"  "GA2 2"  "GA2 3"  "GA2 4"  "L5 1"   "L5 2"   "L5 3"  
##  [43] "L6 1"   "L6 2"   "L6 3"   "L8 1"   "L8 2"   "L8 3"   "LP0 1" 
##  [50] "LP0 2"  "LP0 3"  "PD0 1"  "PD0 2"  "PF0 1"  "PF0 2"  "PF0 3" 
##  [57] "PF1 1"  "PF1 2"  "PF1 3"  "PF2 1"  "PF2 2"  "PF2 3"  "PF3 1" 
##  [64] "PF3 2"  "PF3 3"  "PL1 1"  "PL1 2"  "PL1 3"  "PT0 1"  "PT0 2" 
##  [71] "PT0 3"  "PT1 1"  "PT1 2"  "PT1 3"  "PT2 1"  "PT2 2"  "PT2 3" 
##  [78] "SA0 1"  "SA0 2"  "SA0 3"  "SCH0 2" "SCH0 3" "SM0 1"  "SM0 2" 
##  [85] "SM0 3"  "SM1 1"  "SM1 2"  "SM1 3"  "SN0 1"  "SN0 2"  "SN0 3" 
##  [92] "SP0 1"  "SP0 2"  "SP0 3"  "ST0 1"  "ST0 2"  "ST0 3"  "SYZ0 1"
##  [99] "SYZ0 2" "TR0 1"  "TR0 2"  "TR0 3"  "AN 1"   "AN 2"   "DM 1"  
## [106] "DM 2"   "DM 3"   "L77 1"  "L77 2"  "L77 3"  "LN 1"   "LN 2"  
## [113] "LN 3"   "LN 4"   "N1 1"   "N1 2"   "N1 3"   "1 1"    "1 2"   
## [120] "1 3"    "3 1"    "3 2"    "3 3"    "46 1"   "46 2"   "46 3"  
## [127] "50 1"   "50 2"   "50 3"   "54 1"   "54 2"   "54 3"   "55 1"  
## [134] "55 2"   "55 3"   "67 1"   "67 2"   "67 3"   "86 1"   "86 2"  
## [141] "86 3"   "87 1"   "87 2"   "87 3"   "102 1"  "102 2"  "102 3" 
## [148] "106 1"  "106 2"  "106 3"  "111 1"  "111 2"  "111 3"  "113 1" 
## [155] "113 2"  "113 3"  "114 1"  "114 2"  "114 3"  "114 4"  "118 1" 
## [162] "118 2"  "118 3"  "119 1"  "119 2"  "119 3"  "120 1"  "120 2" 
## [169] "120 3"  "127 1"  "127 2"  "127 3"  "129 1"  "129 2"  "129 3" 
## [176] "135 1"  "135 2"  "149 1"  "149 2"  "149 3"  "150 1"  "150 2" 
## [183] "150 3"  "151 1"  "151 2"  "151 3"  "157 1"  "157 2"  "157 3" 
## [190] "158 1"  "158 2"  "158 3"  "163 1"  "163 2"  "165 1"  "165 2" 
## [197] "165 3"  "168 1"  "168 2"

It worked!

So now the length of the unique values of this now column should be the same as the number of rows in the leaves data.frame.

length(unique(leaves$LEAF_ID)) == nrow(leaves)

## [1] TRUE

Lets learn how to merge two data frames with a common variable. We’ll use the seedling density dataset, and a new dataset plant community cover values collected at the same time.

fire_community <- read.csv("fire_community.csv")

Let’s look at the structure.

str(fire_community)

## 'data.frame':    10 obs. of  14 variables:
##  $ SITE  : Factor w/ 10 levels "B1011O","B1041O",..: 6 5 7 10 1 4 2 8 9 3
##  $ ACANOV: num  0 0 0 0.1 0.2 0 0 0.2 0.5 0.5
##  $ AUSVEL: num  0 0.8 0.2 0 0 0 0 0 0 0.5
##  $ CARHEB: num  0 0 0 0.2 0 0.5 0.2 0.7 0.5 0.5
##  $ EUCFOR: num  0 0 0 0 0.3 0.7 0 0 0 0
##  $ GERANT: num  0 0.1 0.7 0.2 0.2 0.6 0.9 0 0 0.2
##  $ LEPFIL: num  0 0 0 0 0 0.1 0 0 0 0.1
##  $ PHESQU: num  0 0 0.6 0.3 0 0.5 0 2.5 0 0
##  $ PIMALP: num  0 0.1 0.2 0.2 0.4 0.7 0.9 0.9 3.8 4.1
##  $ RANVIC: num  0.9 0.9 1.5 0.3 0.5 2.1 0 0.2 0.6 0.9
##  $ RYTNUD: num  0.4 0 0.2 0 0 1 0 0 3.1 1
##  $ SCLBIF: num  0.4 0.4 0 0 0.4 0 0 0.1 0 0
##  $ TRISPI: num  0.3 0 0 0.1 0.2 0 0 0 0 0
##  $ VIOBET: num  0 0.1 0.1 0.5 0 0.6 0.3 0.5 0 0.5

Merging data frames

The merge function will by default join two data frames indexed by their common columns.

seedling_community <- merge(
  fire_community[, c(1, 8)],
  subset(seedling_density, SPECIES == "PHESQU")[, -6]
)

Here we have merged the cover data for the species PHESQU from the fire_community dataset with the seedling density data for the same species, indexing by SITE, the column common to both data.frames. We drop the SPECIES column with -6 because there is now only one species present in our merged data table.

Let’s see if it worked.

str(seedling_community)

## 'data.frame':    50 obs. of  7 variables:
##  $ SITE            : Factor w/ 10 levels "B1011O","B1041O",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ PHESQU          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SEVERITY        : Factor w/ 4 levels "High","Low","Medium",..: 4 4 4 4 4 1 1 1 1 1 ...
##  $ GRAZED          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ALTITUDE        : int  1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
##  $ QUADRAT         : int  3 2 4 1 5 2 4 5 1 3 ...
##  $ SEEDLING_DENSITY: int  0 0 0 0 0 0 0 0 0 0 ...

Wonderful!

We can write this new dataset to a file so that we can email it to someone with the write.csv function.

write.csv(seedling_community, file = "seedling_community.csv")

We can also save the object in R’s native file format .Rdata

save(seedling_community, file = "seedling_community.Rdata")

Let’s get rid of it from our workspace.

rm(seedling_community)

And now reload it with the load function.

load(file = "seedling_community.Rdata")

Cross-tabulation

We often want to count the frequency of things and quite often the frequency of combinations of factors. Lets crosstabulate a new dataset of kangaroo behaviour.

kangaroos <- read.csv("kangaroos.csv")

So what does it look like?

str(kangaroos)

## 'data.frame':    3718 obs. of  5 variables:
##  $ DATE_TIME: int  1266169500 1266169500 1266169500 1266169500 1266169500 1266169500 1266169500 1266169500 1266171300 1266171300 ...
##  $ SPECIES  : Factor w/ 2 levels "red","western grey": 2 2 1 1 2 2 1 1 2 2 ...
##  $ BEHAVIOUR: Factor w/ 11 levels "Crouching","Drinking",..: 3 3 3 3 10 10 10 10 3 3 ...
##  $ AIR_TEMP : num  18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.1 18.1 ...
##  $ HUMIDITY : int  78 78 78 78 78 78 78 78 80 80 ...

Crosstabulation is achieved with the function table. Let’s crosstabulate date/time by species by behaviour.

behav_freq <- table(kangaroos[1:3])

The table function gives us an array of class table, with as many dimensions as the number of factors.

Lets have a look at how it is structured.

str(behav_freq)

##  'table' int [1:355, 1:2, 1:11] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ DATE_TIME: chr [1:355] "1266169500" "1266171300" "1266172200" "1266173100" ...
##   ..$ SPECIES  : chr [1:2] "red" "western grey"
##   ..$ BEHAVIOUR: chr [1:11] "Crouching" "Drinking" "Foraging" "Grooming" ...

But we can easily convert the array into something more palatable like a data.frame.

behav_freq <- as.data.frame(behav_freq)

So now it should be a data.frame.

str(behav_freq)

## 'data.frame':    7810 obs. of  4 variables:
##  $ DATE_TIME: Factor w/ 355 levels "1266169500","1266171300",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SPECIES  : Factor w/ 2 levels "red","western grey": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BEHAVIOUR: Factor w/ 11 levels "Crouching","Drinking",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq     : int  0 0 0 0 0 0 0 0 0 0 ...

Summary statistics

Let’s have another look at the leaves dataset.

head(leaves)

##   PLOT TREE LEAF    AREA   LENGTH PETIOLE_LENGTH MARGIN_ENTIRE CURVE
## 1  ae2  AR0    1  687473 1435.938          78.42             Y     R
## 2  ae2  AR0    2 1204104 1984.151          81.04             Y     L
## 3  ae2  AR0    3  623927 1380.783          65.37             Y     L
## 4  ae2  BP0    1  145695  628.989          11.26             Y     N
## 5  ae2  BP0    2  161048  661.247          12.60             Y     L
## 6  ae2  BP0    3  126758  597.103          12.87             Y     N
##   WEIGHT   lnAREA    clnAREA LEAF_ID
## 1 1.7322 13.44078  0.9500288   AR0 1
## 2 1.9028 14.00125  1.5104973   AR0 2
## 3 3.8340 13.34379  0.8530397   AR0 3
## 4 0.7743 11.88927 -0.6014783   BP0 1
## 5 0.8067 11.98946 -0.5012913   BP0 2
## 6 0.6302 11.75004 -0.7407140   BP0 3

Time to calculate some summary statistics of the leaves vectors. First, we’ll work out the mean of leaf length is.

mean(leaves$LENGTH)

## [1] 1022.656

Similarly, we can calculate the median…

median(leaves$LENGTH)

## [1] 946.104

And some quantiles.

quantile(leaves$LENGTH)

##       0%      25%      50%      75%     100% 
##  326.068  704.633  946.104 1200.787 2790.411

By default, the quantile function returns the 0, 25, 50, 75 and 100% quantiles, but we can ask for other quantiles by passing a vector of probabilities through the probs argument. For example:

quantile(leaves$LENGTH, probs = c(0.025, 0.5, 0.975))

##      2.5%       50%     97.5% 
##  440.0971  946.1040 2306.9389

Ok, we would also like to know something about the variability of the data. How about we calculate the standard deviation.

sd(leaves$LENGTH)

## [1] 461.63

And if we want to calculate the precision of our estimate of the population mean, we can calculate the standard error. Surely this will be as simple as using the se function, right?

se(leaves$LENGTH)

## Error in se(leaves$LENGTH): could not find function "se"

Oh dear, that was unexpected…

R doesn’t have a built-in function for calculating the standard error of the mean. :(

Functions

All is not lost. We know how to calculate the standard error bit by bit… we just need to work out the standard deviation and the sample size. We are pretty much expert coders now, so without too much trouble we should be able to create our own function that will do our bidding.

se <- function(x) {
  s <- sd(x)
  n <- length(x)
  s / sqrt(n)
}

We’ve just defined a function!

The se function expects us to provide a numeric vector, x. The first thing se does is to calculate the standard deviation of x, storing the result in the object s. It then calculates the sample size (the number of elements of x) and stores this in the object n. The final line of the function uses these intermediate objects to calcualte the standard error, and, because it’s the last line, it is also the value that the function will return. It’s useful to note that the objects s and n only exist temporarily, and effectively disappear once the function has finished doing its thing.

Let’s try it out.

se(leaves$LENGTH)

## [1] 32.72408

Remarkable. Now that we’ve defined the se function, we no longer receive that error.

Repetition & Iteration

When creating, summarising, manupulating or analysing data, there are often tasks that we will perform again and again. For example, we might want to: * repeat a scalar or vector a number of times; * apply a particular function to subsets of data, e.g. to each column or row of a matrix or data.frame, or each element of a list or vector; or * run certain lines of code a pre-specified number of times, or while some condition is true.

Let’s begin with the first point above, and take a look at one of the simplest ways to repeat values, with the rep function.

How about the numbers one though ten four times

rep(1:10, times = 4)

##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3
## [24]  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10

We get a different vector if we use the each argument rather than times.

rep(1:10, each = 4)

##  [1]  1  1  1  1  2  2  2  2  3  3  3  3  4  4  4  4  5  5  5  5  6  6  6
## [24]  6  7  7  7  7  8  8  8  8  9  9  9  9 10 10 10 10

There are a some built-in functions that apply functions to each column or row of a matrix or data frame. For example, the colMeans function calculates the mean each column of the object you pass to it:

colMeans(leaves[4:6])

##           AREA         LENGTH PETIOLE_LENGTH 
##   349999.72362     1022.65645       24.71251

Another function that performs operations on each column of a data frame or matrix is the summary function. This returns summaries of each column, the details of which depend on the column’s class. For example, summaries for factors give level frequencies, while summaries for numeric vectors give the min, mean, max and quartiles. Let’s take a look at the summary of leaves.

summary(leaves)

##   PLOT          TREE          LEAF        AREA             LENGTH      
##  ae2:102   114    :  4   Min.   :1   Min.   :  35598   Min.   : 326.1  
##  can: 15   GA2    :  4   1st Qu.:1   1st Qu.: 160107   1st Qu.: 704.6  
##  cm1: 82   LN     :  4   Median :2   Median : 284449   Median : 946.1  
##            1      :  3   Mean   :2   Mean   : 350000   Mean   :1022.7  
##            102    :  3   3rd Qu.:3   3rd Qu.: 481882   3rd Qu.:1200.8  
##            106    :  3   Max.   :4   Max.   :1366244   Max.   :2790.4  
##            (Other):178                                                 
##  PETIOLE_LENGTH    MARGIN_ENTIRE CURVE       WEIGHT           lnAREA     
##  Min.   :  1.680   N: 15         L: 43   Min.   :0.0740   Min.   :10.48  
##  1st Qu.:  9.345   Y:184         N:120   1st Qu.:0.4964   1st Qu.:11.98  
##  Median : 19.420                 R: 36   Median :0.8925   Median :12.56  
##  Mean   : 24.713                         Mean   :1.2654   Mean   :12.49  
##  3rd Qu.: 32.300                         3rd Qu.:1.7329   3rd Qu.:13.09  
##  Max.   :135.580                         Max.   :9.7833   Max.   :14.13  
##                                                                          
##     clnAREA           LEAF_ID         
##  Min.   :-2.01070   Length:199        
##  1st Qu.:-0.50717   Class :character  
##  Median : 0.06756   Mode  :character  
##  Mean   : 0.00000                     
##  3rd Qu.: 0.59471                     
##  Max.   : 1.63683                     
##

Well that’s great, but what about all the other functions? What if we wanted to apply the se function we created earlier to these columns? It turns out R has a family of functions that makes this relatively painless. These are the *apply functions, including apply, lapply, sapply, tapply, mapply. We’ll only look at apply, and tapply today. They can be daunting at first, but very powerful. That is, they apply functions to elements (or col, row, etc.) of your object all in one go, rather than handling each element in turn.

Let’s dive in and take a look at the apply function.

We need to provide three arguments to the apply function. * The first is the object to which we want to apply our function. For the apply function this must be a data frame or an array. * The second argument specifies the “margin” (i.e. the dimension) that we want to apply the function over. For a data frame, this should be either 1 (which means we will apply to function to the first dimension, i.e. the rows), 2 (apply to the second dimension, i.e. columns), or c(1, 2) (meaning we will apply it to the intersection of rows and columns, i.e. to each cell). * The third required argument is the function that you want to apply to the specified margin.

Here goes then…

apply(leaves[4:6], 2, se)

##           AREA         LENGTH PETIOLE_LENGTH 
##    17979.89481       32.72408        1.50087

Onwards to tapply. When we use tapply before applying the function, the vector is subset according to one or more grouping variables. For example, we could calculate the mean leaf length for each leaf curve direction.

(curve_means <- tapply(leaves$LENGTH, leaves$CURVE, mean))

##         L         N         R 
## 1149.9834  943.0866 1135.8041

Here the second argument specifies the grouping variable (in this case leaves$CURVE), and so the vector leaves$LENGTH is split into subsets according to the direction they curve, prior to the mean function being applied.

We mentioned above that more than one grouping variable can be provided. Let’s demonstrate this by grouping our leaf length data according leaf margin type as well as curvature, and then once again calculating group means.

with(leaves, tapply(LENGTH, list(MARGIN_ENTIRE, CURVE), mean))

##          L        N        R
## N 1057.939 853.8015  896.782
## Y 1156.887 951.2035 1149.864

Graphics

Plotting univariate data in R is typically really easy. Often you just need to know what type of plot you want to make and the specify the vector of data you want plot.

Like a boxplot for instance.

boxplot(leaves$LENGTH)

Or a simple histogram.

hist(leaves$LENGTH)

A really common plot to summarise multidimensional data is a scatterplot matrix. The pairs function is what you want in this case.

pairs(leaves[4:6])

The most general plotting function is unsurprisingly called plot. Give it two vectors of data and it will give a simple scatterplot. Let’s have a go using the temperature data that is part of the kangaroos dataset.

First a simple scatterplot.

plot(kangaroos[, 1], kangaroos[, 4])

Since it is a time series maybe we’d like to plot it as a line?

plot(kangaroos[, 1], kangaroos[, 4], type = "l")

How about some axis labels?

plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
  ylab = "Temperature (Celsius)")

Lets increase the y axis limits and make the tick labels in the right orientation with las=1

plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
  ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1)

It’d be nice if we could adjust those margins and fit another axis on the right hand side with the temperature in Fahrenheit. For pre-plotting adjustments like that, we turn to par. This function can set all sorts of graphical parameters before we even draw anything on the screen. The mar argument will help us determine the size of the margin in units of text line heights, parallel to the axis. We specify the margins with a length four vector in the order: bottom, left, top, right.

par(mar = c(5, 5, 1, 5))

plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
  ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1)

That looks like enough room for a new axis.

Up till now, plot was plotting our axes and axis labels for us. But we can use the axis function to plot our own as well. mtext will do the equivalent job for the axis label.

par(mar=c(5, 5, 1, 5))

plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time",
  ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1)

axis(side = 4, at = 0:10 * 5, labels = (0:10 * 5 * 1.8) + 32, las = 1)

mtext(side = 4, text = "Temperature (Fahrenheit)", line = 3)

We can turn off the automatic x-axis provided by plot, by setting the x-axis type to ‘none’, xaxt = "n". Then we can use axis as we did before, to give us more sensible axis in units.

par(mar=c(5, 5, 1, 5))

plot(kangaroos[, 1], kangaroos[, 4], type = "l", xlab = "Time", 
  ylab = "Temperature (Celsius)", ylim = c(0, 50), las = 1, xaxt = "n")

axis(side = 4, at = 0:10 * 5, labels = (0:10 * 5 * 1.8) + 32, las = 1)

mtext(side = 4, text = "Temperature (Fahrenheit)", line = 3)

axis(side = 1, labels = pretty(as.POSIXct(kangaroos[, 1], origin = "1970-01-01")),
  at = pretty(as.POSIXct(kangaroos[, 1], origin = "1970-01-01")))