Basic data analysis and plotting using R

Using tidyverse and ggplot2 to analyse and plot data

Now that you have a basic understanding how how R works, and what it is used for, let’s put our newly learned skills to practical use.

When working with data in our studies, we can consider R to be the central software for data management, analysis and plotting. It is quite robust for this purpose, in that it also reads in data of many different types and from many different sources. ¹

Working with behavioural data in R

In this section we will directly work with some participant data from the HAVEN study.

HAVEN was a study investigating the relationship between cognition, neurovascular health and platelet function in older adults. For the purposes of this workshop, we will be using a subset of the data from the study, specifically focusing on simple demographical and behavioural data.

We will firstly plot the data using a variety of plots to understand how certain graphs are preferred for visualizing specific relationships. We will then perform basic statistical tests to understand whether the relationship between the variables are significant.

Load and display the data

We first need to load in and read the data. The first dataset is in the participant_data.csv file within the /data folder. This file format is called ‘comma separated values’ or CSV, which we can read into R using the read.csv function. We firstly assign the data to a variable called sub_data (short for subject data), which we can now work directly with.

In the code below, we are downloading the data from a GitHub repository and saving it to our working directory. The first two lines are only required for the online workshop where we run code in the browser; in R you can just run the bottom line.

Naming variables

It is good practice to name your variables appropriately. For example, I assigned the data to sub_data instead of data because data is a reserved word in R. This means that if you try to use data as a variable name, it may overwrite the built-in function data(), which can cause confusion and errors in your code. An alternative is to use df short for data frame, or dat (short for data) but these are generic and do not provide any information about the data (which is particularly useful when working with multiple datasets). So, I would recommend using sub_data or participant_data instead.

Just to see if we have loaded in the correct file and correctly assigned it, we can display the first few rows of sub_data using the head function. This is a pre-built function in R which will show the first 6 rows of the data. You can also specify how many rows you want to see by adding a number in brackets, e.g., head(sub_data, 10) will show the first 10 rows.

So you can see we have different variables, including age, gender, height, weight, bmi and lesion_volume.

We can calculate summary statistics for each variable using the summary function. The summary statistics will give us the mean, range etc. for each.

We know to use the $ operator to access a specific column. For example, if we wanted to see the age of each participant, we can run:

Data Visualization using `ggplot2()`

An important practice when working with data is to have a basic understanding of what the data looks like. This is important for several reasons: seeing the distribution of the data, identifying outliers, and understanding the relationships between different variables - even basic ones!

For example, if we wanted to understand whether people who are older have a higher lesion volume (possibly as a result of aging) we can create a scatter plot of age and total lesion volume, and add a line of best fit. We will use the ggplot2 package, which - as mentioned - is a powerful and flexible plotting system.

Scatter plot

Let’s create a scatter plot of age and lesion volume:

Here we are specifically using the ggplot2 package using the geom_point function to create the points, and the geom_smooth function to add a line of best fit. The method = "lm" argument specifies that we want to use a simple linear regression model for the line of best fit.

So, you can see a positive correlation, namely that (generally) the older you are, the higher the total lesion volume will be.

If we wanted to understand whether there is a difference between the height and weight between males and females, we can also do this using a scatter plot, but grouped by gender:

The colour argument within aes separates the data by the variable assigned. You can see - as we would expect - that males are generally towards the top and right (i.e., taller and heavier).

Boxplot

To see the specific differences between males and females, we can create a boxplot, which visualizes the mean and quartiles in each group:

Again, we use ggplot2 to structure the plot, and the geom_boxplot function to actually create the violin component. You can see that indeed, men are taller and weigh more than women in the group.

Histogram

Histograms are useful if we want to understand the spread of a particular variable across the group. For example, if we wanted to see the distribution of BMI across the group, we can use the geom_histogram function. This will create a histogram of the BMI variable, and - using the geom_density argument we can also add a density curve to it.

Let’s plot a histogram of BMI:

So you can see that the most common BMI is 25, with the variable normally distributed.

Basic statistics and significance testing

So we have been able to visualise the data, and now have a general idea of some of the trends. But how can we test whether these relationships are significant? Can we statistically say that men are taller than women? Can we say that the older you are the more lesions you have?

To answer these questions, lets firstly get some more summary statistics, this time for height and weight specifically. We firstly group the data by gender and run the summarise function to calculate the mean and standard deviation for each group.

You can see that the mean height for women (n=29) is 1.64 and men (n=23) is 1.80 (metres), whilst the mean weight is 65.9 and 85.5 (kg) respectively.

T-test

But is this difference significant? To find out the statistical difference between the mean of two groups we can simply run a t-test. We do this in R using the t.test function, which will calculate the t-statistic and p-value for us. The default is a two-tailed t-test, which tests whether the means of two groups are significantly different from each other without direction. We can also run a one-tailed t-test, but we need to specify the direction of the difference.

So, we can see in either case, we can see that there is a statistically significant difference between the height and weight between males and females. We can also see that the p-value for the one-tailed test - where we test the alternative hypothesis that women weigh more than men - is 1. This makes sense, as it is not the case!

Correlation

Similarly, if we want to see whether there is a significant relationship between two continuous variables, i.e., between age and lesion volume across the entire group, we can run a correlation. We can do this using the cor.test function, which will give us the correlation coefficient (r) and the corresponding p-value.

As with the t.test function, when we run the cor.test it actually generates a list of values. The str function will show this structure.

T-test structure

Run this code to see the structure of the t-test that we ran above:

You can also see that in addition to the various statistics, it also mentions the method, which is Pearson’s product-moment correlation. This is the default method used in R, but you can also use Spearman’s rank correlation or other correlation methods by changing the method argument.

Using the names function, we can see the different data recorded, and see that it contains several components, including the correlation coefficient (r), p-value, confidence intervals and t-statistic.

We can then extract the important values (the R and p-value) from the cor_result object using the $ operator.

And we can re-plot this using ggplot as we did above, but also adding the r and p-values to the plot:

So, whilst there is a significant relationship between age and lesion volume (excluding other predictors), we can see one or two outliers in our data. What would the data look like if we removed them?

We need to remove those with lesion volumes greater than 5. In R there are often several ways to perform the same operation, here are some examples for this case:

# Remove rows where lesion_volume exceeds 5
data_filtered <- sub_data[sub_data$lesion_volume <= 5, ]

# Alternatively, using dplyr
library(dplyr)
sub_data_filtered <- sub_data %>%
  filter(lesion_volume <= 5)

# Or using the subset function
sub_data_filtered <- subset(sub_data, lesion_volume <= 5)

Let’s use the first approach and then plot the relationship with a line of best fit:

The linear regression model lm(lesion_volume ~ age, data = sub_data_filtered) is equivalent to a correlation test which we can see by running the cor.test function again on the filtered data.

So, the relationship is still significant, but less so than before.

Finally, we may also want to see if there is a difference between males and females with lesion volume and age. If so, we would expect to see a significant difference in the correlations between the two groups. Let’s run a correlation for males and females with lesion volume separately, and plot them on the same graph:

But as we are performing two independent tests, we need to correct for multiple comparisons. We can do this using the p.adjust function, which will adjust the p-values for us. The most common method is the Bonferroni correction, which divides the p-value by the number of tests performed.

We can see that there is a significant difference in the relationship between age and total lesion volume for females, and not for males.

Regression models

In the HAVEN study, we were interested with understanding the influence of different factors upon memory function. To do this we can look at participants’ reaction times and accuracy during an episodic memory task using the rt_data dataset.

The episodic memory task

The episodic memory task was a recall task, where about an hour prior, participants watched a video (actually an episode of the TV series ‘Outnumbered’). Inside the MRI scanner, they were presented with two pictures, representing scenes from the episode, and there were asked to select the one which happened first.

Episodic memory task design from the HAVEN study

Let’s load in the data:

Run head to see the data:

And summary to see the summary statistics:

You can see that it is essentially an expanded version of the previous dataset, which includes accuracy and mean reaction time for correct and incorrect responses. Our goal is to create a model to describe how various factors influence participants’ performance on the task.

As before, let’s plot the relationship between age and performance on the task, measured by the percentage accuracy.

Look at your data

Remember, before running any modelling analyses, it is always a good idea to look at your data and it’s relationships, even it is rudimentary!

Note that this time the plotting code is significantly longer than before. This is to demonstrate how you can customise your plots to make them look more suitable for a scientific publication.

In this code, we firstly create a custom theme for the plot, which increases the font size and removes the grid lines, before plotting the data using our newly made theme.

We can see that older participants tend to have more brain lesions and perform worse on the task. However, there’s no direct relationship between the number of lesions and task accuracy. This raises an important question:

Is it age itself that explains lower performance, or do age-related brain changes (like lesions) play a role?

Creating a linear regression model for task performance

Regression models help us understand how different independent variables (“predictors”) influence a dependent (outcome) variable. In our case, we want to see how the predictors age and lesion number influence task performance.

We will keep things simple and just add each predictor separately - first age, then lesion number, and then both together in the same model. Doing this helps us understand how much each variable contributes on its own and whether one explains the effect of the other.

If age still predicts performance even when we include lesion number, it suggests that age has its own independent effect on accuracy.
If the effect of age disappears when we include lesion number, it suggests that lesions might be the reason older adults perform worse.

Let’s start with one predictor - age. The lm function - as previously used - creates a linear model, and the str function shows the structure of the model.

You can see that a lot happens underneath the hood! Importantly, we can use the summary function to see the results of the model, including the coefficients, p-values and R-squared value.

The coefficients tell us how much the dependent variable (correct percent) changes for each unit increase in the predictor variable (age). The p-value $Pr(>|t|)$ tells us whether this relationship is statistically significant. Importantly, the age co-efficient is significant (p = 0.0089) meaning that as people get older, their accuracy tends to decrease.

Now let’s run a model with just lesion number:

The p-value is 0.419, which is not statistically significant. This means we do not have evidence that lesion number is reliably related to performance.

Let’s now add lesion number to our model, firstly as a fixed effect. This allows us to answer the question:

Does lesion number help explain performance even when we already account for age?

This result model shows that age predicts performance (p = 0.0118), even when we include lesion number. Moreover, lesion number still does not independently explain performance, once we account for age. This suggests that age, not lesion number, is the key factor driving the drop in task accuracy.

Calculating group differences using ANOVA and Tukey’s HSD

ANOVAs are commonly used to determine whether there is a significant difference between multiple groups (>2). In our data, we already know that there is a negative correlation between age and task performance, but is there a difference between age groups? We can use an ANOVA to test this.

Firstly, let’s create the age groups, which will be subjects in their 50s, 60s and 70s.

The piping operator

The piping operator %>% is a powerful feature of the dplyr package that allows you to chain together multiple operations. It basically translates to ‘and then’.

So in our code, we are creating a new column called age_group which is based on the age of the participants, those in their 50s (n=17), 60s (n=19) and 70s (n=12). We then group by these groups, and then summarise the data to get the count, minimum and maximum age within each.

Let’s now get the average accuracy within each group:

Following the plot earlier, we can see that the average accuracy also decreases with age, even after splitting by age group. But is there a significant difference? We can run an ANOVA to test this using the aov function. This function takes two arguments, the variables - consisting of the dependent variable (correct percentage) and independent variable (age groups) - and the data.

The results tell us that there is evidence that at least one age group has a significantly different mean accuracy from the others. But it doesn’t tell us which one(s) specifically.

To find this out, we can run a post-hoc test. We will run Tukey’s HSD (Honestly Significant Difference) test, which is a common post-hoc test used after ANOVA specifically for this reason.

Contrary to the ANOVA, Tukey’s HSD tells us that there is no significant difference between any of the groups. This is partly because it examines each pair of group means, correcting for multiple comparisons using a family-wise error rate (FWER) correction.

There was a trend for the 50s group to be more accurate than the 70s group, but this narrowly misses significance (p = 0.06). So, we can conclude that there is in fact no significant difference in accuracy between age groups.

Footnotes

Kabacoff, R. I. (2022). R in action: data analysis and graphics with R and Tidyverse. Simon and Schuster.↩︎