tl;dr: This outlines how archaeologists could use Principal Component Analysis, realise its true potential, and use it as a “starter” and not a “means to an end” in analysing Geometric Morphometric data, and data more generally.
With an increase in user-friendly statistical software – capable of reading many different formats of shape data (beyond just .tps files) – all with increasingly sophisticated graphical outputs and visualisations, researchers (including ourselves!) are beginning to feel more confident in trying their hand at geometric morphometric methodologies (GMM henceforth). For anyone who has signed up to weekly archaeology and GMM “alerts”, e.g. ScienceDirect, you will be all-too-familiar with the exponential increase in article output on GMM. And this should be praised, of course! It provides researchers with a replicable method for testing hypotheses about shape where lineal measurements fail to be robust enough, not that lineal measurements should always be deserted! In many examples of literature on GMM and archaeology some form of Principal Component Analysis (PCA) is featured. But what is PCA? How useful is PCA? And how can we make the most of PCA? Are we becoming too reliant on such? Do we even know enough about it?
This blog-post stems from two experiences: 1) the amount of amendments I have made to my own PCAs over the last few months, as I have realised that these could have been more useful as visual descriptors of analysis, and one Friday night article-skimming (too cool, right?) and grappling with a PCA which got through peer-review and was, quite frankly, incomprehensible. This is a post which I hope will provide food for thought when you are making your own. I also apologise for the distinct lack of diagrams, i.e. no diagrams, ahead. It is rather text heavy, but there are plenty of examples out on the web and in articles. Much of my knowledge comes from Davis (1986) and Harper (1999) – there are some terrific diagrams contained within these references, I promise!
This post is the first of a number of blog-posts introducing methods typically used by shape-lovers in their analyses. If you fancy contributing please do contact us (firstname.lastname@example.org). Maybe you’ll disagree with some of the things said here, or perhaps I’ve missed something – please let us know below! As I said, I am not an expert on this, I’m just trying to collate everything I have learned about PCA into one blog as a handy go-to guide (and you thought I was doing this for you guys, right?).
What is Principal Component Analysis (PCA)?
Principal Component Analysis is a statistical technique which finds “components”, or hypothetical variables which account for as much variance as possible within a multivariate dataset, i.e. a dataset with more than one variable (Davis, 1986; Harper, 1999). These newly-created hypothetical variables stem from a linear combination of the original variables, and are used to make the data easier to explore and visualise; strong patterns between different data-points can be visualised, and other underlying variables such as size (for morphometric data) can be analysed alongside these variables. While it can be used for data which features only two dimensions i.e. two variables (maximum height and maximum thickness), it is typically used for datasets of three dimensions or greater, when it becomes difficult to view in a point-cloud. In order to eliminate dimensions the method teases out variation on a new co-ordinate system, in which every value features a new (x,y) value. These new axes represent combinations of the variables analysed or “principal components”, uncorrelated with each other, with the first principal component representing the main source of variation, the second principal component representing the second main source of variation, and so forth. A great interactive visualisation of this technique can be seen in Powell’s online guide: http://setosa.io/ev/principal-component-analysis/. It is important to remember that PCA does not take into account any form of group structure or subdivisions (males and females, tool-types etc.) – other ordination techniques including Discriminant Function Analysis will be more applicable if you need to do this. In shape-based literature you may come across “relative warps” – fret not! These are principal components visualised with vectors or thin-plate spline transformation grids.
When can we use Principal Component Analysis?
Principal Component Analysis has long been employed for the analysis of traditional/lineal measurements, and over the last two decades GMM utilising different landmark and semi-landmark-based methods, on biological and non-biological material, have been utilising PCA techniques as well. Whether this is a set of lineal measurements, or two-dimensional outline analyses of handaxes, Folsom points, or ceramic vessels, or even three-dimensional
surfaces of crania, Principal Component Analysis can provide an initial (and I stress initial!) examination of changes in variables and shape.
Some fundamentals of Principal Component Analysis
I’ll summarise this in a set of bullet points, to make it as clear as possible to understand:
- The principal components have a vector of principal component coefficients (valuing from -1.0 to 1.0), which are used to create the principal component scores and indicate the direction of the corresponding PC axis in relation to the coordinate system of the original variables;
- The principal component variance (otherwise known as eigenvalues) sum up to the same amount of the variances of all variables (i.e. total variance). In many outputs the variance of a principal component will take the form of a percentage of the total variance;
- Many outputs will be able to represent the shape changes/deformations along the principal component axis e.g. Principal Component 1 may represent a transformation from bottom/distal heavy stone tools to proximal/heavy stone tools;
- Outputs often include a “scree plot”, a plot of eigenvalues, which indicate the number of significant components which should be considered. After this curve starts to flatten out the components may be regarded as insignificant to shape change.
Hanging in there? It’ll get easier from here on, I promise.
How many principal components should you consider?
Short answer: there is no correct answer. Some people consider all principal components which account for more than 1% of variance, or an eigenvalue of 1.0 (Kaiser’s criterion), some people just use the main two components. It depends on how much variance you wish to describe. Someone once described it to me as a town: how many streets (i.e. principal components) should you take to get to know a town? It depends on the size of the streets, and the percentage of the town you want to understand from those streets. If a town is made up of two main streets is that enough to understand the town? You need to select enough components or streets (if you’re still following the analogy) to explain enough of the variance that you are comfortable with. A “good” PCA plot, for archaeology anyway, typically analyses the first three principal components (i.e. PCA 1 vs. PCA 2, PCA 2 vs. PCA 3, PCA 1 vs. PCA 3), with these scores typically accounting for more than 75% of shape variance. When in doubt use common sense! Publications will not let you publish hundreds of these plots, so think of the audience, but most importantly use your results appropriately.
Displaying and analysing the PCA results
The plots should be designed to be informative, whilst being clear at the same time. The suggestions below should not be treated as gospel, but rather designed to just make you think about your plot, and the information that can be displayed.
- PCA plots should have equal axes in order to accurately display the transformation of data; this needs to be better emphasised as many studies fail to modify their graphs (programs like PAST have recently added “equal axes” to their display settings – there is now no excuse!).
- Convex hulls can be added in order to display the entire range and distribution among different groups along the principal components displayed. 95% confidence ellipses (ellipses which cover 95% of the data plotted) are an alternative method to display the distribution of points within groups, and are more appropriate when you have some really far-out outliers.
- Some programs will not display the percentage of the principal component/relative warp inertia, if possible add these, just to allow the reader to fully understand the principal component plot.
- As with other graphs, include a legend and labels where appropriate.
- Visualising the changes in shape data, whether on the extremes of the axes, or by highlighting the shape of individual examples, is always a great way of communicating what is happening within a PCA plot and saves up on words!
- Preference: only use colour when appropriate! Rainbow PCAs may look cool, but when you’re dealing with many different groups it’ll get trippy. Maybe try symbols instead of colours?
So you’ve made the PCA graph… Congratulations! Seeing visual differences in the distribution of the points? At this part you may be tempted to just document the PCA plot, discuss the findings, and conclude. There is, however, so much more you can do with the data! If you take the PC scores…
You can document variability in the principal components in other forms of visual descriptors. How about a box-plot? It’ll allow you to get another visual descriptor of how tightly clustered the data is around the first few principal components, and allows you to see how tight different groups are around certain PCs;
Test for statistical significance: maybe perform a MANOVA and see if there is actual statistical significance among the first twenty principal components. Or maybe perform Canonical Variate Analysis? You’re spoilt for choice with some statistical programs… (remember though that statistical significance is not always archaeological significance!)
Regression: plot the main source of shape variation over other factors like size, or other forms of data e.g. latitude, time.
Most importantly it depends on what your hypothesis is and what your research questions are. Do not just do a technique because you can; do it because it is relevant.
Just have fun exploring the PCA options of the different statistical programs. Many programs allow you to save your files as an .svg meaning you can truly customise it further in a variety of programs! But no rainbows, please?
Davis, J.C. 1986. Statistics and Data Analysis in Geology. John Wiley & Sons, New York.
Harper, D.A.T. (ed.). 1999. Numerical Palaeobiology. John Wiley & Sons, New York.