The field of diagnostics is crucial for patient survival. To be able to diagnosis terminal disease’s such as cancer at earlier stages drastically increases the chance of survival. To do this we would need to know what signals to look for. This is where liquid biopsy is useful because it is a process where it can isolate biomarkers from the blood that have been known to correlate with different cancers. Previously I used logistic regression to look at a small subset of biomarkers and determine if they are good indicators of cancer. This time I am going to determine if the data set I am working with can be visually separated into cancer and healthy groups and what variables play the biggest role in determing cancer. On top of that, I am going to determine which factors play the biggest contribution to each specific cancer.
K-mediods cluster algorithm is very similar to K-means except that it treats one object (mediod) as the cluster instead of a calculating the distance from a centroid. Once it finds the mediod it performs a switch with all of the non-mediod points and calculates the dissimalarity of the two points. It iteratively repeats this process until it minimizes the dissimilarity between all points. The benefit to this is that it becomes more robust and less sensitive to the effects of outlier points.
To determine the composition of the two clusters that we saw I decided to use a PCA biplot, with different colors representing the different types of cancer. This way we can hopefully determine if the two clusters are significant when it comes to separating the cohort based on having cancer or being healthy.
The PCA is showing that the two distinct clusters shown previously have all different types of healthy and cancer patients mixed into it. This could possibly be due to having too many biomarkers in the analysis. I will Repeating the process with a more filtered set of features. Since the clusters were not separated by cancer type, I decided to see if they were separated by gender. Looking at the second PCA biplot we can see that this is not the case
To reduce the amount of “noise” in my data set, I decided to reduce the number of variables in my analysis To do this I took the most important features from PC1, PC2, and PC3 and used those to create a new filtered data set. The new filtered data contained only 32 biomarkers instead of 42.
Since filtering features based on Principle Component contribution was did not give us the cluster separation that I was looking for, I will resort to using Recursive Feature Elimination (RFE) and then use NMDS to visualize the clusters.
RFE runs a random forest algorithm on the features and target that I give it. During the RFE process it cross validates the accuracy of the predictions for a range of variables that I feed it. Here we see that RFE starts at making predictions with only 1 variable and then keeps on going until it starts making predictions with all of my variables as features. My CV plot is showing that we reach an optimum in accuracy with 11 biomarkers. I will now filter my existing data set with only these features.
Once the features were selected I decided to visualize the differences between the cancer types by using both a NMDS plot and a PCA biplot.The reasoning behind plotting the data using NMDS was to see if I could find any non-specific orientation of the data that might lead to a separation between the groups in the cohort. Look at the NMDS plot, it does not show any significant differences in clustering between the 4 groups, even with the newly filtered data table.
# Conclusion: Fortunately looking at the new PCA biplot we do see more separation between the groups. There still seems to be a lot of overlap between all groups, but it seems to be easier to differentiate which biomarkers play a more significant role in determining if a patient has cancer or if they are healthy. Looking at the graph we see CA19-9, CA125, OPN, Prolactin, and Ferritin seem to signal for cancer, while MIF, FGF2, Total.PSA, Leptin, and maybe (sVEGFR1, and sHer2) correlate with patients being healthy.