Statistical Analysis of Different methods
used to optimise EPA yield
College of Engineering
Abstract— Fatty acids have been known to help in the prevention of many
common diseases by both consumers and producers for some time. However, the best way in which to
commercially produce these beneficial acids is still poorly understood. This study models the effect of different
factors on EPA (Eicosapentaenoic acid) yield, a beneficial fatty acid. Through
statistical analysis it was concluded that the PI metal solution has no effect
on the overall EPA yield. On the other
hand, the temperature and Ph values do influence the EPA values. The NaCl concentration was not found to
effect the EPA yield. Further
experimentation is required to test the optimal values of each factor.
Keywords – EPA yield, Statistical analysis, Dataset,
Cells, Factors, Significance, Effect.
Cardiovascular diseases and cancers are alongside
the most common illnesses within the U.K.
Over 150,000 people die from cancer each year 1 and an
estimated 7 million people currently suffer from some form of cardiovascular
disease 2. Another highly prevailing
disease in the U.K is arrhythmia with more than 2 million people experiencing
it (or heart rhythm problems) annually 3. These diseases can have serious effects on
all aspects of a patient’s life, limiting their physical ability, mental state,
relationships and home living. The large
numbers of these cases also means they place a large amount of strain on health
services. Due to this, treatment and prevention techniques are in high demand from
both patient and caregiver, leading to increased interest from researchers.
Several previous studies 4-6
have shown that the -3
polyunsaturated fatty acids EPA (Eicosapentaenoic acid) and DHA
(Docosahexaenoic acid) have influence in the prevention of the aforementioned
diseases. A primary source of these
fatty acids is from fish oil, this is also the current commercial source for
production. However, this form of
production has limitations 7 and so a new method is needed.
This study takes
results using the experimental data published by the University of Hong Kong 8
on the growth of Nitzschia laevis.
The data used was collected through sub culturing cells in a medium (current
optimum growth contents determined through previous studies), with precautions
being taken to avoid precipitation. Statistical analysis will be used,
initially on data for EPA yields at different PI metal solutions. Statistical
investigation will also be conducted on data of EPA yields where both the
medium components and environmental factors are subject to change. This study
aims to determine how best to increase EPA yield in production by solidifying,
with statistical proof, what effects different factors have on the cells growth.
A. Abbreviations and Acronyms
CaCl2 Calcium Chloride
mg/L Milligrams per litre
mL/L Milliliters per litre
g/L Grams per litre
0C Degrees Celsius
pH Is a log value and so is unit less
Materials and Methods
A. Experiment 1
Completed to identify whether the PI metal
solution had an effect on the EPA yield.
The experiment consisted of changing one variable, the PI value. Other variables were held constant with NaCl
fixed at 16 g/L, CaCl2 at 0.204 g/L, temperature at 22oC
and pH held at 7.5. 20 tests were
carried out at the PI value of 13.5 mL/L and a separate 20 carried out at a PI
value of 4.5 mL/L. The EPA yields for
each sample were measured for comparison.
B. Experiment 2
The cohort of this
experiment was larger than the first, consisting of 27 tests. Here the EPA yield of the cells was measured,
studying the effects of the medium component (NaCl) along with temperature and
C. Statistical Analysis
Quantitative results are presented to two
decimal places. Values compared using hypothesis testing assume alpha levels of
0.05 where p-values <0.05 were considered statistically significant. An alpha level of 0.1 was assumed for data modelling. IV. Results and Discussion A. Experiment 1, Data comparison Once the datasets from each sample at different PI levels had been obtained, the initial process was to determine if any major differences occur between the results. Clear quantitative data description allowed for quick, simple comparisons. Table 1.1, showing data descriptions at each PI metal solutions Data Description EPA yield in mg/L PI = 4.5 mL/L EPA yield in mg/L PI = 13.5 mL/L Mean 217.17 219.03 Median 217.50 220.15 Standard deviation 3.82 5.20 Lower quartile 214.10 215.20 Upper quartile 219.85 222.75 Interquartile Range 5.75 7.55 Observing the values in table 1.1 it can be seen that there is a relatively small difference between the means of each dataset. This observation can also be applied to the median. Immediately, this suggests that the average EPA yield is not massively affected by a change in PI metal and so this will be taken as the null hypothesis for further testing. The descriptive values of PI 13.5 mL/L (Dataset 2) are higher overall than those of PI 4.5 mL/L (Dataset 1), in particular, the SD and IQR differ by large amounts. The SD expresses by how much the data points differ from the mean, the large value for this proposes that the values for EPA yield at PI 13.5 mL/L cover a greater range of values than at PI 4.5 mL/L. In order to visually represent the spreads of each dataset and further analyze the results, boxplots (figure 1) were created. Figure 1, Boxplots of each dataset. Figure 1 solidifies the understanding that dataset 2 covers a larger range of values, as seen by the larger size of the second plot. Shape distribution of each dataset was also inferred from the plots with dataset 1 showing symmetrical distribution. Dataset 2 can be argued to have a slightly negative skew but this must be further investigated before taken to be the final distribution. Absolute frequency histograms (figure 2.1, 2.2) were plotted to further evaluate the distribution of each dataset. Figure 2.1 – Absolute frequency histogram of PI metal solution 4.5 mL/L. Figure 2.2 – Absolute frequency histogram of PI metal solution 13.5 mL/L. The appropriate number of bins to best visualize distribution was calculated using the Freedman-Diaconis rule. Bin size = 9 Where n is the number of observations in the sample. To the nearest significant figure, dataset 1 needed 4 bins whereas, dataset 2 required 6. Figure 2.1 reiterates that dataset 1 follows symmetrical distribution. Histograms that show symmetry have datasets with approximately same means and medians, this is confirmed in table 1. In order to check the assumption of dataset 2 boxplot showing negative distribution the shape of the histogram was be studied. The histogram in figure 2.2 shows a clear left-sided skew, with a steady frequency for 3 bins before an increase occurs. Histograms that show left skew suggest that the mean of the data is less than the median, this is true for dataset 2 as shown in table 1. Plotting relative frequencies gave the same results concerning the shape of the distribution (sample size of each dataset was the same). B. Experiment 1 , Claim test As identified from comparing the data, there is not a large difference between the medians of each dataset (claim). Hypothesis testing was completed to test the statistical significance of this claim. For the purpose of the hypothesis test, no assumptions on the distribution of the data were made. As the sample size was small (<30) a non-parametric test was deemed most appropriate. The data was identified as two sampled and two-tailed and due to its non-parametric nature a sign-test was used. The null hypothesis of the test fulfilled the limits of the claim: H0: M1=M2 That the median of dataset 1 is statistically the same as the median of dataset 2. The alternate hypothesis would suggest that PI metal level had a significant effect on EPA yield: H1: M1?M2 For the test an alpha level of 0.05 was closed. The results gave a p value of 0.8238, as this is >0.05 we fail to reject the null
hypothesis, supported by the h value of 0.
The difference between the medians of each dataset can therefore be
taken as statistically insignificant.
This suggests that the PI metal solution does not have an effect on the
assess the claim that we fail to reject the null, further testing was
completed. Parametric testing allowed
this but before it could be completed assumptions about the distribution had to
To determine how
the data was distributed a normal probability plot of each set was made (figure
Figure 3, Normal probability plot of dataset 1
points of dataset 1 follow closely with the line and do not follow their own
trend, they can be said to be linear.
So, the data can be said to be compatible with normal distribution.
4, Normal probability plot of dataset 2.
box and histogram plots of dataset 2 showing a left skew, its normal
probability (figure 4) plot suggests it is normally distributed. This difference in the representations of the
distribution is likely because there is only a slight skewness that is not
large enough to be observed on a normal probability plot. Although the points do not sit directly on
the red line, they do not follow their own trend and so the shape can be
concluded to be linear. For the purpose
of confirming that the null hypothesis is rejected dataset 2 will be assumed to
be normally distributed.
(figure 5) were also used to confirm the distribution of the data. For both datasets the points lie close to the
line and do not follow their own trend so can be said to be normally
5, QQ plots of the datasets.
distribution of the data was then assumed to be normal. However, as the data is unpaired, before a
parametric student t-test can be conducted the variances of each dataset must
be shown to be statistically similar.
The actual values of variance can be seen to be different from table 1.
Despite this, a test was conducted to test the statistical significance of the
difference between the variances.
hypothesis was set:
The datasets have normal distributions and similar variances.
: The datasets have unequal variances
produced an h-value of 0 and so we failed to reject the null and a student t
–test was deemed appropriate.
t-test the means of the data sets were compared:
alternate hypothesis would suggest that PI metal level had an effect on the EPA
results of the t-test obtained a p value of 0.2359 (>0.05) and an h value of
0. Once again this means we fail to
reject the null hypothesis and so the means are not statistically different in
each dataset. Similarly to the medians
the similarity in means suggest PI metal solution does not have an effect on
The test also
gives confidence intervals of -5.04 and 1.32, as 0 is within these values it
confirms that we fail to reject the null hypothesis.
the data can be said to show that changing the PI metal solution does not
effect the average EPA yield from the growth of the cells.
C. Experiment 2, surface model
2 investigates what factors influence the response of the cells in producing
EPA. A boxplot of the data was plotted
to show the range of the results and check for any outliers (figure 6). One outlier was identified at sample 26 and
so it was removed from the data before any further testing was conducted. We
dropped the outlier as it did not majorly
6, Boxplot of the data of experiment 2
the results but would have effected assumptions.
The technique of multiple least squares was
adopted to model the optimal operating conditions of the experiment (figure 6). It was important to fit the correct model to
the data, in this case the below equation was used:
gives the EPA yield, X1, X2 and X3 correspond
to the pH, NaCl and temperature values respectively.
7, linear regression model
The model (figure 7) calculates an equation to minimize the distance
between the data points and a fitted line (least squares method). Bo represents the y intercept of the graph. B1
gives the slope of the line relating the relationships.
The model produces an adjusted R2 value of 0.77, a
statistical measure of the distances of the points to the regression line with
direct consideration of the number of predictors. As
this value is relatively close to 1 it can be said that the model is quite
Once the model was produced it was important to clarify the nature
of the model is appropriate. Linear
modelling is only appropriate for normally distributed data and so this was
checked for (figure 8).
Figure 8, Normal probability plot of experiment 2
Similarly to experiment 1 the points lie close to the line and so
the data can be said to be normally distributed, the median EPA yield can be
concluded to be the same as the mean.
Any bias present in the plot must also be checked, bias being that a
parameter estimate is too extreme a value.
The bias of the model was checked through visual inspection of a
residual vs predicted plot (figure 9).
Figure 9, Predicted vs Residuals of experiment 2
The plot (figure
9) shows that there is a large amount of disorder in the data, no substantial
trend could be seen as the spread varies along the horizontal plot. Due to this, it can be said that the assumptions
of the linear model hold and the model is not bias.
The overall accuracy of the model is also monitored with an actual
vs fitted plot.
Figure 10, showing Actual vs fitted points of data
from experiment 2
Figure 10 shows the points sit closely against the 45 degree line and
so the model can be said to be a ‘good fit’.
D. Experiment 2, Refining the model
In order to refine the model and give the
data the best possible fit against the regression line (shown by an increased
adjusted r squared value) statistically insignificant variables had to be discarded.
The alpha value was assumed to be 0.1 and this was checked against the p
value. Variables with a p-value <0.1 were statistically significant and only these would be continued through for the refined model. The intercept will remain in the model even if it is found to be statistically insignificant as the model must always intercept the y axis at some point. In the case of this model only 3 variables were found to be statistically significant and were continued. Figure 11, refined model The above figure (11) shows a simplified model of the data, excluding any variables that did not have a significant effect on the EPA yield. Similar to the original model, plots were made to test the accuracy and biasness of the model. Figure 12, Normal probability plot of the simplified model Figure 14, Actual vs Predicted plot of data for model 2 The same conclusions can be dawn from figures 12, 13 and 14 as for the first model. The plots show the new model is accurate, not bias and a good fit. A simplified model should theoretically show an increased R squared adjusted value, yet in this case it decreases. As the decrease is very small (0.028) it can be assumed that this is due to an error in the data collection. Analysis of the data collection and results given shows that the large changes in temperature may be the source of the error. During data collection the temperature varied by a relatively large amount (8 degrees), it also may have been difficult to keep the temperature constant throughout the experiment leading to changes in results. Figure 15, multiple regression surface model A response surface model of the simplified model was created (figure 15). It plots the temperature and pH against EPA yield as these were the only factors found to have an effect on the yield. NaCl was not found to influence EPA yield and is therefore not included in the model. The response surface model can be used to predict and optimize the yield. E. Conclusion Statistical analysis of experiment 1 showed that the EPA yield of the data is not significantly affected by the PI metal solution at the 2 levels tested. As it is not a factor that needs to be optimized the cheapest level of PI metal solution should be used for the production of EPA. On the other hand, analysis of experiment 2 showed that both temperature and pH do have a significant effect on the yield. Both these parameters should be optimized (value from surface model) throughout production so that the greatest yield is produced most efficiently. Similarly to the PI metal solution the NaCl level was not found to have a significant effect on the yields and so again the cheapest concentration should be used. V. Reflection Previous third party studies have already identified some of the optimal concentrations of substances for the growth of the cells. This study was completed in order to see how environmental factors and other substances that have not been studied effect growth. Although the results show that the EPA yield is not effected by the PI metal solution, it was only tested at two different levels. On reflection, more levels of PI metal could have been tested to give a more representative set of results. As the temperature varies largely, on repeating the experiment the temperatures should be kept within a limited range. This should increase the accuracy of the results. Ph was found to influence the growth of the cells and so must be carefully monitored throughout experiments. If the experiment was to be repeated, more focus should be placed on altering the influencing factors (identified in this study). This would allow a more in depth study of the optimizing conditions. References 1. BHF Cardiovascular Disease Statistics – U.K. Factsheet. BHF estimate based on GP patient data and latest UK health surveys with CVD fieldwork.Updated 16 August 2017; Cited December 2017 2. Cancer Research UK Data and Statistics. London: Cancer Research UK; 2014. Cancer mortality rates; updated 2014; cited December 2017 Available from: http://www.cancerresearchuk.org/health-professional/cancer-statistics/mortality. 3. NHS choices Health A-Z.U.K.:NHS Choices 2015. Arrythmia Updated July 2015; Cited December 2017 Available From: https://www.nhs.uk/conditions/arrhythmia/ 4. Bang HO, Dyerberg J. Lipid metabolism and ischemic heart disease in Greenland Eskimos. Draper HH, editor. Advances in nutrition research.1980. p. 1-22. 5. Goodstine SL, Zheng T, Holford TR, Ward BA, Carter D, Owens PH, Mayne ST. Dietary (n-3)/(n-6) fatty acid ratio: possible relationship to premenopausal but not postmenopausal breast cancer risk in U.S. women. J Nutr. 2003;133:1409–1414. 6. Leaf A, Kang JX, Xiao YF, Billman GE: Clinical prevention of sudden cardiac death by n-3 polyunsaturated fatty acids and mechanism of prevention of arrhythmias by n-3 fish oils. Circulation 2003, 107: 2646– 2652. 7. Siriwardhana N, Kalupahana NS, Moustaid-Moussa N. Health Benefits of n-3 polyunsaturated fatty acids: eicosapentaenoic acid and docosahexaenoic acid. Adv Food Nutr Res. 2012;65:211-22. 8. Wen, Z. A high yield and productivity strategy for eicosapentaenoic acid production by the diatom Nitzschia laevis in heterotrophic culture. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. 2001 9. Freedman D, Diaconis P, "On the histogram as a density estimator: L2 theory" . Probability Theory and Related Fields. Heidelberg: Springer Berlin. 1981. 57 (4):453–476.