Statistical Analysis of Different methods

used to optimise EPA yield

Elena Peach

Research Centre

College of Engineering

Swansea University

Swansea UK

Abstract— Fatty acids have been known to help in the prevention of many

common diseases by both consumers and producers for some time. However, the best way in which to

commercially produce these beneficial acids is still poorly understood. This study models the effect of different

factors on EPA (Eicosapentaenoic acid) yield, a beneficial fatty acid. Through

statistical analysis it was concluded that the PI metal solution has no effect

on the overall EPA yield. On the other

hand, the temperature and Ph values do influence the EPA values. The NaCl concentration was not found to

effect the EPA yield. Further

experimentation is required to test the optimal values of each factor.

Keywords – EPA yield, Statistical analysis, Dataset,

Cells, Factors, Significance, Effect.

I.

Introduction

Cardiovascular diseases and cancers are alongside

the most common illnesses within the U.K.

Over 150,000 people die from cancer each year 1 and an

estimated 7 million people currently suffer from some form of cardiovascular

disease 2. Another highly prevailing

disease in the U.K is arrhythmia with more than 2 million people experiencing

it (or heart rhythm problems) annually 3. These diseases can have serious effects on

all aspects of a patient’s life, limiting their physical ability, mental state,

relationships and home living. The large

numbers of these cases also means they place a large amount of strain on health

services. Due to this, treatment and prevention techniques are in high demand from

both patient and caregiver, leading to increased interest from researchers.

Several previous studies 4-6

have shown that the -3

polyunsaturated fatty acids EPA (Eicosapentaenoic acid) and DHA

(Docosahexaenoic acid) have influence in the prevention of the aforementioned

diseases. A primary source of these

fatty acids is from fish oil, this is also the current commercial source for

production. However, this form of

production has limitations 7 and so a new method is needed.

This study takes

results using the experimental data published by the University of Hong Kong 8

on the growth of Nitzschia laevis.

The data used was collected through sub culturing cells in a medium (current

optimum growth contents determined through previous studies), with precautions

being taken to avoid precipitation. Statistical analysis will be used,

initially on data for EPA yields at different PI metal solutions. Statistical

investigation will also be conducted on data of EPA yields where both the

medium components and environmental factors are subject to change. This study

aims to determine how best to increase EPA yield in production by solidifying,

with statistical proof, what effects different factors have on the cells growth.

II.

The Paper

A. Abbreviations and Acronyms

EPA Eicosapentaenoic

acid

DHA Docosahexaenoic

acid

pH Power

of Hydrogen

NaCl Sodium

Chloride

CaCl2 Calcium Chloride

PI Propium

Iodide

IQR Inter

quartile range

SD Standard

Deviation

P-Value Probability

value

B. Units

·

mg/L Milligrams per litre

·

mL/L Milliliters per litre

·

g/L Grams per litre

·

0C Degrees Celsius

·

pH Is a log value and so is unit less

III.

Materials and Methods

A. Experiment 1

Completed to identify whether the PI metal

solution had an effect on the EPA yield.

The experiment consisted of changing one variable, the PI value. Other variables were held constant with NaCl

fixed at 16 g/L, CaCl2 at 0.204 g/L, temperature at 22oC

and pH held at 7.5. 20 tests were

carried out at the PI value of 13.5 mL/L and a separate 20 carried out at a PI

value of 4.5 mL/L. The EPA yields for

each sample were measured for comparison.

B. Experiment 2

The cohort of this

experiment was larger than the first, consisting of 27 tests. Here the EPA yield of the cells was measured,

studying the effects of the medium component (NaCl) along with temperature and

Ph.

C. Statistical Analysis

Quantitative results are presented to two

decimal places. Values compared using hypothesis testing assume alpha levels of

0.05 where p-values <0.05 were considered statistically significant. An alpha level of 0.1 was assumed for data
modelling.
IV.
Results and Discussion
A. Experiment 1, Data comparison
Once the datasets from
each sample at different PI levels had been obtained, the initial process was
to determine if any major differences occur between the results. Clear quantitative data description allowed
for quick, simple comparisons.
Table
1.1, showing data descriptions at each PI metal solutions
Data
Description
EPA
yield in mg/L
PI = 4.5 mL/L
EPA
yield in mg/L
PI = 13.5 mL/L
Mean
217.17
219.03
Median
217.50
220.15
Standard deviation
3.82
5.20
Lower quartile
214.10
215.20
Upper quartile
219.85
222.75
Interquartile Range
5.75
7.55
Observing the values in table 1.1 it can be
seen that there is a relatively small difference between the means of each
dataset. This observation can also be
applied to the median. Immediately, this
suggests that the average EPA yield is not massively affected by a change in PI
metal and so this will be taken as the null hypothesis for further testing. The
descriptive values of PI 13.5 mL/L (Dataset 2) are higher overall than those of
PI 4.5 mL/L (Dataset 1), in particular, the SD and IQR differ by large
amounts. The SD expresses by how much
the data points differ from the mean, the large value for this proposes that
the values for EPA yield at PI 13.5 mL/L cover a greater range of values than
at PI 4.5 mL/L.
In order to visually represent the spreads of
each dataset and further analyze the results, boxplots (figure 1) were created.
Figure 1, Boxplots of each dataset.
Figure 1 solidifies the understanding that dataset 2 covers a larger
range of values, as seen by the larger size of the second plot. Shape distribution of each dataset was also
inferred from the plots with dataset 1 showing symmetrical distribution. Dataset 2 can be argued to have a slightly negative
skew but this must be further investigated before taken to be the final
distribution.
Absolute frequency histograms (figure 2.1, 2.2) were plotted to
further evaluate the distribution of each dataset.
Figure 2.1 – Absolute
frequency histogram of PI metal solution 4.5 mL/L.
Figure 2.2 – Absolute
frequency histogram of PI metal solution 13.5 mL/L.
The appropriate number of bins to best visualize distribution was
calculated using the Freedman-Diaconis rule.
Bin size = 9
Where n is the number of observations in the sample.
To the nearest significant figure, dataset 1 needed 4 bins whereas,
dataset 2 required 6.
Figure 2.1 reiterates that dataset 1 follows symmetrical
distribution. Histograms that show
symmetry have datasets with approximately same means and medians, this is
confirmed in table 1. In order to check
the assumption of dataset 2 boxplot showing negative distribution the shape of
the histogram was be studied. The
histogram in figure 2.2 shows a clear left-sided skew, with a steady frequency
for 3 bins before an increase occurs. Histograms
that show left skew suggest that the mean of the data is less than the median,
this is true for dataset 2 as shown in table 1.
Plotting relative frequencies gave the same results concerning the
shape of the distribution (sample size of each dataset was the same).
B. Experiment 1 , Claim test
As identified from comparing the data, there is not a large
difference between the medians of each dataset (claim). Hypothesis testing was completed to test the
statistical significance of this claim.
For the
purpose of the hypothesis test, no assumptions on the distribution of the data
were made. As the sample size was small
(<30) a non-parametric test was deemed most appropriate. The data was identified as two sampled and
two-tailed and due to its non-parametric nature a sign-test was used.
The null
hypothesis of the test fulfilled the limits of the claim:
H0: M1=M2
That the median of dataset 1 is
statistically the same as the median of dataset 2.
The
alternate hypothesis would suggest that PI metal level had a significant effect
on EPA yield:
H1: M1?M2
For the
test an alpha level of 0.05 was closed.
The results
gave a p value of 0.8238, as this is >0.05 we fail to reject the null

hypothesis, supported by the h value of 0.

The difference between the medians of each dataset can therefore be

taken as statistically insignificant.

This suggests that the PI metal solution does not have an effect on the

EPA yield.

To further

assess the claim that we fail to reject the null, further testing was

completed. Parametric testing allowed

this but before it could be completed assumptions about the distribution had to

be made.

To determine how

the data was distributed a normal probability plot of each set was made (figure

3).

Figure 3, Normal probability plot of dataset 1

The

points of dataset 1 follow closely with the line and do not follow their own

trend, they can be said to be linear.

So, the data can be said to be compatible with normal distribution.

Figure

4, Normal probability plot of dataset 2.

Despite

box and histogram plots of dataset 2 showing a left skew, its normal

probability (figure 4) plot suggests it is normally distributed. This difference in the representations of the

distribution is likely because there is only a slight skewness that is not

large enough to be observed on a normal probability plot. Although the points do not sit directly on

the red line, they do not follow their own trend and so the shape can be

concluded to be linear. For the purpose

of confirming that the null hypothesis is rejected dataset 2 will be assumed to

be normally distributed.

QQ plots

(figure 5) were also used to confirm the distribution of the data. For both datasets the points lie close to the

line and do not follow their own trend so can be said to be normally

distributed.

Figure

5, QQ plots of the datasets.

The

distribution of the data was then assumed to be normal. However, as the data is unpaired, before a

parametric student t-test can be conducted the variances of each dataset must

be shown to be statistically similar.

The actual values of variance can be seen to be different from table 1.

Despite this, a test was conducted to test the statistical significance of the

difference between the variances.

A null

hypothesis was set:

H0 :

The datasets have normal distributions and similar variances.

H1

: The datasets have unequal variances

The test

produced an h-value of 0 and so we failed to reject the null and a student t

–test was deemed appropriate.

In the

t-test the means of the data sets were compared:

H0: M1=M2

The

alternate hypothesis would suggest that PI metal level had an effect on the EPA

yield mean:

H1: M1?M2

The

results of the t-test obtained a p value of 0.2359 (>0.05) and an h value of

0. Once again this means we fail to

reject the null hypothesis and so the means are not statistically different in

each dataset. Similarly to the medians

the similarity in means suggest PI metal solution does not have an effect on

EPA yield.

The test also

gives confidence intervals of -5.04 and 1.32, as 0 is within these values it

confirms that we fail to reject the null hypothesis.

In conclusion,

the data can be said to show that changing the PI metal solution does not

effect the average EPA yield from the growth of the cells.

C. Experiment 2, surface model

Experiment

2 investigates what factors influence the response of the cells in producing

EPA. A boxplot of the data was plotted

to show the range of the results and check for any outliers (figure 6). One outlier was identified at sample 26 and

so it was removed from the data before any further testing was conducted. We

dropped the outlier as it did not majorly

Figure

6, Boxplot of the data of experiment 2

effect

the results but would have effected assumptions.

The technique of multiple least squares was

adopted to model the optimal operating conditions of the experiment (figure 6). It was important to fit the correct model to

the data, in this case the below equation was used:

Where Y

gives the EPA yield, X1, X2 and X3 correspond

to the pH, NaCl and temperature values respectively.

Figure

7, linear regression model

The model (figure 7) calculates an equation to minimize the distance

between the data points and a fitted line (least squares method). Bo represents the y intercept of the graph. B1

gives the slope of the line relating the relationships.

The model produces an adjusted R2 value of 0.77, a

statistical measure of the distances of the points to the regression line with

direct consideration of the number of predictors. As

this value is relatively close to 1 it can be said that the model is quite

accurate.

Once the model was produced it was important to clarify the nature

of the model is appropriate. Linear

modelling is only appropriate for normally distributed data and so this was

checked for (figure 8).

Figure 8, Normal probability plot of experiment 2

Similarly to experiment 1 the points lie close to the line and so

the data can be said to be normally distributed, the median EPA yield can be

concluded to be the same as the mean.

Any bias present in the plot must also be checked, bias being that a

parameter estimate is too extreme a value.

The bias of the model was checked through visual inspection of a

residual vs predicted plot (figure 9).

Figure 9, Predicted vs Residuals of experiment 2

The plot (figure

9) shows that there is a large amount of disorder in the data, no substantial

trend could be seen as the spread varies along the horizontal plot. Due to this, it can be said that the assumptions

of the linear model hold and the model is not bias.

The overall accuracy of the model is also monitored with an actual

vs fitted plot.

Figure 10, showing Actual vs fitted points of data

from experiment 2

Figure 10 shows the points sit closely against the 45 degree line and

so the model can be said to be a ‘good fit’.

D. Experiment 2, Refining the model

In order to refine the model and give the

data the best possible fit against the regression line (shown by an increased

adjusted r squared value) statistically insignificant variables had to be discarded.

The alpha value was assumed to be 0.1 and this was checked against the p

value. Variables with a p-value <0.1
were statistically significant and only these would be continued through for
the refined model. The intercept will remain in the model even if it is found
to be statistically insignificant as the model must always intercept the y axis
at some point. In the case of this model only 3 variables were found to be
statistically significant and were continued.
Figure 11, refined model
The above figure (11) shows a simplified model of the data,
excluding any variables that did not have a significant effect on the EPA
yield.
Similar to the original model, plots were made to test the accuracy
and biasness of the model.
Figure 12, Normal probability plot of the simplified
model
Figure 14, Actual vs
Predicted plot of
data for model 2
The same
conclusions can be dawn from figures 12, 13 and 14 as for the first model. The plots show the new model is accurate, not
bias and a good fit.
A simplified
model should theoretically show an increased R squared adjusted value, yet in
this case it decreases. As the decrease
is very small (0.028) it can be assumed that this is due to an error in the
data collection.
Analysis of the
data collection and results given shows that the large changes in temperature
may be the source of the error. During
data collection the temperature varied by a relatively large amount (8 degrees),
it also may have been difficult to keep the temperature constant throughout the
experiment leading to changes in results.
Figure 15, multiple regression surface model
A response
surface model of the simplified model was created (figure 15). It plots the temperature and pH against EPA
yield as these were the only factors found to have an effect on the yield. NaCl was not found to influence EPA yield and
is therefore not included in the model.
The response surface model can be used to predict and optimize the
yield.
E. Conclusion
Statistical analysis of experiment 1 showed that the EPA yield of
the data is not significantly affected by the PI metal solution at the 2 levels
tested. As it is not a factor that needs
to be optimized the cheapest level of PI metal solution should be used for the
production of EPA.
On the other
hand, analysis of experiment 2 showed that both temperature and pH do have a
significant effect on the yield. Both
these parameters should be optimized (value from surface model) throughout
production so that the greatest yield is produced most efficiently.
Similarly to the
PI metal solution the NaCl level was not found to have a significant effect on
the yields and so again the cheapest concentration should be used.
V. Reflection
Previous third party
studies have already identified some of the optimal concentrations of
substances for the growth of the cells. This study was completed in order to
see how environmental factors and other substances that have not been studied
effect growth. Although the results show
that the EPA yield is not effected by the PI metal solution, it was only tested
at two different levels. On reflection,
more levels of PI metal could have been tested to give a more representative
set of results.
As the temperature varies largely, on repeating the experiment the
temperatures should be kept within a limited range. This should increase the accuracy of the results. Ph was found to influence the growth of the
cells and so must be carefully monitored throughout experiments. If the experiment was to be repeated, more
focus should be placed on altering the influencing factors (identified in this
study). This would allow a more in depth
study of the optimizing conditions.
References
1.
BHF Cardiovascular Disease
Statistics – U.K. Factsheet. BHF
estimate based on GP patient data and latest UK health surveys with CVD
fieldwork.Updated 16 August 2017; Cited December 2017
2.
Cancer Research UK Data and
Statistics. London: Cancer Research UK; 2014. Cancer mortality rates; updated 2014; cited
December 2017 Available from: http://www.cancerresearchuk.org/health-professional/cancer-statistics/mortality.
3.
NHS choices Health A-Z.U.K.:NHS
Choices 2015. Arrythmia Updated July 2015; Cited December 2017 Available From: https://www.nhs.uk/conditions/arrhythmia/
4.
Bang
HO, Dyerberg J. Lipid metabolism and ischemic heart disease in Greenland
Eskimos. Draper HH, editor. Advances in nutrition research.1980. p. 1-22.
5.
Goodstine
SL, Zheng T, Holford TR, Ward BA, Carter D, Owens PH, Mayne ST. Dietary
(n-3)/(n-6) fatty acid ratio: possible relationship to premenopausal but not
postmenopausal breast cancer risk in U.S. women. J
Nutr. 2003;133:1409–1414.
6.
Leaf A, Kang JX, Xiao YF, Billman GE: Clinical
prevention of sudden cardiac death by n-3 polyunsaturated fatty acids and
mechanism of prevention of arrhythmias by n-3 fish oils. Circulation 2003, 107:
2646– 2652.
7.
Siriwardhana N, Kalupahana NS, Moustaid-Moussa N. Health Benefits of n-3
polyunsaturated fatty acids: eicosapentaenoic acid and docosahexaenoic acid. Adv
Food Nutr Res. 2012;65:211-22.
8.
Wen, Z. A high yield and productivity strategy for
eicosapentaenoic acid production by the diatom Nitzschia laevis in
heterotrophic culture. (Thesis). University of Hong Kong, Pokfulam, Hong Kong
SAR. 2001
9.
Freedman
D, Diaconis P, "On the histogram as a density estimator: L2 theory" . Probability
Theory and Related Fields. Heidelberg: Springer Berlin. 1981. 57 (4):453–476.