Statistical Analysis of Different methods
used to optimise EPA yield
College of Engineering
Abstract— Fatty acids have been known to help in the prevention of many
common diseases by both consumers and producers for some time. However, the best way in which to
commercially produce these beneficial acids is still poorly understood. This study models the effect of different
factors on EPA (Eicosapentaenoic acid) yield, a beneficial fatty acid. Through
statistical analysis it was concluded that the PI metal solution has no effect
on the overall EPA yield. On the other
hand, the temperature and Ph values do influence the EPA values. The NaCl concentration was not found to
effect the EPA yield. Further
experimentation is required to test the optimal values of each factor.
Keywords – EPA yield, Statistical analysis, Dataset,
Cells, Factors, Significance, Effect.
Cardiovascular diseases and cancers are alongside
the most common illnesses within the U.K.
Over 150,000 people die from cancer each year 1 and an
estimated 7 million people currently suffer from some form of cardiovascular
disease 2. Another highly prevailing
disease in the U.K is arrhythmia with more than 2 million people experiencing
it (or heart rhythm problems) annually 3. These diseases can have serious effects on
all aspects of a patient’s life, limiting their physical ability, mental state,
relationships and home living. The large
numbers of these cases also means they place a large amount of strain on health
services. Due to this, treatment and prevention techniques are in high demand from
both patient and caregiver, leading to increased interest from researchers.
Several previous studies 4-6
have shown that the -3
polyunsaturated fatty acids EPA (Eicosapentaenoic acid) and DHA
(Docosahexaenoic acid) have influence in the prevention of the aforementioned
diseases. A primary source of these
fatty acids is from fish oil, this is also the current commercial source for
production. However, this form of
production has limitations 7 and so a new method is needed.
This study takes
results using the experimental data published by the University of Hong Kong 8
on the growth of Nitzschia laevis.
The data used was collected through sub culturing cells in a medium (current
optimum growth contents determined through previous studies), with precautions
being taken to avoid precipitation. Statistical analysis will be used,
initially on data for EPA yields at different PI metal solutions. Statistical
investigation will also be conducted on data of EPA yields where both the
medium components and environmental factors are subject to change. This study
aims to determine how best to increase EPA yield in production by solidifying,
with statistical proof, what effects different factors have on the cells growth.
A. Abbreviations and Acronyms
CaCl2 Calcium Chloride
mg/L Milligrams per litre
mL/L Milliliters per litre
g/L Grams per litre
0C Degrees Celsius
pH Is a log value and so is unit less
Materials and Methods
A. Experiment 1
Completed to identify whether the PI metal
solution had an effect on the EPA yield.
The experiment consisted of changing one variable, the PI value. Other variables were held constant with NaCl
fixed at 16 g/L, CaCl2 at 0.204 g/L, temperature at 22oC
and pH held at 7.5. 20 tests were
carried out at the PI value of 13.5 mL/L and a separate 20 carried out at a PI
value of 4.5 mL/L. The EPA yields for
each sample were measured for comparison.
B. Experiment 2
The cohort of this
experiment was larger than the first, consisting of 27 tests. Here the EPA yield of the cells was measured,
studying the effects of the medium component (NaCl) along with temperature and
C. Statistical Analysis
Quantitative results are presented to two
decimal places. Values compared using hypothesis testing assume alpha levels of
0.05 where p-values 0.05 we fail to reject the null
hypothesis, supported by the h value of 0.
The difference between the medians of each dataset can therefore be
taken as statistically insignificant.
This suggests that the PI metal solution does not have an effect on the
assess the claim that we fail to reject the null, further testing was
completed. Parametric testing allowed
this but before it could be completed assumptions about the distribution had to
To determine how
the data was distributed a normal probability plot of each set was made (figure
Figure 3, Normal probability plot of dataset 1
points of dataset 1 follow closely with the line and do not follow their own
trend, they can be said to be linear.
So, the data can be said to be compatible with normal distribution.
4, Normal probability plot of dataset 2.
box and histogram plots of dataset 2 showing a left skew, its normal
probability (figure 4) plot suggests it is normally distributed. This difference in the representations of the
distribution is likely because there is only a slight skewness that is not
large enough to be observed on a normal probability plot. Although the points do not sit directly on
the red line, they do not follow their own trend and so the shape can be
concluded to be linear. For the purpose
of confirming that the null hypothesis is rejected dataset 2 will be assumed to
be normally distributed.
(figure 5) were also used to confirm the distribution of the data. For both datasets the points lie close to the
line and do not follow their own trend so can be said to be normally
5, QQ plots of the datasets.
distribution of the data was then assumed to be normal. However, as the data is unpaired, before a
parametric student t-test can be conducted the variances of each dataset must
be shown to be statistically similar.
The actual values of variance can be seen to be different from table 1.
Despite this, a test was conducted to test the statistical significance of the
difference between the variances.
hypothesis was set:
The datasets have normal distributions and similar variances.
: The datasets have unequal variances
produced an h-value of 0 and so we failed to reject the null and a student t
–test was deemed appropriate.
t-test the means of the data sets were compared:
alternate hypothesis would suggest that PI metal level had an effect on the EPA
results of the t-test obtained a p value of 0.2359 (>0.05) and an h value of
0. Once again this means we fail to
reject the null hypothesis and so the means are not statistically different in
each dataset. Similarly to the medians
the similarity in means suggest PI metal solution does not have an effect on
The test also
gives confidence intervals of -5.04 and 1.32, as 0 is within these values it
confirms that we fail to reject the null hypothesis.
the data can be said to show that changing the PI metal solution does not
effect the average EPA yield from the growth of the cells.
C. Experiment 2, surface model
2 investigates what factors influence the response of the cells in producing
EPA. A boxplot of the data was plotted
to show the range of the results and check for any outliers (figure 6). One outlier was identified at sample 26 and
so it was removed from the data before any further testing was conducted. We
dropped the outlier as it did not majorly
6, Boxplot of the data of experiment 2
the results but would have effected assumptions.
The technique of multiple least squares was
adopted to model the optimal operating conditions of the experiment (figure 6). It was important to fit the correct model to
the data, in this case the below equation was used:
gives the EPA yield, X1, X2 and X3 correspond
to the pH, NaCl and temperature values respectively.
7, linear regression model
The model (figure 7) calculates an equation to minimize the distance
between the data points and a fitted line (least squares method). Bo represents the y intercept of the graph. B1
gives the slope of the line relating the relationships.
The model produces an adjusted R2 value of 0.77, a
statistical measure of the distances of the points to the regression line with
direct consideration of the number of predictors. As
this value is relatively close to 1 it can be said that the model is quite
Once the model was produced it was important to clarify the nature
of the model is appropriate. Linear
modelling is only appropriate for normally distributed data and so this was
checked for (figure 8).
Figure 8, Normal probability plot of experiment 2
Similarly to experiment 1 the points lie close to the line and so
the data can be said to be normally distributed, the median EPA yield can be
concluded to be the same as the mean.
Any bias present in the plot must also be checked, bias being that a
parameter estimate is too extreme a value.
The bias of the model was checked through visual inspection of a
residual vs predicted plot (figure 9).
Figure 9, Predicted vs Residuals of experiment 2
The plot (figure
9) shows that there is a large amount of disorder in the data, no substantial
trend could be seen as the spread varies along the horizontal plot. Due to this, it can be said that the assumptions
of the linear model hold and the model is not bias.
The overall accuracy of the model is also monitored with an actual
vs fitted plot.
Figure 10, showing Actual vs fitted points of data
from experiment 2
Figure 10 shows the points sit closely against the 45 degree line and
so the model can be said to be a ‘good fit’.
D. Experiment 2, Refining the model
In order to refine the model and give the
data the best possible fit against the regression line (shown by an increased
adjusted r squared value) statistically insignificant variables had to be discarded.
The alpha value was assumed to be 0.1 and this was checked against the p
value. Variables with a p-value