Statistical Analysis of Different methods

used to optimise EPA yield

Elena Peach

Research Centre

College of Engineering

Swansea University

Swansea UK

Abstract— Fatty acids have been known to help in the prevention of many

common diseases by both consumers and producers for some time. However, the best way in which to

commercially produce these beneficial acids is still poorly understood. This study models the effect of different

factors on EPA (Eicosapentaenoic acid) yield, a beneficial fatty acid. Through

statistical analysis it was concluded that the PI metal solution has no effect

on the overall EPA yield. On the other

hand, the temperature and Ph values do influence the EPA values. The NaCl concentration was not found to

effect the EPA yield. Further

experimentation is required to test the optimal values of each factor.

Keywords – EPA yield, Statistical analysis, Dataset,

Cells, Factors, Significance, Effect.

I.

Introduction

Cardiovascular diseases and cancers are alongside

the most common illnesses within the U.K.

Over 150,000 people die from cancer each year 1 and an

estimated 7 million people currently suffer from some form of cardiovascular

disease 2. Another highly prevailing

disease in the U.K is arrhythmia with more than 2 million people experiencing

it (or heart rhythm problems) annually 3. These diseases can have serious effects on

all aspects of a patient’s life, limiting their physical ability, mental state,

relationships and home living. The large

numbers of these cases also means they place a large amount of strain on health

services. Due to this, treatment and prevention techniques are in high demand from

both patient and caregiver, leading to increased interest from researchers.

Several previous studies 4-6

have shown that the -3

polyunsaturated fatty acids EPA (Eicosapentaenoic acid) and DHA

(Docosahexaenoic acid) have influence in the prevention of the aforementioned

diseases. A primary source of these

fatty acids is from fish oil, this is also the current commercial source for

production. However, this form of

production has limitations 7 and so a new method is needed.

This study takes

results using the experimental data published by the University of Hong Kong 8

on the growth of Nitzschia laevis.

The data used was collected through sub culturing cells in a medium (current

optimum growth contents determined through previous studies), with precautions

being taken to avoid precipitation. Statistical analysis will be used,

initially on data for EPA yields at different PI metal solutions. Statistical

investigation will also be conducted on data of EPA yields where both the

medium components and environmental factors are subject to change. This study

aims to determine how best to increase EPA yield in production by solidifying,

with statistical proof, what effects different factors have on the cells growth.

II.

The Paper

A. Abbreviations and Acronyms

EPA Eicosapentaenoic

acid

DHA Docosahexaenoic

acid

pH Power

of Hydrogen

NaCl Sodium

Chloride

CaCl2 Calcium Chloride

PI Propium

Iodide

IQR Inter

quartile range

SD Standard

Deviation

P-Value Probability

value

B. Units

·

mg/L Milligrams per litre

·

mL/L Milliliters per litre

·

g/L Grams per litre

·

0C Degrees Celsius

·

pH Is a log value and so is unit less

III.

Materials and Methods

A. Experiment 1

Completed to identify whether the PI metal

solution had an effect on the EPA yield.

The experiment consisted of changing one variable, the PI value. Other variables were held constant with NaCl

fixed at 16 g/L, CaCl2 at 0.204 g/L, temperature at 22oC

and pH held at 7.5. 20 tests were

carried out at the PI value of 13.5 mL/L and a separate 20 carried out at a PI

value of 4.5 mL/L. The EPA yields for

each sample were measured for comparison.

B. Experiment 2

The cohort of this

experiment was larger than the first, consisting of 27 tests. Here the EPA yield of the cells was measured,

studying the effects of the medium component (NaCl) along with temperature and

Ph.

C. Statistical Analysis

Quantitative results are presented to two

decimal places. Values compared using hypothesis testing assume alpha levels of

0.05 where p-values 0.05 we fail to reject the null

hypothesis, supported by the h value of 0.

The difference between the medians of each dataset can therefore be

taken as statistically insignificant.

This suggests that the PI metal solution does not have an effect on the

EPA yield.

To further

assess the claim that we fail to reject the null, further testing was

completed. Parametric testing allowed

this but before it could be completed assumptions about the distribution had to

be made.

To determine how

the data was distributed a normal probability plot of each set was made (figure

3).

Figure 3, Normal probability plot of dataset 1

The

points of dataset 1 follow closely with the line and do not follow their own

trend, they can be said to be linear.

So, the data can be said to be compatible with normal distribution.

Figure

4, Normal probability plot of dataset 2.

Despite

box and histogram plots of dataset 2 showing a left skew, its normal

probability (figure 4) plot suggests it is normally distributed. This difference in the representations of the

distribution is likely because there is only a slight skewness that is not

large enough to be observed on a normal probability plot. Although the points do not sit directly on

the red line, they do not follow their own trend and so the shape can be

concluded to be linear. For the purpose

of confirming that the null hypothesis is rejected dataset 2 will be assumed to

be normally distributed.

QQ plots

(figure 5) were also used to confirm the distribution of the data. For both datasets the points lie close to the

line and do not follow their own trend so can be said to be normally

distributed.

Figure

5, QQ plots of the datasets.

The

distribution of the data was then assumed to be normal. However, as the data is unpaired, before a

parametric student t-test can be conducted the variances of each dataset must

be shown to be statistically similar.

The actual values of variance can be seen to be different from table 1.

Despite this, a test was conducted to test the statistical significance of the

difference between the variances.

A null

hypothesis was set:

H0 :

The datasets have normal distributions and similar variances.

H1

: The datasets have unequal variances

The test

produced an h-value of 0 and so we failed to reject the null and a student t

–test was deemed appropriate.

In the

t-test the means of the data sets were compared:

H0: M1=M2

The

alternate hypothesis would suggest that PI metal level had an effect on the EPA

yield mean:

H1: M1?M2

The

results of the t-test obtained a p value of 0.2359 (>0.05) and an h value of

0. Once again this means we fail to

reject the null hypothesis and so the means are not statistically different in

each dataset. Similarly to the medians

the similarity in means suggest PI metal solution does not have an effect on

EPA yield.

The test also

gives confidence intervals of -5.04 and 1.32, as 0 is within these values it

confirms that we fail to reject the null hypothesis.

In conclusion,

the data can be said to show that changing the PI metal solution does not

effect the average EPA yield from the growth of the cells.

C. Experiment 2, surface model

Experiment

2 investigates what factors influence the response of the cells in producing

EPA. A boxplot of the data was plotted

to show the range of the results and check for any outliers (figure 6). One outlier was identified at sample 26 and

so it was removed from the data before any further testing was conducted. We

dropped the outlier as it did not majorly

Figure

6, Boxplot of the data of experiment 2

effect

the results but would have effected assumptions.

The technique of multiple least squares was

adopted to model the optimal operating conditions of the experiment (figure 6). It was important to fit the correct model to

the data, in this case the below equation was used:

Where Y

gives the EPA yield, X1, X2 and X3 correspond

to the pH, NaCl and temperature values respectively.

Figure

7, linear regression model

The model (figure 7) calculates an equation to minimize the distance

between the data points and a fitted line (least squares method). Bo represents the y intercept of the graph. B1

gives the slope of the line relating the relationships.

The model produces an adjusted R2 value of 0.77, a

statistical measure of the distances of the points to the regression line with

direct consideration of the number of predictors. As

this value is relatively close to 1 it can be said that the model is quite

accurate.

Once the model was produced it was important to clarify the nature

of the model is appropriate. Linear

modelling is only appropriate for normally distributed data and so this was

checked for (figure 8).

Figure 8, Normal probability plot of experiment 2

Similarly to experiment 1 the points lie close to the line and so

the data can be said to be normally distributed, the median EPA yield can be

concluded to be the same as the mean.

Any bias present in the plot must also be checked, bias being that a

parameter estimate is too extreme a value.

The bias of the model was checked through visual inspection of a

residual vs predicted plot (figure 9).

Figure 9, Predicted vs Residuals of experiment 2

The plot (figure

9) shows that there is a large amount of disorder in the data, no substantial

trend could be seen as the spread varies along the horizontal plot. Due to this, it can be said that the assumptions

of the linear model hold and the model is not bias.

The overall accuracy of the model is also monitored with an actual

vs fitted plot.

Figure 10, showing Actual vs fitted points of data

from experiment 2

Figure 10 shows the points sit closely against the 45 degree line and

so the model can be said to be a ‘good fit’.

D. Experiment 2, Refining the model

In order to refine the model and give the

data the best possible fit against the regression line (shown by an increased

adjusted r squared value) statistically insignificant variables had to be discarded.

The alpha value was assumed to be 0.1 and this was checked against the p

value. Variables with a p-value