Data Analysis on Research Paper

For our first-year Data Interpretation and Evaluation of Science module, we were assigned a research paper on antibody engineering to increase serum half-life that we had to break down and analyze with the methods we learned in the module. Here are some points I raised.

 

The first and second hypotheses in the first two paragraphs of the Results and Discussion state:Hypothesis 1: (Based on our molecular modeling analysis) we, therefore, hypothesized that amino acid substitutions at positions 250, 314, and 428 would affect the interaction with FcRn without affecting the pH-dependence of this interaction (marked “H1” in the margin of the paper).Hypothesis 2: Since the amino acids at and around these three positions are conserved among all four mouse and human IgG subtypes, we also hypothesized that the FcRn-binding phenotype resulting from an amino acid substitution in the IgG2 subtype could be transferred to the other three IgG subtypes (marked “H2” in the margin of the paper).With a brief justification, I identify that the following are the principal types of logic apparent in:

  1. Hypothesis 1: It is suggesting the presence of inductive logic. It is derived from the results of the general, molecular modeling analysis; the premise was already proven through observations and a more elaborated, specified hypothesis was made based on that accepted premise.
  2. Hypothesis 2: Inductive logic is implied. This hypothesis is a generalized idea derived from a specified observation on the “amino acids at and around” the stated positions. The hypothesis generalizes that an occurrence that happened to one subtype can also happen to the other subtypes.

 

I observed a few issues of poor graphic presentation of data in Figure 2 which could potentially have a misleading effect. The lack of label on the X-axis made it difficult for readers to easily interpret what each alphabet on the X-axis stands for. Error bars do not show the minimum values; this prevents readers to get an estimate of the data errors or uncertainty. False zeros on graphs A and C made the differences in the values look more significant than they are; hence, not proportional to graph B. The values of the data exceed the values on the Y-axis. This may disrupt the value reading of some graphs and lead to inaccuracy.

 

Two inferences are drawn concerning the properties of amino acid residues at positions 250 and 428 that are required for increased binding. The first inference (underlined on page 2 and marked “I1” in the margin of the paper) concerns hydrogen bonding at position 250. The second inference (underlined on page 2 and marked “I2” in the margin of the paper) is that at position 428, large hydrophobic amino acids confer better FcRn binding. The second inference is not consistent with the data shown in Figure 2C as I2 implies that hydrophobic amino acids confer better FcRn binding. Leucine is a hydrophobic, aliphatic amino acid and is the one with the least mean channel fluorescence which conveys the best binding; however, this does not apply to all the hydrophobic amino acids. Moreover, some non-hydrophobic (K and R) amino acids confer better binding than most hydrophobic amino acids. Moreover, leucine is also not one of the largest hydrophobic amino acids (aromatic). Only one of the largest amino acids portrays a smaller mean and better binding. The others have a higher mean and do not confer better binding.

 

Data from a separate pharmacokinetic study comparing the T250Q mutant and wild-type antibodies were not included in the paper, but are provided here in an Excel spreadsheet (“Coursework SRP Q4 data”). The values are not in order of time, and the corresponding time values are not shown because they are not relevant to the following questions. After reorganizing these data on the given spreadsheet, here are my findings to the essential calculations:

  • Sample standard deviation of the
    a. T250Q Cmax values =13.3204
    b. wild type Cmax values = 11.6026
  • Population standard deviation of the
    a. T250Q Cmax values = 13.0965
    b. wild type Cmax values = 11.4076
  • Standard error of the mean (SEM) of the
    a. T250Q Cmax values = 2.43197
    b. wild type Cmax values = 2.11834
  • Mean of the
    a. T250Q Cmax values = 36.34
    b. wild type Cmax values = 36.94
  • Assuming the Cmax values are normally distributed and of similar variance:
    a. The wild type and T250Q Cmax data are not significantly different. (P-Value > 0.05)
    b. The P-value for an unpaired, 2-tailed, t-test = 0.852597215

 

Following the calculations extracted above, I then construct a well-presented histogram of the clearance rate (CL) values for each of the following:

a. the T250Q mutant protein

b. the wild-type protein

The shape of the T250Q Mutant Protein histogram shows a positively-skewed distribution and does not resemble a normal distribution. The sample size did not exceed 30 samples, indicating that they are suitable for a non-parametric statistical test. The shape of the Wild Type Protein histogram, on the other hand, does resemble a normal distribution; therefore, they are suitable for a parametric statistical test.

 

I noticed that reference numbers 1. Morrison et al (1984), 7. Medesan et al (1997) and 12. Saper et al (1991) are missing page numbers, volume, and second author respectively. I proceed to revise and correct them using a literature search in Pubmed. Here are the revised references:

  1. Morrison, S. L., Johnson, M. J., Herzenberg, L. A., and Oi, V. T. (1984) Proc. Natl. Acad. Sci. U. S. A. 81, 6851–6855
  2. Medesan, C., Matesoi, D., Radu, C., Ghetie, V., and Ward, E. S. (1997) J. Immunol. 158, 2211–2217
  3. Saper, M. A., Bjorkman, P. J., and Wiley, D. C. (1991) J. Mol. Biol. 219, 277–319

 

The sample measurements shown in Figure 2 were recorded sequentially in the order ACB. The authors of this paper noticed that most samples in the second half of the experiment had less competitive binding than the starting control. As a result, the authors became concerned that some of the mutations shown in Figure 2 may have affected protein stability, and that samples were partially denaturing (i.e. losing their active structure) before they were assayed. If true, denaturation could have led to inactive proteins being used in later assays. In an attempt to describe the key design elements of an experiment using the same assay that could be used to test the authors’ concerns, I would suggest that his experiment needs to be redone in different orders (ABC, CAB, and BAC). This way, authors could observe at which point of each experiment the protein behavior starts to change. If these points are known and observed, they could help in allocating which mutations that were responsible for affecting the protein stability and denaturing. Knowing this could help the authors pick the most suitable order so that they could eliminate factors leading to the mutation causing the protein denaturing before they were assayed.

 

Lastly, in the legend to Table 1 on page 3, the authors state that “The standard error of the mean (SEM) is shown for each parameter. Two-tailed t-tests were used to compare the statistical significance of differences” between groups. Here, SEM is employed rather than the standard deviation because SEM refers to how far the sample average is from the population’s mean. Standard deviation is the distribution of individual sample values from the mean. This test assesses the mean of the samples rather than the individual sample values. In addition to that, the term “unpaired” refers to the two different, distinct groups whose means are tested in an unpaired t-test. “Two-tailed” stands for both ends in the distribution. I would also conclude that an unpaired, two-tailed t-test is not a suitable statistical test for these data. T-test measures the significant difference between the means of only two data sets. ANOVA, instead, will be a more suitable test for three or more data sets.