Are these two groups different? We compare two samples in terms of statistical parameters such as mean, variance, etc.

t-test: mean comparison.

\(\{x_1, x_2, x_3, x_4, \dots, x_{n_1} \}\) is from \(N(\mu_1, \sigma_1^2)\) and

\(\{y_1, y_2, y_3, y_4, \dots, y_{n_2} \}\) is from \(N(\mu_2, \sigma_2^2)\)

x <- c(5.12, 5.28, 4.98, 4.62, 4.95, 4.35, 4.83, 6.15, 5.83, 5.43)
y <- c(5.37, 5.63, 5.89, 5.92, 5.91, 6.15, 6.29, 6.41, 6.88, 7.12)

qqnorm(x);

qqnorm(y);


ttest.res <- t.test(x,y)
ttest.res$p.value
[1] 0.0006073053

\(t-score = \frac{\bar{x}-\bar{y}}{ \sqrt{ s^2_x/n_1+s^2_y/n_2 } }\)

distribution is \(t_{(\nu)}\)

\(\nu = \frac{(s^2_x/n_1+s^2_y/n_2)^2}{(s^2_x/n_1)^2/(n_1-1)+(s^2_y/n_2)^2/(n_2-1)}\)









Linear regression

\(\hat{y}=\beta_0 + \beta_1 x\)

\(\{x_1, x_2, x_3, x_4, \dots, x_{n_1} \}\) is from \(N(\beta_0+\beta_1\cdot 0, \sigma^2)\) and

\(\{y_1, y_2, y_3, \dots, y_{n_2} \}\) is from \(N(\beta_0+\beta_1\cdot 1, \sigma^2)\)

Since the estimator’s of \(\beta_0\) and \(\beta_1\) are calculated by using \(x_i\) and \(y_j\), we also have probability distributions for the estimators of \(\beta_0\) and \(\beta_1\). \(\beta_1\) estimator’s distribution is another \(t\)-distribution. Using this we can test

\(H_0:\beta_1=0\)

If rejected, \(i.e.\), \(P-\)value is low enough, we can say that means are different.

library(ggplot2)
rm(data_4_lin_reg)
object 'data_4_lin_reg' not found
data_4_lin_reg <- data.frame(name = rep("x", length(x)), values = x)
data_4_lin_reg <- rbind(data_4_lin_reg, data.frame(name = rep("y", length(y)), values = y))
ggplot(data = data_4_lin_reg, aes(x=name, y=values))+geom_boxplot()




# lin_reg2 <- glm(data = data_4_lin_reg, values ~ name, family=gaussian)
# summary(lin_reg2)
var.test(x, y, alternative = "two.sided")

    F test to compare two variances

data:  x and y
F = 1.0127, num df = 9, denom df = 9, p-value = 0.9853
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2515393 4.0771072
sample estimates:
ratio of variances 
          1.012696 
lin_reg <- lm(data = data_4_lin_reg, values ~ name)
coef(summary(lin_reg))[2,4]
[1] 0.0006072601
y_1=1.4*y-2.0
var.test(x, y_1, alternative = "two.sided")

    F test to compare two variances

data:  x and y_1
F = 0.51668, num df = 9, denom df = 9, p-value = 0.3395
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1283364 2.0801567
sample estimates:
ratio of variances 
         0.5166815 
ttest.res <- t.test(x,y_1)
ttest.res$p.value
[1] 0.0001267858

rm(data_4_lin_reg)
data_4_lin_reg <- data.frame(name = rep("x", length(x)), values = x)
data_4_lin_reg <- rbind(data_4_lin_reg, data.frame(name = rep("y", length(y_1)), values = y_1))
#data_4_lin_reg

ggplot(data = data_4_lin_reg, aes(x=name, y=values))+geom_boxplot()

lin_reg <- lm(data = data_4_lin_reg, values ~ name)
coef(summary(lin_reg))[2,4]
[1] 9.585896e-05









General linear regressions

1. Linear regression.

\(Y|_{x} \sim N(\beta_0 + \beta_1 x,\sigma)\)

\(E(Y|_{x})=\beta_0 +\beta_1 x\)

Probability distribution, \(f(y|x;(\beta_0, \beta_1 ))=n(\beta_0 + \beta_1 x,\sigma)\)


ind.var <- seq(1, 10, by=1)
dep.var <- ind.var + rnorm(10, 0, 1.5)
linreg.res <- lm(data = data.frame(x=ind.var, y= dep.var), y ~ x)
coef(summary(linreg.res))
              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -0.2844622  1.0760642 -0.2643543 0.7981897442
x            0.9472229  0.1734235  5.4619079 0.0006001724
interc = coef(summary(linreg.res))[1,1]; slope = coef(summary(linreg.res))[2,1]

ggplot(data = data.frame(x=ind.var, y= dep.var))+geom_point(aes(x=ind.var, y=dep.var)) + 
  geom_line(aes(x=ind.var, y=interc + slope*ind.var   )) +
  geom_line(data = data.frame(x=dnorm(seq(-3.5,0,by=0.01), 0, 1.5)+3, 
                              y=seq(-3.5,0,by=0.01)+interc + slope*3), aes(x=x, y= y  )) + 
  geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+3,
                              y=seq(0,3.5,by=0.01)+interc + slope*3), aes(x=x, y= y  )) +
  geom_line(data = data.frame(x=dnorm(seq(-3.5,0,by=0.01), 0, 1.5)+5, 
                              y=seq(-3.5,0,by=0.01)+interc + slope*5), aes(x=x, y= y  )) + 
  geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+5, 
                              y=seq(0,3.5,by=0.01)+interc + slope*5), aes(x=x, y= y  )) +
  geom_line(data = data.frame(x=dnorm(seq(-3.5,0,by=0.01), 0, 1.5)+7, 
                              y=seq(-3.5,0,by=0.01)+interc + slope*7), aes(x=x, y= y  )) + 
  geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+7, 
                              y=seq(0,3.5,by=0.01)+interc + slope*7), aes(x=x, y= y  ))

NA
NA

How to find \(\beta_0\) and \(\beta_1\).

Data \(\{(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4), (x_5, y_5)\}\)

- Minimize \(SSE\)

\(SSE = F(\beta_0, \beta_1 ) = \sum_1^5 (\beta_0 + \beta_1x_i - y_i)^2\); multivariable (geometry) calculus problem.

- Maximize likelihood function, \(L\)

\(L = \prod_1^5 \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(\beta_0 + \beta_1x_i - y_i)^2}{2\sigma^2})\). Or, minimize \(-ln(L)\) which is the same as minimizing \(SSE\).

- Solve \(X^TX\cdot B=X^TY\)

\(X=\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_5 \end{bmatrix}\), \(B=\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\), \(Y=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_5 \end{bmatrix}\).





linreg.res <- glm(data = data.frame(x=ind.var, y= dep.var), y ~ x, family=gaussian )
coef(summary(linreg.res))
              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -0.2844622  1.0760642 -0.2643543 0.7981897442
x            0.9472229  0.1734235  5.4619079 0.0006001724

2. Logistic regression.

\(Y|x \sim B(n,p|_x).\) We assume that \(E(Y|x) = n\cdot inv.logit(\beta_0 +\beta_1 x)\).

Since \(E(Y|x)=n\cdot p|_x\), \(p|_x=inv.logit(\beta_0 +\beta_1 x)\). Or,

\(logit(p|_x)=\beta_0 +\beta_1 x\).

Probability distribution,

\(f(1|x)=p|_x\) and \(f(0|x)=1-p|_x\)

\[L=\prod_{y_i=0}(1-p|_{x_i})\cdot \prod_{y_i=1}(p|_{x_i})\]

\[\ln L = \sum_{i}((1-y_i)\ln (1-p|_{x_i}) + y_i\ln (p|_{x_i}))\]

\[ \left( \frac{\hat{\beta_i}}{s.e.(\hat{\beta_i})} \right)^2 \sim \chi^2 \]

library(ggplot2)
ind.var_bin <- seq(-3, 3, by=0.5)
#p <- inv.logit(ind.var_bin)

dep.var_bin <- c(0,0,0,1,0,1,0,1,1,0,1,1,1)

binreg.res <- glm(data = data.frame(x=ind.var_bin, y= dep.var_bin), y ~ x, family=binomial(link=logit) )
coef(summary(binreg.res))
             Estimate Std. Error   z value   Pr(>|z|)
(Intercept) 0.2331067  0.6881342 0.3387518 0.73479673
x           0.8053177  0.4440151 1.8137168 0.06972133
tt=seq(-3.2, 3.2, by = 0.1)
ggplot(data=data.frame(x=ind.var_bin, y=dep.var_bin))+geom_point(aes(x=x,y=y)) + 
  geom_line(data=data.frame(x=tt, y=inv.logit(coef(summary(binreg.res))[1,1]+coef(summary(binreg.res))[2,1]*tt)), aes(x=x,y=y)) 


success =  data.frame(count=rep(0, length(ind.var_bin)))
for (ii in (1:length(ind.var_bin))){
  
  p0 <- inv.logit(coef(summary(binreg.res))[1,1]+coef(summary(binreg.res))[2,1]*ind.var_bin[ii])
  success$count[ii] = rbinom(1,100, p0)
  
}

ggplot(data = data.frame(x=ind.var_bin, count=success$count)) + geom_point(aes(x=x,y=count/100)) + geom_line(data=data.frame(x=tt, y=inv.logit(coef(summary(binreg.res))[1,1]+coef(summary(binreg.res))[2,1]*tt)), aes(x=x,y=y)) 



# 
# x0 <- seq(1:100)/100 
# y0 <- 10*dbinom(seq(1:100), 100, p0)
# 
# ggplot(data=data.frame(x=x0, y=y0))+geom_point(aes(x=y,y=x), pch=15, size=0.5)

  # geom_line(data = data.frame(x=dbinom(seq(1:100), 100, 1.5)+7, 
  #                             y=seq(-3.5,0,by=0.01)+interc + slope*7), aes(x=x, y= y  )) + 
  # geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+7, 
  #                             y=seq(0,3.5,by=0.01)+interc + slope*7), aes(x=x, y= y  ))









3. Poisson regression.

\(Y|_x \sim Poisson(e^{\beta_0 +\beta_1 x})\)

\(E(Y|_x)=e^{\beta_0 +\beta_1 x}\)

\[f(y|_x)=\frac{e^{(\beta_0 +\beta_1 x)y} \exp(-e^{\beta_0 +\beta_1 x})}{y!}\]

\[\ln L = \sum_{i}\left( (\beta_0+\beta_1 x_i)y_i - e^{\beta_0+\beta_1 x_i} - \ln (y_i!) \right)\]

\[ \ln (E(Y|_x)) = \beta_0 + \beta_1 x. \] Linear again!!

pois_reg <- glm(data=data, y ~ x, family=poisson)





slope=1.5
x=seq(0, 1, by=0.1)
y=rpois(11, exp(1+slope*x))


pois_reg <- glm(data=data.frame(x=x,y=y), y ~ x, family=poisson)
p_slope = coef(summary(pois_reg))[2,1]; p_interc <- coef(summary(pois_reg))[1,1]
coef(summary(pois_reg))
             Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 0.9086504  0.2974696 3.054599 0.0022536163
x           1.5121244  0.4194391 3.605111 0.0003120198
lin_reg <- glm(data=data.frame(x=x,y=y), y ~ x, family=gaussian)
l_slope = coef(summary(lin_reg))[2,1]; l_interc <- coef(summary(lin_reg))[1,1]
coef(summary(lin_reg))
            Estimate Std. Error  t value   Pr(>|t|)
(Intercept) 1.636364   1.439123 1.137056 0.28488131
x           8.545455   2.432561 3.512945 0.00658806
ggplot(data = data.frame(x=x,y=y))+geom_point(aes(x=x,y=y)) +
  geom_line( data=data.frame(x=seq(0, 1, by=0.05), y=exp(1+slope*seq(0, 1, by=0.05)))     , aes(x=x,y=y)) +
  geom_line( data=data.frame(x=seq(0, 1, by=0.05), y=exp(p_interc+p_slope*seq(0, 1, by=0.05))), aes(x=x,y=y), color = "red") +
  geom_line( data=data.frame(x=seq(0, 1, by=0.05), y=l_interc+l_slope*seq(0, 1, by=0.05)), aes(x=x,y=y), color = "blue")

NA
NA
NA





Regressions to compare two groups: P-value calculation.

- State marker expression data.

What’s the population? What is the random variable? Samples?

Example:

  • population of virginia who had heart attack.
  • Median of NK\(\kappa\)B expression from all Treg cells.
  • 16 people.

What is the distribution?

  • one side open ended counting data: Poisson, Negative binomial distribution
  • data transformed by \(\text{asinh}=\ln (x+\sqrt{x^2+1}) \sim \ln 2 + \ln x \text{ for} x>>1\)
  • Gaussian.

Use T-test or linear regression

ggplot(data = data.frame(x=seq(-10, 100, by=1), y=asinh(seq(-10, 100, by=1))))+geom_line(aes(x=x,y=y)) + coord_fixed(ratio = 4/1) +
  labs(title = paste("y = asinh(x)")) 









- Proportion data. Cluster proportions.

What is the distribution?

- Binomial distribution. Logistic regression or Quasi-logistic regression.

\[ P(x=k)= {n \choose x}p(p+k\phi)^{k-1}(1-p-k\phi)^{n-k}\]

- If \(X \sim B(n,p)\), \(X \sim N(np, npq)\), approximately (\(n\geq 15\)). Then, t-test or linear regression can be applied.

ggplot(data = data.frame(x=seq(0.01,0.99, by=0.01), y=logit( seq(0.01,0.99, by=0.01) )))+geom_line(aes(x=x,y=y)) + coord_fixed(ratio = 1/9) +
  labs(title = paste("y = logit(x)"))

- \[ \ln\left( \frac{proportion}{1-proportion} \right) \sim Gaussian\]. Apply t-test or linear regression to log odds ratio.

Weights

- Maximize likelihood function, \(L\)

\[L = \prod_1^n \left( \frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(\beta_0 + \beta_1x_i - y_i)^2}{2\sigma^2} \right) \right)^{w_i}.\]

\[ -\ln L = Constant \cdot \sum^n_1w_i (\beta_0 + \beta_1x_i - y_i)^2\]

---
title: "Regressions for differential analysis"
output: html_notebook
---

## Are these two groups different? We compare two samples in terms of statistical parameters such as mean, variance, etc.

## **t-test: mean comparison.**

## $\{x_1, x_2, x_3, x_4, \dots, x_{n_1} \}$ is from $N(\mu_1, \sigma_1^2)$ and 
## $\{y_1, y_2, y_3, y_4, \dots, y_{n_2} \}$ is from $N(\mu_2, \sigma_2^2)$


```{r}
x <- c(5.12, 5.28, 4.98, 4.62, 4.95, 4.35, 4.83, 6.15, 5.83, 5.43)
y <- c(5.37, 5.63, 5.89, 5.92, 5.91, 6.15, 6.29, 6.41, 6.88, 7.12)

qqnorm(x);
qqnorm(y);

ttest.res <- t.test(x,y)
ttest.res$p.value

```
## $t-score = \frac{\bar{x}-\bar{y}}{ \sqrt{ s^2_x/n_1+s^2_y/n_2  }  }$ 
## distribution is $t_{(\nu)}$
## $\nu = \frac{(s^2_x/n_1+s^2_y/n_2)^2}{(s^2_x/n_1)^2/(n_1-1)+(s^2_y/n_2)^2/(n_2-1)}$


<br />
<br />
<br />
<br />

<br />
<br />
<br />
<br />

# **Linear regression**
## $\hat{y}=\beta_0 + \beta_1 x$

## $\{x_1, x_2, x_3, x_4, \dots, x_{n_1} \}$ is from $N(\beta_0+\beta_1\cdot 0, \sigma^2)$ and 
## $\{y_1, y_2, y_3, \dots, y_{n_2} \}$ is from $N(\beta_0+\beta_1\cdot 1, \sigma^2)$
## Since the estimator's of $\beta_0$ and $\beta_1$ are calculated by using $x_i$ and $y_j$, we also have probability distributions for the estimators of $\beta_0$ and $\beta_1$. $\beta_1$ estimator's distribution is another $t$-distribution. Using this we can test
## $H_0:\beta_1=0$
## If rejected, $i.e.$, $P-$value is low enough, we can say that means are different.

```{r}
library(ggplot2)
rm(data_4_lin_reg)
data_4_lin_reg <- data.frame(name = rep("x", length(x)), values = x)
data_4_lin_reg <- rbind(data_4_lin_reg, data.frame(name = rep("y", length(y)), values = y))
ggplot(data = data_4_lin_reg, aes(x=name, y=values))+geom_boxplot()



# lin_reg2 <- glm(data = data_4_lin_reg, values ~ name, family=gaussian)
# summary(lin_reg2)
```







```{r}
var.test(x, y, alternative = "two.sided")

lin_reg <- lm(data = data_4_lin_reg, values ~ name)
coef(summary(lin_reg))[2,4]
```



```{r}
y_1=1.4*y-2.0



```



```{r}
var.test(x, y_1, alternative = "two.sided")


```

```{r}
ttest.res <- t.test(x,y_1)
ttest.res$p.value
```

```{r}

rm(data_4_lin_reg)
data_4_lin_reg <- data.frame(name = rep("x", length(x)), values = x)
data_4_lin_reg <- rbind(data_4_lin_reg, data.frame(name = rep("y", length(y_1)), values = y_1))
#data_4_lin_reg

ggplot(data = data_4_lin_reg, aes(x=name, y=values))+geom_boxplot()
```

```{r}
lin_reg <- lm(data = data_4_lin_reg, values ~ name)
coef(summary(lin_reg))[2,4]

```
<br />
<br />
<br />
<br />

<br />
<br />
<br />
<br />

# **General linear regressions**



## **1. Linear regression.** 
## $Y|_{x} \sim N(\beta_0 + \beta_1 x,\sigma)$
## $E(Y|_{x})=\beta_0 +\beta_1 x$
## Probability distribution, $f(y|x;(\beta_0, \beta_1 ))=n(\beta_0 + \beta_1 x,\sigma)$

```{r}

ind.var <- seq(1, 10, by=1)
dep.var <- ind.var + rnorm(10, 0, 1.5)

```


```{r}
linreg.res <- lm(data = data.frame(x=ind.var, y= dep.var), y ~ x)
coef(summary(linreg.res))


interc = coef(summary(linreg.res))[1,1]; slope = coef(summary(linreg.res))[2,1]

ggplot(data = data.frame(x=ind.var, y= dep.var))+geom_point(aes(x=ind.var, y=dep.var)) + 
  geom_line(aes(x=ind.var, y=interc + slope*ind.var   )) +
  geom_line(data = data.frame(x=dnorm(seq(-3.5,0,by=0.01), 0, 1.5)+3, 
                              y=seq(-3.5,0,by=0.01)+interc + slope*3), aes(x=x, y= y  )) + 
  geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+3,
                              y=seq(0,3.5,by=0.01)+interc + slope*3), aes(x=x, y= y  )) +
  geom_line(data = data.frame(x=dnorm(seq(-3.5,0,by=0.01), 0, 1.5)+5, 
                              y=seq(-3.5,0,by=0.01)+interc + slope*5), aes(x=x, y= y  )) + 
  geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+5, 
                              y=seq(0,3.5,by=0.01)+interc + slope*5), aes(x=x, y= y  )) +
  geom_line(data = data.frame(x=dnorm(seq(-3.5,0,by=0.01), 0, 1.5)+7, 
                              y=seq(-3.5,0,by=0.01)+interc + slope*7), aes(x=x, y= y  )) + 
  geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+7, 
                              y=seq(0,3.5,by=0.01)+interc + slope*7), aes(x=x, y= y  ))


```
## How to find $\beta_0$ and $\beta_1$.
## Data $\{(x_1, y_1),  (x_2, y_2),  (x_3, y_3), (x_4, y_4), (x_5, y_5)\}$
### - Minimize $SSE$
### $SSE = F(\beta_0, \beta_1 ) = \sum_1^5 (\beta_0 + \beta_1x_i - y_i)^2$; multivariable (geometry) calculus problem.

### - Maximize likelihood function, $L$
### $L = \prod_1^5 \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(\beta_0 + \beta_1x_i - y_i)^2}{2\sigma^2})$. Or, minimize $-ln(L)$ which is the same as minimizing $SSE$.

### - Solve $X^TX\cdot B=X^TY$
#### $X=\begin{bmatrix}
1 & x_1 \\ 
1 & x_2 \\ 
\vdots & \vdots \\ 
1 & x_5 
\end{bmatrix}$, $B=\begin{bmatrix} \beta_0 \\
\beta_1
\end{bmatrix}$, $Y=\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_5
\end{bmatrix}$.

<br />
<br />
<br />
<br />



```{r}
linreg.res <- glm(data = data.frame(x=ind.var, y= dep.var), y ~ x, family=gaussian )
coef(summary(linreg.res))
```


## **2. Logistic regression.**
## $Y|x \sim B(n,p|_x).$ We assume that $E(Y|x) = n\cdot inv.logit(\beta_0 +\beta_1 x)$.
## Since $E(Y|x)=n\cdot p|_x$, $p|_x=inv.logit(\beta_0 +\beta_1 x)$. Or,
## $logit(p|_x)=\beta_0 +\beta_1 x$.
## Logit is called the link function for logistic regression. 
## Probability distribution, 
### $f(1|x)=p|_x$ and $f(0|x)=1-p|_x$

### $$L=\prod_{y_i=0}(1-p|_{x_i})\cdot \prod_{y_i=1}(p|_{x_i})$$
### $$\ln L = \sum_{i}((1-y_i)\ln (1-p|_{x_i}) + y_i\ln (p|_{x_i}))$$
### $$ \left( \frac{\hat{\beta_i}}{s.e.(\hat{\beta_i})} \right)^2 \sim \chi^2 $$  

```{r}
library(ggplot2)
ind.var_bin <- seq(-3, 3, by=0.5)
#p <- inv.logit(ind.var_bin)

dep.var_bin <- c(0,0,0,1,0,1,0,1,1,0,1,1,1)

binreg.res <- glm(data = data.frame(x=ind.var_bin, y= dep.var_bin), y ~ x, family=binomial(link=logit) )
coef(summary(binreg.res))

tt=seq(-3.2, 3.2, by = 0.1)
ggplot(data=data.frame(x=ind.var_bin, y=dep.var_bin))+geom_point(aes(x=x,y=y)) + 
  geom_line(data=data.frame(x=tt, y=inv.logit(coef(summary(binreg.res))[1,1]+coef(summary(binreg.res))[2,1]*tt)), aes(x=x,y=y)) 

success =  data.frame(count=rep(0, length(ind.var_bin)))
for (ii in (1:length(ind.var_bin))){
  
  p0 <- inv.logit(coef(summary(binreg.res))[1,1]+coef(summary(binreg.res))[2,1]*ind.var_bin[ii])
  success$count[ii] = rbinom(1,100, p0)
  
}

ggplot(data = data.frame(x=ind.var_bin, count=success$count)) + geom_point(aes(x=x,y=count/100)) + geom_line(data=data.frame(x=tt, y=inv.logit(coef(summary(binreg.res))[1,1]+coef(summary(binreg.res))[2,1]*tt)), aes(x=x,y=y)) 


# 
# x0 <- seq(1:100)/100 
# y0 <- 10*dbinom(seq(1:100), 100, p0)
# 
# ggplot(data=data.frame(x=x0, y=y0))+geom_point(aes(x=y,y=x), pch=15, size=0.5)

  # geom_line(data = data.frame(x=dbinom(seq(1:100), 100, 1.5)+7, 
  #                             y=seq(-3.5,0,by=0.01)+interc + slope*7), aes(x=x, y= y  )) + 
  # geom_line(data = data.frame(x=dnorm(seq(0,3.5,by=0.01), 0, 1.5)+7, 
  #                             y=seq(0,3.5,by=0.01)+interc + slope*7), aes(x=x, y= y  ))



```
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />


## **3. Poisson regression.** 
## $Y|_x \sim Poisson(e^{\beta_0 +\beta_1 x})$
## $E(Y|_x)=e^{\beta_0 +\beta_1 x}$
## $$f(y|_x)=\frac{e^{(\beta_0 +\beta_1 x)y} \exp(-e^{\beta_0 +\beta_1 x})}{y!}$$

### $$\ln L = \sum_{i}\left( (\beta_0+\beta_1 x_i)y_i - e^{\beta_0+\beta_1 x_i} - \ln (y_i!) \right)$$

### $$ \ln (E(Y|_x)) = \beta_0 + \beta_1 x. $$ Linear again!!


```{r}
pois_reg <- glm(data=data, y ~ x, family=poisson)
```

<br />
<br />
<br />
<br />

```{r}
slope=1.5
x=seq(0, 1, by=0.1)
y=rpois(11, exp(1+slope*x))


pois_reg <- glm(data=data.frame(x=x,y=y), y ~ x, family=poisson)
p_slope = coef(summary(pois_reg))[2,1]; p_interc <- coef(summary(pois_reg))[1,1]
coef(summary(pois_reg))
lin_reg <- glm(data=data.frame(x=x,y=y), y ~ x, family=gaussian)
l_slope = coef(summary(lin_reg))[2,1]; l_interc <- coef(summary(lin_reg))[1,1]
coef(summary(lin_reg))

ggplot(data = data.frame(x=x,y=y))+geom_point(aes(x=x,y=y)) +
  geom_line( data=data.frame(x=seq(0, 1, by=0.05), y=exp(1+slope*seq(0, 1, by=0.05)))     , aes(x=x,y=y)) +
  geom_line( data=data.frame(x=seq(0, 1, by=0.05), y=exp(p_interc+p_slope*seq(0, 1, by=0.05))), aes(x=x,y=y), color = "red") +
  geom_line( data=data.frame(x=seq(0, 1, by=0.05), y=l_interc+l_slope*seq(0, 1, by=0.05)), aes(x=x,y=y), color = "blue")



```

<br />
<br />
<br />
<br />


# **Regressions to compare two groups: P-value calculation.**

## - State marker expression data.

### What's the population? What is the random variable? Samples?

### Example: 
- population of virginia who had heart attack. 
- Median of NK$\kappa$B expression from all Treg cells.
- 16 people.

### What is the distribution? 
- one side open ended counting data: Poisson, Negative binomial distribution
- data transformed by $\text{asinh}=\ln (x+\sqrt{x^2+1}) \sim \ln 2 + \ln x \text{ for} x>>1$
- Gaussian.

### Use T-test or linear regression

```{r}
ggplot(data = data.frame(x=seq(-10, 100, by=1), y=asinh(seq(-10, 100, by=1))))+geom_line(aes(x=x,y=y)) + coord_fixed(ratio = 4/1) +
  labs(title = paste("y = asinh(x)")) 

```

<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />

##  - Proportion data. Cluster proportions.

### What is the distribution? 

#### - Binomial distribution. **Logistic regression** or Quasi-logistic regression.
#### $$ P(x=k)= {n \choose x}p(p+k\phi)^{k-1}(1-p-k\phi)^{n-k}$$
#### - If $X \sim B(n,p)$, $X \sim N(np, npq)$, approximately ($n\geq 15$). Then, **t-test** or linear regression can be applied.

```{r}
ggplot(data = data.frame(x=seq(0.01,0.99, by=0.01), y=logit( seq(0.01,0.99, by=0.01) )))+geom_line(aes(x=x,y=y)) + coord_fixed(ratio = 1/9) +
  labs(title = paste("y = logit(x)"))
```
#### - $$ \ln\left( \frac{proportion}{1-proportion}  \right) \sim Gaussian$$. Apply *t-test or linear regression* to **log odds ratio**.  


## Weights
### - Maximize likelihood function, $L$
### $$L = \prod_1^n \left( \frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(\beta_0 + \beta_1x_i - y_i)^2}{2\sigma^2} \right) \right)^{w_i}.$$
### $$ -\ln L = Constant \cdot \sum^n_1w_i (\beta_0 + \beta_1x_i - y_i)^2$$
<!-- ### Random numbers -->

<!-- # ```{r} -->
<!-- # # temp <- rpois(10000, 500) -->
<!-- # # #temp -->
<!-- # # hist(temp) -->
<!-- # #  -->
<!-- # #  -->
<!-- # # tt=seq(-3,3,by=0.01) -->
<!-- # # plot(dnorm(tt,0,1),tt) -->
<!-- # ``` -->


<!-- # ```{r} -->
<!-- # temp_log <- log(temp) -->
<!-- # hist(temp_log) -->
<!-- # ``` -->

<!-- # ```{r} -->
<!-- # # qqnorm(temp_log[1:50]); qqline(temp_log[1:50], col = 2) -->
<!-- # #  -->
<!-- # #  -->
<!-- # # y <- rt(20000, df = 5) -->
<!-- # # qqnorm(y); qqline(y, col = 2) -->
<!-- # ``` -->


<!-- # ```{r} -->
<!-- #  -->
<!-- # # bin_temp <- rbinom(100, 100, 0.1) -->
<!-- # # hist(bin_temp/100) -->
<!-- # #  -->
<!-- # # hist(inv.logit(bin_temp/100)) -->
<!-- # #  -->
<!-- # # qqnorm(inv.logit(bin_temp/100)); qqline(inv.logit(bin_temp/100), col = 2) -->
<!-- #  -->
<!-- # ``` -->

