Uncertainty

.leftcol30[
<center>
<img src="https://github.com/emse-madd-gwu/emse-madd-gwu.github.io/raw/master/images/madd_hex_sticker.png" width=250>
</center>
]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 6035: Marketing Analytics for Design Decisions
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> October 27, 2021
]

---

# Pilot Analysis Report

### Assignment is now [posted](https://madd.seas.gwu.edu/2021-Fall/p4-pilot-analysis.html)

### Due 11/07 (that's 10 days from now)

---

# Quick correction from last week

**Observations** - Height of students (inches):

```
#>  [1] 65 69 66 67 68 72 68 69 63 70
```

a) Let's say we know that the height of students, `$\tilde{x}$`, in a classroom follows a normal distribution. A professor obtains the above height measurements students in her classroom. What is the log-likelihood that `$\tilde{x} \sim \mathcal{N} (68, 4)$`? In other words, compute `$\ln \mathcal{L} (\mu = 68, \sigma = 4)$`.

b) Compute the log-likelihood function using the same standard deviation `$(\sigma = 4)$` but with the following different values for the mean, `$\mu: 66, 67, 68, 69, 70$`. How do the results compare? Which value for `$\mu$` produces the highest log-likelihood?

---

Load the data

```r
x <- c(65, 69, 66, 67, 68, 72, 68, 69, 63, 70)
```

Compute the value of f(x) for each x

```r
f_x <- dnorm(x, 68, 4)
```

Likelihood is the product of values in `f_x`

```r
prod(f_x)
```

```
#> [1] 1.447528e-11
```
]

Take the log of the likelihood

```r
log(prod(f_x))
```

```
#> [1] -24.95858
```

The way we typically compute the log-likelihood is by summing up the log of the values in `f_x`

```r
sum(log(f_x))
```

```
#> [1] -24.95858
```
]

---

```r
library(tidyverse)

# Create a vectors of values for the mean
means <- c(66, 67, 68, 69, 70)

# Compute the likelihood using different values for the mean:
L1 <- sum(log(dnorm(x, means[1], 4)))
L2 <- sum(log(dnorm(x, means[2], 4)))
L3 <- sum(log(dnorm(x, means[3], 4)))
L4 <- sum(log(dnorm(x, means[4], 4)))
L5 <- sum(log(dnorm(x, means[5], 4)))
logLiks <- c(L1, L2, L3, L4, L5)

# Plot the result:
df <- data.frame(means, logLiks)

df %>% 
  ggplot(aes(x = means, y = logLiks)) +
  geom_line() +
  geom_point() + 
  theme_bw() + 
  labs(
      x = "Mean Value", 
      y = "Log-likelihood Values"
  )
```
]

---

# Week 9: .fancy[Uncertainty]

### 1. Computing uncertainty
### 2. Reshaping data

### BREAK

### 3. Cleaning pilot data
### 4. Estimating pilot data models

---

# Week 9: .fancy[Uncertainty]

### 1. .orange[Computing uncertainty]
### 2. Reshaping data

### BREAK

### 3. Cleaning pilot data
### 4. Estimating pilot data models

---

# Maximum likelihood estimation

---

---

### The _curvature_ of the log-likelihood function is related to the hessian

---

### The _curvature_ of the log-likelihood function is related to the hessian

---

### Usually report parameter uncertainty ("standard errors") with `$\sigma$` values

---

# Practice Question 1

Suppose we estimate a model and get the following results:

$$
\hat{\beta} =
`\begin{bmatrix}
-0.4
\\ 
0.5
\end{bmatrix}`
$$

$$
\nabla_{\beta}^2 \ln(\mathcal{L}) =
`\begin{bmatrix}
-6000 & 60
\\ 
60 & -700
\end{bmatrix}`
$$

a) Use the hessian to compute the standard errors for `$\hat{\beta}$`

b) Use the standard errors to compute a 95% confidence interval around `$\hat{\beta}$`

---

# .center[Simulating uncertainty]

We can use the coefficients and hessian from a model to obtain draws that reflect parameter uncertainty

```r
beta <- c(-0.7, 0.1, -4.0)

hessian <- matrix(c(
    -6000,   50,   60,
       50, -700,   50,
       60,   50, -300),
    ncol = 3, byrow = TRUE)
```
]

```r
covariance <- -1*solve(hessian)
draws <- MASS::mvrnorm(10^5, beta, covariance)

head(draws)
```

```
#>            [,1]       [,2]      [,3]
#> [1,] -0.7219066 0.12844865 -3.979129
#> [2,] -0.7056323 0.05400818 -3.990519
#> [3,] -0.6969429 0.07709515 -4.027605
#> [4,] -0.7003554 0.07658722 -4.063454
#> [5,] -0.7061336 0.09679411 -4.110877
#> [6,] -0.6947397 0.15654305 -3.869518
```
]

---

# .center[Simulating uncertainty]

We can use the coefficients and hessian from a model to obtain draws that reflect parameter uncertainty

```r
hist(draws[, 1])
```

<img src="figs/unnamed-chunk-12-1.png" width="522.144" />
]

```r
hist(draws[, 2])
```

<img src="figs/unnamed-chunk-13-1.png" width="522.144" />
]

```r
hist(draws[, 3])
```

<img src="figs/unnamed-chunk-14-1.png" width="522.144" />
]

---

# Practice Question 2

Suppose we estimate the following utility model describing preferences for cars:

$$
u_j = \alpha p_j + \beta_1 x_j^{mpg} + \beta_2 x_j^{elec} + \varepsilon_j
$$

a) Generate 10,000 draws of the model coefficients using the estimated coefficients and hessian. Use the `mvrnorm()` function from the `MASS` library.

b) Use the draws to compute the mean and 95% confidence intervals of each parameter estimate.

]

Parameter | Coefficient
----------|------------
`$\alpha$` | -0.7
`$\beta_1$` | 0.1
`$\beta_2$` | -0.4

Hessian:

$$
`\begin{bmatrix}
-6000 & 50 & 60
\\ 
50 & -700 & 50
\\
60 & 50 & -300
\end{bmatrix}`
$$
]

---

## Download the [logitr-cars](https://github.com/emse-madd-gwu/logitr-cars) repo from GitHub

---

# .center[Computing and visualizing uncertainty]

<br>

## 2. Open `code/5.1-uncertainty.R`
]

---

# Week 9: .fancy[Uncertainty]

### 1. Computing uncertainty
### 2. .orange[Reshaping data]

### BREAK

### 3. Cleaning pilot data
### 4. Estimating pilot data models

---

# .center[Names, Values, and Observations]

- Variable **Name**: The name of something you can measure
- Variable **Value**: One instance of a measured variable
- **Observation**: A set of associated measurements across multiple variables

```r
head(fed_spend_long)
```

```
#> # A tibble: 6 × 3
#>   department  year rd_budget_mil
#>   <chr>      <dbl>         <dbl>
#> 1 DOD         1976         35696
#> 2 NASA        1976         12513
#> 3 DOE         1976         10882
#> 4 HHS         1976          9226
#> 5 NIH         1976          8025
#> 6 NSF         1976          2372
```
]

---

# "Long" format data

- Each **variable** has its own **column**
- Each **observation** has its own **row**

---

- Each **variable** has its own **column**
- Each **observation** has its own **row**
]

---

```
#> # A tibble: 6 × 8
#>    year   DHS   DOC   DOD   DOE   DOT   EPA   HHS
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  1976     0   819 35696 10882  1142   968  9226
#> 2  1977     0   837 37967 13741  1095   966  9507
#> 3  1978     0   871 37022 15663  1156  1175 10533
#> 4  1979     0   952 37174 15612  1004  1102 10127
#> 5  1980     0   945 37005 15226  1048   903 10045
#> 6  1981     0   829 41737 14798   978   901  9644
```
]]

---

---

# **Quick practice 1**: "long" or "wide" format?

**Description**: Tuberculosis cases in various countries
.code100[

```
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
```
]

---

# **Quick practice 2**: "long" or "wide" format?

**Description**: Word counts by character type in "Lord of the Rings" trilogy
.code90[

```
#> # A tibble: 9 × 4
#>   Film                       Race   Female  Male
#>   <chr>                      <chr>   <dbl> <dbl>
#> 1 The Fellowship Of The Ring Elf      1229   971
#> 2 The Fellowship Of The Ring Hobbit     14  3644
#> 3 The Fellowship Of The Ring Man         0  1995
#> 4 The Return Of The King     Elf       183   510
#> 5 The Return Of The King     Hobbit      2  2673
#> 6 The Return Of The King     Man       268  2459
#> 7 The Two Towers             Elf       331   513
#> 8 The Two Towers             Hobbit      0  2463
#> 9 The Two Towers             Man       401  3589
```
]

---

# **Quick practice 3**: "long" or "wide" format?

**Description**: Word counts by character type in "Lord of the Rings" trilogy
.code60[

```
#> # A tibble: 18 × 4
#>    Film                       Race   Gender Word_Count
#>    <chr>                      <chr>  <chr>       <dbl>
#>  1 The Fellowship Of The Ring Elf    Female       1229
#>  2 The Fellowship Of The Ring Elf    Male          971
#>  3 The Fellowship Of The Ring Hobbit Female         14
#>  4 The Fellowship Of The Ring Hobbit Male         3644
#>  5 The Fellowship Of The Ring Man    Female          0
#>  6 The Fellowship Of The Ring Man    Male         1995
#>  7 The Return Of The King     Elf    Female        183
#>  8 The Return Of The King     Elf    Male          510
#>  9 The Return Of The King     Hobbit Female          2
#> 10 The Return Of The King     Hobbit Male         2673
#> 11 The Return Of The King     Man    Female        268
#> 12 The Return Of The King     Man    Male         2459
#> 13 The Two Towers             Elf    Female        331
#> 14 The Two Towers             Elf    Male          513
#> 15 The Two Towers             Hobbit Female          0
#> 16 The Two Towers             Hobbit Male         2463
#> 17 The Two Towers             Man    Female        401
#> 18 The Two Towers             Man    Male         3589
```
]

---

# Reshaping data with

## `pivot_longer()` and `pivot_wider()`

---

## .center[From "long" to "wide" with `pivot_wider()`]

---

## .center[From "long" to "wide" with `pivot_wider()`]

```r
head(fed_spend_long)
```

```r
fed_spend_wide <- fed_spend_long %>%
*   pivot_wider(
*       names_from = department,
*       values_from = rd_budget_mil)

head(fed_spend_wide)
```

```
#> # A tibble: 6 × 7
#>    year   DHS   DOC   DOD   DOE   DOT   EPA
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  1976     0   819 35696 10882  1142   968
#> 2  1977     0   837 37967 13741  1095   966
#> 3  1978     0   871 37022 15663  1156  1175
#> 4  1979     0   952 37174 15612  1004  1102
#> 5  1980     0   945 37005 15226  1048   903
#> 6  1981     0   829 41737 14798   978   901
```
]

---

## .center[From "wide" to "long" with `pivot_longer()`]

---

## .center[From "wide" to "long" with `pivot_longer()`]

```r
names(fed_spend_wide)
```

```
#>  [1] "year"     "DHS"      "DOC"      "DOD"      "DOE"      "DOT"      "EPA"      "HHS"      "Interior" "NASA"     "NIH"      "NSF"      "Other"    "USDA"     "VA"
```
]

```r
fed_spend_long <- fed_spend_wide %>%
*   pivot_longer(
*       cols = DHS:VA,
*       names_to = "department",
*       values_to = "rd_budget_mil")

head(fed_spend_long)
```

```
#> # A tibble: 6 × 3
#>    year department rd_budget_mil
#>   <dbl> <chr>              <dbl>
#> 1  1976 DHS                    0
#> 2  1976 DOC                  819
#> 3  1976 DOD                35696
#> 4  1976 DOE                10882
#> 5  1976 DOT                 1142
#> 6  1976 EPA                  968
```
]

---

## Can also set `cols` by selecting which columns _not_ to use

```r
names(fed_spend_wide)
```

```
#>  [1] "year"     "DHS"      "DOC"      "DOD"      "DOE"      "DOT"      "EPA"      "HHS"      "Interior" "NASA"     "NIH"      "NSF"      "Other"    "USDA"     "VA"
```
]

```r
fed_spend_long <- fed_spend_wide %>%
    pivot_longer(
*       cols = -year,
        names_to = "department",
        values_to = "rd_budget_mil")

head(fed_spend_long)
```

---

# Your turn: Long <--> Wide

Open the `practice.Rmd` file.

Under "In Class Question 1", write code to read in the following two files:

- `pv_cells.csv`: Data on solar photovoltaic cell production by country
- `milk_production.csv`: Data on milk production by state

Now modify the format of each:

- If the data are in "wide" format, convert it to "long" with `pivot_longer()`
- If the data are in "long" format, convert it to "wide" with `pivot_wider()`

---

# .fancy[Break]

---

# Week 9: .fancy[Uncertainty]

### 1. Computing uncertainty
### 2. Reshaping data

### BREAK

### 3. .orange[Cleaning pilot data]
### 4. Estimating pilot data models

---

## Download the [formr4conjoint](https://github.com/jhelvy/formr4conjoint) repo from GitHub

---

# .center[Cleaning formr survey data]

<br>

## 2. Open `code/data_cleaning.R`
]

---

# Your Turn

## As a team, pick up where you left off last week and create a `choiceData` data frame in a "long" format

---

# Week 9: .fancy[Uncertainty]

### 1. Computing uncertainty
### 2. Reshaping data

### BREAK

### 3. Cleaning pilot data
### 4. .orange[Estimating pilot data models]

---

# .center[Estimating pilot data models]

<br>

## 2. Open `code/modeling.R`
]

---

# Your Turn

## As a team:

1. Use your `choiceData` data frame to estimate preliminary choice models.  
2. Interpret your model coefficients with uncertainty.