Mărimea eșantionului și semnificația statistică pentru testele chi-pătrat

(Acest articol a fost publicat pentru prima dată pe Jason Bryerși a contribuit cu drag la R-Bloggers). (Puteți raporta problema despre conținutul de pe această pagină aici)

Doriți să vă împărtășiți conținutul pe R-Bloggers? Faceți clic aici dacă aveți un blog sau aici dacă nu.

În această postare vom explora relația dintre dimensiunea eșantionului (n) și semnificație statistică pentru chi-pătrat () test. Reamintim că din distribuția normală, construim un interval de încredere folosind:

unde Z. este statistica testului și:

unde s este eșantionul abatere standard. De obicei nul este zero caz în care respingem nul Ipoteză atunci când încrederea nu se întinde pe zero. Dacă dorim să construim un interval de încredere de 95%, atunci . Presupunând că abaterea standard a eșantionului este constantă indiferent de mărimea eșantionului (o presupunere corectă), atunci ca n crește eroarea standard scade. Următoarele calculează intervalul de încredere pentru n variind de la 10 la 400 presupunând o eșantion de abatere standard de 0,15 și 95% nivel de încredere. Când apoi .

# Define some parameters
sig_level <- .95  # Significance level, 95% here
es <- 0.15        # Effect size in standard units
null_val <- 0     # null value

#' Calculate the standard error
#' 
#' This function will calculate the standard error from a vector of observations or with a given
#' sample standard deviation and sample size.
#' 
#' @param x numeric vector of observations.
#' @param sigma the sample standard deviation.
#' @param n sample size.
standard_error <- function(x, sigma = sd(x), n = length(x)) {
    if(!missing(x)) { # Some basic error checking
        if(sigma != sd(x)) { warning('The sample standard deviation (sigma) is not equal to sd(x)')}
        if(n != length(x)) { warning('The sample size (n) is not equal to length(x).' )}
    }
    return(sigma / sqrt(n))
}
# Create a data.frame with varying sample sizes and the corresponding standard error
df <- data.frame(
    n = 10:400,
    se = standard_error(sigma = 1, n = 10:400)
)
cv <- abs(qnorm((1 - sig_level) / 2)) # Critical value (z test statistic)
df$ci_low <- es - cv * df$se
df$ci_high <- es + cv * df$se
df$sig <- null_val < df$ci_low | null_val > df$ci_high
min_n <- df$n(df$sig) |> min()
ggplot(df, aes(x = n, y = se, color = sig)) + 
    geom_path() +
    geom_point() +
    scale_color_brewer(paste0('p < ', (1 - sig_level)), type="qual", palette = 6) +
    ggtitle(paste0('Minumum n for p < ', (1 - sig_level), ': ', min_n),
            subtitle = paste0('effect size: ', es, '; null value: ', null_val))

Chi-squared () statistica testului este definită ca:

unde O este numărul observat și E este numărul așteptat. Spre deosebire de eroarea standard pentru date numerice, n nu este în mod explicit în formulă și, prin urmare, îl face un pic mai dificil să se determine că mărimea eșantionului de impact a respins nul ipoteză. Mai mult decât atât, deoarece chi-pătratul este calculat din numărul de celule într-un tabel cu lungime și dimensiune variabilă (unu sau două dimensiuni în mod specific), determinând modul în care n Impacturi respingând nul sau nu necesită mai mulți parametri.

Răspunzând la întrebarea cât de mare face n trebuie să fie detectarea unui rezultat semnificativ statistic (adică pentru a respinge nul Ipoteză) este arbitrat ca putere. În timp ce pentru calcularea puterii pentru datele numerice a avut un parametru, abaterea standard a eșantionului, aici trebuie să luăm în considerare proporția de observații în diferite celule. De exemplu, considerați că avem o variabilă cu trei niveluri și ne așteptăm ca proporția de observații din cele trei grupuri să fie de 33%, 25%și, respectiv, 42%. Dacă dimensiunea eșantionului nostru este de 100, atunci ne așteptăm să existe 33, 25 și 42 și observații pentru cele trei categorii. Această funcție va calcula, pentru diferite dimensiuni a eșantionului p-valoare. Există alți parametri care sunt documentați mai jos. O plot Funcția este de asemenea definită folosind cadrul orientat cu obiecte S3.

#' Calculate p-value from a chi-squared test with varying sample sizes
#'
#' This algorithm will start with an initial sample size (`n_start`) and perform a chi-squared test
#' with a vector of counts equal to `n * probs`. This will repeat increasing the sample size by
#' `n_step` until the p-value from the chi-squared test is less than `p_stop`.
#'
#' @param vector of cell probabilities. The sum of the values must equal 1.
#' @param sig_level significance level.
#' @param p_stop the p-value to stop estimating chi-squared tests.
#' @param max_n maximum n to attempt if `p_value` is never less than `p_stop`.
#' @param min_cell_size minimum size per cell to perform the chi-square test.
#' @param n_start the starting sample size.
#' @param n_step the increment for each iteration.
#' @return a data.frame with three columns: n (sample size), p_value, and sig (TRUE if
#'         p_value < sig_level).
#' @importFrom DescTools power.chisq.test CramerV
chi_squared_power <- function(
        probs,
        sig_level = 0.05,
        p_stop = 0.01,
        power = 0.80,
        power_stop = 0.90,
        max_n = 100000,
        min_cell_size = 10,
        n_start = 10,
        n_step = 10
) {
    if(sum(probs) != 1) { # Make sure the sum is equal to 1
        stop('The sum of the probabilities must equal 1.')
    } else if(length(unique(probs)) == 1) {
        stop('All the probabilities are equal.')
    }

    n <- n_start
    p_values <- numeric()
    power_values <- numeric()
    df <- ifelse(is.vector(probs),
                 length(probs) - 1,
                 min(dim(probs)) - 1) # Degrees of freedom
    repeat {
        x <- (probs * n) |> round()
        if(all(x > min_cell_size)) {
            cs <- chisq.test(x, rescale.p = TRUE, simulate.p.value = FALSE)
            p_values(n / n_step) <- cs$p.value
            pow <- DescTools::power.chisq.test(
                n = n,
                w = DescTools::CramerV(as.table(x)),
                df = df,
                sig.level = sig_level
            )
            power_values(n / n_step) <- pow$power
            if((cs$p.value < p_stop & pow$power > power_stop) | n > max_n) {
                break;
            }
        } else {
            p_values(n / n_step) <- NA
            power_values(n / n_step) <- NA
        }
        n <- n + n_step
    }
    result <- data.frame(n = seq(10, length(p_values) * n_step, n_step),
                         p_value = p_values,
                         sig = p_values < sig_level,
                         power = power_values)
    class(result) <- c('chisqpower', 'data.frame')
    attr(result, 'probs') <- probs
    attr(result, 'sig_level') <- sig_level
    attr(result, 'p_stop') <- p_stop
    attr(result, 'power') <- power
    attr(result, 'power_stop') <- power_stop
    attr(result, 'max_n') <- max_n
    attr(result, 'n_step') <- n_step
    return(result)
}

#' Plot the results of chi-squared power estimation
#'
#' @param x result of (chi_squared_power()).
#' @param plot_power whether to plot the power curve.
#' @param plot_p whether to plot p-values.
#' @param digits number of digits to round to.
#' @param segement_color color of the lines marking where power and p values exceed threshold.
#' @param sgement_linetype linetype of the lines marking where power and p values exceed threshold.
#' @param p_linetype linetype for the p-values.
#' @param power_linetype linetype for the power values.
#' @param title plot title. If missing a title will be automatically generated.
#' @parma ... currently not used.
#' @return a ggplot2 expression.
plot.chisqpower <- function(
        x,
        plot_power = TRUE,
        plot_p = TRUE,
        digits = 4,
        segment_color="grey60",
        segment_linetype = 1,
        p_linetype = 1,
        power_linetype = 2,
        title,
        ...
) {
    pow <- attr(x, 'power')

    p <- ggplot(x(!is.na(x$p_value),), aes(x = n, y = p_value))

    if(plot_power) {
        if(any(x$power > pow, na.rm = TRUE)) {
            min_n_power <- min(x(x$power > pow,)$n, na.rm = TRUE)
            p <- p +
                geom_segment(
                    x = 0,
                    xend = min_n_power,
                    y = pow,
                    yend = pow,
                    color = segment_color,
                    linetype = segment_linetype) +
                ggplot2::annotate(
                    geom = 'text',
                    x = 0,
                    y =  pow,
                    label = paste0('Power=",  pow),
                    vjust = -1,
                    hjust = 0) +
                geom_segment(
                    x = min_n_power,
                    xend = min_n_power,
                    y = pow,
                    yend = 0,
                    color = segment_color,
                    linetype = segment_linetype) +
                ggplot2::annotate(
                    geom = "text',
                    x = min_n_power, y = 0,
                    label = paste0('n = ', prettyNum(min_n_power, big.mark = ',')),
                    vjust = 1,
                    hjust = -0.1)
        }
        p <- p +
            geom_path(
                aes(y = power),
                color="#7570b3",
                linetype = power_linetype)
    }
    if(plot_p) {
        if(any(x$sig, na.rm = TRUE)) {
            p <- p +
                geom_segment(
                    x = 0,
                    xend = min(x(x$sig,)$n, na.rm = TRUE),
                    y = attr(x, 'sig_level'),
                    yend = attr(x, 'sig_level'),
                    color = segment_color,
                    linetype = segment_linetype) +
                ggplot2::annotate(
                    geom = 'text',
                    x = 0,
                    y =  attr(x, 'sig_level'),
                    label = paste0('p = ',  attr(x, 'sig_level')),
                    vjust = -1,
                    hjust = 0) +
                geom_segment(
                    x = min(x(x$sig,)$n, na.rm = TRUE),
                    xend = min(x(x$sig,)$n, na.rm = TRUE),
                    y = attr(x, 'sig_level'),
                    yend = 0,
                    color = segment_color,
                    linetype = segment_linetype) +
                ggplot2::annotate(
                    geom = 'text',
                    x = min(x(x$sig,)$n, na.rm = TRUE),
                    y = 0,
                    label = paste0('n = ', prettyNum(min(x(x$sig,)$n, na.rm = TRUE), big.mark = ',')),
                    vjust = 1,
                    hjust = -0.1)
        }
        p <- p +
            geom_path(
                alpha = 0.7,
                linetype = p_linetype)
            # geom_point(aes(color = sig), size = 1) +
            # scale_color_brewer(paste0('p < ', attr(x, 'sig_level')), type="qual", palette = 6)
    }

    if(missing(title)) {
        if(any(x$power > pow, na.rm = TRUE) & any(x$sig, na.rm = TRUE)) {
            min_n <- min(x(x$sig & x$power > pow,)$n, na.rm = TRUE)
            title <- paste0('Smallest n where p < ', attr(x, 'sig_level'), ' and power > ', pow, ': ',
                            prettyNum(min_n, big.mark = ','))
        } else {
            title <- paste0('No n found where p < ', attr(x, 'sig_level'), ' and power > ', pow)
        }
    }

    p <- p +
        ylim(c(0, 1)) +
        ylab('') +
        xlab('Sample Size') +
        ggtitle(title,
                subtitle = paste0('Probabilities: ', paste0(round(attr(x, 'probs'), digits = digits), collapse=", ")))

    return(p)
}

Revenind la exemplul nostru de mai sus, unde proporțiile celulare sunt de 33%, 25%și 42%, am avea nevoie pentru a respinge nul ipoteză.

csp1 <- chi_squared_power(probs =  c(.33, .25, .42))
csp1(csp1$sig,)$n |> min(na.rm = TRUE) # Smallest n that results in p < 0.05

plot(csp1)

În următorul exemplu, avem diferențe mult mai mici între celule cu 25%, 25%, 24%și 26%. În acest exemplu Înainte de a respinge nul ipoteză.

csp3 <- chi_squared_power(probs = c(.25, .25, .24, .26), max_n = 20000)
csp3(csp3$sig,)$n |> min(na.rm = TRUE) # Smallest n that results in p < 0.05

plot(csp3)

Această funcție va funcționa și cu date bidimensionale (adică în două variabile). Următorul exemplu de la Agresti (2007) analizează afilierea politică pe sexe (vezi documentația de ajutor pentru chisq.test().).

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("Femal", "Male"),
                    party = c("Democrat", "Independent", "Republican"))
M

       party
gender  Democrat Independent Republican
  Femal      762         327        468
  Male       484         239        477

sum(M)

Testul chi-pătrat sugerează că ar trebui să respingem nul Test de ipoteză.

chisq.test(M)

    Pearson's Chi-squared test

data:  M
X-squared = 30.07, df = 2, p-value = 2.954e-07

DescTools::CramerV(M) # Effect size

DescTools::power.chisq.test(n = sum(M),
                            w = DescTools::CramerV(M),
                            df = min(dim(M)) - 1,
                            sig.level = 1 - sig_level)

     Chi squared power calculation 

              w = 0.1044358
              n = 2757
             df = 1
      sig.level = 0.05
          power = 0.9997872

NOTE: n is the number of observations

Agresti a avut o dimensiune a eșantionului de 2757, dar putem pune întrebarea care este dimensiunea minimă a eșantionului ar avea nevoie pentru a detecta semnificația statistică? Mai întâi, transformăm numărul în proporții, apoi putem folosi chi_squared_power() Funcție pentru a găsi dimensiunea minimă a eșantionului pentru a respinge nul Test de ipoteză.

M_prob <- M / sum(M) # Convert the counts to percentages
csp4 <- chi_squared_power(probs = M_prob)
plot(csp4)

Pentru o aplicație mai robustă pentru estimarea puterii pentru multe teste statistice, consultați pachetul PWSRR R și aplicația strălucitoare corespunzătoare.

Mărimea eșantionului și semnificația statistică pentru testele chi-pătrat

LĂSAȚI UN MESAJ Renunțați la răspuns