The CGPfunctions package includes functions that I find useful for
teaching statistics especially to novices (as well as an opportunity to
sharpen my own R skills). I only write functions when I have a real need
– no theory – just help for actually practicing the art. They typically
are not “new” methods but rather wrappers around either base R or other
packages and are very task focused. This vignette covers one function
from the package that tries to help users (especially students) do one
thing well by pulling together pieces from a variety of places in
R
. Plot2WayANOVA
, which as the name implies
conducts a 2 way ANOVA and plots the results. I always try and find the
right balance between keeping the number of dependencies to a minimum
and not reinventing the wheel and writing functions that others have
done for me. The function makes use of the following non base r
packages.
ggplot2
as the work horse for all the actual
plottingcar
for it’s ability to compute Type II sums of
squares, we’ll address why that’s important in more detail later in the
scenario. We’ll also make use of it’s leveneTest
.sjstats
which takes out ANOVA table and gives us other
important information such as the effect sizes (η2 and ω2 ) through use of its
anova_stats
function. Prior to this version I had been
using my own local function but this runs rings around what I could
do.broomExtra::glance
will also help us grab very
important results like R2 and display themDescTool::PostHocTest
for accomplishing post hoc
testsThe ANOVA (Analysis of Variance) family of statistical techniques
allow us to compare mean differences of one outcome (dependent) variable
across two or more groups (levels) of one or more independent variables
(factor). It is also true that ANOVA is a special case of the GLM or
regression models so as the number of levels increase it might make more
sense to try one of those approaches. The 2 Way ANOVA allows for
comparisons of mean differences across 2 independent variables
factors
with a varying numbers of levels
in
each factor
.
If you prefer a more regression based approach with a very similar
plotted result I highly recommend the interactions
package
which I was unaware of until just recently. It is available through
CRAN.
The Plot2WayANOVA
function conducts a classic analysis
of variance (ANOVA) in a sane and defensible, albeit opinionated,
manner, not necessarily the only one. It’s real strength (I
hope) lies in the fact that it is pulled together in one function
and more importantly allows you to visualize the results concurrently
with no additional work.
Imagine that you are interested in understanding whether a car’s fuel
efficiency (mpg) varies based upon the type of transmission (automatic
or manual) and the number of cylinders the engine has. Let’s imagine
that the mtcars
data set is actually a random sample of 32
cars from different manufacturers and use the mean mpg
grouped by am
and cyl
to help inform our
thinking. While we expect variation across our sample we’re interested
in whether the differences between the means by grouping of transmission
type and cylinders is significantly different than what we would expect
in random variation across the data.
In simplistic terms we want to know whether am
matters,
cyl
matters or if it depends on the interaction of the two.
It’s this interaction term that typically confuses novices or is
difficult to “see”. That’s where a good interaction graph can hopefully
play a key role, and that’s what the Plot2WayANOVA
focuses
on.
There’s no lack or tools or capabilities in base R or in the many packages to do this task. What this function tries to do is pull together the disparate pieces with a set of sane defaults and a simple interface to work with it. At its simplest you would require the library and then enter this command:
Plot2WayANOVA(formula = mpg ~ am * cyl, dataframe = mtcars)
which lays our question out in R’s vernacular with a formula and a
dataframe. Optionally we can specify a different confidence level and
choose a line or a bar graph. Over time the function has gained a
plethora of formatting options.
“Under the hood”, however there’s a lot of nice features at work.
am
and cyl
is equal.
There’s some debate as to how much imbalance is permissible. But you’ll
be given fair warning if there is any.aov
or Anova
you’ll be presented with
information about effect sizes like eta squared η2. They’re calculated
and appended as additional columns. If you’re unfamiliar with them and
want to know more especially where the numbers come from I recommend a
good introductory stats text. I recommend Learning Statistics with
R LSR see Table
14-1 on page 432.# Install from CRAN
install.packages("CGPfunctions")
# Or the development version from GitHub
# install.packages("devtools")
devtools::install_github("ibecav/CGPfunctions")
then load the library.
The only two required parameters are a formula and a dataframe (like) object. If we run the function in its simplest form here’s what we get.
Plot2WayANOVA(formula = mpg ~ am * cyl, dataframe = mtcars)
#>
#> Converting am to a factor --- check your results
#>
#> Converting cyl to a factor --- check your results
#>
#> --- WARNING! ---
#> You have one or more cells with less than 3 observations.
#> # A tibble: 6 × 3
#> am cyl count
#> <fct> <fct> <int>
#> 1 0 4 3
#> 2 0 6 4
#> 3 0 8 12
#> 4 1 4 8
#> 5 1 6 3
#> 6 1 8 2
#>
#> --- WARNING! ---
#> You have an unbalanced design. Using Type II sum of
#> squares, to calculate factor effect sizes eta and omega.
#> Your two factors account for 0.684 of the type II sum of
#> squares.
#> etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f
#> ------------------------------------------------------------------------
#> 0.049 | 0.133 | 0.036 | 0.086 | 0.036 | 0.392
#> 0.602 | 0.656 | 0.571 | 0.598 | 0.578 | 1.382
#> 0.034 | 0.096 | 0.009 | 0.023 | 0.009 | 0.326
#> | | | | |
#>
#> etasq | term | sumsq | df | meansq | statistic | p.value | power
#> ------------------------------------------------------------------------
#> 0.049 | am | 36.767 | 1 | 36.767 | 3.999 | 0.056 | 0.515
#> 0.602 | cyl | 456.401 | 2 | 228.200 | 24.819 | < .001 | 1.000
#> 0.034 | am:cyl | 25.437 | 2 | 12.718 | 1.383 | 0.269 | 0.298
#> | Residuals | 239.059 | 26 | 9.195 | | |
#>
#> Table of group means
#> # A tibble: 6 × 15
#> # Groups: am [2]
#> am cyl TheMean TheSD TheSEM CIMuliplier LowerBoundCI UpperBoundCI
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 4 22.9 1.45 0.839 4.30 19.3 26.5
#> 2 0 6 19.1 1.63 0.816 3.18 16.5 21.7
#> 3 0 8 15.0 2.77 0.801 2.20 13.3 16.8
#> 4 1 4 28.1 4.48 1.59 2.36 24.3 31.8
#> 5 1 6 20.6 0.751 0.433 4.30 18.7 22.4
#> 6 1 8 15.4 0.566 0.400 12.7 10.3 20.5
#> # ℹ 7 more variables: LowerBoundSEM <dbl>, UpperBoundSEM <dbl>,
#> # LowerBoundSD <dbl>, UpperBoundSD <dbl>, N <int>, LowerBound <dbl>,
#> # UpperBound <dbl>
#>
#> Post hoc tests for all effects that were significant
#>
#> Posthoc multiple comparisons of means: Scheffe Test
#> 95% family-wise confidence level
#>
#> $cyl
#> diff lwr.ci upr.ci pval
#> 6-4 -4.756706 -10.029278 0.5158655 0.09684 .
#> 8-4 -7.329581 -11.723391 -2.9357710 0.00024 ***
#> 8-6 -2.572874 -7.620978 2.4752288 0.64828
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Testing Homogeneity of Variance with Brown-Forsythe
#> *** Possible violation of the assumption ***
#> Brown-Forsythe Test for Homogeneity of Variance using median
#> Df F value Pr(>F)
#> group 5 2.736 0.04086 *
#> 26
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Testing Normality Assumption with Shapiro-Wilk
#>
#> Shapiro-Wilk normality test
#>
#> data: MyAOV_residuals
#> W = 0.96277, p-value = 0.3263
#>
#> Bayesian analysis of models in order
#> # A tibble: 4 × 4
#> model bf support margin_of_error
#> <chr> <dbl> <chr> <dbl>
#> 1 am + cyl 4445285. " data support is extreme" 0.0130
#> 2 am + cyl + am:cyl 3022148. " data support is extreme" 0.0147
#> 3 cyl 2274206. " data support is extreme" 0.0000000127
#> 4 am 86.6 " data support is very strong" 0.0000000188
#>
#> Interaction graph plotted...
In the console you’ll receive a series of messages chronicling your
progress and any diagnostic information. In this case am
and cyl
are being coerced to factors and you’re being
prompted to make sure that’s what is intended.
Next you receive a warning because you have a very unbalanced design. There are only two 8 cylinder cars with a manual transmission and twelve 8 cylinder cars with automatics. Whereas there are eight 4 cylinders with manual and only three that are automatics. Imbalance in our design worries us for two reasons. One is that it causes a lack of statistical power and creates some math challenges in deciding how to divide up the sums of the squared differences. This data set causes the more troublesome worry. Are the number of cylinders and manual versus automatic related systematically which would call our whole design into question. Make sure you can answer questions about which is at work here, or make sure you have a balanced design.
A table follows that is intended to summarize the findings. We’ll discuss it more later when we examine the plot.
The overall measures table can be very handy for comparing numerous
models. For example how does the AIC number change if we were to
eliminate am
?
The table of group means is useful for looking at summary by group. Want the best gas mileage? Buy a 4 cylinder manual car.
In our simple example the only statistically significant effect is for the main effect of number of cylinders. Accordingly the Scheffe test is run against the three types of cars with 4, 6, or 8 cylinders and we can see that with a difference of 7.3 mpg eight and four cylinder cars are statistically significant even as we control for the multiple simultaneous comparisons.
The next step is to test homogeneity of variance also known as (homoscedasticity). Since the math in our ANOVA rely on the assumption that the variance for the different groupings of cars is more or less equal, we need to check that assumption.
We’ll use the Brown-Forsythe test. When you run the
leveneTest
in R the default is actually a Brown-Forsythe,
to get a true Levene you must specify center = mean
.
Brown-Forsythe is actually more robust since it tests differences from
the median. Not surprisingly when we consult out table of group results
we have some reason for concern sine the standard deviations vary
widely.
Finally, let’s address the assumption that our errors or residuals are normally distributed. We’re looking for evidence that our residuals are skewed or tailed or otherwise misshapen in a way that would influence our results. Surprisingly, there is actually quite a bit of controversy on this point since on the one hand we have strong reason to believe that our sample will be imperfect and that our population will not necessarily be “perfectly normal” either. Some argue that some simple plotting is all that is necessary looking for an unspecifiable amount of non normality that will trigger a search for the source. Other prefer a more formal approach using one or more statistical tests.
Plot2WayANOVA
runs the most common test of the normality
assumption (there are many) the Shapiro-Wilk test The statistics look
good, no strong evidence in the data we have.
The default settings for the resultant plot are deliberately minimalistic, allowing you to focus visually on the pattern of means and the connecting lines. If you’re already used to looking at this sort of plot it is immediately apparent from the separation between the lines that the number of cylinders is having a significant impact on mileage. Automatic versus manual transmission seems to have less impact (judged by the relative lack of slope except for 4 cylinder models) and there does seem to be at least the start of an interaction between the two. (bear in mind this is a small data set and we are very unbalanced).
One other easy tip is warranted. Order matters and sometimes it is
helpful to run the command simply reversing the order of the independent
variables to help you better “see” the results,
e.g. Plot2WayANOVA(formula = mpg ~ cyl * am, dataframe = mtcars)
Note that if you want to “save” all these tables of data and
information all you need to do is store the results in an object as in
MyResults <- Plot2WayANOVA(formula = mpg ~ am * cyl, dataframe = mtcars)
then MyResults
can be accessed as a list.
Let’s make some changes that are likely to be quite common:
Plot2WayANOVA(formula = mpg ~ cyl * am,
dataframe = mtcars,
confidence = .99,
title = "MPG by cylinders and type transmission",
xlab = "Cylinders",
ylab = "Miles per gallon",
mean.label = TRUE,
mean.shape = 22,
posthoc.method = "lsd",
errorbar.display = "SEM"
)
#> # A tibble: 6 × 3
#> cyl am count
#> <fct> <fct> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2
#> etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f
#> ------------------------------------------------------------------------
#> 0.602 | 0.656 | 0.571 | 0.598 | 0.578 | 1.382
#> 0.049 | 0.133 | 0.036 | 0.086 | 0.036 | 0.392
#> 0.034 | 0.096 | 0.009 | 0.023 | 0.009 | 0.326
#> | | | | |
#>
#> etasq | term | sumsq | df | meansq | statistic | p.value | power
#> ------------------------------------------------------------------------
#> 0.602 | cyl | 456.401 | 2 | 228.200 | 24.819 | < .001 | 1.000
#> 0.049 | am | 36.767 | 1 | 36.767 | 3.999 | 0.056 | 0.515
#> 0.034 | cyl:am | 25.437 | 2 | 12.718 | 1.383 | 0.269 | 0.298
#> | Residuals | 239.059 | 26 | 9.195 | | |
#> # A tibble: 6 × 15
#> # Groups: cyl [3]
#> cyl am TheMean TheSD TheSEM CIMuliplier LowerBoundCI UpperBoundCI
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 22.9 1.45 0.839 9.92 14.6 31.2
#> 2 4 1 28.1 4.48 1.59 3.50 22.5 33.6
#> 3 6 0 19.1 1.63 0.816 5.84 14.4 23.9
#> 4 6 1 20.6 0.751 0.433 9.92 16.3 24.9
#> 5 8 0 15.0 2.77 0.801 3.11 12.6 17.5
#> 6 8 1 15.4 0.566 0.400 63.7 -10.1 40.9
#> # ℹ 7 more variables: LowerBoundSEM <dbl>, UpperBoundSEM <dbl>,
#> # LowerBoundSD <dbl>, UpperBoundSD <dbl>, N <int>, LowerBound <dbl>,
#> # UpperBound <dbl>
#>
#> Posthoc multiple comparisons of means : Fisher LSD
#> 99% family-wise confidence level
#>
#> $cyl
#> diff lwr.ci upr.ci pval
#> 6-4 -6.920779 -10.994590 -2.8469685 7.0e-05 ***
#> 8-4 -11.563636 -14.958479 -8.1687941 6.6e-10 ***
#> 8-6 -4.642857 -8.543234 -0.7424803 0.0028 **
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Brown-Forsythe Test for Homogeneity of Variance using median
#> Df F value Pr(>F)
#> group 5 2.736 0.04086 *
#> 26
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Shapiro-Wilk normality test
#>
#> data: MyAOV_residuals
#> W = 0.96277, p-value = 0.3263
#> # A tibble: 4 × 4
#> model bf support margin_of_error
#> <chr> <dbl> <chr> <dbl>
#> 1 cyl + am 4505215. " data support is extreme" 0.0128
#> 2 cyl + am + cyl:am 3158150. " data support is extreme" 0.0575
#> 3 cyl 2274206. " data support is extreme" 0.0000000127
#> 4 am 86.6 " data support is very strong" 0.0000000188
Please don’t fail to notice how liberal Fisher’s LSD is compared to Scheffe especially given we’ve demanded more confidence
Although the defaults are minimalistic to allow you to focus on the interaction pattern the function also has any number of optional ways of increasing complexity and showing more information.
Let’s make some custom changes that are more uncommon:
# Create a new dataset
library(dplyr)
library(ggplot2)
library(stringi)
newmpg <- mpg %>%
filter(cyl != 5) %>%
mutate(am = stringi::stri_extract(trans, regex = "auto|manual"))
Plot2WayANOVA(formula = hwy ~ am * cyl,
dataframe = newmpg,
ylab = "Highway mileage",
xlab = "Transmission type",
plottype = "line",
offset.style = "wide",
overlay.type = "box",
mean.label = TRUE,
mean.shape = 20,
mean.size = 3,
mean.label.size = 3,
show.dots = TRUE,
errorbar.display = "SD",
ggtheme = ggplot2::theme_minimal(),
ggplot.component = theme(axis.text.x = element_text(size=14, color="darkblue"))
)
#> etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f
#> ------------------------------------------------------------------------
#> 0.004 | 0.010 | 0.002 | 0.005 | 0.002 | 0.100
#> 0.547 | 0.552 | 0.542 | 0.544 | 0.543 | 1.111
#> 0.005 | 0.012 | 0.001 | 0.003 | 0.001 | 0.109
#> | | | | |
#>
#> etasq | term | sumsq | df | meansq | statistic | p.value | power
#> ---------------------------------------------------------------------------
#> 0.004 | am | 33.447 | 1 | 33.447 | 2.235 | 0.136 | 0.321
#> 0.547 | cyl | 4135.686 | 2 | 2067.843 | 138.208 | < .001 | 1.000
#> 0.005 | am:cyl | 39.796 | 2 | 19.898 | 1.330 | 0.267 | 0.289
#> | Residuals | 3351.458 | 224 | 14.962 | | |
#> # A tibble: 6 × 15
#> # Groups: am [2]
#> am cyl TheMean TheSD TheSEM CIMuliplier LowerBoundCI UpperBoundCI
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 auto 4 28.3 4.15 0.648 2.02 27.0 29.6
#> 2 auto 6 22.9 3.64 0.487 2.00 21.9 23.9
#> 3 auto 8 17.2 2.81 0.368 2.00 16.5 18.0
#> 4 manual 4 29.3 4.86 0.768 2.02 27.8 30.9
#> 5 manual 6 22.7 3.87 0.807 2.07 21.0 24.3
#> 6 manual 8 19.5 4.62 1.33 2.20 16.6 22.4
#> # ℹ 7 more variables: LowerBoundSEM <dbl>, UpperBoundSEM <dbl>,
#> # LowerBoundSD <dbl>, UpperBoundSD <dbl>, N <int>, LowerBound <dbl>,
#> # UpperBound <dbl>
#>
#> Posthoc multiple comparisons of means: Scheffe Test
#> 95% family-wise confidence level
#>
#> $cyl
#> diff lwr.ci upr.ci pval
#> 6-4 -5.270233 -7.323725 -3.216741 1.4e-12 ***
#> 8-4 -10.045433 -12.164702 -7.926165 < 2e-16 ***
#> 8-6 -4.775200 -6.906869 -2.643532 9.7e-10 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Brown-Forsythe Test for Homogeneity of Variance using median
#> Df F value Pr(>F)
#> group 5 1.946 0.08778 .
#> 224
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Shapiro-Wilk normality test
#>
#> data: MyAOV_residuals
#> W = 0.97548, p-value = 0.0005039
#> # A tibble: 4 × 4
#> model bf support margin_of_error
#> <chr> <dbl> <chr> <dbl>
#> 1 cyl 8.86e39 " data support is extreme" 6.85e- 8
#> 2 am + cyl 4.23e39 " data support is extreme" 2.04e- 2
#> 3 am + cyl + am:cyl 1.19e39 " data support is extreme" 1.72e- 2
#> 4 am 7.89e 2 " data support is extreme" 2.82e-11
Many thanks to Danielle Navarro and the book * Learning Statistics with R.*
If you like CGPfunctions, please consider Filing a GitHub issue by leaving feedback here, or by contacting me at ibecav at gmail.com by email.
I hope you’ve found this useful. I am always open to comments, corrections and suggestions.
Chuck
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.