Description
This is the dataset you will be working with:
food <- readr::read_csv(“https://wilkelab.org/DSC385/datasets/food_coded.csv”) food
## # A tibble: 125 × 61
## GPA Gender breakfast calor…¹ calor…² calor…³ coffee comfo…⁴ comfo…⁵ comfo…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 2.4 2 1 430 NaN 315 1 none we don… 9
## 2 3.654 1 1 610 3 420 2 chocol… Stress… 1
## 3 3.3 1 1 720 4 420 2 frozen… stress… 1
## 4 3.2 1 1 430 3 420 2 Pizza,… Boredom 2
## 5 3.5 1 1 720 2 420 2 Ice cr… Stress… 1
## 6 2.25 1 1 610 3 980 2 Candy,… None, … 4
## 7 3.8 2 1 610 3 420 2 Chocol… stress… 1
## 8 3.3 1 1 720 3 420 1 Ice cr… I eat … 1
## 9 3.3 1 1 430 NaN 420 1 Donuts… Boredom 2
## 10 3.3 1 1 430 3 315 2 Mac an… Stress… 1
## # … with 115 more rows, 51 more variables: cook <dbl>,
## # comfort_food_reasons_coded…12 <dbl>, cuisine <dbl>, diet_current <chr>,
## # diet_current_coded <dbl>, drink <dbl>, eating_changes <chr>,
## # eating_changes_coded <dbl>, eating_changes_coded1 <dbl>, eating_out <dbl>,
## # employment <dbl>, ethnic_food <dbl>, exercise <dbl>,
## # father_education <dbl>, father_profession <chr>, fav_cuisine <chr>,
## # fav_cuisine_coded <dbl>, fav_food <dbl>, food_childhood <chr>, …
A detailed data dictionary for this dataset is available here.
(https://wilkelab.org/DSC385/datasets/food_codebook.pdf) The dataset was originally downloaded from Kaggle, and you can find additional information about the dataset here. (https://www.kaggle.com/borapajo/foodchoices/version/5)
Question: Is GPA related to student income, the father’s educational level, or the student’s perception of what an ideal diet is?
To answer this question, first prepare a cleaned dataset that contains only the four relevant data columns, properly cleaned so that numerical values are stored as numbers and categorical values are represented by humanly readable words or phrases. For categorical variables with an inherent order, make sure the levels are in the correct order.
In your introduction, carefully describe each of the four relevant data columns. In your analysis, provide a summary of each of the four columns, using summary() for numerical variables and table() for categorical variables.
Then, make one visualization each for student income, father’s educational level, and ideal diet, and answer the question separately for each visualization. The three visualizations can be of the same type.
Hints:
1. Use case_when() to recode categorical variables.
2. Use fct_relevel() to arrange categorical variables in the right order.
3. Use as.numeric() to convert character strings into numerical values. It is fine to ignore warnings about NA s introduced by coercion.
4. NaN stands for Not a Number and can be treated like NA . You do not need to replace NaN with NA .
5. When using table() , provide the argument useNA = “ifany” to make sure missing values are counted:
table(…, useNA = “ifany”) .
Approach: The approach is pretty straightforward, first we’ll make a boxplot of GPA for each group. Then we need to do some hypothesis testing to see differences in means between groups. With the way the analysis is set up, it’s screaming for us to run a one-way ANOVA on the data, so we’ll do that. Responses are independent, so that is satisfied and we’ll need to do some analysis for common variance and normally distributed. ANOVA was chosen over using notched boxplots because they came off the hinges when rendered causing ugly visualizations, and it’s also just a more exhaustive method of determining differences between groups, based on a quantitative response variable
Analysis:
Below is a breakdown of each of the four columns after data cleaning
data <-food %>% transmute( GPA=GPA, income = case_when(income == 1 ~ ‘Less than $15,000’, income == 2 ~ ‘$15,001 to $30,000’, income == 3 ~ ‘$30,001 to $50,000’, income == 4 ~ ‘$50,001 to $70,000’, income == 5 ~ ‘$70,001 to $100,000’, income == 6 ~ ‘Higher than $100,000′),
) %>% transmute( GPA = as.numeric(GPA), income = as.factor(income),
father_education = as.factor(father_education), ideal_diet_coded = as.factor(ideal_diet_coded)
summary(data$GPA)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s ## 2.200 3.200 3.500 3.416 3.700 4.000 5
table(data$income, useNA=’ifany’)
##
## Less than $15,000 $15,001 to $30,000 $30,001 to $50,000
## 6 7 17
## $50,001 to $70,000 $70,001 to $100,000 Higher than $100,000
## 20 33 41
## <NA> ## 1
table(data$father_education, useNA=’ifany’)
##
## 4 34 12
table(data$ideal_diet_coded, useNA=’ifany’)
##
## Adding veggies/eating healthier food/adding fruit
## 44
## Balance
## 17
## Current diet
## 13
## Home cooked/organic
## 15
## Less sugar
## 6
## More protein
## 16
## Portion control
## 11
## Unclear ## 3
data %>% drop_na(income) %>% drop_na(GPA) %>%
ggplot(aes(y=income,x=GPA)) + geom_boxplot(varwidth = TRUE, fill=’#6092F5′) + theme(le gend.position=”none”)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = GPA ~ father_education, data = data)
##
## $father_education
## diff lwr upr
## p adj
Lastly, we have the student’s ideal diet. These results look much like the results from student income, with no real observable differences between means for each group, because of smaller sample sizes arising from the ranges of answers they were allowed to respond in the survey. Students who chose “unclear” might have statistically lower GPAs from just being lazy, but with just 3 “unclear” responses it’s unlikely that this isn’t just by chance.
As promised, let’s run one-way ANOVA on each factor. Income, with a p-value of .663 confirms the analysis of income above, as there’s no statistically significant difference in means based on income. Father education level, with a p-value of .00715, lower than a significance of .05, has strong evidence for a difference in means between groups within this factor. Ideal diet, with a p-value of .295 comes up short of .05, showing that this factor has weak evidence for a difference in means between groups. Even with our promising “unclear” observation.
To state the above, we do need to confirm the results of our ANOVA analysis by testing for common variance (Levene’s test) and normality (Shapiro-Wilkes test) for each factor. While each factor passes the common variance test, income level and ideal diet do not pass the normality test, showing strong evidence for deviating from the normal distribution. Luckily, there’s a nonparametric alternative to one-way ANOVA, the Kruskal-Wallis rank sum test, which can be used when ANOVA assumptions are not met. Kruskal-Wallis confirms the null hypothesis that there is no distinct difference in means between groups for these two factors.
Reviews
There are no reviews yet.