To use data from an R package, for instance, the bechdel data from the fivethirtyeight package, you can simply load the relevant package with library and then the data frame will be available
We have provided the nnyfs.Rds data file on the course data page.
Suppose you have downloaded this data file into a directory on your computer called data which is a sub-directory of the directory where you plan to do your work, perhaps called 431-nnyfs.
Open RStudio and create a new project into the 431-nnyfs directory on your computer. You should see a data subdirectory in the Files window in RStudio after the project is created.
Now, read in the nnyfs.Rds file to a new tibble in R called nnyfs_new with the following command:
# A tibble: 1,518 × 45
SEQN sex age_child race_eth educ_child language sampling_wt income_pov
<dbl> <fct> <dbl> <fct> <dbl> <fct> <dbl> <dbl>
1 71917 Female 15 3_Black No… 9 English 28299. 0.21
2 71918 Female 8 3_Black No… 2 English 15127. 5
3 71919 Female 14 2_White No… 8 English 29977. 5
4 71920 Female 15 2_White No… 8 English 80652. 0.87
5 71921 Male 3 2_White No… NA English 55592. 4.34
6 71922 Male 12 1_Hispanic 6 English 27365. 5
7 71923 Male 12 2_White No… 5 English 86673. 5
8 71924 Female 8 4_Other Ra… 2 English 39549. 2.74
9 71925 Male 7 1_Hispanic 0 English 42333. 0.46
10 71926 Male 8 3_Black No… 2 English 15307. 1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <fct>, respondent <fct>,
# salt_used <fct>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
# diet_yesterday <fct>, water <dbl>, plank_time <dbl>, height <dbl>,
# weight <dbl>, bmi <dbl>, bmi_cat <fct>, arm_length <dbl>, waist <dbl>,
# arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
# triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …
Using read_csv to read in a comma-separated version of a data file
We have provided the nnyfs.csv data file on the course data page.
Suppose you have downloaded this data file into a directory on your computer called data which is a sub-directory of the directory where you plan to do your work, perhaps called 431-nnyfs.
Open RStudio and create a new project into the 431-nnyfs directory on your computer. You should see a data subdirectory in the Files window in RStudio after the project is created.
Now, read in the nnyfs.csv file to a new tibble in R called nnyfs_new2 with the following command:
Rows: 1518 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): sex, race_eth, language, educ_adult, respondent, salt_used, diet_y...
dbl (27): SEQN, age_child, educ_child, sampling_wt, income_pov, age_adult, e...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nnyfs_new2
# A tibble: 1,518 × 45
SEQN sex age_child race_eth educ_child language sampling_wt income_pov
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 71917 Female 15 3_Black No… 9 English 28299. 0.21
2 71918 Female 8 3_Black No… 2 English 15127. 5
3 71919 Female 14 2_White No… 8 English 29977. 5
4 71920 Female 15 2_White No… 8 English 80652. 0.87
5 71921 Male 3 2_White No… NA English 55592. 4.34
6 71922 Male 12 1_Hispanic 6 English 27365. 5
7 71923 Male 12 2_White No… 5 English 86673. 5
8 71924 Female 8 4_Other Ra… 2 English 39549. 2.74
9 71925 Male 7 1_Hispanic 0 English 42333. 0.46
10 71926 Male 8 3_Black No… 2 English 15307. 1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <chr>, respondent <chr>,
# salt_used <chr>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
# diet_yesterday <chr>, water <dbl>, plank_time <dbl>, height <dbl>,
# weight <dbl>, bmi <dbl>, bmi_cat <chr>, arm_length <dbl>, waist <dbl>,
# arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
# triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …
If you also want to convert the character variables to factors, as you will often want to do before analyzing the results, you should instead use:
Rows: 1518 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): sex, race_eth, language, educ_adult, respondent, salt_used, diet_y...
dbl (27): SEQN, age_child, educ_child, sampling_wt, income_pov, age_adult, e...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nnyfs_new3
# A tibble: 1,518 × 45
SEQN sex age_child race_eth educ_child language sampling_wt income_pov
<dbl> <fct> <dbl> <fct> <dbl> <fct> <dbl> <dbl>
1 71917 Female 15 3_Black No… 9 English 28299. 0.21
2 71918 Female 8 3_Black No… 2 English 15127. 5
3 71919 Female 14 2_White No… 8 English 29977. 5
4 71920 Female 15 2_White No… 8 English 80652. 0.87
5 71921 Male 3 2_White No… NA English 55592. 4.34
6 71922 Male 12 1_Hispanic 6 English 27365. 5
7 71923 Male 12 2_White No… 5 English 86673. 5
8 71924 Female 8 4_Other Ra… 2 English 39549. 2.74
9 71925 Male 7 1_Hispanic 0 English 42333. 0.46
10 71926 Male 8 3_Black No… 2 English 15307. 1.57
# ℹ 1,508 more rows
# ℹ 37 more variables: age_adult <dbl>, educ_adult <fct>, respondent <fct>,
# salt_used <fct>, energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
# diet_yesterday <fct>, water <dbl>, plank_time <dbl>, height <dbl>,
# weight <dbl>, bmi <dbl>, bmi_cat <fct>, arm_length <dbl>, waist <dbl>,
# arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
# triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>, …
Note that, for example, sex and race_eth are now listed as factor (fctr) variables. One place where this distinction between character and factor variables matters is when you summarize the data.