DACSS 603, Spring 2022
PART II (Final Project)
1. What is your research question for the final project?
Is there a relationship between level of education (profile_educ5) and how one rates Planned Parenthood (ftpp)?
Information source: ANES 2020 Social Media Study
#my_data <- read_excel(path= "C:\\Users\\erink\\OneDrive\\Documents\\2020 Social Media.xlxs.xlsx")
#my_data2<- read_csv(file = "C:\Users\erink\OneDrive\Documents\2020Social_Media.csv")
my_data2<- read_csv(file = "C:\\Users\\erink\\OneDrive\\Documents\\2020Social_Media.csv")
head(my_data2)
# A tibble: 6 x 13
caseid profile_gender profile_age profile_racethnicity profile_educ5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3824 2 32 3 1
2 235 2 63 1 3
3 1286 1 41 1 3
4 4981 1 52 2 5
5 1183 2 67 1 3
6 3158 1 46 2 4
# ... with 8 more variables: profile_marital <dbl>,
# profile_income <dbl>, profile_region4 <dbl>,
# profile_region9 <dbl>, profile_metro <dbl>, profile_relig <dbl>,
# profile_born <dbl>, ftpp <dbl>
2. What is your hypothesis (i.e. an answer to the research question) that you want to test?
I expect that those with higher levels of education are more likely to rate Planned Parenthood highly.
3. Present some exploratory analysis. In particular: a. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).
summary(my_data2)
caseid profile_gender profile_age profile_racethnicity
Min. : 1 Min. :1.000 Min. :18.0 Min. :1.000
1st Qu.:1459 1st Qu.:1.000 1st Qu.:36.0 1st Qu.:1.000
Median :2916 Median :1.000 Median :51.0 Median :1.000
Mean :2915 Mean :1.495 Mean :50.5 Mean :1.636
3rd Qu.:4371 3rd Qu.:2.000 3rd Qu.:65.0 3rd Qu.:2.000
Max. :5830 Max. :2.000 Max. :80.0 Max. :4.000
profile_educ5 profile_marital profile_income profile_region4
Min. :1.000 Min. :1.000 Min. : 1.00 Min. :1.000
1st Qu.:3.000 1st Qu.:1.000 1st Qu.: 7.00 1st Qu.:2.000
Median :3.000 Median :1.000 Median :11.00 Median :3.000
Mean :3.397 Mean :2.549 Mean :10.36 Mean :2.671
3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:14.00 3rd Qu.:4.000
Max. :5.000 Max. :6.000 Max. :18.00 Max. :4.000
profile_region9 profile_metro profile_relig profile_born
Min. :1.00 Min. :0.0000 Min. :-7.000 Min. :-7.0000
1st Qu.:3.00 1st Qu.:1.0000 1st Qu.: 1.000 1st Qu.:-1.0000
Median :5.00 Median :1.0000 Median : 9.000 Median : 1.0000
Mean :5.23 Mean :0.8313 Mean : 6.706 Mean : 0.8106
3rd Qu.:8.00 3rd Qu.:1.0000 3rd Qu.:12.000 3rd Qu.: 2.0000
Max. :9.00 Max. :1.0000 Max. :14.000 Max. : 2.0000
ftpp
Min. : -7.00
1st Qu.: 28.00
Median : 61.00
Mean : 56.98
3rd Qu.: 88.00
Max. :100.00
I am mostly interested in level of education and rating of Planned Parenthood. The dataset initially had 521 variables. I removed most, but I did also keep the following variables: gender, age, race, marital status, income, region and religion, so that I can see how some of those variables factor into my research question.
b. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.
install.packages("ggplot2")
library(ggplot2)
pairs(~profile_educ5 + profile_relig + ftpp, data=my_data2)
ggplot(data = my_data2) +
geom_point(mapping = aes(x = ftpp, y = profile_educ5))
ggplot(data =my_data2) +
geom_smooth(mapping = aes(x = ftpp, y = profile_educ5))
install.packages("lsr")
package 'lsr' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\erink\AppData\Local\Temp\RtmpwN4Fqm\downloaded_packages
library(lsr)
install.packages("PerformanceAnalytics")
package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\erink\AppData\Local\Temp\RtmpwN4Fqm\downloaded_packages
library(PerformanceAnalytics)
chart.Correlation(my_data2,histogram=TRUE)
The geom point model isn’t very useful at the moment, I will have to make some adjustments to be able to read that model. The geom smooth model does show that people who rated Planned Parenthood very highly do tend to have high levels of education, but there is also another peak among people who rated Planned Parenthood with a low score.
I have to review best models for this type of data, since there are only 5 options for level of education. I also want to explore the chart.correlation model further as I dig further into this project.